Re: [PATCH v4] mm: Throttle allocators when failing reclaim over memory.high

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Chris Down <chris@chrisdown.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@fb.com,
	Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v4] mm: Throttle allocators when failing reclaim over memory.high
Date: Tue, 23 Jul 2019 16:50:26 -0400	[thread overview]
Message-ID: <20190723205026.GB30522@cmpxchg.org> (raw)
In-Reply-To: <20190723180700.GA29459@chrisdown.name>

On Tue, Jul 23, 2019 at 02:07:00PM -0400, Chris Down wrote:
> We're trying to use memory.high to limit workloads, but have found that
> containment can frequently fail completely and cause OOM situations
> outside of the cgroup. This happens especially with swap space -- either
> when none is configured, or swap is full. These failures often also
> don't have enough warning to allow one to react, whether for a human or
> for a daemon monitoring PSI.
> 
> Here is output from a simple program showing how long it takes in μsec
> (column 2) to allocate a megabyte of anonymous memory (column 1) when a
> cgroup is already beyond its memory high setting, and no swap is
> available:
> 
>     [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
>     > --wait -t timeout 300 /root/mdf
>     [...]
>     95  1035
>     96  1038
>     97  1000
>     98  1036
>     99  1048
>     100 1590
>     101 1968
>     102 1776
>     103 1863
>     104 1757
>     105 1921
>     106 1893
>     107 1760
>     108 1748
>     109 1843
>     110 1716
>     111 1924
>     112 1776
>     113 1831
>     114 1766
>     115 1836
>     116 1588
>     117 1912
>     118 1802
>     119 1857
>     120 1731
>     [...]
>     [System OOM in 2-3 seconds]
> 
> The delay does go up extremely marginally past the 100MB memory.high
> threshold, as now we spend time scanning before returning to usermode,
> but it's nowhere near enough to contain growth. It also doesn't get
> worse the more pages you have, since it only considers nr_pages.
> 
> The current situation goes against both the expectations of users of
> memory.high, and our intentions as cgroup v2 developers. In
> cgroup-v2.txt, we claim that we will throttle and only under "extreme
> conditions" will memory.high protection be breached. Likewise, cgroup v2
> users generally also expect that memory.high should throttle workloads
> as they exceed their high threshold. However, as seen above, this isn't
> always how it works in practice -- even on banal setups like those with
> no swap, or where swap has become exhausted, we can end up with
> memory.high being breached and us having no weapons left in our arsenal
> to combat runaway growth with, since reclaim is futile.
> 
> It's also hard for system monitoring software or users to tell how bad
> the situation is, as "high" events for the memcg may in some cases be
> benign, and in others be catastrophic. The current status quo is that we
> fail containment in a way that doesn't provide any advance warning that
> things are about to go horribly wrong (for example, we are about to
> invoke the kernel OOM killer).
> 
> This patch introduces explicit throttling when reclaim is failing to
> keep memcg size contained at the memory.high setting. It does so by
> applying an exponential delay curve derived from the memcg's overage
> compared to memory.high.  In the normal case where the memcg is either
> below or only marginally over its memory.high setting, no throttling
> will be performed.
> 
> This composes well with system health monitoring and remediation, as
> these allocator delays are factored into PSI's memory pressure
> calculations. This both creates a mechanism system administrators or
> applications consuming the PSI interface to trivially see that the memcg
> in question is struggling and use that to make more reasonable
> decisions, and permits them enough time to act. Either of these can act
> with significantly more nuance than that we can provide using the system
> OOM killer.
> 
> This is a similar idea to memory.oom_control in cgroup v1 which would
> put the cgroup to sleep if the threshold was violated, but it's also
> significantly improved as it results in visible memory pressure, and
> also doesn't schedule indefinitely, which previously made tracing and
> other introspection difficult (ie. it's clamped at 2*HZ per allocation
> through MEMCG_MAX_HIGH_DELAY_JIFFIES).
> 
> Contrast the previous results with a kernel with this patch:
> 
>     [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
>     > --wait -t timeout 300 /root/mdf
>     [...]
>     95  1002
>     96  1000
>     97  1002
>     98  1003
>     99  1000
>     100 1043
>     101 84724
>     102 330628
>     103 610511
>     104 1016265
>     105 1503969
>     106 2391692
>     107 2872061
>     108 3248003
>     109 4791904
>     110 5759832
>     111 6912509
>     112 8127818
>     113 9472203
>     114 12287622
>     115 12480079
>     116 14144008
>     117 15808029
>     118 16384500
>     119 16383242
>     120 16384979
>     [...]
> 
> As you can see, in the normal case, memory allocation takes around 1000
> μsec. However, as we exceed our memory.high, things start to increase
> exponentially, but fairly leniently at first. Our first megabyte over
> memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
> the next is almost an entire second. This gets worse until we reach our
> eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
> However, this is still making forward progress, so permits tracing or
> further analysis with programs like GDB.
> 
> We use an exponential curve for our delay penalty for a few reasons:
> 
> 1. We run mem_cgroup_handle_over_high to potentially do reclaim after
>    we've already performed allocations, which means that temporarily
>    going over memory.high by a small amount may be perfectly legitimate,
>    even for compliant workloads. We don't want to unduly penalise such
>    cases.
> 2. An exponential curve (as opposed to a static or linear delay) allows
>    ramping up memory pressure stats more gradually, which can be useful
>    to work out that you have set memory.high too low, without destroying
>    application performance entirely.
> 
> This patch expands on earlier work by Johannes Weiner. Thanks!
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: kernel-team@fb.com
> ---

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

     prev parent reply	other threads:[~2019-07-23 20:50 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190501184104.GA30293@chrisdown.name>
2019-07-23 18:07 ` Chris Down
2019-07-23 20:50   ` Johannes Weiner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190723205026.GB30522@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chris@chrisdown.name \
    --cc=guro@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox