linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm: memcontrol: prevent starvation when writing memory.high
Date: Fri, 15 Jan 2021 11:20:50 -0500	[thread overview]
Message-ID: <YAHA4uBSLlnxxAbu@cmpxchg.org> (raw)
In-Reply-To: <20210113144654.GD22493@dhcp22.suse.cz>

On Wed, Jan 13, 2021 at 03:46:54PM +0100, Michal Hocko wrote:
> On Tue 12-01-21 11:30:11, Johannes Weiner wrote:
> > When a value is written to a cgroup's memory.high control file, the
> > write() context first tries to reclaim the cgroup to size before
> > putting the limit in place for the workload. Concurrent charges from
> > the workload can keep such a write() looping in reclaim indefinitely.
> > 
> > In the past, a write to memory.high would first put the limit in place
> > for the workload, then do targeted reclaim until the new limit has
> > been met - similar to how we do it for memory.max. This wasn't prone
> > to the described starvation issue. However, this sequence could cause
> > excessive latencies in the workload, when allocating threads could be
> > put into long penalty sleeps on the sudden memory.high overage created
> > by the write(), before that had a chance to work it off.
> > 
> > Now that memory_high_write() performs reclaim before enforcing the new
> > limit, reflect that the cgroup may well fail to converge due to
> > concurrent workload activity. Bail out of the loop after a few tries.
> 
> I can see that you have provided some more details in follow up replies
> but I do not see any explicit argument why an excessive time for writer
> is an actual problem. Could you be more specific?

Our writer isn't necessarily time sensitive, but there is a difference
between a) the write taking a few seconds to reclaim down the
requested delta and b) the writer essentially turning into kswapd for
the workload and busy-spinning inside the kernel indefinitely.

We've seen the writer stuck in this function for minutes, long after
the requested delta has been reclaimed, consuming alarming amounts of
CPU cycles - CPU time that should really be accounted to the workload,
not the system software performing the write.

Obviously, we could work around it using timeouts and signals. In
fact, we may have to until the new kernel is deployed everywhere. But
this is the definition of an interface change breaking userspace, so
I'm a bit surprised by your laid-back response.

> > Fixes: 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
> > Cc: <stable@vger.kernel.org> # 5.8+
> 
> Why is this worth backporting to stable? The behavior is different but I
> do not think any of them is harmful.

The referenced patch changed user-visible behavior in a way that is
causing real production problems for us. From stable-kernel-rules:

 - It must fix a real bug that bothers people (not a, "This could be a
   problem..." type thing).

> > Reported-by: Tejun Heo <tj@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I am not against the patch. The existing interface doesn't provide any
> meaningful feedback to the userspace anyway. User would have to re check
> to see the result of the operation. So how hard we try is really an
> implementation detail.

Yeah, I wish it was a bit more consistent from an interface POV.

Btw, if you have noticed, Roman's patch to enforce memcg->high *after*
trying to reclaim went into the tree at the same exact time as Chris's
series "mm, memcg: reclaim harder before high throttling" (commit
b3ff92916af3b458712110bb83976a23471c12fa). It's likely they overlap.

Chris's patch changes memory.high reclaim on the allocation side from

	reclaim once, sleep if there is still overage

to

	reclaim the overage as long as you make forward progress;
	sleep after 16 no-progress loops if there is still overage

Roman's patch describes a problem where allocating threads go to sleep
when memory.high is lowered by a wider step. This is exceedingly
unlikely after Chris's change.

Because after Chris's change, memory.high is reclaimed on the
allocation side as aggressively as memory.max. The only difference is
that upon failure, one sleeps and the other OOMs.

If Roman's issue were present after Chris's change, then we'd also see
premature OOM kills when memory.max is lowered by a large step. And I
have never seen that happening.

So I suggest instead of my fix here, we revert Roman's patch instead,
as it should no longer be needed. Thoughts?


  reply	other threads:[~2021-01-15 16:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-12 16:30 Johannes Weiner
2021-01-12 17:03 ` Roman Gushchin
2021-01-12 19:45   ` Johannes Weiner
2021-01-12 20:12     ` Roman Gushchin
2021-01-12 21:11       ` Johannes Weiner
2021-01-12 21:45         ` Roman Gushchin
2021-01-15 15:34           ` Johannes Weiner
2021-01-12 18:59 ` Shakeel Butt
2021-01-12 19:53   ` Johannes Weiner
2021-01-12 20:28     ` Shakeel Butt
2021-01-13 14:46 ` Michal Hocko
2021-01-15 16:20   ` Johannes Weiner [this message]
2021-01-15 17:03     ` Roman Gushchin
2021-01-15 20:55       ` Johannes Weiner
2021-01-15 21:27         ` Roman Gushchin
2021-01-19 16:47           ` Johannes Weiner
2021-01-18 13:12     ` Michal Hocko
2021-01-13 17:25 ` Michal Koutný
2021-01-13 18:06 ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YAHA4uBSLlnxxAbu@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox