Re: [v10 3/6] mm, oom: cgroup-aware OOM killer

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Roman Gushchin <guro@fb.com>
To: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v10 3/6] mm, oom: cgroup-aware OOM killer
Date: Thu, 5 Oct 2017 11:44:29 +0100	[thread overview]
Message-ID: <20171005104429.GB12982@castle.dhcp.TheFacebook.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1710050123180.20389@chino.kir.corp.google.com>

On Thu, Oct 05, 2017 at 01:40:09AM -0700, David Rientjes wrote:
> On Wed, 4 Oct 2017, Johannes Weiner wrote:
> 
> > > By only considering leaf memcgs, does this penalize users if their memcg 
> > > becomes oc->chosen_memcg purely because it has aggregated all of its 
> > > processes to be members of that memcg, which would otherwise be the 
> > > standard behavior?
> > > 
> > > What prevents me from spreading my memcg with N processes attached over N 
> > > child memcgs instead so that memcg_oom_badness() becomes very small for 
> > > each child memcg specifically to avoid being oom killed?
> > 
> > It's no different from forking out multiple mm to avoid being the
> > biggest process.
> >

Hi, David!

> 
> It is, because it can quite clearly be a DoS, and was prevented with 
> Roman's earlier design of iterating usage up the hierarchy and comparing 
> siblings based on that criteria.  I know exactly why he chose that 
> implementation detail early on, and it was to prevent cases such as this 
> and to not let userspace hide from the oom killer.
> 
> > It's up to the parent to enforce limits on that group and prevent you
> > from being able to cause global OOM in the first place, in particular
> > if you delegate to untrusted and potentially malicious users.
> > 
> 
> Let's resolve that global oom is a real condition and getting into that 
> situation is not a userspace problem.  It's the result of overcommiting 
> the system, and is used in the enterprise to address business goals.  If 
> the above is true, and its up to memcg to prevent global oom in the first 
> place, then this entire patchset is absolutely pointless.  Limit userspace 
> to 95% of memory and when usage is approaching that limit, let userspace 
> attached to the root memcg iterate the hierarchy itself and kill from the 
> largest consumer.
> 
> This patchset exists because overcommit is real, exactly the same as 
> overcommit within memcg hierarchies is real.  99% of the time we don't run 
> into global oom because people aren't using their limits so it just works 
> out.  1% of the time we run into global oom and we need a decision to made 
> based for forward progress.  Using Michal's earlier example of admins and 
> students, a student can easily use all of his limit and also, with v10 of 
> this patchset, 99% of the time avoid being oom killed just by forking N 
> processes over N cgroups.  It's going to oom kill an admin every single 
> time.

Overcommit is real, but configuring the system in a way that system-wide OOM
happens often is a strange idea. As we all know, the system can barely work
adequate under global memory shortage: network packets are dropped, latency
is bad, weird kernel issues are revealed periodically, etc.
I do not see, why you can't overcommit on deeper layers of cgroup hierarchy,
avoiding system-wide OOM to happen.

> 
> I know exactly why earlier versions of this patchset iterated that usage 
> up the tree so you would pick from students, pick from this troublemaking 
> student, and then oom kill from his hierarchy.  Roman has made that point 
> himself.  My suggestion was to add userspace influence to it so that 
> enterprise users and users with business goals can actually define that we 
> really do want 80% of memory to be used by this process or this hierarchy, 
> it's in our best interest.

I'll repeat myself: I believe that there is a range of possible policies:
from a complete flat (what Johannes did suggested few weeks ago), to a very
hierarchical (as in v8). Each with their pros and cons.
(Michal did provide a clear example of bad behavior of the hierarchical approach).

I assume, that v10 is a good middle point, and it's good because it doesn't
prevent further development. Just for example, you can introduce a third state
of oom_group knob, which will mean "evaluate as a whole, but do not kill all".
And this is what will solve your particular case, right?

> 
> Earlier iterations of this patchset did this, and did it correctly.  
> Userspace influence over the decisionmaking makes it a very powerful 
> combination because you _can_ specify what your goals are or choose to 
> leave the priorities as default so you can compare based solely on usage.  
> It was a beautiful solution to the problem.

I did, but then I did agree with Tejun's point, that proposed semantics will
limit us further. Really, oom_priorities do not guarantee the killing order
(remember numa issues, as well as oom_score_adj), so in practice it can be even
reverted (e.g. low prio cgroup killed before high prio). We shouldn't cause
users rely on these priorities more than some hints to the kernel.
But the way how they are defined doesn't allow to change anything, it's too
rigid.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-10-05 10:45 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-04 15:46 [v10 0/6] " Roman Gushchin
2017-10-04 15:46 ` [v10 1/6] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-10-04 19:14   ` Johannes Weiner
2017-10-04 15:46 ` [v10 2/6] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-10-04 19:15   ` Johannes Weiner
2017-10-04 20:10   ` David Rientjes
2017-10-04 15:46 ` [v10 3/6] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-10-04 19:27   ` Johannes Weiner
2017-10-04 19:51     ` Roman Gushchin
2017-10-04 20:17       ` David Rientjes
2017-10-04 20:22         ` Roman Gushchin
2017-10-04 20:31         ` Johannes Weiner
2017-10-05 11:14           ` Michal Hocko
2017-10-04 19:48   ` Shakeel Butt
2017-10-04 20:15     ` Roman Gushchin
2017-10-04 21:24       ` Shakeel Butt
2017-10-05 10:27         ` Roman Gushchin
2017-10-05 11:12           ` Michal Hocko
2017-10-05 11:45             ` Roman Gushchin
2017-10-04 20:27   ` David Rientjes
2017-10-04 20:41     ` Johannes Weiner
2017-10-05  8:40       ` David Rientjes
2017-10-05 10:27         ` Johannes Weiner
2017-10-05 21:53           ` David Rientjes
2017-10-05 10:44         ` Roman Gushchin [this message]
2017-10-05 22:02           ` David Rientjes
2017-10-06  5:43             ` Michal Hocko
2017-10-05 11:40   ` Michal Hocko
2017-10-04 15:46 ` [v10 4/6] mm, oom: introduce memory.oom_group Roman Gushchin
2017-10-04 19:37   ` Johannes Weiner
2017-10-05 12:06   ` Michal Hocko
2017-10-05 12:32     ` Roman Gushchin
2017-10-05 12:58       ` Michal Hocko
2017-10-04 15:46 ` [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-10-04 20:04   ` Johannes Weiner
2017-10-05 13:14     ` Michal Hocko
2017-10-05 13:41       ` Roman Gushchin
2017-10-05 14:10         ` Michal Hocko
2017-10-05 14:54       ` Johannes Weiner
2017-10-05 16:40         ` Michal Hocko
2017-10-05 15:51       ` Tejun Heo
2017-10-04 15:46 ` [v10 6/6] mm, oom, docs: describe the " Roman Gushchin
2017-10-04 20:08   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171005104429.GB12982@castle.dhcp.TheFacebook.com \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox