From: Michal Hocko <mhocko@kernel.org>
To: Roman Gushchin <guro@fb.com>
Cc: linux-mm@kvack.org, Vladimir Davydov <vdavydov.dev@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
David Rientjes <rientjes@google.com>, Tejun Heo <tj@kernel.org>,
kernel-team@fb.com, cgroups@vger.kernel.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v6 2/4] mm, oom: cgroup-aware OOM killer
Date: Thu, 24 Aug 2017 13:47:06 +0200 [thread overview]
Message-ID: <20170824114706.GG5943@dhcp22.suse.cz> (raw)
In-Reply-To: <20170823165201.24086-3-guro@fb.com>
This doesn't apply on top of mmotm cleanly. You are missing
http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
Please explain more. I guess you mean that an untrusted memcg could hide
itself from the global OOM killer by reducing the oom scores? Well you
need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
already pointed out. I also agree that we absolutely must not kill an
oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
as a protection from an untrusted SIGKILL and inconsistent state as a
result. Those applications simply shouldn't behave differently in the
global and container contexts.
If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.
> To address these issues, cgroup-aware OOM killer is introduced.
>
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
>
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
Why? I believe that the semantic should be as simple as kill the largest
oom killable entity. And the entity is either a process or a memcg which
is marked that way. Why should we mix things and select a memcg to kill
a process inside it? More on that below.
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
>
> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.
If anything you wouldn't have to treat the root memcg any special. It
will be like any other memcg which doesn't have oom_kill_all_tasks...
[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;
I guess I have asked already and we haven't reached any consensus. I do
not like how you treat memcgs and tasks differently. Why cannot we have
a memcg score a sum of all its tasks? How do you want to compare memcg
score with tasks score? This just smells like the outcome of a weird
semantic that you try to select the largest group I have mentioned
above.
This is a rather fundamental concern and I believe we should find a
consensus on it before going any further. I believe that users shouldn't
see any difference in the OOM behavior when memcg v2 is used and there
is no kill-all memcg. If there is such a memcg then we should treat only
those specially. But you might have really strong usecases which haven't
been presented or I've missed their importance.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-24 11:47 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-23 16:51 [v6 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-08-23 16:51 ` [v6 0/4] cgroup-aware OOM killer Roman Gushchin
2017-08-23 16:51 ` [v6 2/4] mm, oom: " Roman Gushchin
2017-08-23 23:19 ` David Rientjes
2017-08-25 10:57 ` Roman Gushchin
2017-08-24 11:47 ` Michal Hocko [this message]
2017-08-24 12:28 ` Roman Gushchin
2017-08-24 12:58 ` Michal Hocko
2017-08-24 13:58 ` Roman Gushchin
2017-08-24 14:13 ` Michal Hocko
2017-08-24 14:58 ` Roman Gushchin
2017-08-25 8:14 ` Michal Hocko
2017-08-25 10:39 ` Roman Gushchin
2017-08-25 10:58 ` Michal Hocko
2017-08-30 11:22 ` Roman Gushchin
2017-08-30 20:56 ` David Rientjes
2017-08-31 13:34 ` Roman Gushchin
2017-08-31 20:01 ` David Rientjes
2017-08-23 16:52 ` [v6 3/4] mm, oom: introduce oom_priority for memory cgroups Roman Gushchin
2017-08-24 12:10 ` Michal Hocko
2017-08-24 12:51 ` Roman Gushchin
2017-08-24 13:48 ` Michal Hocko
2017-08-24 14:11 ` Roman Gushchin
2017-08-28 20:54 ` David Rientjes
2017-08-23 16:52 ` [v6 4/4] mm, oom, docs: describe the cgroup-aware OOM killer Roman Gushchin
2017-08-24 11:15 ` [v6 1/4] mm, oom: refactor the oom_kill_process() function Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170824114706.GG5943@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=rientjes@google.com \
--cc=tj@kernel.org \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox