From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) by kanga.kvack.org (Postfix) with ESMTP id 6989E6B02F4 for ; Mon, 26 Jun 2017 07:55:35 -0400 (EDT) Received: by mail-wm0-f69.google.com with SMTP id b184so637437wme.14 for ; Mon, 26 Jun 2017 04:55:35 -0700 (PDT) Received: from mx1.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id p46si11587216wrc.128.2017.06.26.04.55.33 for (version=TLS1 cipher=AES128-SHA bits=128/128); Mon, 26 Jun 2017 04:55:34 -0700 (PDT) Date: Mon, 26 Jun 2017 13:55:31 +0200 From: Michal Hocko Subject: Re: [RFC PATCH v2 0/7] cgroup-aware OOM killer Message-ID: <20170626115531.GI11534@dhcp22.suse.cz> References: <1496342115-3974-1-git-send-email-guro@fb.com> <20170609163022.GA9332@dhcp22.suse.cz> <20170622171003.GB30035@castle> <20170623134323.GB5314@dhcp22.suse.cz> <20170623183946.GA24014@castle> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170623183946.GA24014@castle> Sender: owner-linux-mm@kvack.org List-ID: To: Roman Gushchin Cc: linux-mm@kvack.org On Fri 23-06-17 19:39:46, Roman Gushchin wrote: > On Fri, Jun 23, 2017 at 03:43:24PM +0200, Michal Hocko wrote: > > On Thu 22-06-17 18:10:03, Roman Gushchin wrote: > > > Hi, Michal! > > > > > > Thank you very much for the review. I've tried to address your > > > comments in v3 (sent yesterday), so that is why it took some time to reply. > > > > I will try to look at it sometimes next week hopefully > > Thanks! > > > > > - You seem to completely ignore per task oom_score_adj and override it > > > > by the memcg value. This makes some sense but it can lead to an > > > > unexpected behavior when somebody relies on the original behavior. > > > > E.g. a workload that would corrupt data when killed unexpectedly and > > > > so it is protected by OOM_SCORE_ADJ_MIN. Now this assumption will > > > > break when running inside a container. I do not have a good answer > > > > what is the desirable behavior and maybe there is no universal answer. > > > > Maybe you just do not to kill those tasks? But then you have to be > > > > careful when selecting a memcg victim. Hairy... > > > > > > I do not ignore it completely, but it matters only for root cgroup tasks > > > and inside a cgroup when oom_kill_all_tasks is off. > > > > > > I believe, that cgroup v2 requirement is a good enough. I mean you can't > > > move from v1 to v2 without changing cgroup settings, and if we will provide > > > per-cgroup oom_score_adj, it will be enough to reproduce the old behavior. > > > > > > Also, if you think it's necessary, I can add a sysctl to turn the cgroup-aware > > > oom killer off completely and provide compatibility mode. > > > We can't really save the old system-wide behavior of per-process oom_score_adj, > > > it makes no sense in the containerized environment. > > > > So what you are going to do with those applications that simply cannot > > be killed and which set OOM_SCORE_ADJ_MIN explicitly. Are they > > unsupported? How does a user find out? One way around this could be to > > simply to not kill tasks with OOM_SCORE_ADJ_MIN. > > They won't be killed by cgroup OOM, but under some circumstances can be killed > by the global OOM (e.g. there are no other tasks in the selected cgroup, > cgroup v2 is used, and per-cgroup oom score adjustment is not set). Hmm, mem_cgroup_select_oom_victim will happily select a memcg which contains OOM_SCORE_ADJ_MIN tasks because it ignores per-task score adj. So memcg OOM killer can kill those tasks AFAICS. But that is not all that important. Becasuse... > I believe, that per-process oom_score_adj should not play any role outside > of the containing cgroup, it's violation of isolation. > > Right now if tasks with oom_score_adj=-1000 eating all memory in a cgroup, > they will be looping forever, OOM killer can't fix this. ... Yes and that is a price we have to pay for the hard requirement that oom killer never kills OOM_SCORE_ADJ_MIN task. It is hard to change that without breaking any existing userspace which relies on the configuration to protect from an unexpected SIGKILL. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org