From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=AeVH=PN=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C24ECC433EF
	for <linux-mm@archiver.kernel.org>; Mon, 25 Oct 2021 09:36:44 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 7EA9660EFF
	for <linux-mm@archiver.kernel.org>; Mon, 25 Oct 2021 09:36:44 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 7EA9660EFF
Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id ECD8F6B0073; Mon, 25 Oct 2021 05:36:43 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E7D0A6B0074; Mon, 25 Oct 2021 05:36:43 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D6BB96B0075; Mon, 25 Oct 2021 05:36:43 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0082.hostedemail.com [216.40.44.82])
	by kanga.kvack.org (Postfix) with ESMTP id C7BF16B0073
	for <linux-mm@kvack.org>; Mon, 25 Oct 2021 05:36:43 -0400 (EDT)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 8CA6318081858
	for <linux-mm@kvack.org>; Mon, 25 Oct 2021 09:36:43 +0000 (UTC)
X-FDA: 78734455086.03.8EBB62F
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
	by imf12.hostedemail.com (Postfix) with ESMTP id 19D54100009D
	for <linux-mm@kvack.org>; Mon, 25 Oct 2021 09:36:42 +0000 (UTC)
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 1C295218ED;
	Mon, 25 Oct 2021 09:36:42 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1635154602; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=2yh1aPbBEgmsdiCaRPtj8WUBQDnHjn4se86O/c7XP04=;
	b=rfexHc8fLyz11QO8h/HS7HW23V8dvtW+GFtOPv9Zqmnq759lHqiVr6/ixckPx8A+d8MwAA
	xfHWutRZ8ee9IP8g5bIrp1hQaUEmKOXraW5pgobnqMFHSc5Wpn18cqAX0XUCo/vTski6Wq
	bK10kT90wH1G+wi6v4ISz/1S10y8EJQ=
Received: from suse.cz (unknown [10.100.201.86])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id E0876A3B81;
	Mon, 25 Oct 2021 09:36:41 +0000 (UTC)
Date: Mon, 25 Oct 2021 11:36:41 +0200
From: Michal Hocko <mhocko@suse.com>
To: Vasily Averin <vvs@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Roman Gushchin <guro@fb.com>, Uladzislau Rezki <urezki@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Shakeel Butt <shakeelb@google.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kernel@openvz.org
Subject: Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the
 limit of dying tasks
Message-ID: <YXZ6qaMJBomVfV8O@dhcp22.suse.cz>
References: <YXJ/63kIpTq8AOlD@dhcp22.suse.cz>
 <cover.1634994605.git.vvs@virtuozzo.com>
 <8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com>
X-Stat-Signature: awxhiwtumj3eh8bd7o3ffhurfx613srd
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=suse.com header.s=susede1 header.b=rfexHc8f;
	spf=pass (imf12.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com;
	dmarc=pass (policy=quarantine) header.from=suse.com
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 19D54100009D
X-HE-Tag: 1635154602-971918
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sat 23-10-21 16:20:51, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
> 
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
> 
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
> 
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
> 
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
> 
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.
> 
> This patch depends on pagefault_out_of_memory() to not trigger
> out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
> and cause a global OOM killer.
> 
> Cc: stable@vger.kernel.org

My view on stable backport is similar to the previous patch. If we want
to have it there then let's wait for some time to see whether there are
any fallouts as this patch depends on the PF_OOM change.

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	struct page_counter *counter;
>  	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
> +	bool passed_oom = false;
>  	bool may_swap = true;
>  	bool drained = false;
>  	unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> +	/* Avoid endless loop for tasks bypassed by the oom killer */
> +	if (passed_oom && task_is_dying())
> +		goto nomem;
>  
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
> +		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> -		goto force;
> -	default:
> -		goto nomem;
>  	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs