From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 85419C4707B
	for <linux-mm@archiver.kernel.org>; Thu, 11 Jan 2024 19:38:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 19E9F6B00B2; Thu, 11 Jan 2024 14:38:19 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1284F6B00B4; Thu, 11 Jan 2024 14:38:19 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F0AAB6B00B5; Thu, 11 Jan 2024 14:38:18 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id DB51D6B00B2
	for <linux-mm@kvack.org>; Thu, 11 Jan 2024 14:38:18 -0500 (EST)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id B45841C1320
	for <linux-mm@kvack.org>; Thu, 11 Jan 2024 19:38:18 +0000 (UTC)
X-FDA: 81668041476.05.61CCD11
Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180])
	by imf23.hostedemail.com (Postfix) with ESMTP id C22D114001E
	for <linux-mm@kvack.org>; Thu, 11 Jan 2024 19:38:16 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=aKqrocxJ;
	spf=pass (imf23.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705001897; a=rsa-sha256;
	cv=none;
	b=KqiofOYjexk1EJ+NPZVZSABSXKQQpxOflx2bYGxfrbljGzrydPWKYLfbIVkYT9ESEd/2vF
	4WFLA9mNrRvX4lqmtFUHksGA6zkILdP8lUmn2sIx+uhaLTmD7J+Uwe+/iR+0AN4+K/L5/+
	0AQLKVTxMEwjaTjT2PwOFujPJl9wk/U=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=aKqrocxJ;
	spf=pass (imf23.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1705001897;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AkzHYxZFi5ouVgT0zy5KypNsq0w5WgyR676jLiAHwes=;
	b=mOZc6lnS/kxRXlHj179VhL0HvciMcwAb2P/XU3iTetZ8rOL1yK1CComAY+lTzt7Jxzo/nV
	3uM0xCnxtuKGedRxtrCcI3/Y81rbvojMn2iqRYDUB4SDVObLnSHLlU/iuQuFCWirY3EQ5Q
	+tftImgeh7Zn61zEXRkcbJRgEgsVyig=
Date: Thu, 11 Jan 2024 11:38:09 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1705001894;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=AkzHYxZFi5ouVgT0zy5KypNsq0w5WgyR676jLiAHwes=;
	b=aKqrocxJT78AHVgmOO5soM2yx71m9o2ggWeLQjfztX924iW2TgDw28QtJUVRZYt8QmtNG6
	87sj4n+fiAcuKSoXUZ860OXIsqOwxI0lmA2YIG6B3SjjCE6lIVUror/SA3V5ivkRQQ9QJC
	/HDzAPXkD3DwzkMT/LiYgwtKl8anBbA=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Roman Gushchin <roman.gushchin@linux.dev>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>, Tejun Heo <tj@kernel.org>,
	Dan Schatzberg <schatzberg.dan@gmail.com>, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high
Message-ID: <ZaBDoRr90kPNMrv7@P9FQF9L96D>
References: <20240111132902.389862-1-hannes@cmpxchg.org>
 <ZaAsbwFP-ttYNwIe@P9FQF9L96D>
 <20240111192807.GA424308@cmpxchg.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20240111192807.GA424308@cmpxchg.org>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: C22D114001E
X-Stat-Signature: 61i61g9kjot4e8wmjif7edj97qfqe86s
X-Rspam-User: 
X-HE-Tag: 1705001896-855324
X-HE-Meta: U2FsdGVkX1/8EBawlAaW1wb7hiqQFQwy85A/3B6kPlDoWrxA0TzizU9wlnR7U3XTegynckNE/fgTRSZ55LyJ2kfqpw/qDTsYUISxoK5kDA+l4WqcA7dXTiwJVBG6ld007zMZ1g9ZQVxG6lUG1K2iMQbkCfdZzDn3Aqz9ktHbIPRpRnBhBVD9f3xc7U2Z3JkyUCyZGWeUH0/FnbMMDjxt5DXibd/F8AShWFT7wa42g0IC+ffmAp+B6r+5kjp6YQUfzh7J9mxFbp31XmpO3HCg5xP2GApf+oZelTRRoBxantU+uId4pcKdQ+aMSTSEjhHL/H8LmpJUcRk3YG2uuYfUq11LqaARDYEwnZCLcydr3ePh4zRhK3cPYaYblZkUxhYc6ZR0L+Ghf5N+0sOfA4hOwQuHN3ibPEzSFy/DRA40GSaPznpdpKozAdHwp8xwzMFamkvN3ewhW/pE41zvgXMZDIoqj9vvGmdQr1KHCJR5C+9P3wzJ4YIE6IIKA3jTKpyC2lgnsY+jEx5J2lkZvmcbHpegkrHyMl2WPNy3VjLdcLhs7gkAQrI5MAG81ux7pN1+aPdt6nYowmRSzAAhMcKXUcZ5TNlxUZHH/GpVV1y2AFUU3rQDb3qkwmIxIxIq83OW3ou4CZdtFqI5jy6Cu98QKpXilkDNsJic3F8bTXvmbA9tAH7hltneMtma2s33xLqMJgD3Ffib+vN+rl7qvYjs7xFndEmSflwDi+y6iCGy3eiemd7PsQa/A96ZxsCK3JbG/yL1GohkLtwnHo2czLaacNTu8zeem3qC+Jbb86Scl4jZZncuunQ4ksHSydIBY2udLLUZBmu4m8F3xudpu67uXRfj4PX/3CLYL7/JbvVYm1ku6lXrieKRPd/OvaYRfnrFNA8T0FXR+GbkiJSJAuP5rT3g9eXdLG7R5PCB7VYyAID6a77EwyD/Nz0/akGwajWZ9xmFQyKpWGiofE/plfc
 lDFGjFeT
 ySOL6vTAK5UvjZsHBSXv/TcraR9Ltu39wVUO5418U1SM8Nl8J4uRONKEWg28vvnSKgycLoqMPAm3OvErvkm+bJyJndEGXiJptzwn/oISeRNx981XXIqySoiHwNNAFKgEyLjLaeHvxz+KN+BnCf839bDiY41wxaStN4q9w
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jan 11, 2024 at 02:28:07PM -0500, Johannes Weiner wrote:
> On Thu, Jan 11, 2024 at 09:59:11AM -0800, Roman Gushchin wrote:
> > On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote:
> > > While investigating hosts with high cgroup memory pressures, Tejun
> > > found culprit zombie tasks that had were holding on to a lot of
> > > memory, had SIGKILL pending, but were stuck in memory.high reclaim.
> > > 
> > > In the past, we used to always force-charge allocations from tasks
> > > that were exiting in order to accelerate them dying and freeing up
> > > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg:
> > > prohibit unconditional exceeding the limit of dying tasks"); it noted
> > > that this can cause (userspace inducable) containment failures, so it
> > > added a mandatory reclaim and OOM kill cycle before forcing charges.
> > > At the time, memory.high enforcement was handled in the userspace
> > > return path, which isn't reached by dying tasks, and so memory.high
> > > was still never enforced by dying tasks.
> > > 
> > > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large
> > > overcharges") added synchronous reclaim for memory.high, it added
> > > unconditional memory.high enforcement for dying tasks as well. The
> > > callstack shows that this path is where the zombie is stuck in.
> > > 
> > > We need to accelerate dying tasks getting past memory.high, but we
> > > cannot do it quite the same way as we do for memory.max: memory.max is
> > > enforced strictly, and tasks aren't allowed to move past it without
> > > FIRST reclaiming and OOM killing if necessary. This ensures very small
> > > levels of excess. With memory.high, though, enforcement happens lazily
> > > after the charge, and OOM killing is never triggered. A lot of
> > > concurrent threads could have pushed, or could actively be pushing,
> > > the cgroup into excess. The dying task will enter reclaim on every
> > > allocation attempt, with little hope of restoring balance.
> > > 
> > > To fix this, skip synchronous memory.high enforcement on dying tasks
> > > altogether again. Update memory.high path documentation while at it.
> > 
> > It makes total sense to me.
> > Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> 
> Thanks
> 
> > However if tasks can stuck for a long time in the "high reclaim" state,
> > shouldn't we also handle the case when tasks are being killed during the
> > reclaim? E. g. something like this (completely untested):
> 
> Yes, that's probably a good idea.
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c4c422c81f93..9f971fc6aae8 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> >                     READ_ONCE(memcg->memory.high))
> >                         continue;
> > 
> > +               if (task_is_dying())
> > +                       break;
> > +
> >                 memcg_memory_event(memcg, MEMCG_HIGH);
> > 
> >                 psi_memstall_enter(&pflags);
> 
> I think we can skip this one. The loop is for traversing from the
> charging cgroup to the one that has memory.high set and breached, and
> then reclaim it. It's not expected to run multiple reclaims.

Yes, the next one is probably enough (hard to say for me without knowing
exactly where whose dying processes are getting stuck - you should have
actual stacktraces I guess).

> 
> > @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >         current->memcg_nr_pages_over_high = 0;
> > 
> >  retry_reclaim:
> > +       if (task_is_dying())
> > +               return;
> > +
> >         /*
> >          * The allocating task should reclaim at least the batch size, but for
> >          * subsequent retries we only want to do what's necessary to prevent oom
> 
> Yeah this is the better place for this check.
> 
> How about this?

Looks really good to me!

I actually thought about moving the check into mem_cgroup_handle_over_high(),
and you already did it in this version.

Thanks!