From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 40719E77188
	for <linux-mm@archiver.kernel.org>; Tue, 14 Jan 2025 16:10:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C339F280008; Tue, 14 Jan 2025 11:10:04 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BBCB6280005; Tue, 14 Jan 2025 11:10:04 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A3620280008; Tue, 14 Jan 2025 11:10:04 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 82D09280005
	for <linux-mm@kvack.org>; Tue, 14 Jan 2025 11:10:04 -0500 (EST)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 2B856439DD
	for <linux-mm@kvack.org>; Tue, 14 Jan 2025 16:10:04 +0000 (UTC)
X-FDA: 83006543928.30.C8645CC
Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com [209.85.222.180])
	by imf27.hostedemail.com (Postfix) with ESMTP id 1F5644000A
	for <linux-mm@kvack.org>; Tue, 14 Jan 2025 16:10:01 +0000 (UTC)
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=Cjs2cUx0;
	dmarc=pass (policy=none) header.from=cmpxchg.org;
	spf=pass (imf27.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.180 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736871002; a=rsa-sha256;
	cv=none;
	b=w07zq2dyd9z6zUKRd0RDbBAa+bNoVt32BN58KG/jJrpX7wL3KN8XwXXreZk+NjIEkyq2WB
	Ex/IORdsKGlSjpFS4Gb4OCwBZMf62YEliENETN6MJoLal0vdeW5p9J3dJvaudJw6cjCFN0
	U2Z5kNRh85z9SbWgSH/P1u7hNn1dY5U=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=Cjs2cUx0;
	dmarc=pass (policy=none) header.from=cmpxchg.org;
	spf=pass (imf27.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.180 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1736871002;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tCbhAAOfgSy4gG11/slI/9NU/Jb4WeqNzdw5obZWbgE=;
	b=XO4pj++gexHCa67uuZc5WxabqS0WOQ+LI9Qtjx6JQikAMDfx3ezYrF1DqnTUI+2kGcDGUp
	V1SDSyS2jPx31bz3IobT57QVxCGwuxyQwk8baKSxpIobm29QmXXreZ5GqcaoAJJOIb9E+O
	FUMG1mQUNibQiEoFUAWAyE/Hgx0wCfE=
Received: by mail-qk1-f180.google.com with SMTP id af79cd13be357-7b6eeff1fdfso490504185a.2
        for <linux-mm@kvack.org>; Tue, 14 Jan 2025 08:10:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1736871001; x=1737475801; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=tCbhAAOfgSy4gG11/slI/9NU/Jb4WeqNzdw5obZWbgE=;
        b=Cjs2cUx0PR/cxHvd1Bb+SmWuvfwixCw8Hw/LVrCCnUHq9JnOYxaeh+8aqmNidxbiG0
         MnxWEqc/XFDXYGLzU2hJzJA2cFUPlM6PgbGB/DoShEY2/t/lJ1i5iUeI4qOkU+XQdTc9
         pIIZq/04ymmppdrB2S7y5lElEDGZBidvbhaeDOtfJFVA901DzsN5BVsNae3TvW9DdHrw
         HFg1XxhTlTQUzyA22xSS5JXa0H7SyO4PpGLrVvCUNd3iC62nxAh5/FJ64yU1stjMiKBR
         MGym+x9qhp1/4c4hi2HacgBShnMvEUWj0bacRFL1tTk0v1ZCRVTfT+yuGbqDBspd7HmG
         +WrQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736871001; x=1737475801;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tCbhAAOfgSy4gG11/slI/9NU/Jb4WeqNzdw5obZWbgE=;
        b=O88aFKBA58l7VQYEiy9FdrCHQFNq8TuBBQmh0RLmFQghvaGzvN6Zm9nuqOiqxOtZTi
         SFEbx5XP6QGHN1g68ZsqZjPm9ZK2TYR1hZfxCyrNqjjHLk6SMLnaLzqZ/NnKHnDQSlm5
         0lUkBspLt0VWK3XCzy0DrTEb86kMi+mQh7fFbLVjujWA8oAGNVGTiMIOs0yymi4a3UnP
         G8tFUbtGhVhl7bFDaifr/xb9T2NClfcs+0KV0tSK2BY5AtkJxemk2YInc++0SQ9ICqEE
         C3UmJ1s1Rqj8PkN0FLisRMRDHIP4EJC4Cs1juER9ufiUQwyWXIxYb0lBX0SLh1m0wa5l
         8VLA==
X-Forwarded-Encrypted: i=1; AJvYcCUq0ugcUfUR287plJw72WQhY0+OH4g1LqpMsQ8R5f9SDybiz8Q+5b8GpMqbTDUauR1bBPTuq+xRoA==@kvack.org
X-Gm-Message-State: AOJu0YyQXy/kPb3d/54jKMdolX/6ZVrV2lMPGASDR/BDSwY8/cCw8mkE
	iwkOdpLnLtCKydB7dRVg9oPFWomiitj4Mo8Jf8u0x5Z5+A6r0a0n5+ffs53SLu0=
X-Gm-Gg: ASbGncunWWhWYsW4BiiAomX5OEWAMK0mJDmyVQ0GPZ7NDrd9frQnXus4ry++Lap/QBn
	oC/M/BEtIoTYouq3iUL8HZx/qvXGTH1YxHYzxH/y+2oUYtmggJzGkaJpxrnOSM/SBt4ivRT9Cxy
	nW9jmMnjL38YgR3R5gZFOokFlqDINdjosb9oPbC6ZIxQql0+2xsWC8X3eLmK33g8a+HyVxrnM7B
	PUnAjqCQTqxiLO/AGR3Wbi8C3QVAORxFVoWtMk/7iEjiqTWiAtRdHg=
X-Google-Smtp-Source: AGHT+IEmpSB8/ppdMkBcq1yLq4xC6DDou+fUzob81MKPvq3WpvvOoJ+9eGO861F9B/iwhNhEwhsskQ==
X-Received: by 2002:a05:620a:f06:b0:7b0:6e8:94ef with SMTP id af79cd13be357-7bcd96146cfmr621646685a.0.1736871000826;
        Tue, 14 Jan 2025 08:10:00 -0800 (PST)
Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-7bce3248304sm615976685a.31.2025.01.14.08.09.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 14 Jan 2025 08:09:59 -0800 (PST)
Date: Tue, 14 Jan 2025 11:09:55 -0500
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Yosry Ahmed <yosryahmed@google.com>, Rik van Riel <riel@surriel.com>,
	Balbir Singh <balbirs@nvidia.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	hakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, Nhat Pham <nphamcs@gmail.com>
Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
Message-ID: <20250114160955.GA1115056@cmpxchg.org>
References: <20241212115754.38f798b3@fangorn>
 <CAJD7tkY=bHv0obOpRiOg4aLMYNkbEjfOtpVSSzNJgVSwkzaNpA@mail.gmail.com>
 <20241212183012.GB1026@cmpxchg.org>
 <Z2BJoDsMeKi4LQGe@tiehlicka>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Z2BJoDsMeKi4LQGe@tiehlicka>
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 1F5644000A
X-Stat-Signature: 9jsbs3je55pdeyqw6qte4nj4maa775hy
X-HE-Tag: 1736871001-278811
X-HE-Meta: U2FsdGVkX1+gp4jFt22JGbBne4lMb90LaK9Xpg1sgxVqSWpdHisOuqEW/BTsaJRIY6EefO18Btfd/rJaVlegN7z3q5Jdv1JtvIZjr79eN49iy0lRiPckEvG+eMyHH/iPGk38Q3AzRASrG/D1W4zluOjtplfPyf7jerUpggwfYVVpkIaWwXkD/G0qh+H8/iIlpsltwJOcXU5Rk0vPI7r2XRiqU8GblCcxZ/sSmApCal4YidvGrBAxZwqdZVcFO9u4/H276QYpJQJXr5/cXqpHF7wJmMi5LWHPZ1qpKaEax5wSP0TPMfPJf+T6k/EOcTjEwG9YV7VJsGR5sf3FzeJoXO/yLtChRLntnyW+WUM1OiZD3Iw9K5CGgg3dR+Sg1OTb3ya1yLW9Ql7Lmb6hxVTngXrjjDFVwyAjTAsGdImqFsJ3jJZ8Trm/xVZQ2gvDeq6uAhnifu2c8nqROhUAkIxmmDylg71VjcH5HmzTkplTSvJpiZRreHhurMDiBPXoc4jiUKF3ncVspVOGiRtp+RKlo9UbpsYQq45QxL7S0z+eSKgb7At4akltkCwkQifLfD7GgI8RzSd4MUXtJ9zG3OBEMMTazk0wlUnAcYKpSzUQJ/YuCt6acln2BM3+yPGOiYRldx5sk1LSUQjIgpN5jTgdaXUYOg8sgGxr1Rlll5xPw+UEZ7Bk7JmykBHua4jIw2sbQcp57oEqTfcWNJrK6lH54regqZ0dkRkPKxWAXWV8TZkE0kLRJ6hdvwfPMy/fFlXZ9D+1tTHovB4tKPbCBF+sS2m3bXi/VIwnfCHUspzTzHqL8p2RB82Vkj1kSp1mzMNtykMhZ0YN3Bh10BqOS7Prre+9kTnkc7IYT6O5CyUWBUhlKYIceCqAJ1a3HQTvxQt3x+uxtm6Td88JQgoDy8OTFKM5GTwqzo2FtirUWj1H5wgiQFMstb9CuVaEPeOof1oOoe2OIhIvORWTymxt/NY
 qgkY3mDy
 arSm0WjE3+WGbzTIoiHbp7eLpX2rTM/rtdqZufYXwIQu+DPPbcmABhuPXdhXL0qkI23G56cusO1uvsJl1K5bStqbfO4LfRv3YUvnU7xRxELWuofiofFmw6z23BVFCNXcamugwSZsesljoktXxLiu0Cmk5BRp1FYJPImVQqwqrRIGqiggDlJQWTnbc+xnerGExQtJ0kBAPC4rfKYLt39DvkFz0BqxFkX6kM1p/bVbsiatlcxKtGCIFFL32T+4Otu3mCLkM1NJQgc27q/p/Fwx8uCPfDQW41GDroqiC6youESH2Xu1zDxM999+NK3ugmcBzj8HuRYFzPhrB2xkguiSeZyIs/B+8wtxX9bTAzNdvQQeAokEz8zXrKPME70p91c8uJJWfBIw7VZIBxGqWG4r0CnejR+shyRty0QIwmVpdG3aMpb12Z/hy4EQYacpcPEmVpqmXmjzmOu5q4u7yOi9UDQAVyg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi,

On Mon, Dec 16, 2024 at 04:39:12PM +0100, Michal Hocko wrote:
> On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
> [...]
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> > 
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date:   Tue Mar 5 15:46:47 2019 -0800
> > 
> >     memcg: killed threads should not invoke memcg OOM killer
> > 
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> > 
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date:   Fri Nov 5 13:38:09 2021 -0700
> > 
> >     memcg: prohibit unconditional exceeding the limit of dying tasks
> > 
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> 
> IIRC the reason going this route was a lack of per-memcg oom reserves.
> Global oom victims are getting some slack because the amount of reserves
> be bound. This is not the case for memcgs though.
> 
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault.
> 
> IIRC the task will handle the pending SIGKILL if the #PF fails. If the
> charge happens from the exit path then we rely on ENOMEM returned from
> gup as a signal to back off. Do we have any caller that keeps retrying
> on ENOMEM?

We managed to extract a stack trace of the livelocked task:

obj_cgroup_may_swap
zswap_store
swap_writepage
shrink_folio_list
shrink_lruvec
shrink_node
do_try_to_free_pages
try_to_free_mem_cgroup_pages
charge_memcg
mem_cgroup_swapin_charge_folio
__read_swap_cache_async
swapin_readahead
do_swap_page
handle_mm_fault
do_user_addr_fault
exc_page_fault
asm_exc_page_fault
__get_user
futex_cleanup
fuxtex_exit_release
do_exit
do_group_exit
get_signal
arch_do_signal_or_restart
exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall
entry_SYSCALL_64
syscall

Both memory.max and memory.zswap.max are hit. I don't see how this
could ever make forward progress - the futex fault will retry until it
succeeds. The only workaround for this state right now is to manually
raise memory.max to let the fault succeed and the exit complete.

> > This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> 
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> > 
> >         ret = task_is_dying() || out_of_memory(&oc);
> > 
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
> 
> Good question! I suspect this early bail out is based on an assumption
> that a dying task will free up the memory soon so oom killer is
> unnecessary.

Correct. It's not about the kill. The important thing is that at least
one exiting task is getting the extra memory headroom usually afforded
to the OOM victim, to guarantee forward progress in the exit path.

> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> 
> As mentioned above this seems really surprising and it would indicate
> that something in the exit path would keep retrying when getting ENOMEM
> from gup or GFP_ACCOUNT allocation. GFP_NOFAIL requests are allowed to
> over-consume.

I hope the path is clear from the stack trace above.

> > I think we want something like this:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53db98d2c4a1..be6b6e72bde5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (mem_cgroup_margin(memcg) >= (1 << order))
> >  		goto unlock;
> >  
> > -	/*
> > -	 * A few threads which were not waiting at mutex_lock_killable() can
> > -	 * fail to bail out. Therefore, check again after holding oom_lock.
> > -	 */
> > -	ret = task_is_dying() || out_of_memory(&oc);
> > +	ret = out_of_memory(&oc);
> 
> I am not against this as it would allow to do an async oom_reaper memory
> reclaim in the worst case. This could potentially reintroduce the "No
> victim available" case described by 7775face2079 ("memcg: killed threads
> should not invoke memcg OOM killer") but that seemed to be a very
> specific and artificial usecase IIRC.

+1

> >  unlock:
> >  	mutex_unlock(&oom_lock);
> > @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (unlikely(current->flags & PF_MEMALLOC))
> >  		goto force;
> >  
> > +	if (unlikely(tsk_is_oom_victim(current)))
> > +		goto force;
> > +
> >  	if (unlikely(task_in_memcg_oom(current)))
> >  		goto nomem;
> 
> This is more problematic as it doesn't cap a potential runaway and
> eventual global OOM which is not really great. In the past this could be
> possible through vmalloc which didn't bail out early for killed tasks.
> That risk has been mitigated by dd544141b9eb ("vmalloc: back off when
> the current task is OOM-killed"). I would like to keep some sort of
> protection from those runaways. Whether that is a limited "reserve" for
> oom victims that would be per memcg or do no let them consume above the
> hard limit at all. Fundamentally a limited reserves doesn't solve the
> underlying problem, it just make it less likely so the latter would be
> preferred by me TBH.

Right. There is no way to limit an OOM victim without risking a memory
deadlock.