From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ABF81E7717F
	for <linux-mm@archiver.kernel.org>; Mon, 16 Dec 2024 15:39:18 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 17BD58D0001; Mon, 16 Dec 2024 10:39:18 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 12BC26B00A5; Mon, 16 Dec 2024 10:39:18 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EE6768D0001; Mon, 16 Dec 2024 10:39:17 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CEC9A6B009C
	for <linux-mm@kvack.org>; Mon, 16 Dec 2024 10:39:17 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 7C6891211A4
	for <linux-mm@kvack.org>; Mon, 16 Dec 2024 15:39:17 +0000 (UTC)
X-FDA: 82901230902.29.86F8158
Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45])
	by imf09.hostedemail.com (Postfix) with ESMTP id 55AAE140020
	for <linux-mm@kvack.org>; Mon, 16 Dec 2024 15:38:55 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=suse.com header.s=google header.b=SgK6tb6o;
	dmarc=pass (policy=quarantine) header.from=suse.com;
	spf=pass (imf09.hostedemail.com: domain of mhocko@suse.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=mhocko@suse.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734363528; a=rsa-sha256;
	cv=none;
	b=RC6gPi0n9xhKbbbfvpkmtCSgR4vOytfBgfc1bJBvehSpaFFtWZx0F2Stbj/RdozoS1/4wI
	G1CO04fr3t4ur9tGgCyZU2xkbbT+NAdQ4vI1ypqL21rGw1eLRc2xcmbC7ghRknZq94Wyyr
	FBNJm9CsRMHzVdldUeoptPkm3al3EFs=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=suse.com header.s=google header.b=SgK6tb6o;
	dmarc=pass (policy=quarantine) header.from=suse.com;
	spf=pass (imf09.hostedemail.com: domain of mhocko@suse.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=mhocko@suse.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1734363528;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=k20My130rzARwyOvgTU0vCNVESW9H6oT+V38TS25XZ8=;
	b=KZPobDicG0aoh0OUHxM8L6vbpph6NnuT/4dRoASo+JdvNgN1NQqz9sPM0B7BqCdbr7Ckn7
	gZb7UPw6mSkftxeCNzQqpCB9WubfZMrteW+God25B7RnpoYuaP7gpkia4OjCO1N+7Gaaif
	E2nF7PX3iAIBd1y/IQctGDHiHkj+az8=
Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-aa6a3c42400so741041466b.0
        for <linux-mm@kvack.org>; Mon, 16 Dec 2024 07:39:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=suse.com; s=google; t=1734363554; x=1734968354; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=k20My130rzARwyOvgTU0vCNVESW9H6oT+V38TS25XZ8=;
        b=SgK6tb6oROsJutjSrw2GUs7Jf1FXLVz+TX0so88LKPTmadhI0VPY+nsA6FmBroDx+x
         bMxy5KMEz/3Xp4XEGDMBfuyYD5zT66acZQkofZQKd5Byw4PK8WrY3hdeUuJz959eF6gr
         REHqiRahsvyuU+4346L6BEukTpLzJ7vF4shTa3KpdhAaMDl+8G1b+GfTiXRs1rzzgGSN
         FfVOZMzjvmqu79z4WQQ1bLjgWLPIx3nTdgPNoKqDB8kq6FA8b29OBPX7Km3JBI/Z/vUT
         GZc0odN8iCmZsj1w1X+sEuni39VNAmFdz47y0GoIQAu3Y9bxDHJFZa+I/zC4tywbtTG2
         GrUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1734363554; x=1734968354;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=k20My130rzARwyOvgTU0vCNVESW9H6oT+V38TS25XZ8=;
        b=tmAjA+1RRpI8Jz7pSEsAWTl6i7bmxdNH41+gABrgHKRHxyUTCjIWSjFhaGnfKi8r9B
         0BegI1tT6ii2p+I7xLRNcLP1H38ocJk6rLH36bmx+nnqM27fZVYa5qoCKQLdQyAfy1sH
         L+HLMbTLAmZbIL+PO0gcgMqMW1OsoTF39aja9sYlx5w2wieAOUrqZ48fYtJHwh8yqZcL
         GN6+A7LIQSJYZzrVcysJ/8uWKjb7+28NHSjCVKy6YAzCH1HjE7BCkaCLIyNGJuWuor4m
         3Ybm92ozI62Oghh3H48ZQANr/rlZsLy9BMQRpFVpTo4O+2Zw2j/jTuvyGhwgqds+xx+i
         YM3w==
X-Forwarded-Encrypted: i=1; AJvYcCW3pq1SgO2T4+1re+xy1khUQY3zxH1pr676VIGQArCBB7iIiZykd7GEQTsTUe4KwZ/Yy5YNFrGfYg==@kvack.org
X-Gm-Message-State: AOJu0YwMNJP2vjS8uU8VkQev3ZVge9ms1No2aPcdFyTvh2jHtYnDUUI1
	i4Yb4CMpQ0AIH7yqKVd5YIXVqi+oYa/WFtn0auxxfwDO6LtMAX6d1WSG5L3gRTc=
X-Gm-Gg: ASbGncutBfjZKOCiO8zbMBAg6zATeJrczOwAmlUYBYhUDSYgmxiaS+iRSMDQStJ9od2
	YrBw5ANvrLpHnezUt9Y4KplEj/TDTUOEZTuq6Wv8use128RHTmhkfYXRDGNjDMwbD92cJ8NyVM4
	5AN/brbjc6m6rfCnE17cqE4zHn9PjcauBYQxSgoaBKE8z3XZ+CMNIT7/8yPiXyQJ0kD0zSU3Iw8
	zcPWObBFgqo5lzodZGZza361f5SztexYXA9eok2YTVwUgYKv4BANOw1t/C3dtdlZFw=
X-Google-Smtp-Source: AGHT+IFtdy6Z+Mdf9GGMuy0jIu5mLADOTup4p6tdMqMZcd7gLMcs48qXum9ADc9H8rCu/BNEGRvYfw==
X-Received: by 2002:a17:906:32d0:b0:aab:da38:1293 with SMTP id a640c23a62f3a-aabda381364mr40185266b.4.1734363553856;
        Mon, 16 Dec 2024 07:39:13 -0800 (PST)
Received: from localhost (109-81-89-64.rct.o2.cz. [109.81.89.64])
        by smtp.gmail.com with ESMTPSA id a640c23a62f3a-aab96359e50sm348643666b.126.2024.12.16.07.39.13
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 16 Dec 2024 07:39:13 -0800 (PST)
Date: Mon, 16 Dec 2024 16:39:12 +0100
From: Michal Hocko <mhocko@suse.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yosry Ahmed <yosryahmed@google.com>, Rik van Riel <riel@surriel.com>,
	Balbir Singh <balbirs@nvidia.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	hakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, Nhat Pham <nphamcs@gmail.com>
Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
Message-ID: <Z2BJoDsMeKi4LQGe@tiehlicka>
References: <20241212115754.38f798b3@fangorn>
 <CAJD7tkY=bHv0obOpRiOg4aLMYNkbEjfOtpVSSzNJgVSwkzaNpA@mail.gmail.com>
 <20241212183012.GB1026@cmpxchg.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20241212183012.GB1026@cmpxchg.org>
X-Stat-Signature: 6qimp95cxam4g6iieaxme8t8actssdy7
X-Rspam-User: 
X-Rspamd-Queue-Id: 55AAE140020
X-Rspamd-Server: rspam08
X-HE-Tag: 1734363535-208497
X-HE-Meta: U2FsdGVkX1+E6/J8JV09PHo8oFb8QcTZTtsZGOGBCtI0oJqJuwkiTL1PbNX6WmstyB/und0aFjFH6hD2lmaXSx2uoNg9g4ox/2rbWfZPBSjuLg5YSlTwf4l6n5j6BZEN31rXfR9TaejYwSyyDf2GTbjIX97De3R7St3GEpAF3/lYC4iIGPryj+3e7ifM+wUsF8a5brFV9f124WfF98A49j5JGSwDR9z1BK+taxU9TEQQeXxR7LEsJXu1yhghYeqLJSR25U2By8rQeGxlMDrA6OJNX/ZhEgtJuJQQ06uUTVrD/GqZRn0Qmrpi0fS58fIQS7i0HAGK/mGUOz7VCxBC254txjhsQjvxjcWIToCQA2XA7Ye/DWbspk1gOSychF4jUzpAG6r03OOs/WT0rjG7B1gR11eytWdUYyCORdpYyN0ae6DyD2K+BE4qKY4Le5Nihz5Sk4VQke+59DGW81clEFArpzoFqjloAkpWvsKuEVLMSOQZ+E3CFS8tAg9UuTCFfwqrVqntx4rxy7yvB10Taz112r7x02/rs2+F4DxxckALJkQW6PLY/NMsMiQ4XoM3KVh3F+MGuAxa2nKWwkeGRPnNjvpguQXRwLk95AC+lCiNxcxgm4anXB9M6Jb5Ml/bUclArI9MVLKDVk4yf/86D/aID4dmP0g53+1T8eS7gIIzCgJWR6DW+z3E5SGk3V5h+8ja1X9Mmpa0J9jFDb4VqIDkI44NIuO0IyHT+5Kbzhvk4nn168+L5fgD/oFMHfuyg645eFalhdry+wRFORFVwtP5Mrb0QYpkCgpcOBQSQcqZQHFlHDhbOWjNBWPF5pgbQ4I7SVdLSmMBy4YtIVthus/74SQU8CQ+6kci6ZZSaqvJVhjvIiXQGvy/J+sBcGvHH1tzrqMfkxK4H1cJFyX0ZXuiS50hW1I/HSotmOX9EIEz3vZW6vihSSRwB6qS3HdqtS1HpOISgD9NMrc7P1e
 3779ji8J
 uD/bmLsQB3HKrW9WejccpfDtjmdPE57aft++z+LFhiRz5yqvyE+YPwr+l45b0kPpnZu7rNY0cGxW5JP5qNhd3Zi2V8cJZlygIrY1gCOfIkOpne4/Y6oB8/x6uxGAeLi1sBs4qf2QHmVesP1Hc0LbBYkCm9Pn6wKLLMjkMpcdXxm7zKfGKZoZ7vhURuueP3DAb8U+p4ip9YwC82gtEEbaUGb9D5fKxpmd3or1yKYBBRKv/ybhC9bScZIfTyn1wh1gD6E4IZMw6QMtYugD0CTNzqL+QUhFOIBmuVXk/38AC6OVmqiJeBbD1oXqzSA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu 12-12-24 13:30:12, Johannes Weiner wrote:
[...]
> So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> have a somewhat tumultous history of policy in that space:
> 
> commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date:   Tue Mar 5 15:46:47 2019 -0800
> 
>     memcg: killed threads should not invoke memcg OOM killer
> 
> allowed dying tasks to simply force all charges and move on. This
> turned out to be too aggressive; there were instances of exiting,
> uncontained memcg tasks causing global OOMs. This lead to that:
> 
> commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> Author: Vasily Averin <vasily.averin@linux.dev>
> Date:   Fri Nov 5 13:38:09 2021 -0700
> 
>     memcg: prohibit unconditional exceeding the limit of dying tasks
> 
> which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> even OOM victims*, can force charges. I am not sure this is correct,
> either:

IIRC the reason going this route was a lack of per-memcg oom reserves.
Global oom victims are getting some slack because the amount of reserves
be bound. This is not the case for memcgs though.

> If we return -ENOMEM to an OOM victim in a fault, the fault handler
> will re-trigger OOM, which will find the existing OOM victim and do
> nothing, then restart the fault.

IIRC the task will handle the pending SIGKILL if the #PF fails. If the
charge happens from the exit path then we rely on ENOMEM returned from
gup as a signal to back off. Do we have any caller that keeps retrying
on ENOMEM?

> This is a memory deadlock. The page
> allocator gives OOM victims access to reserves for that reason.

> Actually, it looks even worse. For some reason we're not triggering
> OOM from dying tasks:
> 
>         ret = task_is_dying() || out_of_memory(&oc);
> 
> Even though dying tasks are in no way privileged or allowed to exit
> expediently. Why shouldn't they trigger the OOM killer like anybody
> else trying to allocate memory?

Good question! I suspect this early bail out is based on an assumption
that a dying task will free up the memory soon so oom killer is
unnecessary.

> As it stands, it seems we have dying tasks getting trapped in an
> endless fault->reclaim cycle; with no access to the OOM killer and no
> access to reserves. Presumably this is what's going on here?

As mentioned above this seems really surprising and it would indicate
that something in the exit path would keep retrying when getting ENOMEM
from gup or GFP_ACCOUNT allocation. GFP_NOFAIL requests are allowed to
over-consume.

> I think we want something like this:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53db98d2c4a1..be6b6e72bde5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (mem_cgroup_margin(memcg) >= (1 << order))
>  		goto unlock;
>  
> -	/*
> -	 * A few threads which were not waiting at mutex_lock_killable() can
> -	 * fail to bail out. Therefore, check again after holding oom_lock.
> -	 */
> -	ret = task_is_dying() || out_of_memory(&oc);
> +	ret = out_of_memory(&oc);

I am not against this as it would allow to do an async oom_reaper memory
reclaim in the worst case. This could potentially reintroduce the "No
victim available" case described by 7775face2079 ("memcg: killed threads
should not invoke memcg OOM killer") but that seemed to be a very
specific and artificial usecase IIRC.

>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (unlikely(current->flags & PF_MEMALLOC))
>  		goto force;
>  
> +	if (unlikely(tsk_is_oom_victim(current)))
> +		goto force;
> +
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;

This is more problematic as it doesn't cap a potential runaway and
eventual global OOM which is not really great. In the past this could be
possible through vmalloc which didn't bail out early for killed tasks.
That risk has been mitigated by dd544141b9eb ("vmalloc: back off when
the current task is OOM-killed"). I would like to keep some sort of
protection from those runaways. Whether that is a limited "reserve" for
oom victims that would be per memcg or do no let them consume above the
hard limit at all. Fundamentally a limited reserves doesn't solve the
underlying problem, it just make it less likely so the latter would be
preferred by me TBH.

Before we do that it would be really good to understand the source of
those retries. Maybe I am missing something really obvious but those
shouldn't really happen. 

-- 
Michal Hocko
SUSE Labs