From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 83EA4E77180
	for <linux-mm@archiver.kernel.org>; Thu, 12 Dec 2024 21:41:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 219FC6B00A4; Thu, 12 Dec 2024 16:41:46 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1C9B66B00A6; Thu, 12 Dec 2024 16:41:46 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 091AB6B00A7; Thu, 12 Dec 2024 16:41:46 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D8C356B00A4
	for <linux-mm@kvack.org>; Thu, 12 Dec 2024 16:41:45 -0500 (EST)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 8763A1A095F
	for <linux-mm@kvack.org>; Thu, 12 Dec 2024 21:41:45 +0000 (UTC)
X-FDA: 82887627606.20.E8E77AB
Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41])
	by imf18.hostedemail.com (Postfix) with ESMTP id 6D6EF1C0004
	for <linux-mm@kvack.org>; Thu, 12 Dec 2024 21:41:32 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=WruB8tRQ;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1734039692;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=NndNSRUhbnWB5FGgPp/6wuTx8RdsrYqkoIBIFndV9tQ=;
	b=i4kR527duQp8TaM5Z0nrGmYWo1eF717dH3LlH8EgWkNrrF2hGPn6ivfJXuKIoTNfNKi6+A
	PvZJOx9SsIof5YznD+zMoGU3DzapIEOHL9qhw1uL7fonQ3oyB7bQgQAglTu6ypfSojWtCP
	XLt2AXi5ZdDY7iMncwxZ0OpwgXmYxnM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734039692; a=rsa-sha256;
	cv=none;
	b=Nmy/7uLXAq/cmS3i1a3ZcZPd+1IHes6xKsfKVvwyYgizgnPied41lTCzgAr3C3dXerxZDV
	hqNUAahZjaaCbGRKYmgVqDJWiwpqaTtkR06kQu3h6d2MD7pXpNH1rtSkPHjAnHa0RcINJN
	gcOV1fZlux4PRZOIJvhSgjn7cPXgtBA=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=WruB8tRQ;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com
Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-6d922db2457so9602876d6.3
        for <linux-mm@kvack.org>; Thu, 12 Dec 2024 13:41:43 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1734039703; x=1734644503; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=NndNSRUhbnWB5FGgPp/6wuTx8RdsrYqkoIBIFndV9tQ=;
        b=WruB8tRQnHxG8POk3tpdY1VW+7qqiuSAuEfl9VM/VhkD+pHZVKecxlE93WIe4EVJrD
         nbioRpzgHhMim6zEceAp2vHPxb+Tz+hov0vVn3aPlCdWeFbmcHAOuaXgFsCE9X8mKG9d
         EykLcoCYyjXGMQVYTRGOn/TOXe6KuZ9pRUDSIAGNfOpDPn0wLzeAcKqpMy5Su+2vC/AO
         fQhzCyXw+sa48ahyLnN2+13mwsMGpOTvZ0VzAo9P03fBBazgNTYPxPPipEWawWA5d/96
         fBlDlTyEmW0eMm/DOWhZjsnqY5NpGRrjpmjKBbjZpS1jt6bwJl7olso+gM9WuP7b/beG
         BpWA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1734039703; x=1734644503;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=NndNSRUhbnWB5FGgPp/6wuTx8RdsrYqkoIBIFndV9tQ=;
        b=LJpdbeN4xbijVgafJ3DknyYTLe+Js8kDBXFaH1Ku5f2kEOgeI59jMt0LaoZn2xVcgj
         6tVWjDpgzLqANYBM/AGLwnY7FfT8P8c2US/BmCWWWIV2nbrPjGkEgNnEkxLQyAFSLCLF
         B8ULEcHRyOrK9vo0I9ekyiC2SlZRfwhexuAVkLIJdu7y27n91WLQZAKoN2+drK85QBzG
         2h6Rqv+wv60OWfJZN/RqVtw3wZBysFiDLIadjrh7riGHhYEceSuWmJLqdgenXAEeY3P6
         uLuw79TjAjO3GvzrMo5IaCzvLm59Qz73MZvLyywjee91+ABqXCM/GgU6kYVM+RSNu2Z+
         Ycpw==
X-Forwarded-Encrypted: i=1; AJvYcCXV7Q+ARJ5brIlJXIqebokJnk4SLGCdu7dWZWnKH4QfPreD91TNX2gdNEhlC17YssgT7bj9ewVZNQ==@kvack.org
X-Gm-Message-State: AOJu0Ywp6+kJFxVoKeliFyOJTjiHa/CE+5MgyYXNIGxatrfKZr2PxmHV
	nStoMK21Evlu5U5jwsckUxPHqH4DkR72+35j3MS7W6hLlBtO7GDsvI/ecITldvB40/RHhoPAuap
	5wAnXDxICFLUA48qjeaW03/yrco/UdIhSkp97
X-Gm-Gg: ASbGnctKZZPDOXOYJpZILFsOxs/9E2xMWEyqmyyvaGRS2D/tBihVLigP6LXIv7/Gbjw
	KUd/k/BeooHcr4pqcEsP5DM+QcQ+VHVkHSis=
X-Google-Smtp-Source: AGHT+IEiN0OqoDo2mRZ6+j4V/MCj+UFVGvVBvwOOn5R3E0vFpV1L+QCjP2VAtummDoAgR/xmNi0aDKjZs9bA7K2Lcgs=
X-Received: by 2002:a05:6214:29e8:b0:6d8:a027:9077 with SMTP id
 6a1803df08f44-6dc8ca3da18mr2385996d6.5.1734039702507; Thu, 12 Dec 2024
 13:41:42 -0800 (PST)
MIME-Version: 1.0
References: <20241212115754.38f798b3@fangorn> <CAJD7tkY=bHv0obOpRiOg4aLMYNkbEjfOtpVSSzNJgVSwkzaNpA@mail.gmail.com>
 <20241212183012.GB1026@cmpxchg.org> <pr5llphyxbbvv3fgn63crohd7y3vsxdif2emst2ac2p3qvkeg6@ny7d43mgmp3k>
In-Reply-To: <pr5llphyxbbvv3fgn63crohd7y3vsxdif2emst2ac2p3qvkeg6@ny7d43mgmp3k>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Thu, 12 Dec 2024 13:41:06 -0800
X-Gm-Features: AbW1kvYrG3xeIYEdjZ61h5ymWOc_3UboGjPg4dvUyfO_e07qePepWjfMDNNQN3s
Message-ID: <CAJD7tkYMwrLTvcORnXVjQ4s+UMSTZD5jddv78awOPw_DqYFufA@mail.gmail.com>
Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Rik van Riel <riel@surriel.com>, 
	Balbir Singh <balbirs@nvidia.com>, Michal Hocko <mhocko@kernel.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, 
	Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, kernel-team@meta.com, 
	Nhat Pham <nphamcs@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: kc7ur4cb3ct3y6i1nftebmdr4c44nbzd
X-Rspamd-Queue-Id: 6D6EF1C0004
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-HE-Tag: 1734039692-473745
X-HE-Meta: U2FsdGVkX1+BuTrbtAB8FsEDRL3rC4OdQSTLjAQaXopRR4RFJICITfeZktJsmdLLwa4MZAaKZQxPbUJovBwRhjIFvjL+AYWtLz9HHHLvNLUtFuM3I9y/+M5lMUGZGQ18ByQeAS5qs5ZRWu0MUSlRwiK+Gp4dxihhXmO2ENQopsXx9azxwIVsppxT58W6fNdMya0EhAQKatVgXpUhVwPCYbIBafv63Atz/hLlXdJc8ReX9rFZ1or3isSlh2vdsLaEew7ZXvA8d3DjfhzAfP0z/4ZRgMQeM+2smY0JsA2/hsLFuN8o2eL6ZbwzkJhqyzjtoilj8sv0Dn+RXe6hyas+nMyJdSALM3a44PHGau2fWxSDPX5xt7kNEfkv7LFiV4l0n0NWID/LOzBVIidz7aVSQ0+DRpUAnC5UU+58NUqmnIse8kBB9zzWXphQkI1CL3qOanc2ZrZ6aDyFnKMQ0Qb4nW98aokD3brqcUVe/pBvPR3TVz/EFARnoQGB9XNWY0mXeUPnvqosQW8AKWbVJvJLDZ+TMmVF7dT7FwzXPX1nSyzfYVRea5KkoW7uQHz4wgBKEkUCvkNb5DDSBrUNO94y2adfNH/eWUIr0s0UyVJOJfPuaMXAkuGmg9zba3YfZYUPD6EXA9BXJOPIEq6iN365yBrUq+s+CCCIoSZEiqJ8IG1uYoZAl8i7O+OYxqKuI44pEKA3BFt+9Z2R0V+MxaYxD0NvZzUnnys/i/5UXBmKwCWcMXsRiynjGDxC/gc7/R4S12KxSE2aK6jnKd4yf3GSQIl3HHW+WlfU/OALuy/rK+CV5CSUQ3tLpxFjLYNvLknVLF3Qq8g8vtWmlPB0q5MYUd8I5x24XqQQ848CLkmIgoB0KliPsAthDz4wY88B1VlmS4IPEEEtA3ORQprAWq+Ym/JAKllac7BIotyhlxUCzNuFkeT81l5sNW0Sb7egoFXacAJJYq+qwhp+1StqaJh
 S98jcI7Y
 2Pb4mAm3LbzZ7GduKUDJrDgLcpuaeJcB9LIo9lXhO63JZvdy/nGwSqVYD1F/hqq2cffgec4QgO2Gz1uttU6bl8bfMBGxc1AvqEgLlFYz43ufM/mAvYLJCZqvU9DJKaH3OODeIfC3FwKAxJlYqjfz7Pl8tdmWo0edMvw623/TmjwYT5DlN9lZHTD28DyVjmYpgESk817HZ5S/MMqR268rLe683MLP/DXRsSNt9AofIucRRyt6CE34P+bgZqr1MlGT6llRX5IyVleIHH3cUsLyCefrg18yuwi+XvW2e
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Dec 12, 2024 at 1:35=E2=80=AFPM Shakeel Butt <shakeel.butt@linux.de=
v> wrote:
>
> On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 12, 2024 at 8:58=E2=80=AFAM Rik van Riel <riel@surriel.co=
m> wrote:
> > > >
> > > > A task already in exit can get stuck trying to allocate pages, if i=
ts
> > > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > > zswap writeback is enabled, and the remaining memory in the cgroup =
is
> > > > not compressible.
> > > >
> > > > This seems like an unlikely confluence of events, but it can happen
> > > > quite easily if a cgroup is OOM killed due to exceeding its memory.=
max
> > > > limit, and all the tasks in the cgroup are trying to exit simultane=
ously.
> > > >
> > > > When this happens, it can sometimes take hours for tasks to exit,
> > > > as they are all trying to squeeze things into zswap to bring the gr=
oup's
> > > > memory consumption below memory.max.
> > > >
> > > > Allowing these exiting programs to push some memory from their own
> > > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > > consumption below memory.max, and exit in seconds rather than hours=
.
> > > >
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > >
> > > Thanks for sending a v2.
> > >
> > > I still think maybe this needs to be fixed on the memcg side, at leas=
t
> > > by not making exiting tasks try really hard to reclaim memory to the
> > > point where this becomes a problem. IIUC there could be other reasons
> > > why reclaim may take too long, but maybe not as pathological as this
> > > case to be fair. I will let the memcg maintainers chime in for this.
> > >
> > > If there's a fundamental reason why this cannot be fixed on the memcg
> > > side, I don't object to this change.
> > >
> > > Nhat, any objections on your end? I think your fleet workloads were
> > > the first users of this interface. Does this break their expectations=
?
> >
> > Yes, I don't think we can do this, unfortunately :( There can be a
> > variety of reasons for why a user might want to prohibit disk swap for
> > a certain cgroup, and we can't assume it's okay to make exceptions.
> >
> > There might also not *be* any disk swap to overflow into after Nhat's
> > virtual swap patches. Presumably zram would still have the issue too.
>
> Very good points.
>
> >
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> >
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date:   Tue Mar 5 15:46:47 2019 -0800
> >
> >     memcg: killed threads should not invoke memcg OOM killer
> >
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> >
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date:   Fri Nov 5 13:38:09 2021 -0700
> >
> >     memcg: prohibit unconditional exceeding the limit of dying tasks
> >
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> >
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault. This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> >
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> >
> >         ret =3D task_is_dying() || out_of_memory(&oc);
> >
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
>
> This is a very good point and actually out_of_memory() will mark the
> dying process as oom victim and put it in the oom reaper's list which
> should help further in such situation.
>
> >
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> >
> > I think we want something like this:
>
> The following patch looks good to me. Let's test this out (hopefully Rik
> will be able to find a live impacted machine) and move forward with this
> fix.

I agree with this too. As Shakeel mentioned, this seemed like a
stopgap and not an actual fix for the underlying problem. Johannes
further outlined how the stopgap can be problematic.

Let's try to fix this on the memcg/reclaim/OOM side, and properly
treat dying tasks instead of forcing them into potentially super slow
reclaim paths. Hopefully Johannes's patch fixes this.