From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CB9DE7717D for ; Wed, 11 Dec 2024 17:20:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 278AD6B00A0; Wed, 11 Dec 2024 12:20:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 22A176B00A1; Wed, 11 Dec 2024 12:20:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 119396B00A2; Wed, 11 Dec 2024 12:20:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E834D6B00A0 for ; Wed, 11 Dec 2024 12:20:41 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 96D9B4399F for ; Wed, 11 Dec 2024 17:20:41 +0000 (UTC) X-FDA: 82883342472.29.BABA211 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf08.hostedemail.com (Postfix) with ESMTP id 1C2E2160011 for ; Wed, 11 Dec 2024 17:20:23 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf08.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733937629; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0RtV6dn6xI7zBCuJpQP0FmXIOL/lVgzsWgWSol16mcc=; b=gYzU88Dw2EHezZBYha0h+MDNCu43Z6HnCEGevLpc5H3Nqe9AG6rjPL/rWVMyX2wJ5aSQ/E og7Ravh5EWhh1RWBHW/OV8Be9//4ctYP1QoY3YxDnydWvmTdApKdQF/ijaMbyWHuGRSlYS 6KchWwb41M5gjUdhRjdsn3QGzapftVU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733937629; a=rsa-sha256; cv=none; b=htwU6vb0oyzewV/EwAfcgNpW7rxT9LZbz+Q3WPH8R+ZslT/IrsnV+186kH7lk/IPxC4SY5 Tpi71Rfj65gnggQ5fTwMiHAIkaqUH1wDiWMjMoUsUOmn73JusDRTTssVFI6gGSrf6qlosc S8C6vu5KbgdNHxirnFLH1O0lpwPiSAc= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf08.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tLQNF-0000000015w-1wQM; Wed, 11 Dec 2024 12:19:45 -0500 Message-ID: <6bc895883abca3522c9efc0c56189741194581e5.camel@surriel.com> Subject: Re: [PATCH] memcg: allow exiting tasks to write back data to swap From: Rik van Riel To: Yosry Ahmed Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Date: Wed, 11 Dec 2024 12:19:45 -0500 In-Reply-To: References: <20241211105336.380cb545@fangorn> <768a404c6f951e09c4bfc93c84ee1553aa139068.camel@surriel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) MIME-Version: 1.0 X-Stat-Signature: sn9b3mfxm67r5sf7c5jda76cfx4d9x3o X-Rspamd-Queue-Id: 1C2E2160011 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1733937623-8878 X-HE-Meta: U2FsdGVkX19rurcKV52ErPSZRjIf7tBmP8+eMmo5A4INALhITmO3xCTYEz9OOZM//92NbO3ZBu9E1rcwaGmx6MtFnTRQVTxxlo6K8uSyiaUQEbyDaCBmuSrNjE+ftGt2MRnP0pGX8fPUCjL9mJGXrkKGW/rdRBfzc+UEdjbxJUj60FEfEajdkVAi1/EaCE+MVkW+DrkebemjXBI1UHuf+u8+E3ickQnO+q/N9zixzu3yp7DNPQ/EqomrgV9uX7F4VOL61hMM0oraZ0p78ytBiJBAV6ujbgx7p0wATa7cUP70xjol0phHzPzPnHN+MD9f+/LTQy/oDTHm3q6KmIJQR5GdAmS6iinxVRFazlOqCFO76hhsrOaiK6/DQCvbIf7sF1SFHRrNoojAt8xfuix6jXgu2cT6ZnYzTTdkj48mV5DR3edlvsMzm2LdlltCs3A96bvIta3rjq2p8s5NB7KGXkg5EmtikUZWIWW0OsDztvD3cx2jC1QG/a2RdDkoR2/skdHkdKpq3hRqqSy4/tyuUIsQtdeYjfAQRfNwAJbpx7vmzbGeJ4h7roKKZo5PauTICpiXAwhOzealyTyShTDIxlgJYEOpfrcqBU11P/R/uCgxDTdN8wokI16oGuqKPRN9o+qfCCOt7eounFqqBzVFqaBmympxRRkIFauBc/VdjJ4fuI/rNcsyLyCpWfzvxc+lZX5k696SC2hofhgnQkfauKNeme/rkoilm5sqja64+3Vc+lR2tt2vhSKGlfu1VYJIhm7rGdVtt9E6vIUZEXyM2Ejn92RWiNS+jQh694DmkRsii983Zegnn43mmUaob0orR8fGvr9D45xflHDTgaT04ErHoHYALE3Ed4VQ+gLiq+GMRJlFwj+6675uREpLrL3IReyEZmZj14fZjhk+s/eoBGrtO30BSO+xPYtgRy6m2Ot9OFat9kozoZXqH2U3HfYJBluYC4yRDn++axCiF3l shuYNDkt CMoYzliGTMkFpgtuAo06RiVJRcf1vVHirZHteP244OEPblKx4mPXoxV8N1YRoGM4PMQ5f57eSkuUxNFVhkMA12h8SjlegrIRW9NhMp47m2jVCMxptzzOftf5b1TpyKR1LKFhaeb/8VJkxXYgpQn9kKAgsOdufWRJy1oFbqlLdAikIKO3iKaAUO33lycb3WR+VVTEHYmnbKfdFS/rTPX1QtROT0Krtb4SvvF4c824l1qB1p4fIiCxvcq5Lr10umzzayWljDX/huzhkiePUpsTQyE+Yb8o8K+s2dRQWvDlxv1MY2V1nFL/IoB8S9t/bSkf0oo+62XhiCpI3G9Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 8:34=E2=80=AFAM Rik van Riel > wrote: > >=20 > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > > On Wed, Dec 11, 2024 at 7:54=E2=80=AFAM Rik van Riel > > > wrote: > > > >=20 > > > > +++ b/mm/memcontrol.c > > > > @@ -5371,6 +5371,15 @@ bool > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!zswap_is_enabled()) > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 return true; > > > >=20 > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Always allow exiting = tasks to push data to swap. A > > > > process in > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * the middle of exit ca= nnot get OOM killed, but may > > > > need > > > > to push > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * uncompressible data t= o swap in order to get the > > > > cgroup > > > > memory > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * use below the limit, = and make progress with the > > > > exit. > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if ((current->flags & PF_EXIT= ING) && memcg =3D=3D > > > > mem_cgroup_from_task(current)) > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 return true; > > > > + > > >=20 > > > I have a few questions: > > > (a) If the task is being OOM killed it should be able to charge > > > memory > > > beyond memory.max, so why do we need to get the usage down below > > > the > > > limit? > > >=20 > > If it is a kernel directed memcg OOM kill, that is > > true. > >=20 > > However, if the exit comes from somewhere else, > > like a userspace oomd kill, we might not hit that > > code path. >=20 > Why do we treat dying tasks differently based on the source of the > kill? >=20 Are you saying we should fail allocations for every dying task, and add a check for PF_EXITING in here? if (unlikely(task_in_memcg_oom(current))) goto nomem; > > However, we don't know until the attempted zswap write > > whether the memory is compressible, and whether doing > > a bunch of zswap writes will help us bring our memcg > > down below its memory.max limit. >=20 > If we are at memory.max (or memory.zswap.max), we can't compress > pages > into zswap anyway, regardless of their compressibility. >=20 Wait, this is news to me. This seems like something we should fix, rather than live with, since compressing the data to a smaller size could bring us below memory.max. Is this "cannot compress when at memory.max" behavior intentional, or just a side effect of how things happen to be? Won't the allocations made from zswap_store ignore the memory.max limit because PF_MEMALLOC is set? > > >=20 > > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() > > > in > > > case we are reclaiming from an ancestor and we hit the limit of > > > that > > > ancestor? > > >=20 > > I don't know if we need or want to reclaim from any > > other memcgs than those of the exiting process itself. > >=20 > > A small blast radius seems like it could be desirable, > > but I'm open to other ideas :) >=20 > The exiting process is part of all the ancestor cgroups by the > hierarchy. >=20 > If we have the following hierarchy: > root > =C2=A0=C2=A0 | > =C2=A0 A > =C2=A0=C2=A0 | > =C2=A0 B >=20 > Then a process in cgroup B could be getting OOM killed due to hitting > the limit of A, not B. In which case, reclaiming from A helps us get > below the limit. We can check if the cgroup is an ancestor and it hit > its limit, but maybe that's an overkill. Since we're dealing with a corner case anyway, I suppose there's no harm using mm_match_cgroup, which also happens to be cleaner than the code I have right now. --=20 All Rights Reversed.