From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6ADDFE77182 for ; Wed, 11 Dec 2024 16:35:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0627A6B009A; Wed, 11 Dec 2024 11:35:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 010AD6B009B; Wed, 11 Dec 2024 11:35:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1B166B009C; Wed, 11 Dec 2024 11:35:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C2B776B009A for ; Wed, 11 Dec 2024 11:35:07 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7CA9C160F68 for ; Wed, 11 Dec 2024 16:35:07 +0000 (UTC) X-FDA: 82883226762.08.7A71480 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf14.hostedemail.com (Postfix) with ESMTP id 02D55100014 for ; Wed, 11 Dec 2024 16:34:39 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733934890; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lXK83SY2csqG5HRslsjcEkHsrb2CoOVYQDZ+nZAVIo8=; b=GYNhxGJQHa60l1Wjv3ImENdmwTou8pGYR80IrpyHrBDn7F9DZWTss4fgQpdRa49S28+FmB djLGI9wCUPOnCJvSQd7grgHDMfTYpE4MmK8mQNiGY6E5aqLjkBij1/qcJaBfgQprMEu1C+ KMzN+b3PnVVM6oWt8tKIgCopLoa4FU0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733934890; a=rsa-sha256; cv=none; b=PWdSfQTOcqZY+FX7Mftpdxa0rfzCO8qQmsfDIbrPBufWZTOrnMDhr8kgET4z0M4PY2LpDa BzQSyLDV6iCNr/B+C65uL73qcoh9CgPDnQpHHG/QIWkrUsxv2qD4PLWcCAoPjYgbiI1UGw nUJcqAK+t8wEiWQpgf3y0yc5+U/8vTw= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com; dmarc=none Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tLPfa-000000000aJ-3MFM; Wed, 11 Dec 2024 11:34:38 -0500 Message-ID: <768a404c6f951e09c4bfc93c84ee1553aa139068.camel@surriel.com> Subject: Re: [PATCH] memcg: allow exiting tasks to write back data to swap From: Rik van Riel To: Yosry Ahmed Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Date: Wed, 11 Dec 2024 11:34:38 -0500 In-Reply-To: References: <20241211105336.380cb545@fangorn> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 02D55100014 X-Stat-Signature: b1tccj9duhr3f1kap8izhfyit41619c4 X-Rspam-User: X-HE-Tag: 1733934879-309757 X-HE-Meta: U2FsdGVkX1+1XGEWav7mRxVllnUVr9tXRl1K+XCeKCNU44MyE+NQyg2O/PgyRiakDULNofWGNivlIejq6mjMJXKFhxkRn0Xam10qWJ+NwVVHMNMbL9rXXSQ+AAMG+L2Hn1EFU2KMhvvWg1hNlvdZXDRW2ygNOFOTj0L+nj1xbwH2X99CAa0tx8wtEoBVQqbmaKUjPoMpl4phSVM36MfI86JJNlWNVz6NoPZOP2MdhuXlgbKCWMmXI/W6xjGWayAbCsQkAiRSc+W42hi04WoZBOOiJOLySvFtl0el5zs6oXYhGFSI7HRig/FzyjH38QjOQxA93yibDxJr8Ko9S92DQDlvrmwhm1e85DP/le1BakEPxexU2VlAHIjZqLQ+vyF9Y4puQ4pVSy8fZBJm+GQOqZnCU+rjllUDZLlfLn7VdnL8eUsXu9p+B/D09hIsSNHDUHg1uzhYOaCrIBqaGLURSUTlegVwYGHooAKwrbgjwrIavUcHAbBaW02DLeziyefTv6fdSKqVfjKhw4SIMY+6oTgjtGFRxe09+qOBo6IThoe3ZxdOxWPdfYjB0sDr2WF+QKqOoYMWjWUeEK0QOXKXvGPYrTjdM7SJ6pLtHwtdkoMVon2pZ6a2jdOdkH/Eb9iHi69JdwVArrhVIH16bpf+wtmwusRqHHv+DH3ll3LHW44WbKsuXjE9KxER0PJq41nc2sY2ZKkogvo3SvdfwoPSstckErGvzbZEVrrQHQTA+AdG/o6HSORCmWyoTznSWW4+KlaModEji4U6foxRF1ihvgnY9bkj+3ek0rLmUcmxCI/FITFirmozKcQY5CUF5a5WVXE8wHB8aR4RWQJRvV2aTsG7XdQ19JPhGGu4DmmPOlF5iR7PoLWzYs/TcHXGLoVT7Kh9s1+ZqZ5SOLVXSf+GElk9A3BteJYhBMZelgo6Yn1uLVqL/C/V+jyMcMBJNj+r/m72PDy9EYscKuTrgDs SBHU4Wxm P7tPbYUzKr2Sl1GVrorqwtGmjfOJQjY6MwMpm/6l5NMRZF0JqCxPnf6QrsatS7eypCiKaMth3+C4sW6C8cNB5XQYB3U0R7/5doS5U2dOEau52qgLAwHVJRJu7S+Tbnryw2IcP+sE3qa8o46n+Dlv5lEvLy8f0B7IvT/YicIzjZc2SOm7sQ5bEPzZ7j/MTQ8gpGrSSgQ+IFLhcCEEysfbskE8Jv9oZIoZ4LZnM0j8mCFJtXkYmVhOI5DSbdu8LFsYAt+gv/tuUKmY/wTdAvXFGGMnr0QDk4uEoAwNdZ4vjWn/v8DJszGGN8kFKx57PC+MKuRtx6JDFkzM3HfU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 7:54=E2=80=AFAM Rik van Riel > wrote: > >=20 > > +++ b/mm/memcontrol.c > > @@ -5371,6 +5371,15 @@ bool > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!zswap_is_enabled()) > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 return true; > >=20 > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Always allow exiting task= s to push data to swap. A > > process in > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * the middle of exit cannot= get OOM killed, but may need > > to push > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * uncompressible data to sw= ap in order to get the cgroup > > memory > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * use below the limit, and = make progress with the exit. > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if ((current->flags & PF_EXITING)= && memcg =3D=3D > > mem_cgroup_from_task(current)) > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 return true; > > + >=20 > I have a few questions: > (a) If the task is being OOM killed it should be able to charge > memory > beyond memory.max, so why do we need to get the usage down below the > limit? >=20 If it is a kernel directed memcg OOM kill, that is true. However, if the exit comes from somewhere else, like a userspace oomd kill, we might not hit that code path. > Looking at the other thread with Michal, it looks like it's because > we > have to go into reclaim first before we get to the point of force > charging for dying tasks, and we spend too much time in reclaim. Is > that correct? >=20 > If that's the case, I am wondering if the real problem is that we > check=C2=A0 mem_cgroup_zswap_writeback_enabled() too late in the process. > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap > entries, only to realize it cannot swap in swap_writepage(). >=20 > Should we check for this in can_reclaim_anon_pages()? If zswap > writeback is disabled and we are already at the memcg limit (or zswap > limit for that matter), we should avoid scanning anon memory to begin > with. The problem is that if we race with memory being freed we may > have some extra OOM kills, but I am not sure how common this case > would be. However, we don't know until the attempted zswap write whether the memory is compressible, and whether doing a bunch of zswap writes will help us bring our memcg down below its memory.max limit. >=20 > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in > case we are reclaiming from an ancestor and we hit the limit of that > ancestor? >=20 I don't know if we need or want to reclaim from any other memcgs than those of the exiting process itself. A small blast radius seems like it could be desirable, but I'm open to other ideas :) > (c) mem_cgroup_from_task() should be called in an RCU read section > (or > we need something like rcu_access_point() if we are not dereferencing > the pointer). >=20 I'll add this in v2. --=20 All Rights Reversed.