From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED98FE7717D for ; Wed, 11 Dec 2024 17:01:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 663956B0089; Wed, 11 Dec 2024 12:01:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 612CD6B009C; Wed, 11 Dec 2024 12:01:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B4396B009E; Wed, 11 Dec 2024 12:01:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 2E12E6B0089 for ; Wed, 11 Dec 2024 12:01:21 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CE828C1406 for ; Wed, 11 Dec 2024 17:01:20 +0000 (UTC) X-FDA: 82883293710.14.9D5098F Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com [209.85.222.177]) by imf24.hostedemail.com (Postfix) with ESMTP id D3E5F180030 for ; Wed, 11 Dec 2024 17:01:15 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hVU57GQh; spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.177 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733936457; a=rsa-sha256; cv=none; b=kHDw183S7+AvXUcWQoLH1xSzfFRLzeZ6b72IP5zGqQqtfbKUIoRcq5Yh8qiTz0oYAdSN1N dgewz11aFSQnhSzvD9REtJJIVir8NgfcQVam+jPwmAyNf74juvw+vNH4c2vgWoqWzrjzNx F7SXhdrdjjXYYDrXvEYDMy6TaWtfBgQ= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hVU57GQh; spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.177 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733936457; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DWR8Jat9HqWQD4tgYwKhNycvduSOkRomQGq1rzYJLCw=; b=xEABEBDe5UHfYXblxhKDgJqnnTYCo77mBo6AqE1NsSMT1vtmOwdnSWr0XaG/+IspYaL9x4 8eGhLHrTP/QJA567ett3eWyTGdwfcnJeSTf/a+QVbaojUiSkiUXgI+D0JVzo4a5o14mEYW lX8gK4aWTGrS+kkLC2ADj0XXIbGuL4s= Received: by mail-qk1-f177.google.com with SMTP id af79cd13be357-7b6e8fe401eso126602385a.2 for ; Wed, 11 Dec 2024 09:01:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733936478; x=1734541278; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DWR8Jat9HqWQD4tgYwKhNycvduSOkRomQGq1rzYJLCw=; b=hVU57GQhL+oqrXGtnv34ecerysfzyqlVM+Q+TE0nHqEbj5V+NWCpJHX7BpHw7KHurb Fx4uBpm2DACweXTZsZol/AnRABXlc11xKIQ+MATsgg60VNrO+SiOQrkIkbT9o1X6o+u0 7ggL9r/gZP3RoT+VuXuROOdm5ZGTY3QffZRKiWUnEsKdYKppPw7rDfaep24qDlisJyk2 9M/StfXBt8Umwkj5sJ9tdQXy3gX+j6n+dL9vxVWnHbae8Ti5BvdP3PMF7s7E6dF93zpt OyK0WOnKkbRW/EN8RYWiL6Mk5q7odjssciBg36kwvdmJ+pwRND46NAkDME8/HiqDfLir bmrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733936478; x=1734541278; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DWR8Jat9HqWQD4tgYwKhNycvduSOkRomQGq1rzYJLCw=; b=qTim6KoHMTyMC0Cd/KjCUAzl10n7Bhr1/REyV2+jJLig0G/rVuDiWkQEpvnLUYjsL5 S3oU0VIrOoStVQ5i2+OA5OvvB0fbG92NuZQeq6Rhkk8d9yhtaYDVUi2Ygy9T8vtOTCvU jhVrOl7kdsXM+3UUTaYxL6rRut26UhySPC8m7HZslSIR6iIjvkcK0jXtkF/pJhVJSNBs C/CynUf/NJM6GEJY5+KV5ucrNRnS3NXEUTo01QToJ3JiBJgGq3UIDfgQxt3sR9YT4lpX uXtND5UQAwU2vQ2IBzFACoLeIvmK1fFPkn0MnAwCd48DXvZtuubHSkQjgylCzHwZLbhz lYyA== X-Forwarded-Encrypted: i=1; AJvYcCXfXoXWS6bWy2g1AzQ21nxPY8cUIJKIkZH98Q3fiAKwIj0ocmcRjfzrnd8YmNuHsH1wwAuDGHUVUA==@kvack.org X-Gm-Message-State: AOJu0YyPiPx+Jv87VjvesydfKgQ2tWRv+CrO3bD5WKB19luZfro3cvVx 6WwmFmyC233g9g3G0dpFO6IM6UBllzMTz0Yshys8xN0D/hX/S3Ef8JCjxBPg/dmeMqwSyLUVL3j uvkn04JyfK6arB+afBfseuKht/HlMG6AIfnZuChXHtbohwKVlZw== X-Gm-Gg: ASbGncs8RYnyNKToLzTJjS55eRf0ueUduCR134vzEcZ3ksqN+sI5fQyS1YxlEor40Hd tHwgyhou7gfpPb60eBVWMrLJDZV4zodO3 X-Google-Smtp-Source: AGHT+IGpw/zsckT4h9+kIHBMmJKYloAZQjEK4AQAEQl8yxWUFUcCtydY97rkPoueLrkZMwCwaCVxH/io1qSKn8f+tas= X-Received: by 2002:a05:6214:b6d:b0:6d8:b3a7:759e with SMTP id 6a1803df08f44-6d934c10eb0mr49278136d6.46.1733936477556; Wed, 11 Dec 2024 09:01:17 -0800 (PST) MIME-Version: 1.0 References: <20241211105336.380cb545@fangorn> <768a404c6f951e09c4bfc93c84ee1553aa139068.camel@surriel.com> In-Reply-To: <768a404c6f951e09c4bfc93c84ee1553aa139068.camel@surriel.com> From: Yosry Ahmed Date: Wed, 11 Dec 2024 09:00:41 -0800 X-Gm-Features: AbW1kvYHmQoq0lrIhLl4tU96ViOyzQC6VHiYAXY8dGp-cFK14P1Q9i_02Qg3ru8 Message-ID: Subject: Re: [PATCH] memcg: allow exiting tasks to write back data to swap To: Rik van Riel Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D3E5F180030 X-Stat-Signature: s1re83ofpzwthoaa11brur4wyqyd8f6r X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1733936475-428534 X-HE-Meta: U2FsdGVkX1947RcLw8OfG6dtiEYPPCl4N+gtRRsI+Im6UIaEJhYwPTyfeD/DLDJy5z+hrNtykbqRuAisz/S0s9wVnVE6+dRpItGq3VGuljxm9VMYa92ErbZAhgwjrbxf0/4E580uWV2U13pKU/3lVAfRHaMCSMxRt73+dphB/nzLxRUJY+F6/YL0/3BkM3iH574+xS93aHTHt4p45urQgYnvekqv3hkY56NvzGTdX+xmUwnKHOpSs/RFpNzNTDsiPq35pTIj3AnPHYjUL5AhhrMGoHI0MnyPmidurMYN7of7wsk0zJNHPSKXLO3CXnCei3VC4JuJJuZVoiJY1j9GXMER0UyQjeeMX5DJD6M7q2Tw2PV4SXQM4ekPMi0lFFT/3MNum8GhtpCHYm7bm+h60MWzieYPfwqaaRXsGybKyjdueUlGIjQ5HxrOzJfmzMbsyK/byij7YkTdh872Ac/jm/5dF0vF1luAjkrwwryDydHqV3FauuvSx0oKvVAtG3GhPteDUN5gD2ttwU9xegDnNa8mqufTP1dhQyVGc3sjAuUazJ6nH2gLSwM6LpT3cesFtem2lm7dkVai3GVSvGeuH/poDCY5vbM0033j/v4ji1hLxAzzRVy0ahWaxYT2KlrVAEct3CHBJ5nWpxMezxZdHKDnpKJaAbsXireCKPtKkcnfMJHnWZzTHFfFkaaKaq2dVm/q282ID6pimP2ZqLcN/e2rGGaFoqSjStrj+x5ZMAoskJAmXVZAGG1LQVVOvI1w38gt+YMiYPO/YwGw7qKFt1skmVZx2A1COCviTiR8a7gLXUwr/LUxxQUHbKAEjt4d5JqA4V6Mnu/V86AkuKP+anwhogeXhfyJ5brh+kkSJPnsvr8PV/xDjS9/HXRuGDZxOuD66Ia66dJU2pjpTRaD+IqeSWOqUO0Jj42phg18QRTLpaO/5iewfW7NT3jf/OyA8Up0dYvmd3W1JCPxeOt wFiC44AW 4QmRqsZa5QdnTBHZ22y9EdbHYpaxwd+oA+FFRkxJsnEErC7jAieDxXmH4Sjp145bqzX6C3Z1BsSpJz6Kl34tIEaL+KL+Ed0/J/3rkzJuwPlHQ/LshTIiY3g5b2WHz2j43eYi+jSPAyl08PnAy93dBrESSrB1psGCR3e7m8c+NCWqG+SToh8hV3H1BNsVhfnob5O7B8l2zmbxNcEKqMuFk4Jll86pq4XR5+To+hKV8IZNHfFpMDiVHECMrjMjSgdDzHX+8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.001939, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 11, 2024 at 8:34=E2=80=AFAM Rik van Riel wro= te: > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > On Wed, Dec 11, 2024 at 7:54=E2=80=AFAM Rik van Riel > > wrote: > > > > > > +++ b/mm/memcontrol.c > > > @@ -5371,6 +5371,15 @@ bool > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > if (!zswap_is_enabled()) > > > return true; > > > > > > + /* > > > + * Always allow exiting tasks to push data to swap. A > > > process in > > > + * the middle of exit cannot get OOM killed, but may need > > > to push > > > + * uncompressible data to swap in order to get the cgroup > > > memory > > > + * use below the limit, and make progress with the exit. > > > + */ > > > + if ((current->flags & PF_EXITING) && memcg =3D=3D > > > mem_cgroup_from_task(current)) > > > + return true; > > > + > > > > I have a few questions: > > (a) If the task is being OOM killed it should be able to charge > > memory > > beyond memory.max, so why do we need to get the usage down below the > > limit? > > > If it is a kernel directed memcg OOM kill, that is > true. > > However, if the exit comes from somewhere else, > like a userspace oomd kill, we might not hit that > code path. Why do we treat dying tasks differently based on the source of the kill? > > > Looking at the other thread with Michal, it looks like it's because > > we > > have to go into reclaim first before we get to the point of force > > charging for dying tasks, and we spend too much time in reclaim. Is > > that correct? > > > > If that's the case, I am wondering if the real problem is that we > > check mem_cgroup_zswap_writeback_enabled() too late in the process. > > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap > > entries, only to realize it cannot swap in swap_writepage(). > > > > Should we check for this in can_reclaim_anon_pages()? If zswap > > writeback is disabled and we are already at the memcg limit (or zswap > > limit for that matter), we should avoid scanning anon memory to begin > > with. The problem is that if we race with memory being freed we may > > have some extra OOM kills, but I am not sure how common this case > > would be. > > However, we don't know until the attempted zswap write > whether the memory is compressible, and whether doing > a bunch of zswap writes will help us bring our memcg > down below its memory.max limit. If we are at memory.max (or memory.zswap.max), we can't compress pages into zswap anyway, regardless of their compressibility. So what I am saying is, if we are already at the limit (pages cannot go into zswap), and writeback is disabled (pages cannot go into swapfiles), then we should probably avoid scanning the anon LRUs and spending all those wasted cycles trying to isolate, unmap, and reclaim them only to fail at the last step. > > > > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in > > case we are reclaiming from an ancestor and we hit the limit of that > > ancestor? > > > I don't know if we need or want to reclaim from any > other memcgs than those of the exiting process itself. > > A small blast radius seems like it could be desirable, > but I'm open to other ideas :) The exiting process is part of all the ancestor cgroups by the hierarchy. If we have the following hierarchy: root | A | B Then a process in cgroup B could be getting OOM killed due to hitting the limit of A, not B. In which case, reclaiming from A helps us get below the limit. We can check if the cgroup is an ancestor and it hit its limit, but maybe that's an overkill. > > > (c) mem_cgroup_from_task() should be called in an RCU read section > > (or > > we need something like rcu_access_point() if we are not dereferencing > > the pointer). > > > I'll add this in v2. > > -- > All Rights Reversed.