From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83EA4E77180 for ; Thu, 12 Dec 2024 21:41:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 219FC6B00A4; Thu, 12 Dec 2024 16:41:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C9B66B00A6; Thu, 12 Dec 2024 16:41:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 091AB6B00A7; Thu, 12 Dec 2024 16:41:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D8C356B00A4 for ; Thu, 12 Dec 2024 16:41:45 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 8763A1A095F for ; Thu, 12 Dec 2024 21:41:45 +0000 (UTC) X-FDA: 82887627606.20.E8E77AB Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf18.hostedemail.com (Postfix) with ESMTP id 6D6EF1C0004 for ; Thu, 12 Dec 2024 21:41:32 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=WruB8tRQ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734039692; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NndNSRUhbnWB5FGgPp/6wuTx8RdsrYqkoIBIFndV9tQ=; b=i4kR527duQp8TaM5Z0nrGmYWo1eF717dH3LlH8EgWkNrrF2hGPn6ivfJXuKIoTNfNKi6+A PvZJOx9SsIof5YznD+zMoGU3DzapIEOHL9qhw1uL7fonQ3oyB7bQgQAglTu6ypfSojWtCP XLt2AXi5ZdDY7iMncwxZ0OpwgXmYxnM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734039692; a=rsa-sha256; cv=none; b=Nmy/7uLXAq/cmS3i1a3ZcZPd+1IHes6xKsfKVvwyYgizgnPied41lTCzgAr3C3dXerxZDV hqNUAahZjaaCbGRKYmgVqDJWiwpqaTtkR06kQu3h6d2MD7pXpNH1rtSkPHjAnHa0RcINJN gcOV1fZlux4PRZOIJvhSgjn7cPXgtBA= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=WruB8tRQ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-6d922db2457so9602876d6.3 for ; Thu, 12 Dec 2024 13:41:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1734039703; x=1734644503; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NndNSRUhbnWB5FGgPp/6wuTx8RdsrYqkoIBIFndV9tQ=; b=WruB8tRQnHxG8POk3tpdY1VW+7qqiuSAuEfl9VM/VhkD+pHZVKecxlE93WIe4EVJrD nbioRpzgHhMim6zEceAp2vHPxb+Tz+hov0vVn3aPlCdWeFbmcHAOuaXgFsCE9X8mKG9d EykLcoCYyjXGMQVYTRGOn/TOXe6KuZ9pRUDSIAGNfOpDPn0wLzeAcKqpMy5Su+2vC/AO fQhzCyXw+sa48ahyLnN2+13mwsMGpOTvZ0VzAo9P03fBBazgNTYPxPPipEWawWA5d/96 fBlDlTyEmW0eMm/DOWhZjsnqY5NpGRrjpmjKBbjZpS1jt6bwJl7olso+gM9WuP7b/beG BpWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734039703; x=1734644503; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NndNSRUhbnWB5FGgPp/6wuTx8RdsrYqkoIBIFndV9tQ=; b=LJpdbeN4xbijVgafJ3DknyYTLe+Js8kDBXFaH1Ku5f2kEOgeI59jMt0LaoZn2xVcgj 6tVWjDpgzLqANYBM/AGLwnY7FfT8P8c2US/BmCWWWIV2nbrPjGkEgNnEkxLQyAFSLCLF B8ULEcHRyOrK9vo0I9ekyiC2SlZRfwhexuAVkLIJdu7y27n91WLQZAKoN2+drK85QBzG 2h6Rqv+wv60OWfJZN/RqVtw3wZBysFiDLIadjrh7riGHhYEceSuWmJLqdgenXAEeY3P6 uLuw79TjAjO3GvzrMo5IaCzvLm59Qz73MZvLyywjee91+ABqXCM/GgU6kYVM+RSNu2Z+ Ycpw== X-Forwarded-Encrypted: i=1; AJvYcCXV7Q+ARJ5brIlJXIqebokJnk4SLGCdu7dWZWnKH4QfPreD91TNX2gdNEhlC17YssgT7bj9ewVZNQ==@kvack.org X-Gm-Message-State: AOJu0Ywp6+kJFxVoKeliFyOJTjiHa/CE+5MgyYXNIGxatrfKZr2PxmHV nStoMK21Evlu5U5jwsckUxPHqH4DkR72+35j3MS7W6hLlBtO7GDsvI/ecITldvB40/RHhoPAuap 5wAnXDxICFLUA48qjeaW03/yrco/UdIhSkp97 X-Gm-Gg: ASbGnctKZZPDOXOYJpZILFsOxs/9E2xMWEyqmyyvaGRS2D/tBihVLigP6LXIv7/Gbjw KUd/k/BeooHcr4pqcEsP5DM+QcQ+VHVkHSis= X-Google-Smtp-Source: AGHT+IEiN0OqoDo2mRZ6+j4V/MCj+UFVGvVBvwOOn5R3E0vFpV1L+QCjP2VAtummDoAgR/xmNi0aDKjZs9bA7K2Lcgs= X-Received: by 2002:a05:6214:29e8:b0:6d8:a027:9077 with SMTP id 6a1803df08f44-6dc8ca3da18mr2385996d6.5.1734039702507; Thu, 12 Dec 2024 13:41:42 -0800 (PST) MIME-Version: 1.0 References: <20241212115754.38f798b3@fangorn> <20241212183012.GB1026@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Thu, 12 Dec 2024 13:41:06 -0800 X-Gm-Features: AbW1kvYrG3xeIYEdjZ61h5ymWOc_3UboGjPg4dvUyfO_e07qePepWjfMDNNQN3s Message-ID: Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap To: Shakeel Butt Cc: Johannes Weiner , Rik van Riel , Balbir Singh , Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: kc7ur4cb3ct3y6i1nftebmdr4c44nbzd X-Rspamd-Queue-Id: 6D6EF1C0004 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1734039692-473745 X-HE-Meta: U2FsdGVkX1+BuTrbtAB8FsEDRL3rC4OdQSTLjAQaXopRR4RFJICITfeZktJsmdLLwa4MZAaKZQxPbUJovBwRhjIFvjL+AYWtLz9HHHLvNLUtFuM3I9y/+M5lMUGZGQ18ByQeAS5qs5ZRWu0MUSlRwiK+Gp4dxihhXmO2ENQopsXx9azxwIVsppxT58W6fNdMya0EhAQKatVgXpUhVwPCYbIBafv63Atz/hLlXdJc8ReX9rFZ1or3isSlh2vdsLaEew7ZXvA8d3DjfhzAfP0z/4ZRgMQeM+2smY0JsA2/hsLFuN8o2eL6ZbwzkJhqyzjtoilj8sv0Dn+RXe6hyas+nMyJdSALM3a44PHGau2fWxSDPX5xt7kNEfkv7LFiV4l0n0NWID/LOzBVIidz7aVSQ0+DRpUAnC5UU+58NUqmnIse8kBB9zzWXphQkI1CL3qOanc2ZrZ6aDyFnKMQ0Qb4nW98aokD3brqcUVe/pBvPR3TVz/EFARnoQGB9XNWY0mXeUPnvqosQW8AKWbVJvJLDZ+TMmVF7dT7FwzXPX1nSyzfYVRea5KkoW7uQHz4wgBKEkUCvkNb5DDSBrUNO94y2adfNH/eWUIr0s0UyVJOJfPuaMXAkuGmg9zba3YfZYUPD6EXA9BXJOPIEq6iN365yBrUq+s+CCCIoSZEiqJ8IG1uYoZAl8i7O+OYxqKuI44pEKA3BFt+9Z2R0V+MxaYxD0NvZzUnnys/i/5UXBmKwCWcMXsRiynjGDxC/gc7/R4S12KxSE2aK6jnKd4yf3GSQIl3HHW+WlfU/OALuy/rK+CV5CSUQ3tLpxFjLYNvLknVLF3Qq8g8vtWmlPB0q5MYUd8I5x24XqQQ848CLkmIgoB0KliPsAthDz4wY88B1VlmS4IPEEEtA3ORQprAWq+Ym/JAKllac7BIotyhlxUCzNuFkeT81l5sNW0Sb7egoFXacAJJYq+qwhp+1StqaJh S98jcI7Y 2Pb4mAm3LbzZ7GduKUDJrDgLcpuaeJcB9LIo9lXhO63JZvdy/nGwSqVYD1F/hqq2cffgec4QgO2Gz1uttU6bl8bfMBGxc1AvqEgLlFYz43ufM/mAvYLJCZqvU9DJKaH3OODeIfC3FwKAxJlYqjfz7Pl8tdmWo0edMvw623/TmjwYT5DlN9lZHTD28DyVjmYpgESk817HZ5S/MMqR268rLe683MLP/DXRsSNt9AofIucRRyt6CE34P+bgZqr1MlGT6llRX5IyVleIHH3cUsLyCefrg18yuwi+XvW2e X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Dec 12, 2024 at 1:35=E2=80=AFPM Shakeel Butt wrote: > > On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote: > > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote: > > > On Thu, Dec 12, 2024 at 8:58=E2=80=AFAM Rik van Riel wrote: > > > > > > > > A task already in exit can get stuck trying to allocate pages, if i= ts > > > > cgroup is at the memory.max limit, the cgroup is using zswap, but > > > > zswap writeback is enabled, and the remaining memory in the cgroup = is > > > > not compressible. > > > > > > > > This seems like an unlikely confluence of events, but it can happen > > > > quite easily if a cgroup is OOM killed due to exceeding its memory.= max > > > > limit, and all the tasks in the cgroup are trying to exit simultane= ously. > > > > > > > > When this happens, it can sometimes take hours for tasks to exit, > > > > as they are all trying to squeeze things into zswap to bring the gr= oup's > > > > memory consumption below memory.max. > > > > > > > > Allowing these exiting programs to push some memory from their own > > > > cgroup into swap allows them to quickly bring the cgroup's memory > > > > consumption below memory.max, and exit in seconds rather than hours= . > > > > > > > > Signed-off-by: Rik van Riel > > > > > > Thanks for sending a v2. > > > > > > I still think maybe this needs to be fixed on the memcg side, at leas= t > > > by not making exiting tasks try really hard to reclaim memory to the > > > point where this becomes a problem. IIUC there could be other reasons > > > why reclaim may take too long, but maybe not as pathological as this > > > case to be fair. I will let the memcg maintainers chime in for this. > > > > > > If there's a fundamental reason why this cannot be fixed on the memcg > > > side, I don't object to this change. > > > > > > Nhat, any objections on your end? I think your fleet workloads were > > > the first users of this interface. Does this break their expectations= ? > > > > Yes, I don't think we can do this, unfortunately :( There can be a > > variety of reasons for why a user might want to prohibit disk swap for > > a certain cgroup, and we can't assume it's okay to make exceptions. > > > > There might also not *be* any disk swap to overflow into after Nhat's > > virtual swap patches. Presumably zram would still have the issue too. > > Very good points. > > > > > So I'm also inclined to think this needs a reclaim/memcg-side fix. We > > have a somewhat tumultous history of policy in that space: > > > > commit 7775face207922ea62a4e96b9cd45abfdc7b9840 > > Author: Tetsuo Handa > > Date: Tue Mar 5 15:46:47 2019 -0800 > > > > memcg: killed threads should not invoke memcg OOM killer > > > > allowed dying tasks to simply force all charges and move on. This > > turned out to be too aggressive; there were instances of exiting, > > uncontained memcg tasks causing global OOMs. This lead to that: > > > > commit a4ebf1b6ca1e011289677239a2a361fde4a88076 > > Author: Vasily Averin > > Date: Fri Nov 5 13:38:09 2021 -0700 > > > > memcg: prohibit unconditional exceeding the limit of dying tasks > > > > which reverted the bypass rather thoroughly. Now NO dying tasks, *not > > even OOM victims*, can force charges. I am not sure this is correct, > > either: > > > > If we return -ENOMEM to an OOM victim in a fault, the fault handler > > will re-trigger OOM, which will find the existing OOM victim and do > > nothing, then restart the fault. This is a memory deadlock. The page > > allocator gives OOM victims access to reserves for that reason. > > > > Actually, it looks even worse. For some reason we're not triggering > > OOM from dying tasks: > > > > ret =3D task_is_dying() || out_of_memory(&oc); > > > > Even though dying tasks are in no way privileged or allowed to exit > > expediently. Why shouldn't they trigger the OOM killer like anybody > > else trying to allocate memory? > > This is a very good point and actually out_of_memory() will mark the > dying process as oom victim and put it in the oom reaper's list which > should help further in such situation. > > > > > As it stands, it seems we have dying tasks getting trapped in an > > endless fault->reclaim cycle; with no access to the OOM killer and no > > access to reserves. Presumably this is what's going on here? > > > > I think we want something like this: > > The following patch looks good to me. Let's test this out (hopefully Rik > will be able to find a live impacted machine) and move forward with this > fix. I agree with this too. As Shakeel mentioned, this seemed like a stopgap and not an actual fix for the underlying problem. Johannes further outlined how the stopgap can be problematic. Let's try to fix this on the memcg/reclaim/OOM side, and properly treat dying tasks instead of forcing them into potentially super slow reclaim paths. Hopefully Johannes's patch fixes this.