From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45719E7717F for ; Thu, 12 Dec 2024 18:30:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D07006B009D; Thu, 12 Dec 2024 13:30:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CB4446B009E; Thu, 12 Dec 2024 13:30:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7BD76B009F; Thu, 12 Dec 2024 13:30:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9ABAF6B009D for ; Thu, 12 Dec 2024 13:30:21 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 464AD4247F for ; Thu, 12 Dec 2024 18:30:21 +0000 (UTC) X-FDA: 82887146580.04.0738F89 Received: from mail-qk1-f169.google.com (mail-qk1-f169.google.com [209.85.222.169]) by imf18.hostedemail.com (Postfix) with ESMTP id B51171C000A for ; Thu, 12 Dec 2024 18:30:07 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p7gCX1I4; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf18.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.169 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734028208; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MmiY44ftef6O+ZMOcAuOGSJN7t6+XvI2HuwN5LfNTaQ=; b=RJne2VH/bCcAS1Pkjl2+UsHndMCKauczSYFMO89tEytNg1ZthCK+E9Cdpnzt6PlgWD/dG6 +hQozvsQyG0c53nCTqvK5GUaNOpc2XQebuBZ8c54hwXMP7bhDeSbbV9S1o+yfocyVbfoRp 2iDMpTYjpacAsAcp6G3h1lndj88YlwY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734028208; a=rsa-sha256; cv=none; b=d8XN9tsAEVV449OiAqkh2onSMaeUZqBGdC4dFCmf1ryxcXLpIqp9h42CJOOs4iIynwF9L9 0rMGVLaR9BCWNU+5m6nlEdTXiCb8w5UhGpQgYU4F4cuPFoBlRRPQtshAfRvWYuzj1VmHgd CeWhL+dvOeI8VNoa/XdorvOxa1sYInE= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p7gCX1I4; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf18.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.169 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org Received: by mail-qk1-f169.google.com with SMTP id af79cd13be357-7b6e8fe401eso77951385a.2 for ; Thu, 12 Dec 2024 10:30:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1734028218; x=1734633018; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=MmiY44ftef6O+ZMOcAuOGSJN7t6+XvI2HuwN5LfNTaQ=; b=p7gCX1I4D3nOFGAtMbmDMkWNAPlftkcRw5dTrnRwuDrHDHGEz2bwvg+rWNps2WPGrQ Yxpi5mTDRb/RH0pdOSyjv/aCY4ufK/hIaiBn6d7oKWnFBgVk6FfD1t/k9a+zhEP/DdU3 ddGk93U4sTSIQqzo/h9uiE6XIV+UPbtdNJ9PSARqvVQLUUyU1uikz45VWg72HWAd7+eF /HLTjVBHScFgyX1SGn8TVul2+2lVG3YPhI396GesM01uRS+NfJTl+t4xdIzxxSvvzne3 NRpStfMN6PrOtUq0PNPy6R7gATQ8OIowSWm6XZBXTQPk2bcpL2IQEEIcLNWlagx7aX7t GQMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734028218; x=1734633018; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=MmiY44ftef6O+ZMOcAuOGSJN7t6+XvI2HuwN5LfNTaQ=; b=tRGlF4yjqcd868al1l690SBKA71GwfPzcmds5FedvGvXXI5EdFQN3CU+YPOqD5RDUi +nU1Kf6vYLdzUIyb+1AuTTWLIROEcVOznUCHqYfmLa0AYAkWsaJqxEQ56KcYLtu6VQdf E8DgxQfvlgDLEo6bbjvIwKClPJYPf+Cr5Q5hHDes7QGYc8TnJobHCTPcc8IR2IivppMM 8cFsyX21oFvM/aL4XEEUrE37PEP6yw0VObef2ezDHS71e7dbWsELpKUr8iCaOqB4bIS5 6w7GLEOEr7EkxthPVF5jMVxss6wHPUJ6rkxvRTRx7eweQ3nAHsctLMYbc7lqqSWVfGOB QONw== X-Forwarded-Encrypted: i=1; AJvYcCVsBu83j59Alz2Hjr29bmhyb4ZFiZivNEyhdhepzYi1srHXA92ezYXPucRj6pG9hI7r9MKszVtCNg==@kvack.org X-Gm-Message-State: AOJu0YwE8VitwBPL02G99JTgBkiwsxjzEigA4kezSsnjNNTvPq6MgISG ZywXTe8knHc2sQcS1wHBeahuGHHbVwbvx6H56Lf/oqYtm7bTt0KADvBcMmtwAJo= X-Gm-Gg: ASbGnctNgfu5MHrmUasUrzohqW9q2kM1GCbWVh4eiAwlaHrKMk/iRXNlaIqtpojL7Mx vTiElizLZ/RWsQMYbToG+NcgSpCLFdRzwSOt7lBCiX9uYag/Auhcnxom0nbG/V3WKjGUlu2O8GH lqfvSBOACksTCl8eukPPy44Rv0nYms87XtyBTbyjnaloOp1EBESHslOmEFCoJzT9eg39A4kwYSj GCHCUIq2ewOzig307QRHD9gvL96rzZ5H46Rlbjkl8dCQsPg76l7HCM= X-Google-Smtp-Source: AGHT+IG38MiE0nwEa6hQYO2vGYkkqMEfbHFTHod38nIta4Us7MLC6sgHlHqKAJsG3jnHQxWLOynmyg== X-Received: by 2002:a05:620a:1909:b0:7b6:cb8a:3d54 with SMTP id af79cd13be357-7b6f89eef4dmr208194685a.51.1734028217762; Thu, 12 Dec 2024 10:30:17 -0800 (PST) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-467a140f7b2sm2942531cf.79.2024.12.12.10.30.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Dec 2024 10:30:16 -0800 (PST) Date: Thu, 12 Dec 2024 13:30:12 -0500 From: Johannes Weiner To: Yosry Ahmed Cc: Rik van Riel , Balbir Singh , Michal Hocko , Roman Gushchin , hakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap Message-ID: <20241212183012.GB1026@cmpxchg.org> References: <20241212115754.38f798b3@fangorn> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Stat-Signature: o6chpo48q4ordxym53m8ut8qgybthngs X-Rspamd-Queue-Id: B51171C000A X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1734028207-816037 X-HE-Meta: U2FsdGVkX19yvi9cBEUGy8ced23Yttcqf9y2MvqgqPfUwLRIt//knZI9mjoiWPsvBGhH9HhE0rHset60FcTrWjDS+2pZx8zcDiXr7WdtLMRnd27um0XXT+ntZ4Fj06qnG/UCbSgHQvMr8ozTvKkOFbs5U7ao0sBUy4wMaraV0VRJzZ9IMNK/bQTZWumSOOAPkZK/XMMetLsJ3B0c/i97BG8hdsYFCM+VVDGji/M5qpAcHmiuA6XE2abdj7NCtJYX2r9ZG7z59rZ9RHuStLc28XPyJaHNT+sIdW4PVY2i4MIGeA6iPmvc1QUYNYs3Li/boWLv6efNS/jp/R2W2eofYAxI2DvHlcNALO0IQMp+r1ElGGDQmAiQhvL/hGRZ16M7IQEfcGagpDUEMnQ0a3o+Otin1OVw18u9nI4XhyZniwgkkh7BW/037gtIVvc/SatIkuoatxUWGa40/9D4uGpYj87fh2u0rsR+fs70MBJw+LNuiEJc6XzCCxR7GafYR9MefpyDxP2gGlD4R7So/g2pThpfv7BfldPvvJtVxPD8VcAbI6hpo7WqT3ikRLM9cpe1XUeNqvklYcef3G88tZ1WeeNtHGiLg3xRlbDYC2sgpeTfCCIFVHdIXuxNhPj4VrnHsabCUp34/JPowCPV+Z9mWTlUzprECiqdp7cZwjgObKdAktKamuvIjzZ5D+PI2VUywpK5y8miMuHtkvpHkBhp10Chs0Acc/IJhcA3+uDy6wrdPx2cF7xaHHOqKjhAvGILRxjhOw9O8WKiG2UDTTYUzsdhlgXbeEpiEuCit0IolNtgNHps6TDObBlmtsMLxxpwZ4NQKEMcoievdL+gFNPhC5SKbkmdRsVPGiS/2tPeaH/TOzq6+74fnrzbbyr3SkkIFpBuj8YZhzLfjndWFPyBOrhfqKfZEAzQ8wsgzP+bN9UgDf5mvZSPZOsZndnTiIk0GBvbE1LXpocEb7n8rH2 lDr3uKAx 4cG3RmOWK8yjROCmq/9dCXh/ax3qteAV5Slc8SoKqlUNxVo3DVlemN/JdmBnav7d2UojihhRepjmU1rGKIRBtYR7K6SBZAhIj7LU7JoClMuSXMNhyzkJXjv7bjLHxQUit4B80Wt4/nyLYXIQOZSde8t0q5grvWzxfQM9/315JZiWd6UrHVMGOxs8JSCwjsnLyKaDuDrAmnTKZWM7MH966xHi2gwZFyPrDgICYqPH/FWTgqG3bZ203p6q5b3OjcrjtV73AKcC/J53e/AQT5mfFTssseplaGyjUGlHkcOCNoyAHJ3ZJ25ZoIn6NN1hx1zk4EoV+rla8EBNbrw3/+2il4O+y8Ho6lwQ9oJV38lYaBnz1R3N7Icimw13oK2D7cyniyIdAV/j/Xtphb87E8PWrtcNOVGZ/PNK0NVw/mTFKARp6aDEkLnfmWW1WsU/1Yfcz6S1AQfK7ZjZ8ZUW4JE1Mlyiz7eJqXwRb1H+13R8mQRzXbbFZdK6RymAGJ00FqfMi+P68sF/42YdcJb++xpMpqXXsJu92OflHG5wSkIhOfv8WSxY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote: > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel wrote: > > > > A task already in exit can get stuck trying to allocate pages, if its > > cgroup is at the memory.max limit, the cgroup is using zswap, but > > zswap writeback is enabled, and the remaining memory in the cgroup is > > not compressible. > > > > This seems like an unlikely confluence of events, but it can happen > > quite easily if a cgroup is OOM killed due to exceeding its memory.max > > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > > > When this happens, it can sometimes take hours for tasks to exit, > > as they are all trying to squeeze things into zswap to bring the group's > > memory consumption below memory.max. > > > > Allowing these exiting programs to push some memory from their own > > cgroup into swap allows them to quickly bring the cgroup's memory > > consumption below memory.max, and exit in seconds rather than hours. > > > > Signed-off-by: Rik van Riel > > Thanks for sending a v2. > > I still think maybe this needs to be fixed on the memcg side, at least > by not making exiting tasks try really hard to reclaim memory to the > point where this becomes a problem. IIUC there could be other reasons > why reclaim may take too long, but maybe not as pathological as this > case to be fair. I will let the memcg maintainers chime in for this. > > If there's a fundamental reason why this cannot be fixed on the memcg > side, I don't object to this change. > > Nhat, any objections on your end? I think your fleet workloads were > the first users of this interface. Does this break their expectations? Yes, I don't think we can do this, unfortunately :( There can be a variety of reasons for why a user might want to prohibit disk swap for a certain cgroup, and we can't assume it's okay to make exceptions. There might also not *be* any disk swap to overflow into after Nhat's virtual swap patches. Presumably zram would still have the issue too. So I'm also inclined to think this needs a reclaim/memcg-side fix. We have a somewhat tumultous history of policy in that space: commit 7775face207922ea62a4e96b9cd45abfdc7b9840 Author: Tetsuo Handa Date: Tue Mar 5 15:46:47 2019 -0800 memcg: killed threads should not invoke memcg OOM killer allowed dying tasks to simply force all charges and move on. This turned out to be too aggressive; there were instances of exiting, uncontained memcg tasks causing global OOMs. This lead to that: commit a4ebf1b6ca1e011289677239a2a361fde4a88076 Author: Vasily Averin Date: Fri Nov 5 13:38:09 2021 -0700 memcg: prohibit unconditional exceeding the limit of dying tasks which reverted the bypass rather thoroughly. Now NO dying tasks, *not even OOM victims*, can force charges. I am not sure this is correct, either: If we return -ENOMEM to an OOM victim in a fault, the fault handler will re-trigger OOM, which will find the existing OOM victim and do nothing, then restart the fault. This is a memory deadlock. The page allocator gives OOM victims access to reserves for that reason. Actually, it looks even worse. For some reason we're not triggering OOM from dying tasks: ret = task_is_dying() || out_of_memory(&oc); Even though dying tasks are in no way privileged or allowed to exit expediently. Why shouldn't they trigger the OOM killer like anybody else trying to allocate memory? As it stands, it seems we have dying tasks getting trapped in an endless fault->reclaim cycle; with no access to the OOM killer and no access to reserves. Presumably this is what's going on here? I think we want something like this: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53db98d2c4a1..be6b6e72bde5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_margin(memcg) >= (1 << order)) goto unlock; - /* - * A few threads which were not waiting at mutex_lock_killable() can - * fail to bail out. Therefore, check again after holding oom_lock. - */ - ret = task_is_dying() || out_of_memory(&oc); + ret = out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2198,6 +2194,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (unlikely(current->flags & PF_MEMALLOC)) goto force; + if (unlikely(tsk_is_oom_victim(current))) + goto force; + if (unlikely(task_in_memcg_oom(current))) goto nomem;