From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0109FC47077 for ; Thu, 11 Jan 2024 19:28:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 78B638D0006; Thu, 11 Jan 2024 14:28:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 73A578D0001; Thu, 11 Jan 2024 14:28:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6026B8D0006; Thu, 11 Jan 2024 14:28:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4D9A68D0001 for ; Thu, 11 Jan 2024 14:28:15 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 24BA040B73 for ; Thu, 11 Jan 2024 19:28:15 +0000 (UTC) X-FDA: 81668016150.21.FA53B9F Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf29.hostedemail.com (Postfix) with ESMTP id 22CA812000F for ; Thu, 11 Jan 2024 19:28:12 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p8MUNk7i; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf29.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.176 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705001293; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G+qi7MD6q3W8+pUjVS0O2dTJ96sM/4HvoNFot/q7Uh0=; b=BVSoYLxRfNx62fMrHHaBM4XH3JQdkTuF/OVWHDjNhlA0aYS5sqotEGhP707Q5EtFgETDoF TTqz+aFN5++y4GJit3dusZoDm7w2HUyo9+g7Iq9N3SAHIsxL7XJY6BbW1p8+6e3sxEOICT J5nCkd3dLkF+JrEJr0l3ZhMIUwjQ0eI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=p8MUNk7i; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf29.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.176 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705001293; a=rsa-sha256; cv=none; b=x1YqayvFan0EYtV6ORgL5rxQto8TguS6RZ1Ocd4Tr5cR1mCHD1hnXgMkImHcHCeytkNDOj zZYdaNJDVxFdaPk1C+GUEODFsgUtdRGVckf/LwQJCTk8kpRiva9S92amxV+G954rNLcR9/ mSznS0BHiL0ktrmkJk1egJ4ywhn638s= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4298e866cd6so30341691cf.0 for ; Thu, 11 Jan 2024 11:28:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1705001292; x=1705606092; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=G+qi7MD6q3W8+pUjVS0O2dTJ96sM/4HvoNFot/q7Uh0=; b=p8MUNk7ixlvE+IH/X7rRendBdnXH/cReOhPPMlqXt+EXRnEUA+xLdxsyE1tfodnBUf Jzt5Yf5fFm95mKnbakX6Y4fiwUDBM61+B3eIQKGjZELMsN8xIRLvHnLF0/zRDcpa20cc W6MvSgjB/mLGw0gu11SOSYEw1c+D2FbhatlB0RHWc9UPlo/qW1YPdgiQ6lUe2kOppO5n T2VEqXYS9SKXxRpBdsksLCB8SSD2lAseaSPerxJEHgORF/nguokFpQMmO0uEvdse4hhI eK8MBvbN99vc6CmtpxqvY3I89RmGJ38GIj+Dy9pfPXLKBTFDJXC1BLa8dDkn2ULMwgwm 4SWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705001292; x=1705606092; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=G+qi7MD6q3W8+pUjVS0O2dTJ96sM/4HvoNFot/q7Uh0=; b=kS8KfDH4HMB6IBum96mDUiPn9yCIoTUGfCxVW1IEq1YxsulpImF2BrGAWpoJBYAwoR bRyTLgcTDUmCNucngW+YoWYqF80adjN4kU8iaN8Af7nVbYumJOFAWsQMyOWfP9I17tC+ 9NA4vNqgAKD4SxxGAZcfF8L2EyTpJQ1pJlM2tEHgPZY/XcchkDj+J/wylCpuM6474i1m uFL3THAifSRFPz8E+UpRcbk0/LwkIqZhRsnN7SXmBzhdFzbYVSXVpPsq31LzJ8X67pD+ p6SkywpjP8OAUw06MAkn+pWDTjoVgZQt1y3Igpqk6xPN8zSNjLzlUkLwaMIWGHgyskm+ Bnqw== X-Gm-Message-State: AOJu0YyPJPvH7jyoxF49NltSvxEX3UuNNizi5Y/8zUT59wC5E0du5WA9 NSMO4lHWaeVd93eoxuLNfXAJft0+y70zrA== X-Google-Smtp-Source: AGHT+IFyX3xTWizP3R3FAeH0+OH2ehQEAC3O6C+6V4EQJvLdtdWP94DaIZPXJjMDMMjv+ghgBIYqXw== X-Received: by 2002:ac8:5b8e:0:b0:429:cb93:94b4 with SMTP id a14-20020ac85b8e000000b00429cb9394b4mr67054qta.32.1705001292141; Thu, 11 Jan 2024 11:28:12 -0800 (PST) Received: from localhost (2603-7000-0c01-2716-da5e-d3ff-fee7-26e7.res6.spectrum.com. [2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id cn3-20020a05622a248300b00429c8ae9b94sm387630qtb.85.2024.01.11.11.28.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Jan 2024 11:28:11 -0800 (PST) Date: Thu, 11 Jan 2024 14:28:07 -0500 From: Johannes Weiner To: Roman Gushchin Cc: Andrew Morton , Michal Hocko , Shakeel Butt , Muchun Song , Tejun Heo , Dan Schatzberg , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high Message-ID: <20240111192807.GA424308@cmpxchg.org> References: <20240111132902.389862-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Stat-Signature: xc8omznzrynjnd5m83ajizdwesph39bh X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 22CA812000F X-HE-Tag: 1705001292-744168 X-HE-Meta: U2FsdGVkX1+AudbjO/UPRwvua1zHLz9JHn46m6wn2FBkiUdAoSGT0TsP/fOKuamPtPs/THBjerHuMyQhlwK5QgAtBDW3VZesOTNQo+pxf3E5S818POYc4pJ4PEjEv10VE+35jQJVq53onlw9LTHgtd4J17VgNED2Fzjp4hoEXJyU7FrICVQa8WD3fVOoEtSzjZKCXe4BhWU53gnoCMXg+7hj/n+AWl+hl0TPvRmQta2Ct7ToTXi+yHOF0oEZeXRlV383O8atH4R4YC+Y1Y4VflzgLCTotD8pfNtgIDR6uO8qiM1dqlgK3zIul0Tr8o83CnDwPjSqza+j+f6Z46O+bDN19ati3jrWyABORENal6O1zisZ9hj7S3RXNM9CzGloFriTU/nToiQONPfG5mEjVYXb/xnWqAbcZrGCgVnsVUJ3+uqWaWqsncxK0gJNJBiIojq+K6bmOFkaMGvFjjukO24fRVijw2UuEAUxyqRQRUc60yxiO/csaSoqjXdEsGwUwnbp17iLNFBBaC+hVPzKYYkDfBsFo5fx7JjWQwYmzCnq2JowPxJpDy4wyJLaIyu96JbCzbrqMFWfUJ4uGCt5JP1wa3HQxJwnUved3b/NoQBShmeoJyLhQqfi2s9sWxU22FVw+TXART5EbblUNN5W8sdeOKkHBEsszpnCMLJfAl78oCpaK6RwVoKPcbCYp33G476defHY0LEgnxoi0udD/UPxnP9M325TvVnxAVO4p9Q5i+I7pCpq9XMZDkdCSXZytyCUzBPxdFAGOtJTKZujnk9COwDGTms+NYwjPNqdDPJ3OvFqiN3r89xL4J0K6rGTD8AFFrmtVX8DxFfrTqFn+JXvjrlHgOXX7ixr6JFYuLEo0KOo+EsRX4bDEBQbYit6rNphEXd/KToxuuZlDtVV8pnBk7TzJEfBnd46AL2/lyL6r54VlgW9JdpNGBzeaPvUC7Gwc2/Ot2Jm671gNLG Fw/KoCym nXB8B/WMHMouFFzS6jeqCH5Tb58TyLiconNfd9IOOTAkwPhLD2cQ/9kwR8YkJvUTf7yXFiW3Zx8/2h99uIEg2YQnuwCqjSXwYrAldN9K49OtmwGLNK48OQ2Vb0QQhpYKmtINLpFQ65TPBosoh0Vv+L7WHcaxoqWZFt08qbTE2ZCTc1ch/uRty9Vlh9jKVsJ3wHUZQNYrtmFUyS0Rkn6OR7Er/8ThfX4S94A+6pHey0C3ALu6Ag5bOsvUPgI+/TUTgM4VH+Pkm8botvlhNTVaVr+D+Fjt9S3RzKZQ+itaRX3GVTWulGT/YuF/pINNglpTOtuUkINGOTylc76fdVWgqKbq5BEEHq7exX1iBefGzrxx2ICkMrPPw7CBZxPCc+Y+SSMCzJlRknXaqTR0dVCTNh3UQQr4JJ4Sox1l2CL6ajKku+OMq0+hK3qG1MVqum+fbK5BiFYzxog7FMElvpm9qdRyixJZA75/GI0gWXh1stz3Xtw2XDh6qCLJVrEczbgRY0O/+H0fbThpDESf9YhDvRbZzTK8+Hn3ZlefqXwsY/pkvxH0YX2+TVnfjEP8GLm4eJ7dLHT56ZxNM3QZHGzvaBD9VtQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 11, 2024 at 09:59:11AM -0800, Roman Gushchin wrote: > On Thu, Jan 11, 2024 at 08:29:02AM -0500, Johannes Weiner wrote: > > While investigating hosts with high cgroup memory pressures, Tejun > > found culprit zombie tasks that had were holding on to a lot of > > memory, had SIGKILL pending, but were stuck in memory.high reclaim. > > > > In the past, we used to always force-charge allocations from tasks > > that were exiting in order to accelerate them dying and freeing up > > their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg: > > prohibit unconditional exceeding the limit of dying tasks"); it noted > > that this can cause (userspace inducable) containment failures, so it > > added a mandatory reclaim and OOM kill cycle before forcing charges. > > At the time, memory.high enforcement was handled in the userspace > > return path, which isn't reached by dying tasks, and so memory.high > > was still never enforced by dying tasks. > > > > When c9afe31ec443 ("memcg: synchronously enforce memory.high for large > > overcharges") added synchronous reclaim for memory.high, it added > > unconditional memory.high enforcement for dying tasks as well. The > > callstack shows that this path is where the zombie is stuck in. > > > > We need to accelerate dying tasks getting past memory.high, but we > > cannot do it quite the same way as we do for memory.max: memory.max is > > enforced strictly, and tasks aren't allowed to move past it without > > FIRST reclaiming and OOM killing if necessary. This ensures very small > > levels of excess. With memory.high, though, enforcement happens lazily > > after the charge, and OOM killing is never triggered. A lot of > > concurrent threads could have pushed, or could actively be pushing, > > the cgroup into excess. The dying task will enter reclaim on every > > allocation attempt, with little hope of restoring balance. > > > > To fix this, skip synchronous memory.high enforcement on dying tasks > > altogether again. Update memory.high path documentation while at it. > > It makes total sense to me. > Acked-by: Roman Gushchin Thanks > However if tasks can stuck for a long time in the "high reclaim" state, > shouldn't we also handle the case when tasks are being killed during the > reclaim? E. g. something like this (completely untested): Yes, that's probably a good idea. > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c4c422c81f93..9f971fc6aae8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2465,6 +2465,9 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, > READ_ONCE(memcg->memory.high)) > continue; > > + if (task_is_dying()) > + break; > + > memcg_memory_event(memcg, MEMCG_HIGH); > > psi_memstall_enter(&pflags); I think we can skip this one. The loop is for traversing from the charging cgroup to the one that has memory.high set and breached, and then reclaim it. It's not expected to run multiple reclaims. > @@ -2645,6 +2648,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) > current->memcg_nr_pages_over_high = 0; > > retry_reclaim: > + if (task_is_dying()) > + return; > + > /* > * The allocating task should reclaim at least the batch size, but for > * subsequent retries we only want to do what's necessary to prevent oom Yeah this is the better place for this check. How about this? --- >From 6124a13cb073f5ff06b9c1309505bc937d65d6e5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 11 Jan 2024 07:18:47 -0500 Subject: [PATCH] mm: memcontrol: don't throttle dying tasks on memory.high While investigating hosts with high cgroup memory pressures, Tejun found culprit zombie tasks that had were holding on to a lot of memory, had SIGKILL pending, but were stuck in memory.high reclaim. In the past, we used to always force-charge allocations from tasks that were exiting in order to accelerate them dying and freeing up their rss. This changed for memory.max in a4ebf1b6ca1e ("memcg: prohibit unconditional exceeding the limit of dying tasks"); it noted that this can cause (userspace inducable) containment failures, so it added a mandatory reclaim and OOM kill cycle before forcing charges. At the time, memory.high enforcement was handled in the userspace return path, which isn't reached by dying tasks, and so memory.high was still never enforced by dying tasks. When c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") added synchronous reclaim for memory.high, it added unconditional memory.high enforcement for dying tasks as well. The callstack shows that this path is where the zombie is stuck in. We need to accelerate dying tasks getting past memory.high, but we cannot do it quite the same way as we do for memory.max: memory.max is enforced strictly, and tasks aren't allowed to move past it without FIRST reclaiming and OOM killing if necessary. This ensures very small levels of excess. With memory.high, though, enforcement happens lazily after the charge, and OOM killing is never triggered. A lot of concurrent threads could have pushed, or could actively be pushing, the cgroup into excess. The dying task will enter reclaim on every allocation attempt, with little hope of restoring balance. To fix this, skip synchronous memory.high enforcement on dying tasks altogether again. Update memory.high path documentation while at it. Fixes: c9afe31ec443 ("memcg: synchronously enforce memory.high for large overcharges") Reported-by: Tejun Heo Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 73692cd8c142..7be7a2f4e536 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2603,8 +2603,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, } /* - * Scheduled by try_charge() to be executed from the userland return path - * and reclaims memory over the high limit. + * Reclaims memory over the high limit. Called directly from + * try_charge() (context permitting), as well as from the userland + * return path where reclaim is always able to block. */ void mem_cgroup_handle_over_high(gfp_t gfp_mask) { @@ -2623,6 +2624,17 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) current->memcg_nr_pages_over_high = 0; retry_reclaim: + /* + * Bail if the task is already exiting. Unlike memory.max, + * memory.high enforcement isn't as strict, and there is no + * OOM killer involved, which means the excess could already + * be much bigger (and still growing) than it could for + * memory.max; the dying task could get stuck in fruitless + * reclaim for a long time, which isn't desirable. + */ + if (task_is_dying()) + goto out; + /* * The allocating task should reclaim at least the batch size, but for * subsequent retries we only want to do what's necessary to prevent oom @@ -2673,6 +2685,9 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) } /* + * Reclaim didn't manage to push usage below the limit, slow + * this allocating task down. + * * If we exit early, we're guaranteed to die (since * schedule_timeout_killable sets TASK_KILLABLE). This means we don't * need to account for any ill-begotten jiffies to pay them off later. @@ -2867,11 +2882,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, } } while ((memcg = parent_mem_cgroup(memcg))); + /* + * Reclaim is set up above to be called from the userland + * return path. But also attempt synchronous reclaim to avoid + * excessive overrun while the task is still inside the + * kernel. If this is successful, the return path will see it + * when it rechecks the overage and simply bail out. + */ if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH && !(current->flags & PF_MEMALLOC) && - gfpflags_allow_blocking(gfp_mask)) { + gfpflags_allow_blocking(gfp_mask)) mem_cgroup_handle_over_high(gfp_mask); - } return 0; } -- 2.43.0