From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8B02E77182 for ; Wed, 11 Dec 2024 16:26:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1918A6B0088; Wed, 11 Dec 2024 11:26:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 142186B008A; Wed, 11 Dec 2024 11:26:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0084D6B0093; Wed, 11 Dec 2024 11:26:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D63EC6B0088 for ; Wed, 11 Dec 2024 11:26:55 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 40479C11A8 for ; Wed, 11 Dec 2024 16:26:55 +0000 (UTC) X-FDA: 82883205804.09.D00BD72 Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) by imf13.hostedemail.com (Postfix) with ESMTP id 411312000D for ; Wed, 11 Dec 2024 16:26:30 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hgA+Xwhq; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733934395; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VmF7B4UdiNqGoqgVNbUdDqAU3iWrVv8dqPWk1iyMjhc=; b=tdrmw5vuNCH4IXUs/ITLN3eOUuCyAbomJPns9JD9QOSrpBspzHOjhmE2dUsn+40CrMNgSj 6gBC6ZsuTy/i+0V4UabkFL6z09v3xxzc2OdGMpKQUFSbWOVOVvYxiF3Mqy5J0Mzgs5k56X s6jpayxlEG+i7w8RKfCjRoNeTQw/oPk= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hgA+Xwhq; spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.171 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733934395; a=rsa-sha256; cv=none; b=AZL7pz1im7gn1IcufaTFllxH/jxQ1x+OnxgM4VQCqU/ehHwysctpRUvv2jFeH8q+MZ7QiX xMQBZrjzgwnD/7xFMbeXexZRKhUWd4XUqdl4CDM4Yvrms0jc2OspF5Np9TN5zIDJE83Eky d5SbgowTzA2jAUcKPXn5nZJin62d//M= Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7b6f19a6c04so15032785a.0 for ; Wed, 11 Dec 2024 08:26:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733934412; x=1734539212; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VmF7B4UdiNqGoqgVNbUdDqAU3iWrVv8dqPWk1iyMjhc=; b=hgA+Xwhq8jI7T5HvbqokoMBJGtELVehINNs7MzR0BIPOed6ygmRiR9FtkSFik6aBrI 6xaENvrBmUoZoDDaHe8EIwxiFkvbvcB0tv+SzItJItdpMjoqRrtQDCO2XKVBD2bQKDLx J2FIAcV6vPQYcN6RrRZV4yjT0j6s8fHhZBmqAslJc8jWW/xNa9ZWVlNLfluVYwvh6uZJ Se134jWJ1/RvPa6hFLMAXRtbwvzPDQg2GXGkdv/s2mDVDEREnyTfmqP/k03LvjCB/mu8 vYt3zC/TERPblzFnOxDV+puT5VHOAEqCWq97r9wB3RBbHFsBv4blKSQhz3wrVncCWlHs ME9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733934412; x=1734539212; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VmF7B4UdiNqGoqgVNbUdDqAU3iWrVv8dqPWk1iyMjhc=; b=V3x3nTk+9xN3D5d/rEBmdaz3uc8X6RKLDXVl2iIV4aM/EjmEZY9FJY6OC3PX7KZxmB 4XOgRGPJlVXwyjN8YUbh+gifCEJ+f1BZUZgmfM0KyA048ji2mRDjP5rts3iGJyobLOPJ /+fdVW5z6Jps1Lxe9ywiI5gobdKCX6sVj3rOPumBF6EgrgE4qLOtf9TTp3W804SBfjCy +51r6wzCZITC2vHhPwAgWqlijou5dLOmE4mH2ePV9B9d1veZx4kUkFAzRxh10PO4x18K isdmaOHYD+6U9Dsy4Q4P3l1LasDc+6YCYqgbj2ZQwVsj8cjGL1pgBnAm/vKlrykCQ/u7 SOaA== X-Forwarded-Encrypted: i=1; AJvYcCWmiyx8iOTm/6VV+K5nEgBGCvinwQjWbO/9ClJAND5nNoX16CkMUPyU7rOOxZIKpn2U35WK93udCQ==@kvack.org X-Gm-Message-State: AOJu0YxHindL5nezKvGeFrojRoBq/80+K7+fSfzOU4FjQ+v+Oc5JZ6Xj tizpOnARl5QAnrxJq9ZQ5Y0vXuOE01GiG7X218T0BXacycgnK0QfFv5NLisOcAMbaHiAZeebhCg xj1tc7yXoYS6QQgNU75gET0TC4thmL2F1uUVB X-Gm-Gg: ASbGnctLKlFnJ0k9ULEZXdPbpd58DPwlP4g+hqMuzmR9JhzursW8PMMzcJuO5SWk4af woZhSLWkauWnkBhrVqbZSi9bJWx3Kzft0 X-Google-Smtp-Source: AGHT+IE7hqAxeY1BamlWjBEdDx9LHupCMPUOLWWQnkrSePi0bDITuoK8qiLThq9KLPMhxghkKeLR+aQPtsiS3x8RJ6U= X-Received: by 2002:a05:6214:27e4:b0:6d4:36ff:4356 with SMTP id 6a1803df08f44-6d934ae9b33mr52733116d6.19.1733934412169; Wed, 11 Dec 2024 08:26:52 -0800 (PST) MIME-Version: 1.0 References: <20241211105336.380cb545@fangorn> In-Reply-To: <20241211105336.380cb545@fangorn> From: Yosry Ahmed Date: Wed, 11 Dec 2024 08:26:15 -0800 X-Gm-Features: AbW1kvbRB3RiEDwpv3a-k_lbcdWn-0ivQPHXBpRrjUl4XTlgbAVnAmY8nCI6va4 Message-ID: Subject: Re: [PATCH] memcg: allow exiting tasks to write back data to swap To: Rik van Riel Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 411312000D X-Rspamd-Server: rspam12 X-Stat-Signature: db6gizubd9hkqbz1xd7ac8fg975rhi45 X-Rspam-User: X-HE-Tag: 1733934390-540042 X-HE-Meta: U2FsdGVkX18fdFnNsrtYI/JYADWOGxYNO1kSbMYnhfEEMtNK3b8GvRCpHwioOOZK1aR3ETHE9hxxcmPv+X3CsDEDCRsC0pyO9goa+wWGA2oLohMHNOetOxea30T5iyUefq36H3UayKClyCAKtFX6Z/yQ1avWJVc4o8fxCEu1PjpPnLVkpJgMl+3H+aP5/U1PVOwlAkszhthu4BZzDwzBHVhrRWD97hiGRyuPDzyGPhB++5i7Jty99BMa7gDt+0Y0gmthJkhuAGP5BtqUBa76LaF/b9tOmNQH0QNPw3yRyfLUbVn9ZgC2oMLxkZV10vb2e1m/ENLZZoz6tI6Cbyq85q3eLGxkjsr1mPVI0bTHO+Xn1ZSKhGukzljo50bcVFRU4b1nf4pGhqSA+q3vATuX8E/WUwJEZILB5VIfh98nf5RqQjKt1VHyyNrVpPamsR1lHxmI+5v231ZhrZ5TAngU6toWT4lOTnWOqUIB61o/ctamCkIJa7zCxzI+4Z9zxalJWm/KQpkuPlW+LgOAOLdryqnV+d7PNG1TwnIbKEFPoJ/ruj5PYIEszm4KNy4OaSQb3Z/DAH2PLNGe5AGyqpguFfGPv94SK0xDIqTZRHp92lR1TpEj32hfAs1446zehkAMOq3XhNPaGsPQn7KcdYhj/IHfPEytFSBwfF5lxefR0KetKoNO1dRqq4sLBTr6D/2iR0De7ZmK87d2uvxgzwZMjrtAVONeBjwivkZ0QZRu5MhcL0WvSTpwXn/ICpKQCVe+Y94gbe9LeJhG608E+cm1suUkuYRtiyctFoGV8u2XdLOF/6KkMvrOGTP6QYv9lJudgOsEFe/NYJj2M0IHXNYexMWJ7gV5a1UWxK3hCSUdh2CjyxQ331zxjmtHgAaWdtCjcOp/apwFduUiPIGQbe4QwAHZ1Tln0E3/qHpwp1KTkohGR9dCxU8zi5MiioR27gwpq6V+/NAaDh23sanVdGS iiX/XgPU jnrGgQl85yU/YfjJM8GVU4u2+GlDvfZSxk/EPh9KHXFLJBOxGJz12A8m7HX90rKfZ6lcMrtisFdaPg9fDMq4pFoqw8wZvgwnAGm7ztLsC8GxFxzuA/J8lHHDxfGLrxlqsKBadwuB/fK6n3DpnsC+HvnCnTsPrxlyYpWPpWX+khnG4wDHkUnb/b5aksCW144kaXNATH/dJFWg49et41M79V14dWoyJWrJ4nYMo8dcw3cOdpJAtOCAy/4sU/wC3X7u68S+x X-Bogosity: Ham, tests=bogofilter, spamicity=0.000018, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 11, 2024 at 7:54=E2=80=AFAM Rik van Riel wro= te: > > A task already in exit can get stuck trying to allocate pages, if its > cgroup is at the memory.max limit, the cgroup is using zswap, but > zswap writeback is enabled, and the remaining memory in the cgroup is > not compressible. > > This seems like an unlikely confluence of events, but it can happen > quite easily if a cgroup is OOM killed due to exceeding its memory.max > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > When this happens, it can sometimes take hours for tasks to exit, > as they are all trying to squeeze things into zswap to bring the group's > memory consumption below memory.max. > > Allowing these exiting programs to push some memory from their own > cgroup into swap allows them to quickly bring the cgroup's memory > consumption below memory.max, and exit in seconds rather than hours. > > Loading this fix as a live patch on a system where a workload got stuck > exiting allowed the workload to exit within a fraction of a second. > > Signed-off-by: Rik van Riel > --- > mm/memcontrol.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7b3503d12aaf..03d77e93087e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem= _cgroup *memcg) > if (!zswap_is_enabled()) > return true; > > + /* > + * Always allow exiting tasks to push data to swap. A process in > + * the middle of exit cannot get OOM killed, but may need to push > + * uncompressible data to swap in order to get the cgroup memory > + * use below the limit, and make progress with the exit. > + */ > + if ((current->flags & PF_EXITING) && memcg =3D=3D mem_cgroup_from= _task(current)) > + return true; > + I have a few questions: (a) If the task is being OOM killed it should be able to charge memory beyond memory.max, so why do we need to get the usage down below the limit? Looking at the other thread with Michal, it looks like it's because we have to go into reclaim first before we get to the point of force charging for dying tasks, and we spend too much time in reclaim. Is that correct? If that's the case, I am wondering if the real problem is that we check mem_cgroup_zswap_writeback_enabled() too late in the process. Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap entries, only to realize it cannot swap in swap_writepage(). Should we check for this in can_reclaim_anon_pages()? If zswap writeback is disabled and we are already at the memcg limit (or zswap limit for that matter), we should avoid scanning anon memory to begin with. The problem is that if we race with memory being freed we may have some extra OOM kills, but I am not sure how common this case would be. (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in case we are reclaiming from an ancestor and we hit the limit of that ancestor? (c) mem_cgroup_from_task() should be called in an RCU read section (or we need something like rcu_access_point() if we are not dereferencing the pointer). > for (; memcg; memcg =3D parent_mem_cgroup(memcg)) > if (!READ_ONCE(memcg->zswap_writeback)) > return false; > -- > 2.47.0 > > >