From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6482C02183 for ; Tue, 14 Jan 2025 19:42:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2E2A3280006; Tue, 14 Jan 2025 14:42:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 29D44280005; Tue, 14 Jan 2025 14:42:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1348A280006; Tue, 14 Jan 2025 14:42:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E4C6B280005 for ; Tue, 14 Jan 2025 14:42:23 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A183244C0F for ; Tue, 14 Jan 2025 19:42:23 +0000 (UTC) X-FDA: 83007078966.09.9414178 Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf08.hostedemail.com (Postfix) with ESMTP id 8DF88160020 for ; Tue, 14 Jan 2025 19:42:21 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=QqqyufIR; spf=pass (imf08.hostedemail.com: domain of mhocko@suse.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736883741; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=URutu6iZhZLQvnv3Wg+eMF7yoZlTyVi2IgQurCtdros=; b=33xLkwCHV/ExjWJbJh7udS+oYOCk4u4T9lWR+CGXqBNtoD2FR69cgu5zBsutaQHhaTZnbq b94nGwCIkAe+i1smyc0DspQk4YMHmkOtRHK8D4Am8ATXAgJF3xkX5fjzOqs8v0ZXwPkIVk qEev8xhFE2dSlAAEXlvyjJoRfeuLQG8= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=QqqyufIR; spf=pass (imf08.hostedemail.com: domain of mhocko@suse.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736883741; a=rsa-sha256; cv=none; b=tc/lsPd5HyuVLrwyRZKVxQb2pxknvU5dwdw5qxn1+3nYJt4+pfrJh2jqtGN9GlrOOCvHGG JleM88dqb8+uxB0/aPLKxJJ5BHADowjcWGKgX97nDqVj9bvYIBhNOAFgHLKD2Z16OWTKl5 0lACjGJMVTTLSnV1zavEXBbqSpthiHk= Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5d7e527becaso9789987a12.3 for ; Tue, 14 Jan 2025 11:42:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1736883740; x=1737488540; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=URutu6iZhZLQvnv3Wg+eMF7yoZlTyVi2IgQurCtdros=; b=QqqyufIRswxQ/LThCfTiZp/DSMlAkC5WJfC3brngHGPJd6YYI67iorB69cOlxmcL0d OVS4wsOPaRHas72rLDZ14R14pMnz32JlA3a2WEyD2fnrddw1hpDNCQrKgeQkafetNGzk ynAARAwqWmF18iDY4Sx3xLjt0oEq/WTN4aLEyGMUbymxakSKeLMy5aaUY4hFp7+5ng0t 7VyQqKlOwo7TQgwlc0/ffjphYRENd8g29SeVbqj9uZv0kC/TZ5rnuApApwOmRaFcmNWa lZmLyk8/tqfsyUAi3FGIsJBWQAkuLz+rCWD9xPMDtD1LB6r1aqYx5+5GPxw4yp416a6e Hstg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736883740; x=1737488540; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=URutu6iZhZLQvnv3Wg+eMF7yoZlTyVi2IgQurCtdros=; b=VIlBokkcrYDqq2UJLM/2A/9kkUURWo7rOLoawLqxLFrP3abmuZofyVPBkY77oVqQdF tC5KyVmrZgOsGBsWn8UhndEfXmgx+llf8CLvXrBQPdfMrA5pQP+HnEFmVJZVQpsEs+kX LtINk/kvHEWIwohM4E+pIU9RtqOhiPHNuby9WJ20yegdeIThpwMDHTOio7AhaHlnokVx oQNodJN6AbyRsJ3LCG1W8udNQz54wXlKf3S7ptrR9PklDzDkvPFy17OkAyZQsc8rlSG/ HDJ/sdzdfmZW9OCoMP/+9ejWNNCSlgZkmUKtWQ4O0WkcEYf4fhqiRQYWrooesXUxpqy+ wuQg== X-Forwarded-Encrypted: i=1; AJvYcCXQgaSNeBb8lMdthYF1KwI7CIN230C/sJ7RAosqaN/fzAwuT48ADd4ZQjoN68GhnvuOUqxF2l4qvA==@kvack.org X-Gm-Message-State: AOJu0YzjPzw6GDbMDg2ci5o1JFhQr7nJRkZrSd1B5qUgEUpekdwmt+y5 hII4XteloHX0hz4/VDrnGilZUx+VKtcmmTpCjMKOXde/C55hgP3NCPFmy6OszS8Emk1JWeBKIyp s X-Gm-Gg: ASbGncvF66z7Moz+KngtFgHn7ofhsdbrYs7BhFuSo79J7QswZdU3i/pq5lB5JBoxBls cg2Sj38WL42bakjKwjkHJtr4l8/9owE69fy+x+96jF61MgWf927/H4VR4ihGRHTJB9g40zJfsym HFEiZwy5JbiKzsiaGzLix10vdCNZ7MiRI/71Scjb4d5kNGiraD6GJkG6VNMQAhMOGxZV6IrzW+P pYAIZENw5pMw23rNDHS3j+vzQ986ZbpS3fwUm6ZLENx9ghP6PPQP5yjkr1UojbzcPjo1g== X-Google-Smtp-Source: AGHT+IHX58DCazbPKYkixhcaH+ova41IeImuJ3K429IoKdaLrXLzYh2PmVvFh17MJ8hWqpFaLBoDxg== X-Received: by 2002:a05:6402:321a:b0:5d3:cff5:635e with SMTP id 4fb4d7f45d1cf-5d972e63d86mr60948294a12.26.1736883740019; Tue, 14 Jan 2025 11:42:20 -0800 (PST) Received: from localhost (109-81-90-202.rct.o2.cz. [109.81.90.202]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ab2c90d5c9asm662108266b.45.2025.01.14.11.42.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jan 2025 11:42:19 -0800 (PST) Date: Tue, 14 Jan 2025 20:42:18 +0100 From: Michal Hocko To: Johannes Weiner Cc: Rik van Riel , Yosry Ahmed , Balbir Singh , Roman Gushchin , hakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Nhat Pham Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap Message-ID: References: <20241212183012.GB1026@cmpxchg.org> <20250114160955.GA1115056@cmpxchg.org> <193d98b0d5d2b14da1b96953fcb5d91b2a35bf21.camel@surriel.com> <20250114192322.GB1115056@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250114192322.GB1115056@cmpxchg.org> X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 8DF88160020 X-Rspam-User: X-Stat-Signature: j1qezymrbgqmqxpd97tdi6f9ujjh4f4u X-HE-Tag: 1736883741-589294 X-HE-Meta: U2FsdGVkX18WI8SMG3pgggJ/QCS/7+H1Zg+hBMkWAtfaK63ePUK6uC3xUKLc+ksygck38I03fqOiXZmfKqw3E+OFUTIFEQL4fapEBmNZj49jDBMGQfmy3Vc9ZEYZx0O0W+By36/hcgcUvdY25u0iNH0J58QQWbwVWRq3Y7P/lwrJ1hjSWC5GDxyBboNakkw/1SGxKGrxDDtd+p/7nNlh52etPPXLTlIGrcd7s7V4l9FVs4itcnus/eDZcnCWywlRrMrm7wMsLeOdqOauxFK70igQnSujjKYw+qiW0UIl6Pc7f9dIsbvBjF2RWvZPxfN2Iycc7wsrQ/8A88NZLIZ0pa2Qq1wWEQ8U3E9izGU4jpbPNsynZU5Q5wkELqETxbBpDXpsgrULo4FgOvdx3DubJrLfF3GM0gNF8Mi/JvfoKGgpxFTfORSXbPj09GF0VDd00cWT53khh4u7xqXjIAGJwoigFIowULVydGBr3EMOuZgyejhEWMf7DNfvhuPcJY1m6ijMbTrDx6DCmMc/jWH7bRj2OP1UmU7lw9kqPvOKUsp0AxzybZnLAScd6jYpPy0GwzBIkCiVbVip5dMvC+4gspELfjORqkJU3BlNZJQqC9GOh419DNWcHmAWfQWiI5nIXNm8s9aZbQhgYNA+KpJBft/A00BtJLvaZVgKsKVQt2XU0EFSuwveMv8DAIxQ59YPe9X/Trj5sC7c1iIvDqaHheVtW+McpIReW1iai6o09JnPoQIiHrYKDzdaqcXD8Omhoalmiy4eCk7BeH1exMID7mkFXuS3c1aVGJjdinecNH2Z/3GDeHD63sDgsPqAUGqJtZMDnKr5J6X0u5bMxKi8pa1TnqLiQNmJwXbkvd9iNzJsmEpuN7b9Wza5v5NdPe3fRJ3QPPQZ7BoTVN7ztT+m1XMsfsDCrIklfjZu1OdWEGb+Se/UHfQR/UlneHnVQ8WqiCkBlLlCzDZT+9Q5SRq 4jiB3Awc q0YiYhZuVRdyL+svNCpbwFyjYDbHgDSKkN28K91Jbw5SylK/PDrnZ8eb2DVMaofOuYZw0WVxkt7xudPUr4SgVLsA705KBFQF7+4FRxDAT8HQqhMU0IylV1F6etslpsy0iRIECuY62TONqCRpBZjODseNpEGh65p3rNjsvMc7NkSe/xHEQMRbqdLTdY/RH0jwUlyqqcgKZGaqLgH1FsRWj8DeJJfiM5hZXK7k97JH2ml2ZRapYWE0hrHSkeIw9WSNQkmkLOdCalL1sFypiKvkC3UDiDu+vz07Urz7UoScLeXhcHmWOmcE4zTKktg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue 14-01-25 14:23:22, Johannes Weiner wrote: > On Tue, Jan 14, 2025 at 07:13:07PM +0100, Michal Hocko wrote: > > Anyway, have you tried to reproduce with > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 7b3503d12aaf..9c30c442e3b0 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1627,7 +1627,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > * A few threads which were not waiting at mutex_lock_killable() can > > * fail to bail out. Therefore, check again after holding oom_lock. > > */ > > - ret = task_is_dying() || out_of_memory(&oc); > > + ret = out_of_memory(&oc); > > > > unlock: > > mutex_unlock(&oom_lock); > > > > proposed by Johannes earlier? This should help to trigger the oom reaper > > to free up some memory. > > Yes, I was wondering about that too. > > If the OOM reaper can be our reliable way of forward progress, we > don't need any reserve or headroom beyond memory.max. > > IIRC it can fail if somebody is holding mmap_sem for writing. The exit > path at some point takes that, but also around the time it frees up > all its memory voluntarily, so that should be fine. Are you aware of > other scenarios where it can fail? Setting MMF_OOM_SKIP is the final moment when oom reaper can act. This is after exit_mm_release which releases futex. Also get_user callers shouldn't be holding exclusive mmap_lock as that would deadlock when PF path takes the read lock, right? > What if everything has been swapped out already and there is nothing > to reap? IOW, only unreclaimable/kernel memory remaining in the group. Yes, this is possible. It is also possible the the oom victim depletes oom reserves globally and fail the allocation resulting in the same problem. Reserves do buy some time but do not solve the underlying issue. > It still seems to me that allowing the OOM victim (and only the OOM > victim) to bypass memory.max is the only guarantee to progress. > > I'm not really concerned about side effects. Any runaway allocation in > the exit path (like the vmalloc one you referenced before) is a much > bigger concern for exceeding the physical OOM reserves in the page > allocator. What's a containment failure for cgroups would be a memory > deadlock at the system level. It's a class of kernel bug that needs > fixing, not something we can really work around in the cgroup code. I do agreee that a memory deadlock is not really proper way to deal with the issue. I have to admit that my understanding was based on ENOMEM being properly propagated out of in kernel user page faults. It seems I was wrong about that. On the other hand wouldn't that be a proper way to deal with the issue? Relying on allocations never failing is quite fragile. -- Michal Hocko SUSE Labs