From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B85FAC433FE for ; Tue, 16 Nov 2021 10:17:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 32C9261BF4 for ; Tue, 16 Nov 2021 10:17:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 32C9261BF4 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id B34BE6B0085; Tue, 16 Nov 2021 05:17:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AE4506B0088; Tue, 16 Nov 2021 05:17:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9AC6E6B0089; Tue, 16 Nov 2021 05:17:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0216.hostedemail.com [216.40.44.216]) by kanga.kvack.org (Postfix) with ESMTP id 8D6396B0085 for ; Tue, 16 Nov 2021 05:17:26 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 4D15487714 for ; Tue, 16 Nov 2021 10:17:26 +0000 (UTC) X-FDA: 78814391334.01.5D410DF Received: from mail-il1-f172.google.com (mail-il1-f172.google.com [209.85.166.172]) by imf23.hostedemail.com (Postfix) with ESMTP id CC46C9000385 for ; Tue, 16 Nov 2021 10:17:03 +0000 (UTC) Received: by mail-il1-f172.google.com with SMTP id i11so19761165ilv.13 for ; Tue, 16 Nov 2021 02:17:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=C1QM8AKW5rCLdWswQ1PcWE7G6amUcVQfOA2HOJnupHE=; b=YO9KcfeV6oCJZxTCA38up2FrHFwJCtZss1LEpwB2Pnv8diLgW81FDXy03fz6kQHchs TqXyAezTk/ymE+8iz6KwKrpxFM/3VN5OcIAHzGppszyRDgEQO2SEJ/RhXHwRaV1Uqb3e eDT3+FEqURzZVV1Vy+B/emJ0tJRLqOFVdKhgE7bKZ+/3meE+NrVWdRSTVPRh8RToe4BR vS44JPbhOdK4QV58tXCo0ZbTHFmtkLP2B3EoHAu8wejNlwPL7B1xwIN8iv4ZcDDomoR4 nI+sz4B1kTgEZ6zgsNj4FWV91LqESl+hA/LkVa6AuV2OFhAcuHuxMdJoNFhg0zdLhdhz KqVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=C1QM8AKW5rCLdWswQ1PcWE7G6amUcVQfOA2HOJnupHE=; b=akRL/0BVTSpNz+CNDAnDOHcaFVnSTzIVN+o4coE61X6MbnIv1iAeuDN0jHwrxunqDt Jff/PgSmpzd/LdtWPbf2AzV/y57LKkpQb6LCc61xz/pN03NzSD9Bw3Q0qgO7GoMHwPs2 +iLyLbb6oJ28VyrGebbvTB9IZ1GMcGvAIKI272wo5mIvtt6X468xPtRgU+YNYn/VH08X 67rrJHu9zLfae8b0sWuOg3y4PJNoYwX2TttRs8F6IEWOiD5EPDwSNn2HcM6YHwEwgY9U 2sBA9BurIxxTzImVuZh0uP8WFod+IL2Cty5ZOyNU9vhwpLTz6xgDrm4VRbPcTPkbYM8J 4JgA== X-Gm-Message-State: AOAM532CmQih4/EUXnbk7FE6QT26J3t3D2p1XcQ3JRJDZ+9vDTlO5fM9 D3jGnP3KotX7fWuD3QJGfgmi6CQ54MJQHlHdXoISfA== X-Google-Smtp-Source: ABdhPJy6WwO8mtKgC1cfXDxgttmwaE0wCOtrnvSZ8v2mF+zSstZy18ngGO/685EBIT1OrJ0xHIrKWIwXGD/N2m/ScJo= X-Received: by 2002:a05:6e02:1561:: with SMTP id k1mr3761344ilu.135.1637057840850; Tue, 16 Nov 2021 02:17:20 -0800 (PST) MIME-Version: 1.0 References: <20211111234203.1824138-1-almasrymina@google.com> <20211111234203.1824138-3-almasrymina@google.com> In-Reply-To: From: Mina Almasry Date: Tue, 16 Nov 2021 02:17:09 -0800 Message-ID: Subject: Re: [PATCH v3 2/4] mm/oom: handle remote ooms To: Michal Hocko Cc: "Theodore Ts'o" , Greg Thelen , Shakeel Butt , Andrew Morton , Hugh Dickins , Roman Gushchin , Johannes Weiner , Tejun Heo , Vladimir Davydov , Muchun Song , riel@surriel.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=YO9KcfeV; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of almasrymina@google.com designates 209.85.166.172 as permitted sender) smtp.mailfrom=almasrymina@google.com X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: CC46C9000385 X-Stat-Signature: 3bbcrinqtf9aur8edd597b1gcjik55ii X-HE-Tag: 1637057823-560916 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 16, 2021 at 1:28 AM Michal Hocko wrote: > > On Mon 15-11-21 16:58:19, Mina Almasry wrote: > > On Mon, Nov 15, 2021 at 2:58 AM Michal Hocko wrote: > > > > > > On Fri 12-11-21 09:59:22, Mina Almasry wrote: > > > > On Fri, Nov 12, 2021 at 12:36 AM Michal Hocko wrote: > > > > > > > > > > On Fri 12-11-21 00:12:52, Mina Almasry wrote: > > > > > > On Thu, Nov 11, 2021 at 11:52 PM Michal Hocko wrote: > > > > > > > > > > > > > > On Thu 11-11-21 15:42:01, Mina Almasry wrote: > > > > > > > > On remote ooms (OOMs due to remote charging), the oom-killer will attempt > > > > > > > > to find a task to kill in the memcg under oom, if the oom-killer > > > > > > > > is unable to find one, the oom-killer should simply return ENOMEM to the > > > > > > > > allocating process. > > > > > > > > > > > > > > This really begs for some justification. > > > > > > > > > > > > > > > > > > > I'm thinking (and I can add to the commit message in v4) that we have > > > > > > 2 reasonable options when the oom-killer gets invoked and finds > > > > > > nothing to kill: (1) return ENOMEM, (2) kill the allocating task. I'm > > > > > > thinking returning ENOMEM allows the application to gracefully handle > > > > > > the failure to remote charge and continue operation. > > > > > > > > > > > > For example, in the network service use case that I mentioned in the > > > > > > RFC proposal, it's beneficial for the network service to get an ENOMEM > > > > > > and continue to service network requests for other clients running on > > > > > > the machine, rather than get oom-killed when hitting the remote memcg > > > > > > limit. But, this is not a hard requirement, the network service could > > > > > > fork a process that does the remote charging to guard against the > > > > > > remote charge bringing down the entire process. > > > > > > > > > > This all belongs to the changelog so that we can discuss all potential > > > > > implication and do not rely on any implicit assumptions. > > > > > > > > Understood. Maybe I'll wait to collect more feedback and upload v4 > > > > with a thorough explanation of the thought process. > > > > > > > > > E.g. why does > > > > > it even make sense to kill a task in the origin cgroup? > > > > > > > > > > > > > The behavior I saw returning ENOMEM for this edge case was that the > > > > code was forever looping the pagefault, and I was (seemingly > > > > incorrectly) under the impression that a suggestion to forever loop > > > > the pagefault would be completely fundamentally unacceptable. > > > > > > Well, I have to say I am not entirely sure what is the best way to > > > handle this situation. Another option would be to treat this similar to > > > ENOSPACE situation. This would result into SIGBUS IIRC. > > > > > > The main problem with OOM killer is that it will not resolve the > > > underlying problem in most situations. Shmem files would likely stay > > > laying around and their charge along with them. Killing the allocating > > > task has problems on its own because this could be just a DoS vector by > > > other unrelated tasks sharing the shmem mount point without a gracefull > > > fallback. Retrying the page fault is hard to detect. SIGBUS might be > > > something that helps with the latest. The question is how to communicate > > > this requerement down to the memcg code to know that the memory reclaim > > > should happen (Should it? How hard we should try?) but do not invoke the > > > oom killer. The more I think about this the nastier this is. > > > > So actually I thought the ENOSPC suggestion was interesting so I took > > the liberty to prototype it. The changes required: > > > > 1. In out_of_memory() we return false if !oc->chosen && > > is_remote_oom(). This gets bubbled up to try_charge_memcg() as > > mem_cgroup_oom() returning OOM_FAILED. > > 2. In try_charge_memcg(), if we get an OOM_FAILED we again check > > is_remote_oom(), if it is a remote oom, return ENOSPC. > > 3. The calling code would return ENOSPC to the user in the no-fault > > path, and SIGBUS the user in the fault path with no changes. > > I think this should be implemented at the caller side rather than > somehow hacked into the memcg core. It is the caller to know what to do. > The caller can use gfp flags to control the reclaim behavior. > Hmm I'm a bit struggling to envision this. So would it be acceptable at the call sites where we doing a remote charge, such as shmem_add_to_page_cache(), if we get ENOMEM from the mem_cgroup_charge(), and we know we're doing a remote charge (because current's memcg != the super block memcg), then we return ENOSPC from shmem_add_to_page_cache()? I believe that will return ENOSPC to the userspace in the non-pagefault path and SIGBUS in the pagefault path. Or you had something else in mind? > > To be honest I think this is very workable, as is Shakeel's suggestion > > of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can > > document the behavior and if the userspace doesn't want to get killed > > they can catch the sigbus and handle it gracefully. If not, the > > userspace just gets killed if we hit this edge case. > > I am not sure about the MEMCG_OOM_NO_VICTIM approach. It sounds really > hackish to me. I will get back to Shakeel's email as time permits. The > primary problem I have with this, though, is that the kernel oom killer > cannot really do anything sensible if the limit is reached and there > is nothing reclaimable left in this case. The tmpfs backed memory will > simply stay around and there are no means to recover without userspace > intervention. > -- > Michal Hocko > SUSE Labs On Tue, Nov 16, 2021 at 1:39 AM Michal Hocko wrote: > > On Tue 16-11-21 10:28:25, Michal Hocko wrote: > > On Mon 15-11-21 16:58:19, Mina Almasry wrote: > [...] > > > To be honest I think this is very workable, as is Shakeel's suggestion > > > of MEMCG_OOM_NO_VICTIM. Since this is an opt-in feature, we can > > > document the behavior and if the userspace doesn't want to get killed > > > they can catch the sigbus and handle it gracefully. If not, the > > > userspace just gets killed if we hit this edge case. > > > > I am not sure about the MEMCG_OOM_NO_VICTIM approach. It sounds really > > hackish to me. I will get back to Shakeel's email as time permits. The > > primary problem I have with this, though, is that the kernel oom killer > > cannot really do anything sensible if the limit is reached and there > > is nothing reclaimable left in this case. The tmpfs backed memory will > > simply stay around and there are no means to recover without userspace > > intervention. > > And just a small clarification. Tmpfs is fundamentally problematic from > the OOM handling POV. The nuance here is that the OOM happens in a > different memcg and thus a different resource domain. If you kill a task > in the target memcg then you effectively DoS that workload. If you kill > the allocating task then it is DoSed by anybody allowed to write to that > shmem. All that without a graceful fallback. I don't know if this addresses your concern, but I'm limiting the memcg= use to processes that can enter that memcg. Therefore they would be able to allocate memory in that memcg anyway by entering it. So if they wanted to intentionally DoS that memcg they can already do it without this feature.