From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C6826C433DB for ; Tue, 16 Feb 2021 08:12:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2237464DC3 for ; Tue, 16 Feb 2021 08:12:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2237464DC3 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7C8B98D0155; Tue, 16 Feb 2021 03:12:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 779D68D0140; Tue, 16 Feb 2021 03:12:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68FBC8D0155; Tue, 16 Feb 2021 03:12:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id 5474D8D0140 for ; Tue, 16 Feb 2021 03:12:37 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 126331808E84B for ; Tue, 16 Feb 2021 08:12:37 +0000 (UTC) X-FDA: 77823414354.23.walk57_5f08b6427642 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id E592437604 for ; Tue, 16 Feb 2021 08:12:36 +0000 (UTC) X-HE-Tag: walk57_5f08b6427642 X-Filterd-Recvd-Size: 3737 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Tue, 16 Feb 2021 08:12:36 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1613463155; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=w5QsXYyE8z2VKQ4MnO2HFDm3PsxDTFu4+x9dngf5//g=; b=mJfaUIShUOf+vrFTpdnPwvAuDqNv1KtPX3BaoOmFuXkS+wZj09kth9CArqbSxZvT9AogNw MvQJgMClfgtFYZhducrOa4KfrOmAxqf6WuJMHfMEYJpgZM8HkPg6EZmLwbJ+kbVWGo76JW nNV3NPQrh8KNR7WMyyNZbrHt6V3OlQs= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 1F986B027; Tue, 16 Feb 2021 08:12:35 +0000 (UTC) Date: Tue, 16 Feb 2021 09:12:33 +0100 From: Michal Hocko To: Eiichi Tsukata Cc: corbet@lwn.net, mike.kravetz@oracle.com, mcgrof@kernel.org, keescook@chromium.org, yzaikin@google.com, akpm@linux-foundation.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, felipe.franciosi@nutanix.com Subject: Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom Message-ID: References: <20210216030713.79101-1-eiichi.tsukata@nutanix.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210216030713.79101-1-eiichi.tsukata@nutanix.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 16-02-21 03:07:13, Eiichi Tsukata wrote: > Hugepages can be preallocated to avoid unpredictable allocation latency. > If we run into 4k page shortage, the kernel can trigger OOM even though > there were free hugepages. When OOM is triggered by user address page > fault handler, we can use oom notifier to free hugepages in user space > but if it's triggered by memory allocation for kernel, there is no way > to synchronously handle it in user space. Can you expand some more on what kind of problem do you see? Hugetlb pages are, by definition, a preallocated, unreclaimable and admin controlled pool of pages. Under those conditions it is expected and required that the sizing would be done very carefully. Why is that a problem in your particular setup/scenario? If the sizing is really done properly and then a random process can trigger OOM then this can lead to malfunctioning of those workloads which do depend on hugetlb pool, right? So isn't this a kinda DoS scenario? > This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If > enabled, it first tries to free a hugepage if available before invoking > the oom-killer. The default value is disabled not to change the current > behavior. Why is this interface not hugepage size aware? It is quite different to release a GB huge page or 2MB one. Or is it expected to release the smallest one? To the implementation... [...] > +static int sacrifice_hugepage(void) > +{ > + int ret; > + > + spin_lock(&hugetlb_lock); > + ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0); ... no it is going to release the default huge page. This will be 2MB in most cases but this is not given. Unless I am mistaken this will free up also reserved hugetlb pages. This would mean that a page fault would SIGBUS which is very likely not something we want to do right? You also want to use oom nodemask rather than a full one. Overall, I am not really happy about this feature even when above is fixed, but let's hear more the actual problem first. -- Michal Hocko SUSE Labs