From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9EEC6C43334 for ; Wed, 20 Jul 2022 12:27:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2BA16B0071; Wed, 20 Jul 2022 08:27:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BDBD56B0073; Wed, 20 Jul 2022 08:27:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA6436B0074; Wed, 20 Jul 2022 08:27:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9BBDA6B0071 for ; Wed, 20 Jul 2022 08:27:01 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 73BF7AAE2A for ; Wed, 20 Jul 2022 12:27:01 +0000 (UTC) X-FDA: 79707402642.22.87CF9CB Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf23.hostedemail.com (Postfix) with ESMTP id D5983140013 for ; Wed, 20 Jul 2022 12:27:00 +0000 (UTC) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out2.suse.de (Postfix) with ESMTP id 501F12073D; Wed, 20 Jul 2022 12:26:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1658320019; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ZCjM+U4btn2hFbEhlDJAN0XtxAvoC02J52EYR1r3quA=; b=Fa2MtZ39CZxDiZriXHGmS2WGkOLuQjz2pfbG20IeYYKejTIRGC+lbGxkrpzNqtel1RUux0 FcHMxzZ3sv/eAr1y/nWT7EG7G3YQCg4wCg1G+lQtwR6G1pzL2Q3guv6pvwo4yDYbpAbUMh mhufBOCOcd6ziidndxCd7WCw21Vk9Uc= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 6B9F42C141; Wed, 20 Jul 2022 12:26:55 +0000 (UTC) Date: Wed, 20 Jul 2022 14:26:51 +0200 From: Michal Hocko To: Mina Almasry Cc: Yosry Ahmed , Roman Gushchin , Yafang Shao , Alexei Starovoitov , Shakeel Butt , Matthew Wilcox , Christoph Hellwig , "David S. Miller" , Daniel Borkmann , Andrii Nakryiko , Tejun Heo , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Subject: Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Message-ID: References: <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658320021; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZCjM+U4btn2hFbEhlDJAN0XtxAvoC02J52EYR1r3quA=; b=ezyYQy/IGVDcQswg9/Pz35d8BEoXusu6GOquUpddhjl152U3MDsVJdQx6Lyuw857FAmlhG u7X09GeAl1aPkoI/e3nD88Tk4kYNNCrHGMuDI6tyszI2NahaEKUXbk5W3fJd/02GwqQv60 om+Y7LKGNYqtlOtfEfG5/qAgFozllWg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658320021; a=rsa-sha256; cv=none; b=I5e4uGQRWc6+aiIVbrLzbyBrvDpU0dyGflkcaCTxZwwEWnjH6GqhInJFIkyWeUsn1eieJr uGtoj6yjhW508O1ckWB+Mh01GfpFebUzDmkoArgt044obbOOGgRoEAYVCT+QZmE+XNP3hr f836/aHx3yYXZa5IAPDNXmS0HOT+gTA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=Fa2MtZ39; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf23.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com X-Rspamd-Queue-Id: D5983140013 Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=Fa2MtZ39; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf23.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com X-Rspamd-Server: rspam12 X-Rspam-User: X-Stat-Signature: foey3318dunr6rdmsmb3mkqux6m1w5qq X-HE-Tag: 1658320020-639527 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 19-07-22 11:46:41, Mina Almasry wrote: [...] > An interface like cgroup.sticky.[bpf/tmpfs/..] would work for us > similar to tmpfs memcg= mount option. I would maybe rename it to > cgroup.charge_for.[bpf/tmpfs/etc] or something. > > With regards to OOM, my proposal on this patchset is to return ENOSPC > to the caller if we hit the limit of the remote memcg and there is > nothing to kill: > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ That would imply SIGBUS on the #PF path. Is this really the way how we want to tell userspace that something they are not aware of like a limit in a completely different resource domain has triggered? > There is some precedent to doing this in the kernel. If a hugetlb > allocation hits the hugetlb_cgroup limit, we return ENOSPC to the > caller (and SIGBUS in the charge path). The reason there being that we > don't support oom-kill or reclaim or swap for hugetlb pages. Following hugetlb is not really a great idea because hugetlb has always been quite special and its users are aware of that. The same doesn't really apply to other resources like tmpfs. > I think it is also reasonable to prevent removing the memcg if there > is cgroup.charge_for.[bpf/tmpfs/etc] still alive. Currently we prevent > removing the memcg if there are tasks attached. So we can also prevent > removing the memcg if there are bpf/tmpfs charge sources pending. I can imagine some way of keeping cgroups active even without tasks but so far I haven't really seen a good way how to achieve that. cgroup.sticky.[bpf/tmpfs/..] interface is really weird if you ask me. For one thing I have hard time imagine how to identify those resources. tmpfs by path is really strange because the same mount point can be referenced through many paths. Not the mention the path can be remounted/redirected to anything after the configurion which would just lead to a lot of confusion. Exposing internal ids is also far from great. It would also put an additional burden on the kernel implementation to ensure there is no overlap in resources among different cgroups. Also how many of those sticky resources do we want to grow? To me this have way too many red flags that it sounds like an interface which would break really easily. The more I think about that the more I agree with Tejun that corner cases are just waiting to jump out at us. -- Michal Hocko SUSE Labs