From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AEBD1CD4857 for ; Wed, 4 Sep 2024 16:46:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3D3F06B009F; Wed, 4 Sep 2024 12:46:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3606B6B00AC; Wed, 4 Sep 2024 12:46:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 188ED6B00D8; Wed, 4 Sep 2024 12:46:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EB6CC6B009F for ; Wed, 4 Sep 2024 12:46:05 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 98B7C1C03C3 for ; Wed, 4 Sep 2024 16:46:05 +0000 (UTC) X-FDA: 82527633090.29.5641A9D Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41]) by imf14.hostedemail.com (Postfix) with ESMTP id 767E4100013 for ; Wed, 4 Sep 2024 16:46:03 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=Zk7e+2be; spf=pass (imf14.hostedemail.com: domain of mhocko@suse.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725468315; a=rsa-sha256; cv=none; b=msIgIWADIMziFPa1C1/DkSdp7K0mfBL+0zo95GWV/XhRcm9Ehngv9rRC5oDsB3aAEMJFee pNuJ5vNw1WYIHu1vcENoAL/quXxEhT50d/EPxHgv1403K65ZnK+lKgw5uATN1vVyGhd5eO uSDbXy2lbuxqmR9KPNl582EfAw+PFqo= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=Zk7e+2be; spf=pass (imf14.hostedemail.com: domain of mhocko@suse.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725468315; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mxgCn6iAecAxYp2unFwUyhJKU6UBoBh4bFD7TaZBxYs=; b=lSz7W7BqvPwV9z5JtNn2TmwEE/yyZpnOlLQe28RNZomLBgNGmo8ZFz36unegeWz1BXQNXt tj5n9coqOiQMlo5fJElPpPryZpbIldk3hVUvvkVBX0AukW8uY3pCTZDZ5T6QN3nNsyZJl4 T0gqA8s5BBMKBrX0P7pyY50boZvIfHE= Received: by mail-ej1-f41.google.com with SMTP id a640c23a62f3a-a89c8db505bso501053566b.0 for ; Wed, 04 Sep 2024 09:46:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1725468362; x=1726073162; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=mxgCn6iAecAxYp2unFwUyhJKU6UBoBh4bFD7TaZBxYs=; b=Zk7e+2bek0ik9iJwb6W8yrseQ+WODSaNOR+7H+P3QAp7i24m/cElt2IG/Xnm6luvoj lqrSO6cXM66xWzgu413O+VNbXbxod2Ge0sExVzO0Ud5RNOLC8t806dhZ2gzp9B1W1H2W NWAipvyERzH0RN3dPNtHwlU5rlI4FBc1FKNuYMBjW6oVyvjeFb7V2+oEF7mFcSGilmKi WRtqOmzHI3PNbY03b9/iV+4IJ7C2237SHU0cLd7u/dm6TXhJxrAcntQL2N2maBYqHl2u 9h4AB6lsCwJZ9Bot4PaWIac0TxKQZa9r5m/cVWIsR2rWyDtKxU73BtJRsnPcUHw3aEF+ dNBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725468362; x=1726073162; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=mxgCn6iAecAxYp2unFwUyhJKU6UBoBh4bFD7TaZBxYs=; b=vCEJZTxMFlU3r0fJ1ft9SL5rFnm/zEyvj9ePfH7IrP1Iuybj42HsJFV0LU1Jf6eNUY hVDQEp8OHPni3NmIJsBAthmByB0p+zvqH8emKLUbIYQ/JsQ17gIxgac3zpCiFWA+nRhO jMxf69s5wms2bXZL895jgOCOayCOq2gYRpJM5adhng3w0CkCJKLD48flKLHwtXYLq0+X PheY2ogwLqTftY1RCITf514NikhV+3+QIHshtleC8uq7D++XjD2CMH+IZ1cHlJ8uWe1d ngRdDN/vXnm3ds9c/1XbdSgyi4OJeipWbvBN3BR5ouGkuJ+Bv1/wKGk/03/OCNK7XVvT qMqw== X-Forwarded-Encrypted: i=1; AJvYcCXKLR8ie7nOtNm7ay1Vc4tm9zRsAVg1iMrq9cRLUMUzCWLAYqeSIkQLp91GRB1L9JVkO0hdX9aBBA==@kvack.org X-Gm-Message-State: AOJu0YxYxAeyCRVW1GblHWX3/B76xAoTkmtW06noyUAXuz6TqU7MFpvZ IsjXP4CxbgSwduZGiR+kbAT/yo2aSCtCtL8jKI2zm5NW3VmofKssBQ8TkmbdPKM= X-Google-Smtp-Source: AGHT+IGRz0ziFAeppg2o5paJw6XE1KSt3WjvRwb7zjZiymqFUfXvnLBXV+d9kgfP3Swh5wb5XF0BCg== X-Received: by 2002:a17:907:e8d:b0:a86:899d:6459 with SMTP id a640c23a62f3a-a8a32eda727mr377230566b.38.1725468361792; Wed, 04 Sep 2024 09:46:01 -0700 (PDT) Received: from localhost (109-81-94-33.rct.o2.cz. [109.81.94.33]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a8a61fbaf5bsm13498066b.23.2024.09.04.09.46.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2024 09:46:01 -0700 (PDT) Date: Wed, 4 Sep 2024 18:46:00 +0200 From: Michal Hocko To: Kent Overstreet Cc: Andrew Morton , Christoph Hellwig , Yafang Shao , jack@suse.cz, Vlastimil Babka , Dave Chinner , Christian Brauner , Alexander Viro , Paul Moore , James Morris , "Serge E. Hallyn" , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-bcachefs@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM Message-ID: References: <20240902095203.1559361-1-mhocko@kernel.org> <20240902145252.1d2590dbed417d223b896a00@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: ngfbm3uwnq8ebzkx59ipfz85ti3sc1xy X-Rspamd-Queue-Id: 767E4100013 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1725468363-620666 X-HE-Meta: U2FsdGVkX19U2EkE8YeuRnE+/HPCKHQjfZX9FHEgfuTa/YuiuqoiPa23a7tXXLCfu7MLnNyQp7NpJIhskdELlt2LzdlELlS93idrDru1zPlK3qKE9Xg1WdWoBGW+IRsKxb+veh3q9Lw9nLJkflASuLsQ8aMi2uaLYSA9qtoL+7R3sgPWK6i4XZEar3Az4XTPtAd8sBEyj3uWPYc/VNaYiVjchuPe6PghtMoiAQtVknfzQDa6lGZB49Y5FnUANTiGxH497eUFlbmm0VuQ1TQOqFnPF8vaxi/ylRILpyUSRZ3I3MILFV2KEznDSAer68kGVbudUyH9GMnmZRQ9CHbzgLmNYrsjkV71evoOOZmOR+0q+D7LSgMEYIqHtZUFaIpDcRTbTNrUzbFmf+9a4uI0fQXegsdLGwlSeMphbtwS5BXjFE713Swo8dM0qlrhCQGiO2ygy5xV1FcmbW0mMQfFc5rLCZj6FdY/RqV3UsnzSgQgdekJYGdWMrma8LPUJT1QBF5q4HXn3RaMWrw4aDy7Xsd40i2Bpy+W+wuuwInaxRM/L3pE21NO0WfjI4KYDTyoUHL8NRSSj28sZwpcuNl2cwSGTHRVJRLo4DHLiWE5MbnYGqp2erMCkCscMovxTjwalaaUQ8SWUC/zeV5lasriistbouTr4kPYZvRubgLPZISX8vAldJv6Ykrxp/j5Ids6Xl6PMiSqwuB0bXtBJ/DfwY5RuVickb7Tt6U3AhQxeXu6SkKNtTmw/xYHXl2pP8VzKRyi7KKTZHzN46lvjsfkGZdxbe1jptU23fy+NXqys625UhbJw5BmrBLWRZZ9CEWV2qL+N0VX1pLlYwHoSudK2MGUNkyrtptIhUv8URvUhQv500jzVTepD3eavZK/a4o+pq3BHbfBPD//IaCVCy+/XLbquagTmAw1ocLQ7kexAD4TKcqrbzsQ2ASPp9PKjVJ9m52nayIF+yVpEBSIplP w/EVMmwp XzY9jFx6KzejOmsowgLtorum88ZxHLUTF/UIfESavBgPB9cCo3MYEK/EaRulSDHIYW/D3v/65CjhZaZKVScsW2aCEtTX9cuvgtJJn1LuGbvGoyMouBTVGistdJ+jOx/Ed2KZHKZBZZKl/53Y7TYkCsru2DwugM5kikQwlAOD0KlTFz3LlqtwJwBOXYq30OF/+ajJ8Gcx/JB+Dd5Lsvl3HoBeAdSM+wHWPRNMT0I/Qg0OXRRPX4xng03yKBX2+PMQUhtzkGxlAxjs4Kln2DlurgfA9I1eyhys7ATZGoYViRJl+e5vtwNQJuLFB12E0672y9DYLH6B4wDm85mLBVgHEnDLrW7BKONuE376ZK8f/7mSHF/5delDycCyH6dzNCxHL1VRZSIhY/3sjvaAzXNfyQYI6UgKC1J9zUkHy/to68pkc07QcgLKFi8qdpA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed 04-09-24 12:05:56, Kent Overstreet wrote: > On Wed, Sep 04, 2024 at 09:14:29AM GMT, Michal Hocko wrote: > > On Tue 03-09-24 19:53:41, Kent Overstreet wrote: > > [...] > > > However, if we agreed that GFP_NOFAIL meant "only fail if it is not > > > possible to satisfy this allocation" (and I have been arguing that that > > > is the only sane meaning) - then that could lead to a lot of error paths > > > getting simpler. > > > > > > Because there are a lot of places where there's essentially no good > > > reason to bubble up an -ENOMEM to userspace; if we're actually out of > > > memory the current allocation is just one out of many and not > > > particularly special, better to let the oom killer handle it... > > > > This is exactly GFP_KERNEL semantic for low order allocations or > > kvmalloc for that matter. They simply never fail unless couple of corner > > cases - e.g. the allocating task is an oom victim and all of the oom > > memory reserves have been consumed. This is where we call "not possible > > to allocate". > > *nod* > > Which does beg the question of why GFP_NOFAIL exists. Exactly for the reason that even rare failure is not acceptable and there is no way to handle it other than keep retrying. Typical code was while (!(ptr = kmalloc())) ; Or the failure would be much more catastrophic than the retry loop taking unbound amount of time. > > > So the error paths would be more along the lines of "there's a bug, or > > > userspace has requested something crazy, just shut down gracefully". > > > > How do you expect that to be done? Who is going to go over all those > > GFP_NOFAIL users? And what kind of guide lines should they follow? It is > > clear that they believe they cannot handle the failure gracefully > > therefore they have requested GFP_NOFAIL. Many of them do not have > > return value to return. > > They can't handle the allocatian failure and continue normal operation, > but that's entirely different from not being able to handle the > allocation failure at all - it's not hard to do an emergency shutdown, > that's a normal thing for filesystems to do. > > And if you scan for GFP_NOFAIL uses in the kernel, a decent number > already do just that. It's been quite some time since I've looked the last time. And I am not saying all the existing ones really require something as strong as GFP_NOFAIL semantic. If they could be dropped then great! The fewer we have the better. But the point is there are some which _do_ need this. We have discussed that in other email thread where you have heard why XFS and EXT4 does that and why they are not going to change that model. For those users we absolutely need a predictable and well defined behavior because they know what they are doing. [...] > But as a matter of policy going forward, yes we should be saying that > even GFP_NOFAIL allocations should be checking for -ENOMEM. I argue that such NOFAIL semantic has no well defined semantic and legit users are forced to do while (!(ptr = kmalloc(GFP_NOFAIL))) ; or BUG_ON(!(ptr = kmalloc(GFP_NOFAIL))); So it has no real reason to exist. We at the allocator level have 2 choices. Either we tell users they will not get GFP_NOFAIL and you just do the above or we provide NOFAIL which really guarantees that there is no failure even if that means the allocation gets unbounded amount of time. The latter have a slight advantage because a) you can identify those callers more easily and b) the allocator can do some heuristics to help those allocations. We can still discuss how to handle unsupported cases (like GFP_ATOMIC | __GFP_NOFAIL or kmalloc($UNCHECKED_USER_INPUT_THAT_IS_TOO_LARGE, __GFP_NOFAIL)) but the fact of the Linux kernel is that we have legit users and we need to optimize for them. > > Yes, we need to define some reasonable maximum supported sizes. For the > > page allocator this has been order > 1 and we considering we have a > > warning about those requests for years without a single report then we > > can assume we do not have such abusers. for kvmalloc to story is > > different. Current INT_MAX is just not any practical limit. Past > > experience says that anything based on the amount of memory just doesn't > > work (e.g. hash table sizes that used to that scaling and there are > > other examples). So we should be practical here and look at existing > > users and see what they really need and put a cap above that. > > Not following what you're saying about hash tables? Hash tables scale > roughly with the amount of system memory/workingset. I do not have sha handy but I do remember dcache hashtable scaling with the amount of memory in the past and that led to GBs of memory allocated on TB systems. This is not the case anymore I just wanted to mention that scaling with the amount of memory can get really wrong easily. > But it seems to me that the limit should be lower if you're on e.g. a 2 > GB machine (not failing with a warning, just failing immediately rather > than oom killing a bunch of stuff first) - and it's going to need to be > raised above INT_MAX as large memory machines keep growing, I keep > hitting it in bcachefs fsck code. Do we actual usecase that would require more than couple of MB? The amount of memory wouldn't play any actual role then. -- Michal Hocko SUSE Labs