From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56AA6CD13D2 for ; Sat, 31 Aug 2024 15:46:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F24B8D002F; Sat, 31 Aug 2024 11:46:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A2338D0022; Sat, 31 Aug 2024 11:46:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 369818D002F; Sat, 31 Aug 2024 11:46:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 192AA8D0022 for ; Sat, 31 Aug 2024 11:46:28 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6B7E4120E65 for ; Sat, 31 Aug 2024 15:46:27 +0000 (UTC) X-FDA: 82512967614.19.27699A2 Received: from out-185.mta0.migadu.com (out-185.mta0.migadu.com [91.218.175.185]) by imf18.hostedemail.com (Postfix) with ESMTP id 82DCF1C0008 for ; Sat, 31 Aug 2024 15:46:25 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=KBl10ThL; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf18.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.185 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725119164; a=rsa-sha256; cv=none; b=KwIV1/TzrJTIIAMJKZq+f0L8tl8OaaoVu96k5EwM/9TnIDruWimDT+IGHwWZ3B2HufoKAG qBOCO+OK41R7V8J/uo7Msynz39RHoszF1K3myELEy4nTBETwUiliqr6+nNnHko3IIC1E7+ BlcDcB90Zevw4kzLHgODLy2pUL8xgag= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=KBl10ThL; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf18.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.185 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725119164; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+Ssykqay2fp62kkpmBqa33KeQFdv9PjPfIXcKwkriOA=; b=8mOdIzi8uEC+Xu3wAp/XLGsL/8RPTWWp1vBSasG6X99s9zL86LmG25XEIeedDaBhLrMUdm S6GRDXCZdJqTCPqGmlBozQi798OVmdkem8RxJFe0HblIv2VoiNP7PAYZ/Ve8DrZlxuksAP Yu3iiP6aDFL8yb0uXCWXVSZ5H8MUM1g= Date: Sat, 31 Aug 2024 11:46:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1725119183; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=+Ssykqay2fp62kkpmBqa33KeQFdv9PjPfIXcKwkriOA=; b=KBl10ThLQ2GKFQK5Kn90cPVgcjPOkabpIBYmNyauDeH8hW1M6e9rGbvyaMZqV/zVLKAcWC VohFTDfY9p3UJhRcNjvGrIRsVZzFP4egEW5K85hYJ8iDgdY5H3/22yJ68Hi84/qdPougIm 4zbqduyRcpZ/osqY3uwnE0FegLfXm54= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Theodore Ts'o Cc: Dave Chinner , Michal Hocko , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Chinner Subject: Re: [PATCH] bcachefs: Switch to memalloc_flags_do() for vmalloc allocations Message-ID: References: <20240828140638.3204253-1-kent.overstreet@linux.dev> <20240830033905.GC9627@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240830033905.GC9627@mit.edu> X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Queue-Id: 82DCF1C0008 X-Rspamd-Server: rspam01 X-Stat-Signature: nxmm4njjq79huu1jsgxd7g9k4mdorwy5 X-HE-Tag: 1725119185-851 X-HE-Meta: U2FsdGVkX1/P68WutF1KYK+9i0FjdN5AWKvcvzC6qq0Ggug84suZXDf3OZuMJIe8LxTh8WzgP6zTUtmoAb/gxJoERiR9lhFCkJoYD7VFeAaxfVREq4YRV2zA5YJGTW4VReOYXbfN1Oh1TgFTaTYSdd9HpIwpdVB64KzI8LKYsMR9BZSZi0CzhIKGXWRB4I0LJmWtUkTaZORKubzaZ49uo66cwHI2dILF6SSWtDpvwCTnpoizaveeYa5j5JmYXwpaAV7Ow1WLYs3m0pQBTs/LqX/bK9IdH9ytF82T7FIvPZFA3MtJ5f+6YGf3cWxtJkKM2HWgs8Sl0QlM02vpwLgpMFtqr3NggMg5WW+cyGz5lN+Yx3n6Hbv+832L8pfbbb0TnLl3fJIHHvalTM1gyIxNBp4908EFCBDkEX/um3EdtmeV609cVY0lyDCQfdtnkx3V0iE7x5j1afhNQnsUw7BUPItfgGFaxyvUrO8s6H9dq6ouRWC5bGLxOdXpp8N2VYNU7YeJ3pnl/OKBvf/Xp3CxQZd+ZT0L+TBmNSNLICIqHnUp5hUXvbBQuMncKifzlaKz2VD9WceXHeVde+1xDPNW94/jRsdgPPmG+a3I+yfnS/ThOkuWlk02wfx+cUXEPLqhmwDDnY4Jxw/WrV3SSGJOnznmPsMkgN9/JekLuSWAAQ6WJpfHiWwML8aZshLi/z8vW+9Xx+6i4ie3hl05DJGGOIEJQfE+6HMBAEHSdiJTq6qsSvo4ItCD671rZyN/1WjLMr/KasM+XVjKf/LxxP/IHAq1LBPTKbr7Hjl43ujpUigqbt3ESCglNnAm6oSX25CNJoL148cHUICA+KDCYOzfjCy+YIY+ajXAzPUvaUg49YisQah7HduFWRUEY3xKLF4lkRxSsx2lXhQrMdhj2aCuPGiPWlAmQeWgQJTLbcF9yMC2+pXgGPzT3q4R/lSZGOQtc6VKbVWzoeMJn8YlDGk HZK4eKQH DiLTjPy0Sc5MX5chYQUUfaT8QoL1m7TMzl+23oC2uf0lGcraWk4Q0lTfbtp/w76HvoYWuxO1Cwo4IHB4QHhY4arjycgogH3HWxbRGcLiYwV+4rdFIAHI5DbQMmtqbl+Khfp+AQchkpwB+mNUD5pZICzsduWXoQPg6Hmwws7Yqj3m3/PN2M99KeAaO/kFdGLUlNfuT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 29, 2024 at 11:39:05PM GMT, Theodore Ts'o wrote: > On Fri, Aug 30, 2024 at 12:27:11AM +1000, Dave Chinner wrote: > > > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > > now. This was the default Irix kernel allocator behaviour (it had a > > forwards progress guarantee and would never fail allocation unless > > told it could do so). We've been using the same "guaranteed not to > > fail" semantics on Linux since the original port started 25 years > > ago via open-coded loops. > > Ext3/ext4 doesn't have quite the history as XFS --- it's only been > around for 23 years --- but we've also used __GFP_NOFAIL or its > moral equivalent, e.g.: > > > do { > > p = kmalloc(size); > > while (!p); > > For the entire existence of ext3. > > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > > can fail due to external scoped memory allocation contexts. This > > will force us to revert all __GFP_NOFAIL allocations back to > > open-coded will-not-fail loops. > > The same will be true for ext4. And as Dave has said, the MM > developers want to have visibility to when file systems have basically > said, "if you can't allow us to allocate memory, our only alternative > is to cause user data loss, crash the kernel, or loop forever; we will > choose the latter". The MM developers tried to make __GFP_NOFAIL go > away several years ago, and ext4 put the retry loop back, As a result, > the compromise was that the MM developers restored __GFP_NOFAIL, and > the file systems developers have done their best to reduce the use of > __GFP_NOFAIL as much as possible. > > So if you try to break the GFP_NOFAIL promise, both xfs and ext4 will > back to the retry loop. And the MM devs will be sad, and they will > forcibly revert your change to *ther* code, even if that means > breaking bcachefs. Becuase otherwise, you will be breaking ext4 and > xfs, and so we will go back to using a retry loop, which will be worse > for Linux users. GFP_NOFAIL may be better than the retry loop, but it's still not good. Consider what happens when you have a GFP_NOFAIL in a critical IO path, when the system is almost exhausted on memory; yes, that allocation will succeed _eventually_, but without any latency bounds. When you're thrashing or being fork bombed, that allocation is contending with everything else. Much the same way that a lock in a critical path where the work done under the lock grows when the system is loaded, it's a contention point subject to catastrophic failure. Much better to preallocate, e.g. with a mempool, or have some other kind of fallback. It might work to do __GFP_NOFAIL|__GFP_HIGH in critical paths, but I've never seen that investigated or tried. And this is an area filesystem people really need to be thinking about. Block layer gets this right, filesystems do not, and I suspect this is a key contributor to our performance and behaviour sucking when we're thrashing. bcachefs puts a lot of effort into making sure we can run in bounded memory, because I put a lot of emphasiss on consistent performance and bounded latency, not just winning benchmarks. There's only two __GFP_NOFAIL allocations in bcachefs, and I'll likely remove both of them when I get around to it.