From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F310BC48297
	for <linux-mm@archiver.kernel.org>; Mon, 12 Feb 2024 19:30:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 320306B0075; Mon, 12 Feb 2024 14:30:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D0786B007B; Mon, 12 Feb 2024 14:30:13 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1BFDE6B007D; Mon, 12 Feb 2024 14:30:13 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 0D5CF6B0075
	for <linux-mm@kvack.org>; Mon, 12 Feb 2024 14:30:13 -0500 (EST)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id C787540B05
	for <linux-mm@kvack.org>; Mon, 12 Feb 2024 19:30:12 +0000 (UTC)
X-FDA: 81784142664.02.A47008C
Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170])
	by imf16.hostedemail.com (Postfix) with ESMTP id 0F84A180005
	for <linux-mm@kvack.org>; Mon, 12 Feb 2024 19:30:09 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b="qlAJEgm/";
	spf=pass (imf16.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.170 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1707766210;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=rn+g3POrCt16v7JX8rOl/Zs5Xf7fDMrrf9BRGk2OghE=;
	b=fOweH7k2zx6H1nx5xkUmt+A5YhmzpKRV14z+IEWGCm5Tq2nVndIhaErYVSfnh0VZF0Lyi5
	htRloLgkStWpQZGukl8LaBDH2gP6vkN/GdOP5g1PD1PbdMlc/AYx4nuWpcC2oEkUuNEBYV
	IpZpn4/YYK17Jae+XxS8IAWyPAlJqYo=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707766210; a=rsa-sha256;
	cv=none;
	b=DdAQ5YOWrGjsYO7rCA9+RtvhcgehbuMVisMaGSKo4c9QRZ2iB2aKCuIMv1y4W2PXXSbG7l
	gUrX2gip6Udt2JCZWzetSRhgz9HSHGjxWfUR4om/g4G60wdFhWkwYH6IdZKDD99HBMZvpw
	kG50GGJR81uoaxHBDPnJ/TyzKsddS4g=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b="qlAJEgm/";
	spf=pass (imf16.hostedemail.com: domain of kent.overstreet@linux.dev designates 95.215.58.170 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Date: Mon, 12 Feb 2024 14:30:02 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1707766207;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=rn+g3POrCt16v7JX8rOl/Zs5Xf7fDMrrf9BRGk2OghE=;
	b=qlAJEgm/CtrwPB/nNjb/vD8B55ZNOWsm5jV92wK+E5nqfANbsMzyK1sF+vJMb50sKllkDV
	soOrFT4O6IJ7Ik/CqAfSvHUWFGw1xEuigZ7G+ivDvjSoDAMh8kTuZdpM0mIWsj5f1+wh6M
	lJ2xYc5vs/HXrMgJ8ngbq2VC/Igqep4=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Dave Chinner <david@fromorbit.com>
Cc: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>, 
	Michal Hocko <mhocko@suse.com>, Matthew Wilcox <willy@infradead.org>, 
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, 
	linux-block@vger.kernel.org, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, 
	linux-nvme@lists.infradead.org, Kent Overstreet <kent.overstreet@gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Removing GFP_NOFS
Message-ID: <cepmpv7vdq7i6277wheqqnqsniqnkomvh7sn3535rcacvorkuu@5caayyz44qzr>
References: <ZZcgXI46AinlcBDP@casper.infradead.org>
 <ZZzP6731XwZQnz0o@dread.disaster.area>
 <3ba0dffa-beea-478f-bb6e-777b6304fb69@kernel.org>
 <ZcUQfzfQ9R8X0s47@tiehlicka>
 <3aa399bb-5007-4d12-88ae-ed244e9a653f@kernel.org>
 <ZclyYBO0vcQHZ5dV@dread.disaster.area>
 <5p4zwxtfqwm3wgvzwqfg6uwy5m3lgpfypij4fzea63gu67ve4t@77to5kukmiic>
 <ZcmgFThkhh9HYsXh@dread.disaster.area>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZcmgFThkhh9HYsXh@dread.disaster.area>
X-Migadu-Flow: FLOW_OUT
X-Rspamd-Queue-Id: 0F84A180005
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: geb9cgbb3wue13hsuchsmsa6n1fyix5w
X-HE-Tag: 1707766209-313191
X-HE-Meta: U2FsdGVkX19hOSkgV3bv4yfFn1obCayTEc3L7jb79M8prjgyOvx4GjF1V0Gov4ljne1bn9pXG2ZWi+dWsQY4lMvBsoatH+Uczw+H/2qddTcd/lsvPzIJN448UXj947Lbbv8vnQWnbo134pwn3BvMkYSlDE75Lw5IRnY36SlQfoLCRwt0DC8KdN0b1TaQ8wMca3i0e8OaFtJffmoeYi3F4zBR/LavalwhHJLFuCAn6kID5rdWolp6q/JZgcnLvj57BaCVsVW/VgohZ/Ba0W/r+SMCymvS9KysRt5I2Q+aDexFzHgWXxYKSiwZ+zM+UgipD++BLsmGu4hcmUEYFOsAqQm1cQyYDdQ1LXqwjh4toPe5tc3KfBmgM15brrh5yh7mWHQn+/9MUrATU/Tos5/+MJ1WSJFcE30gmVM3+TXz44QqhptmjTQ5H2iCuWszCE1gu0DfAHre2UTaPqYCW62j1HZy4mzGCYg6d4BfqjKPcc2H06zYF+tpsXvew7TASVEOYBDYZLRYqmyIGvGjd7CFCAJsVGJ2vhyv9poIaSRSYO02qreiEnpE5vNGd3WUoOuhWd3N9WrZlUdviW1k0OWd0G6+7+79qU2qTMNopCuZMHTQET9aQIZ7UvdkZmsc9qqz1DlysmtHAvEJpJ7y46vGeF1OvTVkYlg5dyl6JLRjgcT1bA/G+eoOP2lQNViC1CUJRs8Op7zTGplDDKtVFDJ9c/MT18AIREylpNPDB6ISKq0azgY7w+smvPVmiu7B70hTvgK6rDP1xap9jwo3RJ/DMVFHANxbq9zdUUy+LI4TqYgGu25WjC5aMjk4EMaNQzzlvQ5z6e637XTJlYC/FXMC8WzaXruR/4+CopJ5I8oIGCW1K+63qdZBOLbapFUxnoY4RwyV2fqrukQHE5cgH6nNSy/N89neUapKnZyPQTnqTy8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 12, 2024 at 03:35:33PM +1100, Dave Chinner wrote:
> On Sun, Feb 11, 2024 at 09:06:33PM -0500, Kent Overstreet wrote:
> > That's because in general most code in the IO path knows how to make
> > effective use of biosets and mempools (which may take some work! you
> > have to ensure that you're always able to make forward progress when
> > memory is limited, and in particular that you don't double allocate from
> > the same mempool if you're blocking the first allocation from
> > completing/freeing).
> 
> Yes, I understand this, and that's my point: NOIO context tends to
> be able to use mempools and other mechanisms to prevent memory
> allocation failure, not NOFAIL.
> 
> The IO layers are request based and that enables one-in, one out
> allocation pools that can guarantee single IO progress. That's all
> the IO layers need to guarantee to the filesystems so that forwards
> progress can always be made until memory pressure.
> 
> However, filesystems cannot guarantee "one in, one out" allocation
> behaviour. A transaction can require a largely unbound number of
> memory allocations to succeed to make progress through to
> completion, and so things like mempools -cannot be used- to prevent
> memory allocation failures whilst providing a forwards progress
> guarantee.

I don't see that that's actually true. There's no requirement that
arbitrarily large IOs must be done atomically, within a single
transaction: there's been at most talk of eventually doing atomic writes
through the pagecache, but the people on that can't even finish atomic
writes through the block layer, so who knows when that'll happen.

I generally haven't been running into filesyste operations that require
an unbounded number of memory allocations (reflink is a bit of an
exception in the current bcachefs code, and even that is just a
limitation I could solve if I really wanted to...)

> Hence a NOFAIL scope if useful at the filesystem layer for
> filesystem objects to ensure forwards progress under memory
> pressure, but it is compeltely unnecessary once we transition to the
> IO layer where forwards progress guarantees ensure memory allocation
> failures don't impede progress.
> 
> IOWs, we only need NOFAIL at the NOFS layers, not at the NOIO
> layers. The entry points to the block layer should transition the
> task to NOIO context and restore the previous context on exit. Then
> it becomes relatively trivial to apply context based filtering of
> allocation behaviour....
> 
> > > i.e NOFAIL scopes are not relevant outside the subsystem that sets
> > > it.  Hence we likely need helpers to clear and restore NOFAIL when
> > > we cross an allocation context boundaries. e.g. as we cross from
> > > filesystem to block layer in the IO stack via submit_bio(). Maybe
> > > they should be doing something like:
> > > 
> > > 	nofail_flags = memalloc_nofail_clear();
> > 
> > NOFAIL is not a scoped thing at all, period; it is very much a
> > _callsite_ specific thing, and it depends on whether that callsite has a
> > fallback.
> 
> *cough*
> 
> As I've already stated, NOFAIL allocation has been scoped in XFS for
> the past 20 years.
> 
> Every memory allocation inside a transaction *must* be NOFAIL unless
> otherwise specified because memory allocation inside a dirty
> transaction is a fatal error.

Say you start to incrementally mempoolify your allocations inside a
transaction - those mempools aren't going to do anything if there's a
scoped NOFAIL, and sorting that out is going to get messy fast.

> However, that scoping has never been
> passed to the NOIO contexts below the filesytsem - it's scoped
> purely within the filesystem itself and doesn't pass on to other
> subsystems the filesystem calls into.

How is that managed?
> 
> > The most obvious example being, as mentioned previously, mempools.
> 
> Yes, they require one-in, one-out guarantees to avoid starvation and
> ENOMEM situations. Which, as we've known since mempools were
> invented, these guarantees cannot be provided by most filesystems.
> 
> > > > - NOWAIT - as said already, we need to make sure we're not turning an
> > > > allocation that relied on too-small-to-fail into a null pointer exception or
> > > > BUG_ON(!page).
> > > 
> > > Agreed. NOWAIT is removing allocation failure constraints and I
> > > don't think that can be made to work reliably. Error injection
> > > cannot prove the absence of errors  and so we can never be certain
> > > the code will always operate correctly and not crash when an
> > > unexepected allocation failure occurs.
> > 
> > You saying we don't know how to test code?
> 
> Yes, that's exactly what I'm saying.
> 
> I'm also saying that designing algorithms that aren't fail safe is
> poor design. If you get it wrong and nothing bad can happen as a
> result, then the design is fine.
> 
> But if the result of missing something accidentally is that the
> system is guaranteed to crash when that is hit, then failure is
> guaranteed and no amount of testing will prevent that failure from
> occurring.
> 
> And we suck at testing, so we absolutely need to design fail
> safe algorithms and APIs...

GFP_NOFAIL dosen't magically make your algorithm fail safe, though.

Suren and I are trying to get memory allocation profiling into 6.9, and
I'll be posting the improved fault injection immediately afterwards -
this is what I used to use to make sure every allocation failure path in
the bcachefs predecessor was tested. Hopefully that'll make things
easier...