From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DA436D1357E
	for <linux-mm@archiver.kernel.org>; Sun, 27 Oct 2024 19:58:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 68F786B009D; Sun, 27 Oct 2024 15:58:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 618906B009E; Sun, 27 Oct 2024 15:58:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4916C6B00A0; Sun, 27 Oct 2024 15:58:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 27B566B009D
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 15:58:20 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 5E72E1C6263
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 19:57:54 +0000 (UTC)
X-FDA: 82720442910.08.701AC18
Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179])
	by imf13.hostedemail.com (Postfix) with ESMTP id 77D062000B
	for <linux-mm@kvack.org>; Sun, 27 Oct 2024 19:57:54 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b="dR/OPLl4";
	spf=pass (imf13.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730059044;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Jl9RE+0N8YfJ1XW1XaPbXD0GLkglTkP7Mz1u51Q5s38=;
	b=tv7B9VR2eqbJ/xPXcBLQg/r8aqrgXjig4DjXqOcobJZCzqRHz/KycqD86VICLNqREA2WQs
	lgOjo4CJoGJQ1ha92ld/c+h5JVJtPPFcOuFbFfUmRZJz1obJDFr5QQ/qrzdGLjaDfZBoSF
	g1fkDcwf/Hb6pzHFBeQnYdIqJKq9ZyI=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b="dR/OPLl4";
	spf=pass (imf13.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730059044; a=rsa-sha256;
	cv=none;
	b=OHSp1GMSA3pjHMa6mo6aCJn6Rti4xBHzeqkUM6D0HdQvPFjCAZOSVUkPEZeTBEkLstEK7i
	A4fH05kJqlsIXh6Y7eObO9BFKHgkRD4jXg67QU3k+2b5Lk3fnQl0ZZ5WpjxFQSdpZ7TAUV
	lAp8so751QEbwP7+/31FbrooVIBy3SQ=
Date: Sun, 27 Oct 2024 15:58:11 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1730059095;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Jl9RE+0N8YfJ1XW1XaPbXD0GLkglTkP7Mz1u51Q5s38=;
	b=dR/OPLl47aKrncYYRCAx9bV0BcETFA8FZrlmxQdGTsYgqFEzVPEqyiB5XKMIDaMiXyiS9D
	7byBKaC4s9E+jfLi4BPsGm+MPtuOFNbOOZKqM3sNx9M8w5DuFDuoAEnEt0mtDLaApSZjBB
	BDyOVHUWpJz82smsUY0VTbL4gTF8JKA=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>, 
	linux-bcachefs@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>, 
	Andrew Morton <akpm@linux-foundation.org>, Uladzislau Rezki <urezki@gmail.com>, 
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc()
Message-ID: <la4btdzbgouyyyggx2h67keldt4ak5mhbn3huxygwpa5tooh5w@dutiusw5jz6n>
References: <zfr4sh3wtzi4viladlgwyon6voqhcj5fe3fv2nc3hqyyxdw5wd@onyj3r2m3usz>
 <CAHk-=wi=PrbZnwnvhKEF6UUQNCZdNsUbr+hk-jOWGr-q4Mmz=Q@mail.gmail.com>
 <m46mlvv57oypstekojhkdwpts6mi4r63l4kugs4lpry3d2r7dq@kbmied6nzsc3>
 <CAHk-=wg8iZWDbX_MAujGtqHJYQneKBwDJnVAY94siNr1gLmtaA@mail.gmail.com>
 <CAHk-=wiW_QZ9VS3Ho2Ff8ZDp-T6SomDFrVPWG3abs45LMZpxZQ@mail.gmail.com>
 <6eo3gekf6twbnzhpsi2emz2s6sgtof6iba2rvbor7himmejoq5@qbfwtpbpvqoe>
 <CAHk-=wga3FXReWVhU2eid8+sXhBF1QgP1iMJu1jnSX6fapoyXQ@mail.gmail.com>
 <ikaf72w2oap3crjrybbd5jp267slnb7dygz4m62dfw3edu2ppj@f7dv2qdx3yga>
 <CAHk-=wggLED-UmxS0gXrtOXBVWexCkyudYCr5zXNh8p5PB9bng@mail.gmail.com>
 <CAHk-=wjQDCtzBUiRPuKfyjGFiR9JZi82ENyzdvKei4W3pxt=tA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAHk-=wjQDCtzBUiRPuKfyjGFiR9JZi82ENyzdvKei4W3pxt=tA@mail.gmail.com>
X-Migadu-Flow: FLOW_OUT
X-Rspam-User: 
X-Stat-Signature: 7eosnw7umtpeohirq37nx93f91a6ym6i
X-Rspamd-Queue-Id: 77D062000B
X-Rspamd-Server: rspam11
X-HE-Tag: 1730059074-693847
X-HE-Meta: U2FsdGVkX19kNBrrY6el08Z+JZRbv6kUmBR2yPPxxCWjEowDTT8GgYNVbAQQyDsrgQHxanN2KgF+dhwPJTg2jev/SQkAPyRRTaC2cJAbPhkotKrONke80440WtvkoSXnj9Otqn40GrUuFtS0y6AWLmQy5RvgSYkYDELc6A/1Oq6Xc4xFEBRR2IuX3tHghsQiI7d55KlbuOoEdDBNy2MtkG5AOZD0A7DF7dHjO9hBY6nx30qQJaNnDAQlhQVHrv515yUEC16fIPvFTnBZxykibsQtLeGpLjt19WoFKF++xIUWHSE6HYrYoZ+xY/2a7RrZ9ItV9RcFjeKW2eGhg7jOr/Q2cJsgeF8+dt/0LQs+7fr5/GTFilR6w+lvgZ9cUtE6v6Kar90eVFqgwVVcRXUZqr/U97xGrm6icl+Jkhq/onXJRj01HIHaBAnsoULNA4TdWVTkx9S3QBuW0AtsAxO+2O1vCaFtpO0tQbn4khi9wAXsXjCvF4wBs5Wj8npFVlqqIZBxvx+JQafo0Zik9tcka6mWUSzWsB1+oRHqmQcGHldIQ3qDIzC3D2J9hGT1TILWRNesZqefSzTx/RWbwLMrwJeBfOs5SFDEeEclaOwnNk9aF58ix7AZAbF/VluxxSEtXu7zaG7+z3XMMfiDDbBO0gsH86RoO67r55hzjLgX12/O7D5ugBq6xrKOoV5X+6kuwe0gqgvdcs00+yNThKdilAvCiaB4cd7qZf6oweAmBBBtksXyxKdYbEdN1HOSgeNer3zL4H4aJypT81kUiFhMGqXoi6RiZZdumwwhc7ZxAW31er4SmxBIWwX38Nb4+Id85l1+MrLQhKhsMj6tHJNh9SPIAZiKja8FK2LRG9gNwGQ8ohF4M3/FJkakZbfB8aeg4dGcYBFCqpT6lBrXAx0MlyZsBF02Qh2Xe9AL0B7rKXEN+7nDF3hWRRHFNgZJxYnXfak2eTjvoNWtozs15G9
 ZTB7A8dy
 FS6EcRQgvPOlglKoc//YftYQeRx2lbUGVjvDEH7AM2KZcdyrDlL96kRfJ+uAV0TudAC9xyhucBUyabYPSC5OEIdn6CcAZ0R9YlQNrXpNtzRuB/Bl5LhPo2qhla6PXcOufmv9+2aEWvc7+IamgkORsR3lkGRV7RKaEZsCc0L2oE+SNoV7TM7XKNH6Ie9YaJ50Ea0Uy
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote:
> There's a very real reason many places don't use filesystems that do
> fsck any more.

So fsck has been on my mind a lot lately - seems like all I'm working on
lately is fsck related things - incidentally, "filesystems that don't
need fsck" is a myth.

The reason is that filesystems have mutable global state, and that state
has to be consistent for the filesystem to work correctly: at a minumum,
global usage counters that have to be correct for -ENOSPC to work, and
allocation/free space maps that have to be correct to not double
allocate.

The only way out of this is to do a pure logging filesystem, i.e. nilfs,
and there's a reason those never took off - compacting overhead is too
high, they don't scale with real world usage (and you still have to give
up on posix -ENOSPC, not that that's any real loss).

Even in distributed land, they may get away without a traditional
precise fsck, but they get away with that by leaving the heavy lifting
(precise allocation information) to something like a traditional local
filesystem that does have that. And they still need, at a minumum, a
global GC operation - but GC is just a toy version of fsck to the
filesystem developer; it'll have the same algorithmic complexity as
traditional fsck but without having to be precise.

(Incidentally, the main check allocations fsck pass in bcachefs is
directly descended from the runtime GC code in bcache, even if barely
recognizable now).

And a filesystem needs to be able to cope with extreme damage to be
considered fit for purpose - we need to degrade gracefully if there's
corruption, not tell the user "oops, your filesystem is inaccessible" if
something got scribbled over. I consider it flatly inacceptable to not
be able to recover a filesystem if there's data on it. If you blew away
the superblock and all the backup superblocks by running mkfs, /that's/
pretty much unrecoverable because there's too much in the superblock we
really need, but literally anything else we should be able to recover
from - and automatically is the goal.

So there's a lot of interesting challanges in fsck.

- Scaling: fsck is pretty much the limiting factor on filesystem
  scalability, If it wasn't for fsck bcachefs would probably scale up to
  an exabyte fairly trivially. Making fsck scale to exabyte range
  filesystems is going to take a _lot_ of clever sharding and clever
  algorithms.

- Continuing to run gracefully in the presence of damage wherever
  possible, instead of forcing fsck to be run right away.

  If allocation info is corrupt in the wrong ways such that we might
  double allocate, that's a problem, or if interior btree nodes are
  toast that requires expensive repair, but we should be able to
  continue running with most other types of corruption. That hasn't been
  the priority while in development - in development, we want to fail
  fast and noisily so that bugs are reported and the filesystem is left
  in a state where we can see what happened - but this is an area I'm
  starting to work on now.

- Making sure that fsck never makes things worse

  You really don't want fsck to ever delete anything, this could be
  absolutely tragic in the event of any sort of transient error (or
  bug). We've still got a bit of work to do here with pointers to
  indirect extents, and there's probably some other cases that need to
  be looked at - I think XFS is ahead of bcachefs here, I know
  Darrick has a notion of "tainted" metadata, the idea being that if a
  pointer to an indirect extent points to a missing extent, we don't
  delete it, we just flag it as tainted: don't log more fsck errors,
  just return -EIO when reading from it; then if we're able to recover
  the indirect extent later we can just clear the tainted flag.

We've got some fun tricks for getting back online as quickly as possible
even in the event of catastrophic damage. If alloc info is suspect, we
can do very quick pass that walks all pointers and just marks a "bucket
is currently allocated, don't use" bitmap and defer repairing or
rebuilding the actual alloc info until we're online, in the background.
And if interior btree nodes are toast and we need to scan (which
shouldn't ever happen, but users are users and hardware is hardware, and
I haven't done btrfs dup style replication because you can't trust SSDs
to lay writes out on different erase units) there's a bitmap in the
superblock of ranges that have btree nodes so the scan pass on a modern
filesystem shouldn't take too long.