From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA436D1357E for ; Sun, 27 Oct 2024 19:58:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68F786B009D; Sun, 27 Oct 2024 15:58:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 618906B009E; Sun, 27 Oct 2024 15:58:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4916C6B00A0; Sun, 27 Oct 2024 15:58:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 27B566B009D for ; Sun, 27 Oct 2024 15:58:20 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5E72E1C6263 for ; Sun, 27 Oct 2024 19:57:54 +0000 (UTC) X-FDA: 82720442910.08.701AC18 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) by imf13.hostedemail.com (Postfix) with ESMTP id 77D062000B for ; Sun, 27 Oct 2024 19:57:54 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="dR/OPLl4"; spf=pass (imf13.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730059044; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Jl9RE+0N8YfJ1XW1XaPbXD0GLkglTkP7Mz1u51Q5s38=; b=tv7B9VR2eqbJ/xPXcBLQg/r8aqrgXjig4DjXqOcobJZCzqRHz/KycqD86VICLNqREA2WQs lgOjo4CJoGJQ1ha92ld/c+h5JVJtPPFcOuFbFfUmRZJz1obJDFr5QQ/qrzdGLjaDfZBoSF g1fkDcwf/Hb6pzHFBeQnYdIqJKq9ZyI= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="dR/OPLl4"; spf=pass (imf13.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.179 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730059044; a=rsa-sha256; cv=none; b=OHSp1GMSA3pjHMa6mo6aCJn6Rti4xBHzeqkUM6D0HdQvPFjCAZOSVUkPEZeTBEkLstEK7i A4fH05kJqlsIXh6Y7eObO9BFKHgkRD4jXg67QU3k+2b5Lk3fnQl0ZZ5WpjxFQSdpZ7TAUV lAp8so751QEbwP7+/31FbrooVIBy3SQ= Date: Sun, 27 Oct 2024 15:58:11 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1730059095; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Jl9RE+0N8YfJ1XW1XaPbXD0GLkglTkP7Mz1u51Q5s38=; b=dR/OPLl47aKrncYYRCAx9bV0BcETFA8FZrlmxQdGTsYgqFEzVPEqyiB5XKMIDaMiXyiS9D 7byBKaC4s9E+jfLi4BPsGm+MPtuOFNbOOZKqM3sNx9M8w5DuFDuoAEnEt0mtDLaApSZjBB BDyOVHUWpJz82smsUY0VTbL4gTF8JKA= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Linus Torvalds Cc: Lorenzo Stoakes , linux-bcachefs@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka , Andrew Morton , Uladzislau Rezki , Christoph Hellwig Subject: Re: [PATCH] mm: Drop INT_MAX limit from kvmalloc() Message-ID: References: <6eo3gekf6twbnzhpsi2emz2s6sgtof6iba2rvbor7himmejoq5@qbfwtpbpvqoe> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Stat-Signature: 7eosnw7umtpeohirq37nx93f91a6ym6i X-Rspamd-Queue-Id: 77D062000B X-Rspamd-Server: rspam11 X-HE-Tag: 1730059074-693847 X-HE-Meta: U2FsdGVkX19kNBrrY6el08Z+JZRbv6kUmBR2yPPxxCWjEowDTT8GgYNVbAQQyDsrgQHxanN2KgF+dhwPJTg2jev/SQkAPyRRTaC2cJAbPhkotKrONke80440WtvkoSXnj9Otqn40GrUuFtS0y6AWLmQy5RvgSYkYDELc6A/1Oq6Xc4xFEBRR2IuX3tHghsQiI7d55KlbuOoEdDBNy2MtkG5AOZD0A7DF7dHjO9hBY6nx30qQJaNnDAQlhQVHrv515yUEC16fIPvFTnBZxykibsQtLeGpLjt19WoFKF++xIUWHSE6HYrYoZ+xY/2a7RrZ9ItV9RcFjeKW2eGhg7jOr/Q2cJsgeF8+dt/0LQs+7fr5/GTFilR6w+lvgZ9cUtE6v6Kar90eVFqgwVVcRXUZqr/U97xGrm6icl+Jkhq/onXJRj01HIHaBAnsoULNA4TdWVTkx9S3QBuW0AtsAxO+2O1vCaFtpO0tQbn4khi9wAXsXjCvF4wBs5Wj8npFVlqqIZBxvx+JQafo0Zik9tcka6mWUSzWsB1+oRHqmQcGHldIQ3qDIzC3D2J9hGT1TILWRNesZqefSzTx/RWbwLMrwJeBfOs5SFDEeEclaOwnNk9aF58ix7AZAbF/VluxxSEtXu7zaG7+z3XMMfiDDbBO0gsH86RoO67r55hzjLgX12/O7D5ugBq6xrKOoV5X+6kuwe0gqgvdcs00+yNThKdilAvCiaB4cd7qZf6oweAmBBBtksXyxKdYbEdN1HOSgeNer3zL4H4aJypT81kUiFhMGqXoi6RiZZdumwwhc7ZxAW31er4SmxBIWwX38Nb4+Id85l1+MrLQhKhsMj6tHJNh9SPIAZiKja8FK2LRG9gNwGQ8ohF4M3/FJkakZbfB8aeg4dGcYBFCqpT6lBrXAx0MlyZsBF02Qh2Xe9AL0B7rKXEN+7nDF3hWRRHFNgZJxYnXfak2eTjvoNWtozs15G9 ZTB7A8dy FS6EcRQgvPOlglKoc//YftYQeRx2lbUGVjvDEH7AM2KZcdyrDlL96kRfJ+uAV0TudAC9xyhucBUyabYPSC5OEIdn6CcAZ0R9YlQNrXpNtzRuB/Bl5LhPo2qhla6PXcOufmv9+2aEWvc7+IamgkORsR3lkGRV7RKaEZsCc0L2oE+SNoV7TM7XKNH6Ie9YaJ50Ea0Uy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Oct 20, 2024 at 02:21:50PM -0700, Linus Torvalds wrote: > There's a very real reason many places don't use filesystems that do > fsck any more. So fsck has been on my mind a lot lately - seems like all I'm working on lately is fsck related things - incidentally, "filesystems that don't need fsck" is a myth. The reason is that filesystems have mutable global state, and that state has to be consistent for the filesystem to work correctly: at a minumum, global usage counters that have to be correct for -ENOSPC to work, and allocation/free space maps that have to be correct to not double allocate. The only way out of this is to do a pure logging filesystem, i.e. nilfs, and there's a reason those never took off - compacting overhead is too high, they don't scale with real world usage (and you still have to give up on posix -ENOSPC, not that that's any real loss). Even in distributed land, they may get away without a traditional precise fsck, but they get away with that by leaving the heavy lifting (precise allocation information) to something like a traditional local filesystem that does have that. And they still need, at a minumum, a global GC operation - but GC is just a toy version of fsck to the filesystem developer; it'll have the same algorithmic complexity as traditional fsck but without having to be precise. (Incidentally, the main check allocations fsck pass in bcachefs is directly descended from the runtime GC code in bcache, even if barely recognizable now). And a filesystem needs to be able to cope with extreme damage to be considered fit for purpose - we need to degrade gracefully if there's corruption, not tell the user "oops, your filesystem is inaccessible" if something got scribbled over. I consider it flatly inacceptable to not be able to recover a filesystem if there's data on it. If you blew away the superblock and all the backup superblocks by running mkfs, /that's/ pretty much unrecoverable because there's too much in the superblock we really need, but literally anything else we should be able to recover from - and automatically is the goal. So there's a lot of interesting challanges in fsck. - Scaling: fsck is pretty much the limiting factor on filesystem scalability, If it wasn't for fsck bcachefs would probably scale up to an exabyte fairly trivially. Making fsck scale to exabyte range filesystems is going to take a _lot_ of clever sharding and clever algorithms. - Continuing to run gracefully in the presence of damage wherever possible, instead of forcing fsck to be run right away. If allocation info is corrupt in the wrong ways such that we might double allocate, that's a problem, or if interior btree nodes are toast that requires expensive repair, but we should be able to continue running with most other types of corruption. That hasn't been the priority while in development - in development, we want to fail fast and noisily so that bugs are reported and the filesystem is left in a state where we can see what happened - but this is an area I'm starting to work on now. - Making sure that fsck never makes things worse You really don't want fsck to ever delete anything, this could be absolutely tragic in the event of any sort of transient error (or bug). We've still got a bit of work to do here with pointers to indirect extents, and there's probably some other cases that need to be looked at - I think XFS is ahead of bcachefs here, I know Darrick has a notion of "tainted" metadata, the idea being that if a pointer to an indirect extent points to a missing extent, we don't delete it, we just flag it as tainted: don't log more fsck errors, just return -EIO when reading from it; then if we're able to recover the indirect extent later we can just clear the tainted flag. We've got some fun tricks for getting back online as quickly as possible even in the event of catastrophic damage. If alloc info is suspect, we can do very quick pass that walks all pointers and just marks a "bucket is currently allocated, don't use" bitmap and defer repairing or rebuilding the actual alloc info until we're online, in the background. And if interior btree nodes are toast and we need to scan (which shouldn't ever happen, but users are users and hardware is hardware, and I haven't done btrfs dup style replication because you can't trust SSDs to lay writes out on different erase units) there's a bitmap in the superblock of ranges that have btree nodes so the scan pass on a modern filesystem shouldn't take too long.