From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E519C46CD2 for ; Sat, 27 Jan 2024 19:45:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C18076B0074; Sat, 27 Jan 2024 14:45:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA00F6B0078; Sat, 27 Jan 2024 14:45:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A40716B007B; Sat, 27 Jan 2024 14:45:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8E7776B0074 for ; Sat, 27 Jan 2024 14:45:08 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 391941604AA for ; Sat, 27 Jan 2024 19:45:08 +0000 (UTC) X-FDA: 81726119496.21.34C600A Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf20.hostedemail.com (Postfix) with ESMTP id 258CD1C000E for ; Sat, 27 Jan 2024 19:45:05 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=PWzFuf3D; dmarc=none; spf=pass (imf20.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.47 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706384706; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dauG2xyvU0hWcJd1iYWE0kBCym5t6BZZQljgkbaBQjU=; b=t/Un9cVs4yOt2nsfxnvQ3E9oDDPSbaL5gYXMTAI5vduVuRE2JXGYuTvUEU68TwZYyIcsMm 2IwiwnhvYAibIgvMN+1s8Z3Q4I1/OQJBWqPvAsaRnyJRR7gRv2RlZUYaf6qyyAEPK6uUzC SxrHr1uEdXTlrQfrsxc3t3j3QW6WaHM= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=PWzFuf3D; dmarc=none; spf=pass (imf20.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.47 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706384706; a=rsa-sha256; cv=none; b=tUVYgJMkdO7jkmN8ixaKy4WmnUgjeHgeWLPWJxZ8BEtSrU76SocnXfdhoh8CF4xzmjKVU0 v1H7p13qB7FB38lfdk0OFRYWPPoiicqcBzxK1AfLLZbXSlQsuAQ+Z8TFydCc+mc1OfdK9Z 5QADWVUefTqn4EyiLHYzLnDNCKgSL/o= Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-55d4013f3e0so1114599a12.3 for ; Sat, 27 Jan 2024 11:45:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; t=1706384704; x=1706989504; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dauG2xyvU0hWcJd1iYWE0kBCym5t6BZZQljgkbaBQjU=; b=PWzFuf3DoU004HA4+LPwOE+GgKQxN1i9IRO+xzU8R6U5f2/1A3IMYV71fTEXj+GicQ 3I/+o4fuTj9cwUFCV6uG5POZNy8RTWy/iuf4DsLDN2vlNhaU5Id8SyeUcVkLXxPfl6Xr HVvLT27bAeIJUnzw5ESegaa9uPxT5k7j5tAA4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706384704; x=1706989504; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dauG2xyvU0hWcJd1iYWE0kBCym5t6BZZQljgkbaBQjU=; b=hiotfsxRHYQkJ5nwlZaWFQUCo+rFgdiwbUGvij7axqQo/iGHcjs6fV8fd0q9PHRykF +WFIAeAXiyrfHx7jEuhhQoIHXXf8UI7qHRNI6romlq5SXCSdN7d3BrOXlb8mPw3T48SX 10YvCYyI4PVExt4Rt06FixRTe8UDf6eG9PSbieqJR5RocXTKuLCqkNnj/bREbR44Ytt1 IHAalgwS5PppbQw2FMbAGr4OUOieR6eN3GEL4hZPsMd6Ni/76Rij25lapShiWETknenv jv9G1LbfuTNXddyovleczPr3EhS02ulAXSKX3uHzN8b5Ix0tMin1ngLN4LqWWsBhVT2M oSLA== X-Gm-Message-State: AOJu0YxDcsc0VczhszjsK659AEv4iFbGZng2CiUcxoTaWZemvnWe92Ux BAYeNFFWGYJvxCY3Hv4hWhwtqJ9Unc0RK0AWMPnYZ6A0xvbLk7HAESSRalqdKX9W08dPvTm6K8w tB/Yxqg== X-Google-Smtp-Source: AGHT+IEAoMhC/s4ZBSrTpujqK4/5fGPvhA1LIYny+oS2D7P6TUaTWMAktZriWomF3PwYqEJGVihH8A== X-Received: by 2002:aa7:d909:0:b0:55e:b565:c676 with SMTP id a9-20020aa7d909000000b0055eb565c676mr1031026edr.21.1706384704159; Sat, 27 Jan 2024 11:45:04 -0800 (PST) Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com. [209.85.208.46]) by smtp.gmail.com with ESMTPSA id h1-20020a0564020e0100b0055d312732dbsm1887614edh.5.2024.01.27.11.45.02 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 27 Jan 2024 11:45:02 -0800 (PST) Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-55c1ac8d2f2so1257043a12.2 for ; Sat, 27 Jan 2024 11:45:02 -0800 (PST) X-Received: by 2002:a05:6402:35c4:b0:55c:2852:7b50 with SMTP id z4-20020a05640235c400b0055c28527b50mr1635136edc.29.1706384702029; Sat, 27 Jan 2024 11:45:02 -0800 (PST) MIME-Version: 1.0 References: <2024012522-shorten-deviator-9f45@gregkh> <20240125205055.2752ac1c@rorschach.local.home> <2024012528-caviar-gumming-a14b@gregkh> <20240125214007.67d45fcf@rorschach.local.home> <2024012634-rotten-conjoined-0a98@gregkh> <20240126101553.7c22b054@gandalf.local.home> <2024012600-dose-happiest-f57d@gregkh> <20240126114451.17be7e15@gandalf.local.home> In-Reply-To: From: Linus Torvalds Date: Sat, 27 Jan 2024 11:44:45 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems To: Matthew Wilcox Cc: James Bottomley , Amir Goldstein , Steven Rostedt , Greg Kroah-Hartman , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Christian Brauner , Al Viro Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 258CD1C000E X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: drmuj9jt18j3yrycen5sdr6s8j9aohch X-HE-Tag: 1706384705-820814 X-HE-Meta: U2FsdGVkX19SDdYbAKTh9BxpHJyBc7YrDvEmvL8iqnw5HMfJM4AeLen6podp765v3HbYX+STnoY5HYN0VriSfx5nzR5Dinu+7Jsq0RFJDzMzCCBkmhPkl8KojB2Q50qqHC3qNnuXwXkP4lCLijsBTXPDPvX9m7MQv9sasQnOTWCkjrVnV0OEfkGmO6JSCkVTU/VSFWkFWhZc0JzUkqXpcoJQHVAzskCutTTwuF8/lxfnBflCnVBbr/2j0WjCIzZGgzG04LcX4Th9aYFNTddsNGShbL+Sw3ABN3LixffIVKOs0VwOeh5hEhzZU24Pp10oK4ybcPC5KbDhTBVMnhvBA41zCGz0FE62puQxOO+EnIgujeEcsOFeLZrj1bIaqQRccBS28Gr9XlVJDSY6Y4SGk2hj9R9cyGoNw7ZjAyc/uJnI6FXanw0ux9zT13HFYe3nwsZuRu4u7e0M8w49XAHWUQ+CLYjERinDDNzdTU/k3Z4EMH6t3Cr2KySaUe6b5rn928Q1BjSFXQBbrIqoFw4n+5bb8h37TONJBdit9LGmKcgCDSGoZm2sdGYMMn6rt04Or2hNnxctqri7XLnVu3OsworkUQVvWfDt48KgMnV0CZJjn+z0jW/276YwWttY7ANfkLL4ytwWTKCobda8gP2WVB3dxZRY8HXCS7B0DN9H+gKQebxBe7D7sQRJqqrUAaHTpeGylZ+lUw1b0nHrBOMWXDjN0N1w2mj1uGUBFAkjV3rq7J9bgtr4W2sR+yyF8Bk2xiGb7kUjjNYitodlTnvKy0dZHccjfXuttkuJ6W+E2M2eyT1pnmI8aYxjqp+kUHHIuj455+gU312bxvpmOn62Fdj+wnsWFrAT5sqwwpwWglYbrzM2GR7KJ9xCH8sDMJOXfGMuxi/wKgVulNTqxXWVEdqoZEn1ZCoInLDSix+P7P6GvLMStkmwnR0OKvI3sri/86Svn4mv0oxdJC0SvEb +zn8xFmx J4qZufxkcBD1bqx34pOc1vsAjdGhoeBg8rhsDHg1WAE6Ge3nKcFubMZM3ZYp0VKD8IUtuA4WZlD3pWQfc1z73lhU5rnPUpXYGoyxZXxYHZJ/ixrlHzXzrs41pmdy2S7viVpxAGvz7vTwpk4hxxr86Y9641B+at7kHrEIihNCi8TDEw5Uz7JLXhqVwug+Q7gGkQHmmFQXyDUEj+FhLcsR6QTega4VO7ZxfAl5RZkap6DM2XgzrrQScCsgUvdqYU4kPmhaUvs2wmHf4jS4zeSZ74goybiv+OH0Aqfrq+7eg1QHah5mVo828qWTFIMqRXU6L9N5D/lvB7iAe6jhOFodCQhuvtsPhPZou1zbkD8n8itI9SHbvKKkUCdXujq3pi+fh0cNPj93An6pvk2A4eG2yKxIdwN7QXPFcpyOTaYyKI4/RG5/ZUdzp3Fvd9DnyEBeFwybzHzm7IHgTHWMn4E6eEr4fUgOvHJSUGyJju25fMdZYA4GZVnbdyvtooJZ1fH9C5sKNlGeINOAAqc6Ap0RQmPbylG8f9PzYZ9fVNPMW5u0V+/gO1t6OKt9R4Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, 27 Jan 2024 at 10:06, Matthew Wilcox wrote: > > I'd suggest that eventfs and shiftfs are not "simple filesystems". > They're synthetic filesystems that want to do very different things > from block filesystems and network filesystems. We have a lot of > infrastructure in place to help authors of, say, bcachefs, but not a lot > of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs, > debugfs, etc). Indeed. I think it's worth pointing out three very _fundamental_ design issues here, which all mean that a "regular filesystem" is in many ways much simpler than a virtual one: (a) this is what the VFS has literally primarily been designed for. When you look at a lot of VFS issues, they almost all come from just very basic "this is what a filesystem needs" issues, and particularly performance issues. And when you talk "performance", the #1 thing is caching. In fact, I'd argue that #2 is caching too. Caching is just *so* important, and it really shows in the VFS. Think about just about any part of the VFS, and it's all about caching filesystem data. It's why the dentry cache exists, it's why the page / folios exist, it's what 99% of all the VFS code is about. And that performance / caching issue isn't just why most of the VFS code exists, it's literally also the reason for most of the design decisions. The dentry cache is a hugely complicated beast, and a *lot* of the complications are directly related to one thing, and one thing only: performance. It's why locking is so incredibly baroque. Yes, there are other complications. The whole notion of "bind mounts" is a huge complication that arguably isn't performance-related, and it's why we have that combination of "vfsmount" and "dentry" that we together call a "path". And that tends to confuse low-level filesystem people, because the other thing the VFS layer does is to try to shield the low-level filesystem from higher-level concepts like that, so that the low-level filesystem literally doesn't have to know about "oh, this same filesystem is mounted in five different places". The VFS layer takes care of that, and the filesystem doesn't need to know. So part of it is that the VFS has been designed for regular filesystems, but the *other* part of the puzzle is on the other side: (b) regular filesystems have been designed to be filesystems. Ok, that may sound like a stupid truism, but when it comes to the discussion of virtual filesystems and relative simplicity, it's quite a big deal. The fact is, a regular filesystem has literally been designed from the ground up to do regular filesystem things. And that matters. Yes, yes, many filesystems then made various bad design decisions, and the world isn't perfect. But basically things like "read a directory" and "read and write files" and "rename things" are all things that the filesystem was *designed* for. So the VFS layer was designed for real filesystems, and real filesystems were designed to do filesystem operations, so they are not just made to fit together, they are also all made to expose all the normal read/write/open/stat/whatever system calls. (c) none of the above is generally true of virtual filesystems Sure, *some* virtual filesystems are designed to act like a filesystem from the ground up. Something like "tmpfs" is obviously a virtual filesystem, but it's "virtual" only in the sense that it doesn't have much of a backing store. It's still designed primarily to *be* a filesystem, and the only operations that happen on it are filesystem operations. So ignore 'tmpfs' here, and think about all the other virtual filesystems we have. And realize that hey aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them. So you have "simple" things like just plain read-only files in /proc, and desp[ite being about as simple as they come, they fail miserably at the most fundamental part of a file: you can't even 'stat()' them and get sane file size data from them. And "caching" - which was the #1 reason for most of the filesystem code - ends up being much less so, although it turns out that it's still hugely important because of the abstraction interface it allows. So all those dentries, and all the complicated lookup code, end up still being quite important to make the virtual filesystem look like a filesystem at all: it's what gives you the 'getcwd()' system call, it's what still gives you the whole bind mount thing, it really ends up giving a lot of "structure" to the virtual filesystem that would be an absolute nightmare without it. But it's a structure that is really designed for something else. Because the non-filesystem virtual part that a virtual filesystem is actually trying to expose _as_ a filesystem to user space usually has lifetime rules (and other rules) that are *entirely* unrelated to any filesystem activity. A user can "chdir()" into a directory that describes a process, but the lifetime of that process is then entirely unrelated to that, and it can go away as a process, while the directory still has to virtually exist. That's part of what the VFS code gives a virtual filesystem: the dentries etc end up being those things that hang around even when the virtual part that they described may have disappeared. And you *need* that, just to get sane UNIX 'home directory' semantics. I think people often don't think of how much that VFS infrastructure protects them from. But it's also why virtual filesystems are generally a complete mess: you have these two pieces, and they are really doing two *COMPLETELY* different things. It's why I told Steven so forcefully that tracefs must not mess around with VFS internals. A virtual filesystem either needs to be a "real filesystem" aka tmpfs and just leave it *all* to the VFS layer, or it needs to just treat the dentries as a separate cache that the virtual filesystem is *not* in charge of, and trust the VFS layer to do the filesystem parts. But no. You should *not* look at a virtual filesystem as a guide how to write a filesystem, or how to use the VFS. Look at a real FS. A simple one, and preferably one that is built from the ground up to look like a POSIX one, so that you don't end up getting confused by all the nasty hacks to make it all look ok. IOW, while FAT is a simple filesystem, don't look at that one, just because then you end up with all the complications that come from decades of non-UNIX filesystem history. I'd say "look at minix or sysv filesystems", except those may be simple but they also end up being so legacy that they aren't good examples. You shouldn't use buffer-heads for anything new. But they are still probably good examples for one thing: if you want to understand the real power of dentries, look at either of the minix or sysv 'namei.c' files. Just *look* at how simple they are. Ignore the internal implementation of how a directory entry is then looked up on disk - because that's obviously filesystem-specific - and instead just look at the interface. Linus