From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F35A3C4332F for ; Wed, 1 Nov 2023 20:10:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1BB768D0051; Wed, 1 Nov 2023 16:10:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 144928D0050; Wed, 1 Nov 2023 16:10:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F00558D0051; Wed, 1 Nov 2023 16:10:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id DAEEB8D0050 for ; Wed, 1 Nov 2023 16:10:46 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A49881209A1 for ; Wed, 1 Nov 2023 20:10:46 +0000 (UTC) X-FDA: 81410478492.08.16C5CE0 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf03.hostedemail.com (Postfix) with ESMTP id 9087E20003 for ; Wed, 1 Nov 2023 20:10:44 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=gexFZZaQ; dmarc=none; spf=pass (imf03.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.180 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698869444; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NBWFFWWFL2IpGLUFcCc1X6zcQFWEtkQQEaX3iixgj/k=; b=TkQuN/w2qIBt2SssdT7r1ToD/7dCzNmWVPPZ2LmTG8+Y++u6ULAsd8ni+OlLSc0IceV/BP x93l/DPU27REvKBaC5H25O31/BRBHIVJ5YK2O96+zNN/8Ne3RvUmqOyLmSWJMkiPG83YCu ombAAV+Tez/ASjrl8/QSd6q8jg8+cHo= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=gexFZZaQ; dmarc=none; spf=pass (imf03.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.180 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698869444; a=rsa-sha256; cv=none; b=Nu25AXnj3DJ4LUM353kwuiafZ7e1ii09i28Shhzu27/2KWyHX9wJufMJt/lxuU5XxVk3S/ 8Hi05CK5dp/3ijBReOG7mGvcjevSUsksGwpi6gBjSFRArmWSXxvV0icjCYVT3rK3/bp6NV LuyFrcfdfo3hoVU+mX4fwL8vyz9UVIY= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-2c5720a321aso2362711fa.1 for ; Wed, 01 Nov 2023 13:10:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; t=1698869442; x=1699474242; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=NBWFFWWFL2IpGLUFcCc1X6zcQFWEtkQQEaX3iixgj/k=; b=gexFZZaQUp1p7v2tcRDjPoPI4OoRv0rcoEL5ZfQPcQCUEBthJQGg5rhSA0TnrB06Lh h3PV9kTMmXHhck29s8BDKQl8ovRz3nrFowVqw+hg0T5RaTwrNCRuAYJ5XqZgSH8JI+L+ A2AWnX1dK2yPNssZyuNHX/c77ynho8BmPzzg4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698869442; x=1699474242; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=NBWFFWWFL2IpGLUFcCc1X6zcQFWEtkQQEaX3iixgj/k=; b=uw8YGPNFWFHN+z7ghAo6UusleooCqjPj9JH0RqeOCtEzG+spLtyB4j0WBfTOTRHdXh TW3tXh7ogU5DixIxgf/G70Cf7W++69Wx8Tsp1U0tGMvbe+9rdyUneMn8IWgNEChRX/CL 58DaGx227Xa6MGpJ4kRr6ldyhyKaI1OeShuzEvOA5LrcIcWwoJGVs9WKc41oMEbv697Z LJhIDMWYa2xWUF7InlFy3yzzCrPPFyPEZvAFT6EBsPwQhbTpChZ0lMvlQ9aGKPUD8ExR 8GLV13mLKIberAZBLTuzlrwkC5xWYlPAPppl5mE7/oxOX1gcY8ew2l3k1rVhvtjorQEi q3Mw== X-Gm-Message-State: AOJu0YzrggAn7k/ioGg6msEy02Ds8KaS0nqYz9W+4a6t9lgE70Vqzwgj 2QGoWk4fFHQ/9XwaYW9996MJh8F1ZVPnYQSzn6ziyg== X-Google-Smtp-Source: AGHT+IG4XMlkvBUnn40EjhBoj51Eg7bWaRqxflmpWHRDxCuORFkcezh0955mLBmgSwC4NpmBJS2u/A== X-Received: by 2002:a2e:8217:0:b0:2c5:1eb6:bd1e with SMTP id w23-20020a2e8217000000b002c51eb6bd1emr12516638ljg.43.1698869442367; Wed, 01 Nov 2023 13:10:42 -0700 (PDT) Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com. [209.85.208.176]) by smtp.gmail.com with ESMTPSA id g2-20020a2eb0c2000000b002c123b976a5sm280695ljl.123.2023.11.01.13.10.41 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 01 Nov 2023 13:10:42 -0700 (PDT) Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2c50d1b9f22so2548351fa.0 for ; Wed, 01 Nov 2023 13:10:41 -0700 (PDT) X-Received: by 2002:a17:907:25c6:b0:9b2:82d2:a2db with SMTP id ae6-20020a17090725c600b009b282d2a2dbmr2496156ejc.28.1698869421474; Wed, 01 Nov 2023 13:10:21 -0700 (PDT) MIME-Version: 1.0 References: <2ef9ac6180e47bc9cc8edef20648a000367c4ed2.camel@kernel.org> <6df5ea54463526a3d898ed2bd8a005166caa9381.camel@kernel.org> <3d6a4c21626e6bbb86761a6d39e0fafaf30a4a4d.camel@kernel.org> <20231101101648.zjloqo5su6bbxzff@quack3> In-Reply-To: <20231101101648.zjloqo5su6bbxzff@quack3> From: Linus Torvalds Date: Wed, 1 Nov 2023 10:10:03 -1000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing To: Jan Kara Cc: Dave Chinner , Jeff Layton , Amir Goldstein , Kent Overstreet , Christian Brauner , Alexander Viro , John Stultz , Thomas Gleixner , Stephen Boyd , Chandan Babu R , "Darrick J. Wong" , "Theodore Ts'o" , Andreas Dilger , Chris Mason , Josef Bacik , David Sterba , Hugh Dickins , Andrew Morton , Jan Kara , David Howells , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 9087E20003 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: a19zf631d56mmc8jytf83sar3xrq1odo X-HE-Tag: 1698869444-292425 X-HE-Meta: U2FsdGVkX18D4vceD46eotDSVWhpOmZVD3LZ3I/4TwuSsw3GZIBc8lGvr+0j+IWociufcmelnHFW6PQoySrSbT775xhADuVcUWiiiezHyYMB5hU2QPmCNQKjhsddi4e2QJWCqVuOKRKgSaBGkPzhAG0A1bkmT1WIeUTkZ7QY/pRDSuh+tCDuJ2yg6gyt6MwAmKF7pcohCn5gA4Y68wahfJoB9gkPUVXOCVCg5t/88a6XUbepNZeQ7XIO+FIUKHMtdAc0BHpAc++/dOiG7Cd+kPj4FLu0izecppt6sNWzXMVjgoB5LRGd7yqsgAWWf16+08J1f6vMOwWrwTQPZx7cC8lZTe8HLwIQTnG15C45lS6p+TAyC25lL7YoF4LAqBh54nfVlevPXRupFfaZps8avarKbm/pFrRtsLFcbkUBZOHoHfV6l8bcPf1VPl4Silds8H8ZpBkqFyJBsjbDHLb3L/KCLPh0DMSiAoBMgdrmyI9KVhsqcrd4nrAi1E7aXce7OTe7nEj/uDC2mRsYjMak/Ut69oSFyxrXkJINjzn0rEDopeudQwOxQs/Rqo56ukvRMx+p2C1m8Yz1h38IU86JXU4oy1jAqQeKi6+49Z7dHpdA0ajrVKmydWIqN43z87rrErSaofeutmqychCMEIZ2CMNedTlnJ7asdqDPUXx0fZHEihP9GYXariET2r4hJGaThGjvuBCkY42zkskm+95WZ3lOB01PKImL4BmFwc/gxJss5b+8Ra6/qAAIKfdp6isyTUvZfhGd3ew046zIenBFIjA1DqTKzsJRJW3RJhmXudnbthmQpMd61UCIcQPjITqAGmAeNeBdcquhXELpumfi7n9q03kr3UXo40mzV+KRoSIv2VLlvoPZ3IAxOdsxxUxc2zZbz4T2mRrTllMFT7GmnNHnmRcghTIhW8d4gvGnSiWUqdiZKIGU4eee6Bl12RyJhiAbT9EbFIfvHuNl5hG dFPBVulB hdNiAwzzu5oW8aCOewI3+f6Z2vEsG2y7CHkqoqgIiPwq8+RUR10IQF3dqbJ5QwYz7KgvsMRO+tDoo2pCdoBXgLMYPn1f8qAzOMF9YvTSZyudPgmzDw2xviYCmUJx+imE0ZAvRzlARf7XK3jwRnK9wucXDBriYs2Z4WTLEVAd4oLGpgv7NX76tLAKQzeQMKZgT+fxY4T16C387lPq2Tb6WW5/qnXRsHC84by9+piSc1pMkZzjnjWLUh87pFdBu8rvIcKB4rl6CLr/0InVL7wNy/lRTwXfNExSog4QGBk8BTTaIYymE7CLc8YBpzRH/zzN+0OqFXovpG3rIEGYOlUAW1nIsmkcvO+XV5+9djOvq6Xb4GAeAbxicJ55w59u3X2eJmr9qNZZGjsYNc8dAzCe/or+/0XZAbJ6in9bUVdXFC+QNBR7MCmOynX+7he+lA9z8m6ETiiNC0ktnX96Mid83kslSSy6e4mvUCL+YBLpmsMjX6W2vv0Yj5Z6UdqwSqKQRkbJ2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 1 Nov 2023 at 00:16, Jan Kara wrote: > > OK, but is this compatible with the current XFS behavior? AFAICS currently > XFS sets sb->s_time_gran to 1 so timestamps currently stored on disk will > have some mostly random garbage in low bits of the ctime. I really *really* don't think we can use ctime as a "i_version" replacement. The whole fine-granularity patches were well-intentioned, but I do think they were broken. Note that we can't use ctime as a "i_version" replacement for other reasons too - you have filesystems like FAT - which people do want to export - that have a single-second (or is it 2s?) granularity in reality, even though they report a 1ns value in s_time_gran. But here's a suggestion that people may hate, but that might just work in practice: - get rid of i_version entirely - use the "known good" part of ctime as the upper bits of the change counter (and by "known good" I mean tv_sec - or possibly even "tv_sec / 2" if that dim FAT memory of mine is right) - make the rule be that ctime is *never* updated for atime updates (maybe that's already true, I didn't check - maybe it needs a new mount flag for nfsd) - have a per-inode in-memory and vfs-internal (entirely invisible to filesystems) "ctime modification counter" that is *NOT* a timestamp, and is *NOT* i_version - make the rule be that the "ctime modification counter" is always zero, *EXCEPT* if (a) I_VERSION_QUERIED is set AND (b) the ctime modification doesn't modify the "known good" part of ctime so how the "statx change cookie" ends up being "high bits tv_sec of ctime, low bits ctime modification cookie", and the end result of that is: - if all the reads happen after the last write (common case), then the low bits will be zero, because I_VERSION_QUERIED wasn't set when ctime was modified - if you do a write *after* a modification, the ctime cookie is guaranteed to change, because either the known good (sec/2sec) part of ctime is new, *or* the counter gets updated - if the nfs server reboots, the in-memory counter will be cleared again, and so the change cookie will cause client cache invalidations, but *only* for those "ctime changed in the same second _after_ somebody did a read". - any long-time caches of files that don't get modified are all fine, because they will have those low bits zero and depend on just the stable part of ctime that works across filesystems. So there should be no nasty thundering herd issues on long-lived caches on lots of clients if the server reboots, or atime updates every 24 hours or anything like that. and note that *NONE* of this requires any filesystem involvement (except for the rule of "no atime changes ever impact ctime", which may or may not already be true). The filesystem does *not* know about that modification counter, there's no new on-disk stable information. It's entirely possible that I'm missing something obvious, but the above sounds to me like the only time you'd have stale invalidations is really the (unusual) case of having writes after cached reads, and then a reboot. We'd get rid of "inode_maybe_inc_iversion()" entirely, and instead replace it with logic in inode_set_ctime_current() that basically does - if the stable part of ctime changes, clear the new 32-bit counter - if I_VERSION_QUERIED isn't set, clear the new 32-bit counter - otherwise, increment the new 32-bit counter and then the STATX_CHANGE_COOKIE code basically just returns (stable part of ctime << 32) + new 32-bit counter (and again, the "stable part of ctime" is either just tv_sec, or it's "tv_sec >> 1" or whatever). The above does not expose *any* changes to timestamps to users, and should work across a wide variety of filesystems, without requiring any special code from the filesystem itself. And now please all jump on me and say "No, Linus, that won't work, because XYZ". Because it is *entirely* possible that I missed something truly fundamental, and the above is completely broken for some obvious reason that I just didn't think of. Linus