From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D2B6D41D68 for ; Tue, 12 Nov 2024 08:02:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 17A146B009B; Tue, 12 Nov 2024 03:02:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 125CB8D0001; Tue, 12 Nov 2024 03:02:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE2876B009F; Tue, 12 Nov 2024 03:02:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id CD0FA6B009B for ; Tue, 12 Nov 2024 03:02:40 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7B0121C734E for ; Tue, 12 Nov 2024 08:02:40 +0000 (UTC) X-FDA: 82776699684.06.03857AD Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) by imf19.hostedemail.com (Postfix) with ESMTP id 816C81A0008 for ; Tue, 12 Nov 2024 08:01:46 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=JaTKDc5o; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf19.hostedemail.com: domain of david@fromorbit.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731398383; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uS5sz2JrbxVWrFAMlEwTsUEhH2M/8tFpm8+HfNM8G+A=; b=2aNr8wyBYzMZTklbBjjPYiFyHzmBkQScgvsEr4yaczPSPJEFa0dwBBu0nupcTiN2AX0XvJ fSqW8nvhWmYrvnxEh+sPNNpn6QeVpj4sHnoOSD4xxWM6Z8CJ1ldh6UETcqKqTDUFMLhtGY BBbigwAtKMnkUo7FuWHZhnq7RhnD1f4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=JaTKDc5o; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf19.hostedemail.com: domain of david@fromorbit.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731398383; a=rsa-sha256; cv=none; b=azMU4NZSO6erZEeqyDmYe+FkQDBRZLSFXE1vS52JJQJc2g3Xkob+quT331UPmx4GLFheWC D8EyY9aoPxbh9XIj28h36nsBl6OOItxNRNBHKuv3QFtxapitHpp2MIBzrQs7yWbvJDegnh uabUbYu6Kwz9UbU5tsOUHnIYG5n2FwQ= Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-2e9b4a5862fso2909494a91.1 for ; Tue, 12 Nov 2024 00:02:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1731398557; x=1732003357; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=uS5sz2JrbxVWrFAMlEwTsUEhH2M/8tFpm8+HfNM8G+A=; b=JaTKDc5ofaOrfp1Hy5XLyutEmK+qGpvPC/aiD7psBgJZe9BgLn3Zi1muyEZje+Jqh1 6DcH4Zilj0/jOXsdr2oNbMRzw+NAyhnawWDqcUWwtA6zRhM9ji6t6RqOwZmPgBvAZwFb 9pbk+BCzi0Qlat6IJTIGCIGcKA+iYFdP1ufApyMUWyGhzvhKuWiu6N3rhbpW3V0jRrPP H3Mx3qSKLQQY3CZ23nhfCHu/ncyUo3LjeRAgiAxYyrzQ9hBLkuD6qm5jsVG5Mhil1vRd yUtn7tJoYWIn21OFJnebYFsZAoQ2dvnFmryzYAIAYpFzs6VSNBkVV/fJQ5cjGLNl2D7z Pesw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731398557; x=1732003357; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=uS5sz2JrbxVWrFAMlEwTsUEhH2M/8tFpm8+HfNM8G+A=; b=gO8trvECf2FJ59c+/t79KBlhurkSD7fRL9EZzT1fw4hu6yw2CkpTJo2B/FEnxKKiI5 GFAX7VB7avzaWMUvqotZs4xlWDJ8p0rCIAFWn9xWptSju1Otn6Ci/9dI2/V1NouuWCy9 Az2q+3p9n6HE15VxclC5a6tBTcqFhBPKx1p0vGOrtvWGaj6siim10T5b7Ys19075Gn8P ib3Ny9jvfvfnxMLJGNAwlrO6FhkeZ1vUGYl6v6t8HGMNlcwGdJ+dqHkTDvcAjib/fleF UzUAfpvrvQc606k8FFvI+Zq5IVTghspuOfz8JZxVLDJ7lPX8EjiwVLN9ffEJsd9zhlBL JOBw== X-Gm-Message-State: AOJu0YxLB2DAmkppDUFSs1smIXrOMYpzsNiF1S7bBca9MEtqqhpPBdCa ztWxKN17F4vzmpnHDtKkZRoPMC2Wjxy13VyQbOCnPLv2DuTx0OI86eW7J0P032g= X-Google-Smtp-Source: AGHT+IFdASJjKkBSve/9sR6y2nL3MjtIZ6Y0Rsyru0XowgMfxn2lqyfKHrnIxIhR4ebg9IkWdeg66w== X-Received: by 2002:a17:90a:ec86:b0:2e1:d5c9:1bc4 with SMTP id 98e67ed59e1d1-2e9e4aa8a19mr2404697a91.7.1731398557043; Tue, 12 Nov 2024 00:02:37 -0800 (PST) Received: from dread.disaster.area (pa49-186-86-168.pa.vic.optusnet.com.au. [49.186.86.168]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2e99a4f97b2sm12006946a91.8.2024.11.12.00.02.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Nov 2024 00:02:36 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1tAlr7-00DYAE-1V; Tue, 12 Nov 2024 19:02:33 +1100 Date: Tue, 12 Nov 2024 19:02:33 +1100 From: Dave Chinner To: Jens Axboe Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hannes@cmpxchg.org, clm@meta.com, linux-kernel@vger.kernel.org, willy@infradead.org, kirill@shutemov.name, linux-btrfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Message-ID: References: <20241111234842.2024180-1-axboe@kernel.dk> <20241111234842.2024180-11-axboe@kernel.dk> <0487b852-6e2b-4879-adf1-88ba75bdecc0@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0487b852-6e2b-4879-adf1-88ba75bdecc0@kernel.dk> X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 816C81A0008 X-Stat-Signature: wt38ihax5kr6x45fxyhct6it76fi7yxe X-Rspam-User: X-HE-Tag: 1731398506-545565 X-HE-Meta: U2FsdGVkX18jWq/uV56NI2WaGB+ZZ+NSSI8vgYLmnP5UekdyJe8dMe5ZsBZh/zCstL31m7ZxAkr3nmJi6ReQ9GqZHt1yxULQfY6kaKs+y7C/D4ExsBdpIrSmHvni7sJNIXrS395XKetf7OfAfgnrYfrh40RM4j5WvnLDu+Wrx/tszTfJpNLQ5o7eZa+1ewApMnbaG7TNLRj8saAfF/JTKUjkEd5s5Q+GdJGPH28h9Dx5Ug8ThsGOJZ2j2DR9u+u3V9G8BQElpt21X4kr5yoDxWJ0j7muCP/9VKQlvrjzz7RJx/PJC4qIBeJ5VBMrcnlYGQWE241d3EVqi+yLxFcFgCKH+Y/HM6S5JQmrc1HVHn/qW3atvRLCHmq9k1e+jtmOwdNmeDkS3g9rFyLVuhVK4/gV5DQlcpEHvKPxrpdQtW5qLJqdLPAqyqmPMjSh/pKBUP0LWOoKSplECZhcXYC4suy8f0XzCqgCbEBb9n3qVZHzPxwXNKBEEGEsLM/5+DCrUydHhHiH8LJ4aJt5Ggb+7NaL7DAxBhlUYvifh4/9ZFBvsW8wmKJUc1M7z8ivYsPTWxcYhnsEUxHLpxH1AmoCmEeUieqMn9H0GhvKzpeTIXO94GHPwCcPnGSB3oyQdHBZFGoykEJ7CndBAA1zNvfVcOvtwBWpzg8qNmWHb3K063lsemXNEa6I9y8Q8MLnCmjbslvnjWoR9yziWcV7vqdcd3wwk7ZvSepKtPuKlOClx5RAFNVIEwpcBp6U60SmUbhvl6TIA8egoFJyL+sT6VNlTdhlUOFJWlHZUMDzohdbM2Rq8lwn9TGGYqiKeMn0MhZTi3XKIh/FUxnVH3A0w4gaSmXQlO3qKozdfBxlfNXhL4BevtyPBGsqfHR8mO2K+Hfea3HiQG5Pafeha8gAWdj1OU5tqoGRA72D3n4cK/SzTwrbzpypd3ZNXWJVN0XJ+qi3+H0mvsb6qm68RIhHCtO eeUdCFYz x5wzofHLrmRRDZLZX7Xul0jFGqYk/8YXIkIXQ5mc/m+j1aseecGVOG7C+zegrhxH1SIijd4JSqiC1bqdldQAiUabkF7VQvREmK2ElZfXBXSdXj9urBTVfpD9Ul8YwqWCGDCkv1KkEChToHYApdNmJpb0fxCl33mL6S9SChKAzeeho38COjkfiXfBIDuRKSHL+H7qeLhMvKq64LMPDJYDIe1B8tbKWJMckY8jqrvRWi+1o7+cWBQ3L8XcOD7jkWi9a6g8KqXDuJbOHGwu81Da6A69UfVK8t2bA7Ud81R2vjjAH8QIEw6zZN+cPnB5e7RmcsjuP0IVI1Nja/3PTr2vEKIp6idaQaDiTh3Ku7S3Ly+DY2/1ipdZYafx9adMo+f7vDZ3s5NjBppy8AUyCtVs5tB9IUPq7kU1fSBByjtEEhQyMDkOpf4Tmz/Y8AzTeVqdpDsVNPE5LgnhxZKm/cPi5gXkPLA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 11, 2024 at 06:27:46PM -0700, Jens Axboe wrote: > On 11/11/24 5:57 PM, Dave Chinner wrote: > > On Mon, Nov 11, 2024 at 04:37:37PM -0700, Jens Axboe wrote: > >> If RWF_UNCACHED is set for a write, mark new folios being written with > >> uncached. This is done by passing in the fact that it's an uncached write > >> through the folio pointer. We can only get there when IOCB_UNCACHED was > >> allowed, which can only happen if the file system opts in. Opting in means > >> they need to check for the LSB in the folio pointer to know if it's an > >> uncached write or not. If it is, then FGP_UNCACHED should be used if > >> creating new folios is necessary. > >> > >> Uncached writes will drop any folios they create upon writeback > >> completion, but leave folios that may exist in that range alone. Since > >> ->write_begin() doesn't currently take any flags, and to avoid needing > >> to change the callback kernel wide, use the foliop being passed in to > >> ->write_begin() to signal if this is an uncached write or not. File > >> systems can then use that to mark newly created folios as uncached. > >> > >> Add a helper, generic_uncached_write(), that generic_file_write_iter() > >> calls upon successful completion of an uncached write. > > > > This doesn't implement an "uncached" write operation. This > > implements a cache write-through operation. > > It's uncached in the sense that the range gets pruned on writeback > completion. That's not the definition of "uncached". Direct IO is, by definition, "uncached" because it bypasses the cache and is not coherent with the contents of the cache. This IO, however, is moving the data coherently through the cache (both on read and write). The cached folios are transient - i.e. -temporarily resident- in the cache whilst the IO is in progress - but this behaviour does not make it "uncached IO". Calling it "uncached IO " is simply wrong from any direction I look at it.... > For write-through, I'd consider that just the fact that it > gets kicked off once dirtied rather than wait for writeback to get > kicked at some point. > > So I'd say write-through is a subset of that. I think the post-IO invalidation that these IOs do is largely irrelevant to how the page cache processes the write. Indeed, from userspace, the functionality in this patchset would be implemented like this: oneshot_data_write(fd, buf, len, off) { /* write into page cache */ pwrite(fd, buf, len, off); /* force the write through the page cache */ sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); /* Invalidate the single use data in the cache now it is on disk */ posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED); } Allowing the application to control writeback and invalidation granularity is a much more flexible solution to the problem here; when IO is sequential, delayed allocation will be allowed to ensure large contiguous extents are created and that will greatly reduce file fragmentation on XFS, btrfs, bcachefs and ext4. For random writes, it'll submit async IOs in batches... Given that io_uring already supports sync_file_range() and posix_fadvise(), I'm wondering why we need an new IO API to perform this specific write-through behaviour in a way that is less flexible than what applications can already implement through existing APIs.... > > the same problems you are trying to work around in this series > > with "uncached" writes. > > > > IOWS, what we really want is page cache write-through as an > > automatic feature for buffered writes. > > I don't know who "we" is here - what I really want is for the write to > get kicked off, but also reclaimed as part of completion. I don't want > kswapd to do that, as it's inefficient. "we" as in the general cohort of filesystem and mm developers who interact closely with the page cache all the time. There was a fair bit of talk about writethrough and other transparent page cache IO path improvements at LSFMM this year. > > That also gives us a common place for adding cache write-through > > trigger logic (think writebehind trigger logic similar to readahead) > > and this is also a place where we could automatically tag mapping > > ranges for reclaim on writeback completion.... > > I appreciate that you seemingly like the concept, but not that you are > also seemingly trying to commandeer this to be something else. Unless > you like the automatic reclaiming as well, it's not clear to me. I'm not trying to commandeer anything. Having thought about it more, I think this new API is unneccesary for custom written applications to perform fine grained control of page cache residency of one-shot data. We already have APIs that allow applications to do exactly what this patchset is doing. rather than choosing to modify whatever benchmark being used to use existing APIs, a choice was made to modify both the applicaiton and the kernel to implement a whole new API.... I think that was the -wrong choice-. I think this partially because the kernel modifications are don't really help further us towards the goal of transparent mode switching in the page cache. Read-through should be a mode that the readahead control activates, not be something triggered by a special read() syscall flag. We already have access patterns and fadvise modes guiding this. Write-through should be controlled in a similar way. And making the data being read and written behave as transient page caceh objects should be done via an existing fadvise mode, too, because the model you have implemented here exactly matches the definition of FADV_NOREUSE: POSIX_FADV_NOREUSE The specified data will be accessed only once. Having a new per-IO flag that effectively collides existing control functionality into a single inflexible API bit doesn't really make a whole lot of sense to me. IOWs, I'm not questioning whether we need rw-through modes and/or IO-transient residency for page cache based IO - it's been on our radar for a while. I'm more concerned that the chosen API in this patchset is a poor one as it cannot replace any of the existing controls we already have for these sorts of application directed page cache manipulations... -Dave. -- Dave Chinner david@fromorbit.com