From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD6DFD41D79 for ; Tue, 12 Nov 2024 13:36:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 29FDF6B0088; Tue, 12 Nov 2024 08:36:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 250456B009D; Tue, 12 Nov 2024 08:36:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07AA28D0002; Tue, 12 Nov 2024 08:36:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D9E3F8D0001 for ; Tue, 12 Nov 2024 08:36:26 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6E219A027C for ; Tue, 12 Nov 2024 13:36:26 +0000 (UTC) X-FDA: 82777540608.27.48084CC Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) by imf30.hostedemail.com (Postfix) with ESMTP id B4A0C80024 for ; Tue, 12 Nov 2024 13:35:05 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=rSTfxcu0; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf30.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.180 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731418408; a=rsa-sha256; cv=none; b=JV2l0PUPcQiptBj6HmH4UHmcvs1tLH+q6pLvGnLSgScSDDAUWIXG642qNpJqdJ8IRgpXbp iwBc0+x10xBbKrOT+2flFvx6ri7N24lOyo6eL2Ognah0D8MddtdNCjIi6nJJXM93YwNOUX nKeDtLEiN8opKD07TtU1AQpWQqT8zp8= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=rSTfxcu0; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf30.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.180 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731418408; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=t6Oh4OX2Jyzis83TRylUMhyFscGtnBD3LHLbc+6KIU8=; b=Um+AM6VM4vtHNG9YIrdI47ifwvr/V/W6md83Bhg6InuLHZlcByx86m4qkZm7jBlIU9W6z/ x9+6EBmtzOaad3NmEVUK3xooQMC2/vIORYdUkBEqAwFRJP29aBLmZY0d8e/rg7++RzabCJ x9EL8HORVl/ycipzGE1PiyOHU2QYtGA= Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-7e9e38dd5f1so4264969a12.0 for ; Tue, 12 Nov 2024 05:36:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1731418583; x=1732023383; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=t6Oh4OX2Jyzis83TRylUMhyFscGtnBD3LHLbc+6KIU8=; b=rSTfxcu02vSjeIQONCfbhgsQyzBhS/2TwC8eDPbRPqfKYp7yIOFpMzN3TzKf03JKc2 TTluG/duaZhNgBCCp6ML++D48kDjKcAuVtbj88zU/mU+dkNf/Ui5khcy39fQTG2vD0Fn euvPh9lkuPzyiWNjOC+SjD2hPI6tKzHRFJC0kbIyC0fvbbh7j9Myqa7aWyW9dehX2lWp x6jQmJoMNiBZASi6AvXaTFn3KF035Fq0PyHFuV90j4Dt/1SW3bOtmZvVElEpk2K/M8Zw irIlEPwvQbrDACjcQ9ZQkc1f1vIzxr1HIdSIAz9RgTzehj3Y2Hnf9na0vzsNeaOGzL3l ztog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731418583; x=1732023383; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=t6Oh4OX2Jyzis83TRylUMhyFscGtnBD3LHLbc+6KIU8=; b=LdmC0qm05eSBYLwMaLWqQKprJhdqMscIj8ltC+LVK/7CDtZoulj26d3dcNJB/vwNoB P5Gwl/LiQZOE96SL4EMFrl8GoPfsYcyqZyTtmmFxLtAadzefmbFUjK2eqnVS+EqJp+Ir AA4FT6UuQ52x4PCPjcwL5TLTgaVN7WNBOrut8fXEqc0Xrk7dCDPLf55YLSkDtWFKMhfF TjwyahmdsAYTAEIorsgvQvanBL7jUAC0ZSvw5j1sZNBIOrqjL8zdtiL/O3wZvGh5fde4 zIaWObfeko7c1u9hzaD6Xu568DpI44AGC1l6sIL/Nm3ht7rsGk2c647ODtQ4w+/jotnl DZAQ== X-Forwarded-Encrypted: i=1; AJvYcCXvAnG+MRFhA7xscMen7bf6ZY5KTKBbMaK2t7rJDFmZH8nZJclPHtm9Fr5Fy8S2KS25spCyx/JKZQ==@kvack.org X-Gm-Message-State: AOJu0YxN7cIDkfGKGEEX2b49lar/re+KO6KPTGPGeEcQV8D0W+iV2Dyh 9A7ggLy0aIfZy92QKVkpsCZlvKvzc+dGNd7HqFaCh3yEARaDrIjVxZgvZ5L3pVM= X-Google-Smtp-Source: AGHT+IFIqnnMyIxMivt2cPh+q7ifEJlIf3ezR4cfEqcYFyOgxpQyOhRuCX6EGtaMImreoqHFw37fdQ== X-Received: by 2002:a17:90b:5484:b0:2e2:bb32:73eb with SMTP id 98e67ed59e1d1-2e9b178ea9emr22376758a91.31.1731418582953; Tue, 12 Nov 2024 05:36:22 -0800 (PST) Received: from dread.disaster.area (pa49-186-86-168.pa.vic.optusnet.com.au. [49.186.86.168]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2e9e8148d9dsm645777a91.2.2024.11.12.05.36.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Nov 2024 05:36:22 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1tAr46-00DelY-1w; Wed, 13 Nov 2024 00:36:18 +1100 Date: Wed, 13 Nov 2024 00:36:18 +1100 From: Dave Chinner To: "Kirill A. Shutemov" Cc: Jens Axboe , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hannes@cmpxchg.org, clm@meta.com, linux-kernel@vger.kernel.org, willy@infradead.org, linux-btrfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Message-ID: References: <20241111234842.2024180-1-axboe@kernel.dk> <20241111234842.2024180-11-axboe@kernel.dk> <0487b852-6e2b-4879-adf1-88ba75bdecc0@kernel.dk> <2sjhov4poma4o4efvwe2xk474iorxwvf4ifqa5oee74744ke2e@lipjana3f5ti> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2sjhov4poma4o4efvwe2xk474iorxwvf4ifqa5oee74744ke2e@lipjana3f5ti> X-Stat-Signature: sh9jgbkx6kqxgggqo9fz9i8ypgewcsox X-Rspamd-Queue-Id: B4A0C80024 X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1731418505-965870 X-HE-Meta: U2FsdGVkX1+HYWrzCvjl5jt13Gsg5W7/5iEyd4d+neioqQ5TsJ72Om5xRfz4q3qtXtv98Q3jkuCNlJlr9fhhHu5sxv4vSG3D7fzIg8s1q/TVFtV6urjvByAbrzs69/8dsiA/z8rqTIsiWewVqJomSxeS4aySbXPcE8Yva3ZoWxevJ69imyCIL0JJuCvBpJ9V3C6mQ/C/IiqY95y1tp5Mk5LsWcgTRS732lb0HM2WYt9KvjP1JxE5cZ/v/DIx5xUTTxmlbNeWQHoey482MYJcJCloh51xgKXCz9rVWE+dR8X+NDdgXr7joU9H2R6YB52SCt1RAXWHusvjeA1r+cpoNWLAj0JbHhs/A6XjGfdTnyfwVXmRrSRzmwLFIsIK2nBCiij+JZ0EqXZ/XdPtNjVhjQyER6Fdocxz8bKUzyjXSHk2qg6TeS7xnJLnFkexZ34CKcdXcP19ndLXBGMVoK/8LL/t1EPS+M0XMgw7gkWFJHrk7r0u+VxqaCv44Ch4BC63PsfVbF74zdXG9h23Yr/2K3pZ1z0/mkx3KpjEquanuHbY1cMlGXEJe/ZFYRj6SjlVSOtZSvKl2heRRa9bCSe3bqeoTJHXBkOKiIJxai4gPJQVRpdVI6XFs6EmvM+/tuaiMu2nt2GnWcKvQZGkvshC4Wik6xkq8oyqVsl3Px0W6od8UDSKjXecn0kL9zbsJfX/vGCf9TM00BVb1P+tdQP5xtolI0vrljK+mvNjJGjGaXdck23KfH0LDiRAUZFsRTcpKnuV+Yfp70nY/qYgYkl4E6CiBLihgNbQSd3ZscngSBMp5L1QPg/JgInrolRkhIc8rvk0QxzJCb1p+mH9AjvSW6VJ+uyBOPHudSOiaWaP66+4hdin3bYJv4gMaqcGW6wk+ogCaBW6VMF1ugCzoBXq4h9WQ5F/jH9B+LpXz5XOp96l8uQa2MYmZx3xmjeBAgTYH5cHS7a8wnYWS+wlWuB jjmBwLug WKrjjUH5Q7fX5YzkzG02b38WlB1LM0pv9IoAdqjQgy5A9asLJ7EkfVY/i6O/HsPi7REpcy5wd/Bex0MK5uuqixbnoYR1WjMTYacUfgDvfx0cN0n09cSXqQiJv6ArJ9nQC9up3qzf1S7xTebWL0GBClutu44G78qXOtZ46nHShlNXetwoV/Pq+mkBXfQA27P8IgKfNO35f1GUfsGnUpHuMDDY2JY7OWGxwqOUfqTGbRbLRQZb/+XWHzfMaxBjFsPYkBIjO6KFsr3p5IzSCDY10xxQAJPwhq68VGk4aL7yjg+vS+ATWgJD/HSlHv9SsmaKCrS+jIMtRJJ6Sp2AIRqiJNN2EnPKMF6ubyoHZc8yGhyT9rgejBSlD6tEZt+Ismm8TkPHmiimm8wLmTHoCkSfp1NgXP0I8MkkPqqubXE9sqxMFXBUFprTg2lPDB6+2BfDYZkAwyvYiNDWg4v1Z3s7pQaKg4Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Nov 12, 2024 at 11:50:46AM +0200, Kirill A. Shutemov wrote: > On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote: > > I think the post-IO invalidation that these IOs do is largely > > irrelevant to how the page cache processes the write. Indeed, > > from userspace, the functionality in this patchset would be > > implemented like this: > > > > oneshot_data_write(fd, buf, len, off) > > { > > /* write into page cache */ > > pwrite(fd, buf, len, off); > > > > /* force the write through the page cache */ > > sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); > > > > /* Invalidate the single use data in the cache now it is on disk */ > > posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED); > > } > > > > Allowing the application to control writeback and invalidation > > granularity is a much more flexible solution to the problem here; > > when IO is sequential, delayed allocation will be allowed to ensure > > large contiguous extents are created and that will greatly reduce > > file fragmentation on XFS, btrfs, bcachefs and ext4. For random > > writes, it'll submit async IOs in batches... > > > > Given that io_uring already supports sync_file_range() and > > posix_fadvise(), I'm wondering why we need an new IO API to perform > > this specific write-through behaviour in a way that is less flexible > > than what applications can already implement through existing > > APIs.... > > Attaching the hint to the IO operation allows kernel to keep the data in > page cache if it is there for other reason. You cannot do it with a > separate syscall. Sure we can. FADV_NOREUSE is attached to the struct file - that's available to every IO that is done on that file. Hence we know before we start every IO on that file if we only need to preserve existing page cache or all data we access. Having a file marked like this doesn't affect any other application that is accessing the same inode. It just means that the specific fd opened by a specific process will not perturb the long term residency of the page cache on that inode. > Consider a scenario of a nightly backup of the data. The same data is in > cache because the actual workload needs it. You don't want backup task to > invalidate the data from cache. Your snippet would do that. The code I presented was essentially just a demonstration of what "uncached IO" was doing. That it is actually cached IO, and that it can be done from userspace right now. Yes, it's not exactly the same cache invalidation semantics, but that's not the point. The point was that the existing APIs are *much more flexible* than this proposal, and we don't actually need new kernel functionality for applications to see the same benchmark results as Jens has presented. All they need to do is be modified to use existing APIs. The additional point to that end is that FADV_NOREUSE should be hooke dup to the conditional cache invalidation mechanism Jens added to the page cache IO paths. Then we have all the functionality of this patch set individually selectable by userspace applications without needing a new IO API to be rolled out. i.e. the snippet then bcomes: /* don't cache after IO */ fadvise(fd, FADV_NORESUSE) .... write(fd, buf, len, off); /* write through */ sync_file_range(fd, off, len, SYNC_FILE_RANGE); Note how this doesn't need to block in sync_file_range() before doing the invalidation anymore? We've separated the cache control behaviour from the writeback behaviour. We can now do both write back and write through buffered writes that clean up the page cache after IO completion has occurred - write-through is not restricted to uncached writes, nor is the cache purge after writeback completion. IOWs, we can do: /* don't cache after IO */ fadvise(fd, FADV_NORESUSE) .... off = pos; count = 4096; while (off < pos + len) { ret = write(fd, buf, count, off); /* get more data and put it in buf */ off += ret; } /* write through */ sync_file_range(fd, pos, len, SYNC_FILE_RANGE); And now we only do one set of writeback on the file range, instead of one per IO. And we still get the page cache being released on writeback Io completion. This is a *much* better API for IO and page cache control. It is not constrained to individual IOs, so applications can allow the page cache to write-combine data from multiple syscalls into a single physical extent allocation and writeback IO. This is much more efficient for modern filesytsems - the "writeback per IO" model forces filesystms like XFS and ext4 to work like ext3 did, and defeats buffered write IO optimisations like dealyed allocation. If we are going to do small "allocation and write IO" patterns, we may as well be using direct IO as it is optimised for that sort of behaviour. So let's conside the backup application example. IMO, backup applications really don't want to use this new uncached IO mechanism for either reading or writing data. Backup programs do sequential data read IO as they walk the backup set - if they are doing buffered IO then we -really- want readahead to be active. However, uncached IO turns off readahead, which is the equivalent of the backup application doing: fadvise(fd, FADV_RANDOM); while (len > 0) { ret = read(fd, buf, len, off); fadvise(fd, FADV_DONTNEED, off, len); /* do stuff with buf */ off += ret; len -= ret; } Sequential buffered read IO after setting FADV_RANDOM absolutely *sucks* from a performance perspective. This is when FADV_NOREUSE is useful. We can leave readahead turned on, and when we do the first read from the page cache after readahead completes, we can then apply the NOREUSE policy. i.e. if the data we are reading has not been accessed, then turf it after reading if NOREUSE is set. If the data was already resident in cache, then leave it there as per a normal read. IOWs, if we separate the cache control from the read IO itself, there is no need to turn off readahead to implement "drop cache on-read" semantics. We just need to know if the folio has been accessed or not to determine what to do with it. Let's also consider the backup data file - that is written sequentially. It's going to be large and we don't know it's size ahead of time. If we are using buffered writes we want delayed allocation to optimise the file layout and hence writeback IO throughput. We also want to drop the page cache when writeback eventually happens, but we really don't want writeback to happen on every write. IOWs, backup programs can take advantage of "drop cache when clean" semantics, but can't really take any significant advantage from per-IO write-through semantics. IOWs, backup applications really want per-file NOREUSE write semantics that are seperately controlled w.r.t. cache write-through behaviour. One of the points I tried to make was that the uncached IO proposal smashes multiple disparate semantics into a single per-IO control bit. The backup application example above shows exactly how that API isn't actually very good for the applications that could benefit from the functionality this patchset adds to the page cache to support that single control bit... -Dave. -- Dave Chinner david@fromorbit.com