From: Dave Chinner <david@fromorbit.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
hannes@cmpxchg.org, clm@meta.com, linux-kernel@vger.kernel.org,
willy@infradead.org, linux-btrfs@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED
Date: Wed, 13 Nov 2024 00:36:18 +1100 [thread overview]
Message-ID: <ZzNZ0iqx8EMlGVf0@dread.disaster.area> (raw)
In-Reply-To: <2sjhov4poma4o4efvwe2xk474iorxwvf4ifqa5oee74744ke2e@lipjana3f5ti>
On Tue, Nov 12, 2024 at 11:50:46AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 12, 2024 at 07:02:33PM +1100, Dave Chinner wrote:
> > I think the post-IO invalidation that these IOs do is largely
> > irrelevant to how the page cache processes the write. Indeed,
> > from userspace, the functionality in this patchset would be
> > implemented like this:
> >
> > oneshot_data_write(fd, buf, len, off)
> > {
> > /* write into page cache */
> > pwrite(fd, buf, len, off);
> >
> > /* force the write through the page cache */
> > sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> >
> > /* Invalidate the single use data in the cache now it is on disk */
> > posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
> > }
> >
> > Allowing the application to control writeback and invalidation
> > granularity is a much more flexible solution to the problem here;
> > when IO is sequential, delayed allocation will be allowed to ensure
> > large contiguous extents are created and that will greatly reduce
> > file fragmentation on XFS, btrfs, bcachefs and ext4. For random
> > writes, it'll submit async IOs in batches...
> >
> > Given that io_uring already supports sync_file_range() and
> > posix_fadvise(), I'm wondering why we need an new IO API to perform
> > this specific write-through behaviour in a way that is less flexible
> > than what applications can already implement through existing
> > APIs....
>
> Attaching the hint to the IO operation allows kernel to keep the data in
> page cache if it is there for other reason. You cannot do it with a
> separate syscall.
Sure we can. FADV_NOREUSE is attached to the struct file - that's
available to every IO that is done on that file. Hence we know
before we start every IO on that file if we only need to preserve
existing page cache or all data we access.
Having a file marked like this doesn't affect any other application
that is accessing the same inode. It just means that the specific
fd opened by a specific process will not perturb the long term
residency of the page cache on that inode.
> Consider a scenario of a nightly backup of the data. The same data is in
> cache because the actual workload needs it. You don't want backup task to
> invalidate the data from cache. Your snippet would do that.
The code I presented was essentially just a demonstration of what
"uncached IO" was doing. That it is actually cached IO, and that it
can be done from userspace right now. Yes, it's not exactly the same
cache invalidation semantics, but that's not the point.
The point was that the existing APIs are *much more flexible* than
this proposal, and we don't actually need new kernel functionality
for applications to see the same benchmark results as Jens has
presented. All they need to do is be modified to use existing APIs.
The additional point to that end is that FADV_NOREUSE should be
hooke dup to the conditional cache invalidation mechanism Jens added
to the page cache IO paths. Then we have all the functionality of
this patch set individually selectable by userspace applications
without needing a new IO API to be rolled out. i.e. the snippet
then bcomes:
/* don't cache after IO */
fadvise(fd, FADV_NORESUSE)
....
write(fd, buf, len, off);
/* write through */
sync_file_range(fd, off, len, SYNC_FILE_RANGE);
Note how this doesn't need to block in sync_file_range() before
doing the invalidation anymore? We've separated the cache control
behaviour from the writeback behaviour. We can now do both write
back and write through buffered writes that clean up the page cache
after IO completion has occurred - write-through is not restricted
to uncached writes, nor is the cache purge after writeback
completion.
IOWs, we can do:
/* don't cache after IO */
fadvise(fd, FADV_NORESUSE)
....
off = pos;
count = 4096;
while (off < pos + len) {
ret = write(fd, buf, count, off);
/* get more data and put it in buf */
off += ret;
}
/* write through */
sync_file_range(fd, pos, len, SYNC_FILE_RANGE);
And now we only do one set of writeback on the file range, instead
of one per IO. And we still get the page cache being released on
writeback Io completion.
This is a *much* better API for IO and page cache control. It is not
constrained to individual IOs, so applications can allow the page
cache to write-combine data from multiple syscalls into a single
physical extent allocation and writeback IO.
This is much more efficient for modern filesytsems - the "writeback
per IO" model forces filesystms like XFS and ext4 to work like ext3
did, and defeats buffered write IO optimisations like dealyed
allocation. If we are going to do small "allocation and write IO"
patterns, we may as well be using direct IO as it is optimised for
that sort of behaviour.
So let's conside the backup application example. IMO, backup
applications really don't want to use this new uncached IO
mechanism for either reading or writing data.
Backup programs do sequential data read IO as they walk the backup set -
if they are doing buffered IO then we -really- want readahead to be
active.
However, uncached IO turns off readahead, which is the equivalent of
the backup application doing:
fadvise(fd, FADV_RANDOM);
while (len > 0) {
ret = read(fd, buf, len, off);
fadvise(fd, FADV_DONTNEED, off, len);
/* do stuff with buf */
off += ret;
len -= ret;
}
Sequential buffered read IO after setting FADV_RANDOM absolutely
*sucks* from a performance perspective.
This is when FADV_NOREUSE is useful. We can leave readahead turned
on, and when we do the first read from the page cache after
readahead completes, we can then apply the NOREUSE policy. i.e. if
the data we are reading has not been accessed, then turf it after
reading if NOREUSE is set. If the data was already resident in
cache, then leave it there as per a normal read.
IOWs, if we separate the cache control from the read IO itself,
there is no need to turn off readahead to implement "drop cache
on-read" semantics. We just need to know if the folio has been
accessed or not to determine what to do with it.
Let's also consider the backup data file - that is written
sequentially. It's going to be large and we don't know it's size
ahead of time. If we are using buffered writes we want delayed
allocation to optimise the file layout and hence writeback IO
throughput. We also want to drop the page cache when writeback
eventually happens, but we really don't want writeback to happen on
every write.
IOWs, backup programs can take advantage of "drop cache when clean"
semantics, but can't really take any significant advantage from
per-IO write-through semantics. IOWs, backup applications really
want per-file NOREUSE write semantics that are seperately controlled
w.r.t. cache write-through behaviour.
One of the points I tried to make was that the uncached IO proposal
smashes multiple disparate semantics into a single per-IO control
bit. The backup application example above shows exactly how that API
isn't actually very good for the applications that could benefit
from the functionality this patchset adds to the page cache to
support that single control bit...
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2024-11-12 13:36 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-11 23:37 [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
2024-11-11 23:37 ` [PATCH 01/16] mm/filemap: change filemap_create_folio() to take a struct kiocb Jens Axboe
2024-11-11 23:37 ` [PATCH 02/16] mm/readahead: add folio allocation helper Jens Axboe
2024-11-11 23:37 ` [PATCH 03/16] mm: add PG_uncached page flag Jens Axboe
2024-11-12 9:12 ` Kirill A. Shutemov
2024-11-12 14:07 ` Jens Axboe
2024-11-11 23:37 ` [PATCH 04/16] mm/readahead: add readahead_control->uncached member Jens Axboe
2024-11-11 23:37 ` [PATCH 05/16] mm/filemap: use page_cache_sync_ra() to kick off read-ahead Jens Axboe
2024-11-11 23:37 ` [PATCH 06/16] mm/truncate: add folio_unmap_invalidate() helper Jens Axboe
2024-11-11 23:37 ` [PATCH 07/16] fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag Jens Axboe
2024-11-11 23:37 ` [PATCH 08/16] mm/filemap: add read support for RWF_UNCACHED Jens Axboe
2024-11-11 23:37 ` [PATCH 09/16] mm/filemap: drop uncached pages when writeback completes Jens Axboe
2024-11-12 9:31 ` Kirill A. Shutemov
2024-11-12 14:09 ` Jens Axboe
2024-11-11 23:37 ` [PATCH 10/16] mm/filemap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-11-12 0:57 ` Dave Chinner
2024-11-12 1:27 ` Jens Axboe
2024-11-12 8:02 ` Dave Chinner
2024-11-12 9:50 ` Kirill A. Shutemov
2024-11-12 13:36 ` Dave Chinner [this message]
2024-11-12 14:51 ` Jens Axboe
2024-11-11 23:37 ` [PATCH 11/16] mm: add FGP_UNCACHED folio creation flag Jens Axboe
2024-11-11 23:37 ` [PATCH 12/16] ext4: add RWF_UNCACHED write support Jens Axboe
2024-11-12 16:36 ` Brian Foster
2024-11-12 17:13 ` Jens Axboe
2024-11-12 18:11 ` Brian Foster
2024-11-12 18:47 ` Jens Axboe
2024-11-11 23:37 ` [PATCH 13/16] iomap: make buffered writes work with RWF_UNCACHED Jens Axboe
2024-11-12 1:01 ` Darrick J. Wong
2024-11-12 1:30 ` Jens Axboe
2024-11-12 16:37 ` Brian Foster
2024-11-12 17:16 ` Jens Axboe
2024-11-12 18:15 ` Brian Foster
2024-11-11 23:37 ` [PATCH 14/16] xfs: punt uncached write completions to the completion wq Jens Axboe
2024-11-11 23:37 ` [PATCH 15/16] xfs: flag as supporting FOP_UNCACHED Jens Axboe
2024-11-11 23:37 ` [PATCH 16/16] btrfs: add support for uncached writes Jens Axboe
2024-11-12 1:31 ` [PATCHSET v3 0/16] Uncached buffered IO Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZzNZ0iqx8EMlGVf0@dread.disaster.area \
--to=david@fromorbit.com \
--cc=axboe@kernel.dk \
--cc=clm@meta.com \
--cc=hannes@cmpxchg.org \
--cc=kirill@shutemov.name \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox