From: Jan Kara <jack@suse.cz>
To: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <willy@infradead.org>, Qu Wenruo <wqu@suse.com>,
linux-btrfs@vger.kernel.org, djwong@kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-block@vger.kernel.org, linux-mm@kvack.org,
martin.petersen@oracle.com, jack@suse.com
Subject: Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO
Date: Tue, 21 Oct 2025 11:22:20 +0200 [thread overview]
Message-ID: <rlu3rbmpktq5f3vgex3zlfjhivyohkhr5whpdmv3lscsgcjs7r@4zqutcey7kib> (raw)
In-Reply-To: <aPc7HVRJYXA1hT8h@infradead.org>
On Tue 21-10-25 00:49:49, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote:
> > Just FYI, because it might be interesting in this context.
> >
> > For anonymous memory we have this working by only writing the folio out if
> > it is completely unmapped and there are no unexpected folio references/pins
> > (see pageout()), and only allowing to write to such a folio ("reuse") if
> > SWP_STABLE_WRITES is not set (see do_swap_page()).
> >
> > So once we start writeback the folio has no writable page table mappings
> > (unmapped) and no GUP pins. Consequently, when trying to write to it we can
> > just fallback to creating a page copy without causing trouble with GUP pins.
>
> Yeah. But anonymous is the easy case, the pain is direct I/O to file
> mappings. Mapping the right answer is to just fail pinning them and fall
> back to (dontcache) buffered I/O.
I agree file mappings are more painful but we can also have interesting
cases with anon pages:
P - anon page
Thread 1 Thread 2
setup DIO read to P setup DIO write from P
And now you can get checksum failures for the write unless the write is
bounced (falling back to dontcache). Similarly with reads:
Thread 1 Thread 2
setup DIO read to P setup DIO read to P
you can get read checksum mismatch unless both reads are bounced (bouncing
one of the reads is not enough because the memcpy from the bounce page to
the final buffer may break checksum computation of the IO going directly).
So to avoid checksum failures even if user screws up and buffers overlap we
need to bounce every IO even to/from anon memory. Or we need to block one
of the IOs until the other one completes - a scheme that could work is we'd
try to acquire kind of exclusive pin to all the pages (page lock?). If we
succeed, we run the IO directly. If we don't succeed, we wait for the
exclusive pins to be released, acquire standard pin (to block exclusive
pinning) and *then* submit uncached IO. But it is all rather complex and
I'm not sure it's worth it...
For file mappings things get even more complex because you can do:
P - file mapping page
Thread 1 Thread 2
setup DIO write from P setup buffered write from Q to P
and you get checksum failures for the DIO write. So if we don't bounce the
DIO, we'd also have to teach buffered IO to avoid corrupting buffers of DIO
in flight.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2025-10-21 9:22 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
2025-10-20 10:00 ` Christoph Hellwig
2025-10-20 10:24 ` Qu Wenruo
2025-10-20 11:45 ` Christoph Hellwig
2025-10-20 11:16 ` Jan Kara
2025-10-20 11:44 ` Christoph Hellwig
2025-10-20 13:59 ` Jan Kara
2025-10-20 14:59 ` Matthew Wilcox
2025-10-20 15:58 ` Jan Kara
2025-10-20 17:55 ` John Hubbard
2025-10-21 8:27 ` Jan Kara
2025-10-21 16:56 ` John Hubbard
2025-10-20 19:00 ` David Hildenbrand
2025-10-21 7:49 ` Christoph Hellwig
2025-10-21 7:57 ` David Hildenbrand
2025-10-21 9:33 ` Jan Kara
2025-10-21 9:43 ` David Hildenbrand
2025-10-21 9:22 ` Jan Kara [this message]
2025-10-21 9:37 ` David Hildenbrand
2025-10-21 9:52 ` Jan Kara
2025-10-21 3:17 ` Qu Wenruo
2025-10-21 7:48 ` Christoph Hellwig
2025-10-21 8:15 ` Qu Wenruo
2025-10-21 11:30 ` Johannes Thumshirn
2025-10-22 2:27 ` Qu Wenruo
2025-10-22 5:04 ` hch
2025-10-22 6:17 ` Qu Wenruo
2025-10-22 6:24 ` hch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=rlu3rbmpktq5f3vgex3zlfjhivyohkhr5whpdmv3lscsgcjs7r@4zqutcey7kib \
--to=jack@suse.cz \
--cc=david@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=willy@infradead.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox