linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: Any way to ensure minimal folio size and alignment for iomap based direct IO?
       [not found] <9598a140-aa45-4d73-9cd2-0c7ca6e4020a@gmx.com>
@ 2025-09-15 13:03 ` Matthew Wilcox
  2025-09-15 18:12   ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-09-15 13:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-fsdevel, linux-btrfs, linux-mm

On Mon, Sep 15, 2025 at 07:32:53PM +0930, Qu Wenruo wrote:
> - No fs block is allowed to cross (large) folio boundaries
>   This ensure that the btrfs checksum routine needs no multi-shot calls
>   for a single data block, and ensures we can use a lot of
>   bio_advance_iter_single() calls to move to the next block.

That's true for pagecache I/O, yes.

> But things are going crazy for iomap based direct IOs.
> 
> I'm getting the following bio during my local tests, which is using 8K fs
> block size with 4K page size:
> 
> [  130.957366] root=5 inode=2464 logical=15974400 length=8192 index=0
> bv_offset=0 bv_len=4096 is not aligned to 8192
> [  130.957376] i=0 page=0xffff8cc616e96000 offset=0 size=4096
> [  130.961977] i=1 page=0xffff8cc61730e000 offset=0 size=4096
> 
> The bio initially looks fine, the length is 8K, properly aligned.
> 
> But the dump of the bio shows it's not the case, instead of a large folio,
> it's two page sized folios.
> 
> This will not pass the btrfs requirement, but weirdly the alignment check
> for the iov_iter at check_direct_IO() shows no problem.
> 
> But unfortunately I can not find any folio allocation for the direct IO
> routine except the zero_page...
> 
> Any clue on the iomap part, or is the btrfs requirement incompatible with
> iomap in the first place?

It's nothing to do with iomap.  We can't make the assumption that
userspace is using large folios for, eg, anonymous memory.  Or if
the memory is backed by page cache, we can't assume that the file
that's mmaped is on a similarly-aligned block device.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Any way to ensure minimal folio size and alignment for iomap based direct IO?
  2025-09-15 13:03 ` Any way to ensure minimal folio size and alignment for iomap based direct IO? Matthew Wilcox
@ 2025-09-15 18:12   ` Pankaj Raghav (Samsung)
  2025-09-15 21:46     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-09-15 18:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Qu Wenruo, linux-fsdevel, linux-btrfs, linux-mm, mcgrof,
	p.raghav, kernel

> > But unfortunately I can not find any folio allocation for the direct IO
> > routine except the zero_page...
> > 
> > Any clue on the iomap part, or is the btrfs requirement incompatible with
> > iomap in the first place?
> 
> It's nothing to do with iomap.  We can't make the assumption that
> userspace is using large folios for, eg, anonymous memory.  Or if
> the memory is backed by page cache, we can't assume that the file
> that's mmaped is on a similarly-aligned block device.

Just to add to willy's point, XFS did not have this requirement when we
upstreamed block size > page size support. The only thing that XFS does
is to pad the direct I/O with zeroes if I/O was smaller than block size.

Is it very difficult to add multi-shot checksum calls for a data block
in btrfs? Does it break certain reliability guarantees?

Another crazy idea would be to either fall back to buffered I/O if this
condition is not met or allocate a new folio and copy the contents so that
it meets the condition of single large folio that matches the block
size (like we do in bio_copy_user_iov() when we cannot map).

--
Pankaj


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Any way to ensure minimal folio size and alignment for iomap based direct IO?
  2025-09-15 18:12   ` Pankaj Raghav (Samsung)
@ 2025-09-15 21:46     ` Qu Wenruo
  2025-09-15 23:05       ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-09-15 21:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Matthew Wilcox
  Cc: linux-fsdevel, linux-btrfs, linux-mm, mcgrof, p.raghav



在 2025/9/16 03:42, Pankaj Raghav (Samsung) 写道:
>>> But unfortunately I can not find any folio allocation for the direct IO
>>> routine except the zero_page...
>>>
>>> Any clue on the iomap part, or is the btrfs requirement incompatible with
>>> iomap in the first place?
>>
>> It's nothing to do with iomap.  We can't make the assumption that
>> userspace is using large folios for, eg, anonymous memory.  Or if
>> the memory is backed by page cache, we can't assume that the file
>> that's mmaped is on a similarly-aligned block device.
> 
> Just to add to willy's point, XFS did not have this requirement when we
> upstreamed block size > page size support. The only thing that XFS does
> is to pad the direct I/O with zeroes if I/O was smaller than block size.
> 
> Is it very difficult to add multi-shot checksum calls for a data block
> in btrfs? Does it break certain reliability guarantees?

I'd say it's not impossible, but still not an easy thing to do.

E.g. at data read time we need to verify the checksum. Currently we're 
able to do the checksum for one block in one go, then advance the bio iter.

But with multi-shot one, we have to update the shash several times 
before we can determine if the result is correct.

There is even compression algorithm which can not support multi-shot 
interface, lzo.

Thankfully compression is only possible for buffered IO, so it's not 
involved in this case.

> 
> Another crazy idea would be to either fall back to buffered I/O if this
> condition is not met or allocate a new folio and copy the contents so that
> it meets the condition of single large folio that matches the block
> size (like we do in bio_copy_user_iov() when we cannot map).

I'd prefer to reject the direct IO completely, but also fine with 
falling back to buffered IO.

However then the problem is why the read iov_iter passes the alignment 
check, but we still get the bio not meeting the large folio requirement?


Anyway the direction is clear (double check on the iov iter alignment), 
and for the worst case, introduce multi-shot checksum verification code.

Thanks a lot for the help,
Qu

> 
> --
> Pankaj



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Any way to ensure minimal folio size and alignment for iomap based direct IO?
  2025-09-15 21:46     ` Qu Wenruo
@ 2025-09-15 23:05       ` Matthew Wilcox
  2025-09-15 23:20         ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-09-15 23:05 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Pankaj Raghav (Samsung),
	linux-fsdevel, linux-btrfs, linux-mm, mcgrof, p.raghav

On Tue, Sep 16, 2025 at 07:16:48AM +0930, Qu Wenruo wrote:
> > Is it very difficult to add multi-shot checksum calls for a data block
> > in btrfs? Does it break certain reliability guarantees?
> 
> I'd say it's not impossible, but still not an easy thing to do.
> 
> E.g. at data read time we need to verify the checksum. Currently we're able
> to do the checksum for one block in one go, then advance the bio iter.
> 
> But with multi-shot one, we have to update the shash several times before we
> can determine if the result is correct.
> 
> There is even compression algorithm which can not support multi-shot
> interface, lzo.
> 
> Thankfully compression is only possible for buffered IO, so it's not
> involved in this case.

Would it be acceptable to vmap() the pages and do the checksum on the
virtual address?

> However then the problem is why the read iov_iter passes the alignment
> check, but we still get the bio not meeting the large folio requirement?

The virtual address _is_ aligned.  It's just not backed with large
folios, for whatever reason.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Any way to ensure minimal folio size and alignment for iomap based direct IO?
  2025-09-15 23:05       ` Matthew Wilcox
@ 2025-09-15 23:20         ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-09-15 23:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung),
	linux-fsdevel, linux-btrfs, linux-mm, mcgrof, p.raghav



在 2025/9/16 08:35, Matthew Wilcox 写道:
> On Tue, Sep 16, 2025 at 07:16:48AM +0930, Qu Wenruo wrote:
>>> Is it very difficult to add multi-shot checksum calls for a data block
>>> in btrfs? Does it break certain reliability guarantees?
>>
>> I'd say it's not impossible, but still not an easy thing to do.
>>
>> E.g. at data read time we need to verify the checksum. Currently we're able
>> to do the checksum for one block in one go, then advance the bio iter.
>>
>> But with multi-shot one, we have to update the shash several times before we
>> can determine if the result is correct.
>>
>> There is even compression algorithm which can not support multi-shot
>> interface, lzo.
>>
>> Thankfully compression is only possible for buffered IO, so it's not
>> involved in this case.
> 
> Would it be acceptable to vmap() the pages and do the checksum on the
> virtual address?

That may not be any better than multi-shot runs, as we still need to 
advance the iter by a sub-block sized length and mapping them.

Considering we need to do sub-block handling anyway, I'll just come up 
with a helper to handle the iteration.

> 
>> However then the problem is why the read iov_iter passes the alignment
>> check, but we still get the bio not meeting the large folio requirement?
> 
> The virtual address _is_ aligned.  It's just not backed with large
> folios, for whatever reason.
> 

Oh, that explains the problem.

So even if we do the extra checks to ensure all the pages of the iter is 
backed by large folios inside btrfs, it will still be very problematic 
for user space programs.

As they have no control on the underlying page layouts, and will hit 
random DIO failure or fallback, which is not acceptable for end users.

Thanks a lot for the determining answer,
Qu


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-15 23:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <9598a140-aa45-4d73-9cd2-0c7ca6e4020a@gmx.com>
2025-09-15 13:03 ` Any way to ensure minimal folio size and alignment for iomap based direct IO? Matthew Wilcox
2025-09-15 18:12   ` Pankaj Raghav (Samsung)
2025-09-15 21:46     ` Qu Wenruo
2025-09-15 23:05       ` Matthew Wilcox
2025-09-15 23:20         ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox