[LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
@ 2025-03-20 11:41 Luis Chamberlain
  2025-03-20 12:11 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Luis Chamberlain @ 2025-03-20 11:41 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-block
  Cc: lsf-pc, david, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, willy, djwong, john.g.garry, ritesh.list, p.raghav,
	gost.dev, da.gomez, Luis Chamberlain

We've been constrained to a max single 512 KiB IO for a while now on x86_64.
This is due to the number of DMA segments and the segment size. With LBS the
segments can be much bigger without using huge pages, and so on a 64 KiB
block size filesystem you can now see 2 MiB IOs when using buffered IO.
But direct IO is still crippled, because allocations are from anonymous
memory, and unless you are using mTHP you won't get large folios. mTHP
is also non-deterministic, and so you end up in a worse situation for
direct IO if you want to rely on large folios, as you may *sometimes*
end up with large folios and sometimes you might not. IO patterns can
therefore be erratic.

As I just posted in a simple RFC [0], I believe the two step DMA API
helps resolve this.  Provided we move the block integrity stuff to the
new DMA API as well, the only patches really needed to support larger
IOs for direct IO for NVMe are:

  iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
  blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit

The other two nvme-pci patches in that series are to just help with
experimentation now and they can be ignored.

It does beg a few questions:

 - How are we computing the new max single IO anyway? Are we really
   bounded only by what devices support?
 - Do we believe this is the step in the right direction?
 - Is 2 MiB a sensible max block sector size limit for the next few years?
 - What other considerations should we have?
 - Do we want something more deterministic for large folios for direct IO?

[0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 11:41 [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Luis Chamberlain
@ 2025-03-20 12:11 ` Matthew Wilcox
  2025-03-20 13:29   ` Daniel Gomez
  2025-03-20 13:47 ` Daniel Gomez
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2025-03-20 12:11 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-mm, linux-block, lsf-pc, david, leon, hch,
	kbusch, sagi, axboe, joro, brauner, hare, djwong, john.g.garry,
	ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> We've been constrained to a max single 512 KiB IO for a while now on x86_64.
...
> It does beg a few questions:
> 
>  - How are we computing the new max single IO anyway? Are we really
>    bounded only by what devices support?
>  - Do we believe this is the step in the right direction?
>  - Is 2 MiB a sensible max block sector size limit for the next few years?
>  - What other considerations should we have?
>  - Do we want something more deterministic for large folios for direct IO?

Is the 512KiB limit one that real programs actually hit?  Would we
see any benefit from increasing it?  A high end NVMe device has a
bandwidth limit around 10GB/s, so that's reached around 20k IOPS,
which is almost laughably low.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 12:11 ` Matthew Wilcox
@ 2025-03-20 13:29   ` Daniel Gomez
  2025-03-20 14:31     ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Gomez @ 2025-03-20 13:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block, lsf-pc,
	david, leon, hch, kbusch, sagi, axboe, joro, brauner, hare,
	djwong, john.g.garry, ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 12:11:47PM +0100, Matthew Wilcox wrote:
> On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> ...
> > It does beg a few questions:
> > 
> >  - How are we computing the new max single IO anyway? Are we really
> >    bounded only by what devices support?
> >  - Do we believe this is the step in the right direction?
> >  - Is 2 MiB a sensible max block sector size limit for the next few years?
> >  - What other considerations should we have?
> >  - Do we want something more deterministic for large folios for direct IO?
> 
> Is the 512KiB limit one that real programs actually hit?  Would we
> see any benefit from increasing it?  A high end NVMe device has a
> bandwidth limit around 10GB/s, so that's reached around 20k IOPS,
> which is almost laughably low.

Current devices do more than that. A quick search gives me 14GB/s and 2.5M IOPS
for gen5 devices:

https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1743/

An gen6 goes even further.

Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 11:41 [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Luis Chamberlain
  2025-03-20 12:11 ` Matthew Wilcox
@ 2025-03-20 13:47 ` Daniel Gomez
  2025-03-20 14:54   ` Christoph Hellwig
  2025-03-20 14:18 ` Christoph Hellwig
  2025-03-20 18:46 ` Ritesh Harjani
  3 siblings, 1 reply; 25+ messages in thread
From: Daniel Gomez @ 2025-03-20 13:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-mm, linux-block, lsf-pc, david, leon, hch,
	kbusch, sagi, axboe, joro, brauner, hare, willy, djwong,
	john.g.garry, ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 04:41:11AM +0100, Luis Chamberlain wrote:
> We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> This is due to the number of DMA segments and the segment size. With LBS the
> segments can be much bigger without using huge pages, and so on a 64 KiB
> block size filesystem you can now see 2 MiB IOs when using buffered IO.

Actually up to 8 MiB I/O with 64k filesystem block size with buffered I/O as we
can describe up to 128 segments at 64k size.

Daniel


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 11:41 [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Luis Chamberlain
  2025-03-20 12:11 ` Matthew Wilcox
  2025-03-20 13:47 ` Daniel Gomez
@ 2025-03-20 14:18 ` Christoph Hellwig
  2025-03-20 15:37   ` Bart Van Assche
  2025-03-20 21:40   ` Luis Chamberlain
  2025-03-20 18:46 ` Ritesh Harjani
  3 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-03-20 14:18 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-mm, linux-block, lsf-pc, david, leon, hch,
	kbusch, sagi, axboe, joro, brauner, hare, willy, djwong,
	john.g.garry, ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> We've been constrained to a max single 512 KiB IO for a while now on x86_64.

No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
SCSI and NVMe setup.

> This is due to the number of DMA segments and the segment size.

In nvme the max_segment_size is UINT_MAX, and for most SCSI HBAs it is
fairly large as well.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 13:29   ` Daniel Gomez
@ 2025-03-20 14:31     ` Matthew Wilcox
  0 siblings, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2025-03-20 14:31 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block, lsf-pc,
	david, leon, hch, kbusch, sagi, axboe, joro, brauner, hare,
	djwong, john.g.garry, ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 02:29:56PM +0100, Daniel Gomez wrote:
> On Thu, Mar 20, 2025 at 12:11:47PM +0100, Matthew Wilcox wrote:
> > On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> > > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> > ...
> > > It does beg a few questions:
> > > 
> > >  - How are we computing the new max single IO anyway? Are we really
> > >    bounded only by what devices support?
> > >  - Do we believe this is the step in the right direction?
> > >  - Is 2 MiB a sensible max block sector size limit for the next few years?
> > >  - What other considerations should we have?
> > >  - Do we want something more deterministic for large folios for direct IO?
> > 
> > Is the 512KiB limit one that real programs actually hit?  Would we
> > see any benefit from increasing it?  A high end NVMe device has a
> > bandwidth limit around 10GB/s, so that's reached around 20k IOPS,
> > which is almost laughably low.
> 
> Current devices do more than that. A quick search gives me 14GB/s and 2.5M IOPS
> for gen5 devices:
> 
> https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1743/
> 
> An gen6 goes even further.

That kind of misses my point.  You don't need to exceed 512KiB I/Os to
be bandwidth limited.  So what's the ROI of all this work?  Who benefits?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 13:47 ` Daniel Gomez
@ 2025-03-20 14:54   ` Christoph Hellwig
  2025-03-21  9:14     ` Daniel Gomez
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-03-20 14:54 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block, lsf-pc,
	david, leon, hch, kbusch, sagi, axboe, joro, brauner, hare,
	willy, djwong, john.g.garry, ritesh.list, p.raghav, gost.dev,
	da.gomez

On Thu, Mar 20, 2025 at 02:47:22PM +0100, Daniel Gomez wrote:
> On Thu, Mar 20, 2025 at 04:41:11AM +0100, Luis Chamberlain wrote:
> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> > This is due to the number of DMA segments and the segment size. With LBS the
> > segments can be much bigger without using huge pages, and so on a 64 KiB
> > block size filesystem you can now see 2 MiB IOs when using buffered IO.
> 
> Actually up to 8 MiB I/O with 64k filesystem block size with buffered I/O
> as we can describe up to 128 segments at 64k size.

Block layer segments are in no way limited to the logical block size.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 14:18 ` Christoph Hellwig
@ 2025-03-20 15:37   ` Bart Van Assche
  2025-03-20 15:58     ` Keith Busch
  2025-03-20 21:40   ` Luis Chamberlain
  1 sibling, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-03-20 15:37 UTC (permalink / raw)
  To: Christoph Hellwig, Luis Chamberlain
  Cc: linux-fsdevel, linux-mm, linux-block, lsf-pc, david, leon,
	kbusch, sagi, axboe, joro, brauner, hare, willy, djwong,
	john.g.garry, ritesh.list, p.raghav, gost.dev, da.gomez

On 3/20/25 7:18 AM, Christoph Hellwig wrote:
> On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
>> We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> 
> No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
> SCSI and NVMe setup.

Is NVME_MAX_KB_SZ the current maximum I/O size for PCIe NVMe
controllers? From drivers/nvme/host/pci.c:

/*
  * These can be higher, but we need to ensure that any command doesn't
  * require an sg allocation that needs more than a page of data.
  */
#define NVME_MAX_KB_SZ	8192
#define NVME_MAX_SEGS	128
#define NVME_MAX_META_SEGS 15
#define NVME_MAX_NR_ALLOCATIONS	5

>> This is due to the number of DMA segments and the segment size.
> 
> In nvme the max_segment_size is UINT_MAX, and for most SCSI HBAs it is
> fairly large as well.

I have a question for NVMe device manufacturers. It is known since a
long time that submitting large I/Os with the NVMe SGL format requires
less CPU time compared to the NVMe PRP format. Is this sufficient to
motivate NVMe device manufacturers to implement the SGL format? All SCSI
controllers I know of, including UFS controllers, support something that
is much closer to the NVMe SGL format rather than the NVMe PRP format.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 15:37   ` Bart Van Assche
@ 2025-03-20 15:58     ` Keith Busch
  2025-03-20 16:13       ` Kanchan Joshi
                         ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Keith Busch @ 2025-03-20 15:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, sagi, axboe, joro, brauner,
	hare, willy, djwong, john.g.garry, ritesh.list, p.raghav,
	gost.dev, da.gomez

On Thu, Mar 20, 2025 at 08:37:05AM -0700, Bart Van Assche wrote:
> On 3/20/25 7:18 AM, Christoph Hellwig wrote:
> > On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> > > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> > 
> > No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
> > SCSI and NVMe setup.
> 
> Is NVME_MAX_KB_SZ the current maximum I/O size for PCIe NVMe
> controllers? From drivers/nvme/host/pci.c:

Yes, this is the driver's limit. The device's limit may be lower or
higher.

I allocate out of hugetlbfs to reliably send direct IO at this size
because the nvme driver's segment count is limited to 128. The driver
doesn't impose a segment size limit, though. If each segment is only 4k
(a common occurance), I guess that's where Luis is getting the 512K
limit?

> /*
>  * These can be higher, but we need to ensure that any command doesn't
>  * require an sg allocation that needs more than a page of data.
>  */
> #define NVME_MAX_KB_SZ	8192
> #define NVME_MAX_SEGS	128
> #define NVME_MAX_META_SEGS 15
> #define NVME_MAX_NR_ALLOCATIONS	5
> 
> > > This is due to the number of DMA segments and the segment size.
> > 
> > In nvme the max_segment_size is UINT_MAX, and for most SCSI HBAs it is
> > fairly large as well.
> 
> I have a question for NVMe device manufacturers. It is known since a
> long time that submitting large I/Os with the NVMe SGL format requires
> less CPU time compared to the NVMe PRP format. Is this sufficient to
> motivate NVMe device manufacturers to implement the SGL format? All SCSI
> controllers I know of, including UFS controllers, support something that
> is much closer to the NVMe SGL format rather than the NVMe PRP format.

SGL support does seem less common than you'd think. It is more efficient
when you have physically contiguous pages, or an IOMMU mapped
discontiguous pages into a dma contiguous IOVA. If you don't have that,
PRP is a little more efficient for memory and CPU usage. But in the
context of large folios, yeah, SGL is the better option.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 15:58     ` Keith Busch
@ 2025-03-20 16:13       ` Kanchan Joshi
  2025-03-20 16:38       ` Christoph Hellwig
  2025-03-20 21:46       ` Luis Chamberlain
  2 siblings, 0 replies; 25+ messages in thread
From: Kanchan Joshi @ 2025-03-20 16:13 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche
  Cc: Christoph Hellwig, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, sagi, axboe, joro, brauner,
	hare, willy, djwong, john.g.garry, ritesh.list, p.raghav,
	gost.dev, da.gomez

On 3/20/2025 9:28 PM, Keith Busch wrote:
> On Thu, Mar 20, 2025 at 08:37:05AM -0700, Bart Van Assche wrote:
>> On 3/20/25 7:18 AM, Christoph Hellwig wrote:
>>> On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
>>>> We've been constrained to a max single 512 KiB IO for a while now on x86_64.
>>> No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
>>> SCSI and NVMe setup.
>> Is NVME_MAX_KB_SZ the current maximum I/O size for PCIe NVMe
>> controllers? From drivers/nvme/host/pci.c:
> Yes, this is the driver's limit. The device's limit may be lower or
> higher.
> 
> I allocate out of hugetlbfs to reliably send direct IO at this size
> because the nvme driver's segment count is limited to 128. The driver
> doesn't impose a segment size limit, though. If each segment is only 4k
> (a common occurance), I guess that's where Luis is getting the 512K
> limit?

Even if we hit that segment count limit (128), the I/O can go fine as 
block layer will split that, while application still thinks it's single 
I/O.
But if we don't want this internal split (for LBS) or using passthrough 
path, we will see failure.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 15:58     ` Keith Busch
  2025-03-20 16:13       ` Kanchan Joshi
@ 2025-03-20 16:38       ` Christoph Hellwig
  2025-03-20 21:50         ` Luis Chamberlain
  2025-03-20 21:46       ` Luis Chamberlain
  2 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-03-20 16:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Christoph Hellwig, Luis Chamberlain,
	linux-fsdevel, linux-mm, linux-block, lsf-pc, david, leon, sagi,
	axboe, joro, brauner, hare, willy, djwong, john.g.garry,
	ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 09:58:47AM -0600, Keith Busch wrote:
> I allocate out of hugetlbfs to reliably send direct IO at this size
> because the nvme driver's segment count is limited to 128.

It also works pretty well for buffered I/O for file systems supporting
larger folios.  I can trivially create 1MB folios in the page cache
on XFS and then do I/O on them.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 11:41 [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Luis Chamberlain
                   ` (2 preceding siblings ...)
  2025-03-20 14:18 ` Christoph Hellwig
@ 2025-03-20 18:46 ` Ritesh Harjani
  2025-03-20 21:30   ` Darrick J. Wong
  3 siblings, 1 reply; 25+ messages in thread
From: Ritesh Harjani @ 2025-03-20 18:46 UTC (permalink / raw)
  To: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block
  Cc: lsf-pc, david, leon, hch, kbusch, sagi, axboe, joro, brauner,
	hare, willy, djwong, john.g.garry, p.raghav, gost.dev, da.gomez,
	Luis Chamberlain

Luis Chamberlain <mcgrof@kernel.org> writes:

> We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> This is due to the number of DMA segments and the segment size. With LBS the
> segments can be much bigger without using huge pages, and so on a 64 KiB
> block size filesystem you can now see 2 MiB IOs when using buffered IO.
> But direct IO is still crippled, because allocations are from anonymous
> memory, and unless you are using mTHP you won't get large folios. mTHP
> is also non-deterministic, and so you end up in a worse situation for
> direct IO if you want to rely on large folios, as you may *sometimes*
> end up with large folios and sometimes you might not. IO patterns can
> therefore be erratic.
>
> As I just posted in a simple RFC [0], I believe the two step DMA API
> helps resolve this.  Provided we move the block integrity stuff to the
> new DMA API as well, the only patches really needed to support larger
> IOs for direct IO for NVMe are:
>
>   iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
>   blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit

Maybe some naive questions, however I would like some help from people
who could confirm if my understanding here is correct or not.

Given that we now support large folios in buffered I/O directly on raw
block devices, applications must carefully serialize direct I/O and
buffered I/O operations on these devices, right?

IIUC. until now, mixing buffered I/O and direct I/O (for doing I/O on
/dev/xxx) on separate boundaries (blocksize == pagesize) worked fine,
since direct I/O would only invalidate its corresponding page in the
page cache. This assumes that both direct I/O and buffered I/O use the
same blocksize and pagesize (e.g. both using 4K or both using 64K).
However with large folios now introduced in the buffered I/O path for
block devices, direct I/O may end up invalidating an entire large folio,
which could span across a region where an ongoing direct I/O operation
is taking place. That means, with large folio support in block devices,
application developers must now ensure that direct I/O and buffered I/O
operations on block devices are properly serialized, correct?

I was looking at posix page [1] and I don't think posix standard defines
the semantics for operations on block devices. So it is really upto the
individual OS implementation, correct? 

And IIUC, what Linux recommends is to never mix any kind of direct-io
and buffered-io when doing I/O on raw block devices, but I cannot find
this recommendation in any Documentation? So can someone please point me
one where we recommend this?

[1]: https://pubs.opengroup.org/onlinepubs/9799919799/

-ritesh

>
> The other two nvme-pci patches in that series are to just help with
> experimentation now and they can be ignored.
>
> It does beg a few questions:
>
>  - How are we computing the new max single IO anyway? Are we really
>    bounded only by what devices support?
>  - Do we believe this is the step in the right direction?
>  - Is 2 MiB a sensible max block sector size limit for the next few years?
>  - What other considerations should we have?
>  - Do we want something more deterministic for large folios for direct IO?
>
> [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org
>
>   Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 18:46 ` Ritesh Harjani
@ 2025-03-20 21:30   ` Darrick J. Wong
  2025-03-21  2:13     ` Ritesh Harjani
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2025-03-20 21:30 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block, lsf-pc,
	david, leon, hch, kbusch, sagi, axboe, joro, brauner, hare,
	willy, john.g.garry, p.raghav, gost.dev, da.gomez

On Fri, Mar 21, 2025 at 12:16:28AM +0530, Ritesh Harjani wrote:
> Luis Chamberlain <mcgrof@kernel.org> writes:
> 
> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> > This is due to the number of DMA segments and the segment size. With LBS the
> > segments can be much bigger without using huge pages, and so on a 64 KiB
> > block size filesystem you can now see 2 MiB IOs when using buffered IO.
> > But direct IO is still crippled, because allocations are from anonymous
> > memory, and unless you are using mTHP you won't get large folios. mTHP
> > is also non-deterministic, and so you end up in a worse situation for
> > direct IO if you want to rely on large folios, as you may *sometimes*
> > end up with large folios and sometimes you might not. IO patterns can
> > therefore be erratic.
> >
> > As I just posted in a simple RFC [0], I believe the two step DMA API
> > helps resolve this.  Provided we move the block integrity stuff to the
> > new DMA API as well, the only patches really needed to support larger
> > IOs for direct IO for NVMe are:
> >
> >   iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
> >   blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
> 
> Maybe some naive questions, however I would like some help from people
> who could confirm if my understanding here is correct or not.
> 
> Given that we now support large folios in buffered I/O directly on raw
> block devices, applications must carefully serialize direct I/O and
> buffered I/O operations on these devices, right?
> 
> IIUC. until now, mixing buffered I/O and direct I/O (for doing I/O on
> /dev/xxx) on separate boundaries (blocksize == pagesize) worked fine,
> since direct I/O would only invalidate its corresponding page in the
> page cache. This assumes that both direct I/O and buffered I/O use the
> same blocksize and pagesize (e.g. both using 4K or both using 64K).
> However with large folios now introduced in the buffered I/O path for
> block devices, direct I/O may end up invalidating an entire large folio,
> which could span across a region where an ongoing direct I/O operation

I don't understand the question.  Should this read  ^^^ "buffered"?
As in, directio submits its write bio, meanwhile another thread
initiates a buffered write nearby, the write gets a 2MB folio, and
then the post-write invalidation knocks down the entire large folio?
Even though the two ranges written are (say) 256k apart?

--D

> is taking place. That means, with large folio support in block devices,
> application developers must now ensure that direct I/O and buffered I/O
> operations on block devices are properly serialized, correct?
> 
> I was looking at posix page [1] and I don't think posix standard defines
> the semantics for operations on block devices. So it is really upto the
> individual OS implementation, correct? 
> 
> And IIUC, what Linux recommends is to never mix any kind of direct-io
> and buffered-io when doing I/O on raw block devices, but I cannot find
> this recommendation in any Documentation? So can someone please point me
> one where we recommend this?
> 
> [1]: https://pubs.opengroup.org/onlinepubs/9799919799/
> 
> 
> -ritesh
> 
> >
> > The other two nvme-pci patches in that series are to just help with
> > experimentation now and they can be ignored.
> >
> > It does beg a few questions:
> >
> >  - How are we computing the new max single IO anyway? Are we really
> >    bounded only by what devices support?
> >  - Do we believe this is the step in the right direction?
> >  - Is 2 MiB a sensible max block sector size limit for the next few years?
> >  - What other considerations should we have?
> >  - Do we want something more deterministic for large folios for direct IO?
> >
> > [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org
> >
> >   Luis
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 14:18 ` Christoph Hellwig
  2025-03-20 15:37   ` Bart Van Assche
@ 2025-03-20 21:40   ` Luis Chamberlain
  1 sibling, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2025-03-20 21:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-mm, linux-block, lsf-pc, david, leon,
	kbusch, sagi, axboe, joro, brauner, hare, willy, djwong,
	john.g.garry, ritesh.list, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 03:18:46PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> 
> No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
> SCSI and NVMe setup.

Sorry you're right, I should have been clearer. This is only an issue without
large folios for buffered IO, or without scatter list chaining support.

Or put another way, block driver which don't support scatter list
chaining will end up with a different max IO possible for direct IO and
io-uring cmd.

> > This is due to the number of DMA segments and the segment size.
> 
> In nvme the max_segment_size is UINT_MAX, and for most SCSI HBAs it is
> fairly large as well.

For Direct IO or io-uring cmd when large folios may not be used the
segments will be constrained to the page size.

  Luis


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 15:58     ` Keith Busch
  2025-03-20 16:13       ` Kanchan Joshi
  2025-03-20 16:38       ` Christoph Hellwig
@ 2025-03-20 21:46       ` Luis Chamberlain
  2 siblings, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2025-03-20 21:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Christoph Hellwig, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, sagi, axboe, joro, brauner,
	hare, willy, djwong, john.g.garry, ritesh.list, p.raghav,
	gost.dev, da.gomez

On Thu, Mar 20, 2025 at 09:58:47AM -0600, Keith Busch wrote:
> I allocate out of hugetlbfs to reliably send direct IO at this size
> because the nvme driver's segment count is limited to 128. The driver
> doesn't impose a segment size limit, though. If each segment is only 4k
> (a common occurance), I guess that's where Luis is getting the 512K
> limit?

Right, for direct IO we are not getting the large folios we can take
fruit from with buffered IO, so in effect we can end up with large IOs
possible with buffered but not direct IO.

Yes, direct IO with huge pages can help you overcome this as you noted.
But also mTHP can be enabled and used for direct IO, or io-uring cmd but
mTHP is not deterministic for your allocations even if you have a min
order filesystem. The min order requirement is only useful to use in the
buffered IO case.

  Luis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 16:38       ` Christoph Hellwig
@ 2025-03-20 21:50         ` Luis Chamberlain
  0 siblings, 0 replies; 25+ messages in thread
From: Luis Chamberlain @ 2025-03-20 21:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Bart Van Assche, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, sagi, axboe, joro, brauner,
	hare, willy, djwong, john.g.garry, ritesh.list, p.raghav,
	gost.dev, da.gomez

On Thu, Mar 20, 2025 at 05:38:04PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 20, 2025 at 09:58:47AM -0600, Keith Busch wrote:
> > I allocate out of hugetlbfs to reliably send direct IO at this size
> > because the nvme driver's segment count is limited to 128.
> 
> It also works pretty well for buffered I/O for file systems supporting
> larger folios.  I can trivially create 1MB folios in the page cache
> on XFS and then do I/O on them.

Right, but try DIO or io-uring cmd. The two step dma API seems to help us
bridge this gap and provide parity.

  Luis


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 21:30   ` Darrick J. Wong
@ 2025-03-21  2:13     ` Ritesh Harjani
  2025-03-21  3:05       ` Darrick J. Wong
  2025-03-21 16:38       ` Keith Busch
  0 siblings, 2 replies; 25+ messages in thread
From: Ritesh Harjani @ 2025-03-21  2:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block, lsf-pc,
	david, leon, hch, kbusch, sagi, axboe, joro, brauner, hare,
	willy, john.g.garry, p.raghav, gost.dev, da.gomez

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Fri, Mar 21, 2025 at 12:16:28AM +0530, Ritesh Harjani wrote:
>> Luis Chamberlain <mcgrof@kernel.org> writes:
>> 
>> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
>> > This is due to the number of DMA segments and the segment size. With LBS the
>> > segments can be much bigger without using huge pages, and so on a 64 KiB
>> > block size filesystem you can now see 2 MiB IOs when using buffered IO.
>> > But direct IO is still crippled, because allocations are from anonymous
>> > memory, and unless you are using mTHP you won't get large folios. mTHP
>> > is also non-deterministic, and so you end up in a worse situation for
>> > direct IO if you want to rely on large folios, as you may *sometimes*
>> > end up with large folios and sometimes you might not. IO patterns can
>> > therefore be erratic.
>> >
>> > As I just posted in a simple RFC [0], I believe the two step DMA API
>> > helps resolve this.  Provided we move the block integrity stuff to the
>> > new DMA API as well, the only patches really needed to support larger
>> > IOs for direct IO for NVMe are:
>> >
>> >   iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
>> >   blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
>> 
>> Maybe some naive questions, however I would like some help from people
>> who could confirm if my understanding here is correct or not.
>> 
>> Given that we now support large folios in buffered I/O directly on raw
>> block devices, applications must carefully serialize direct I/O and
>> buffered I/O operations on these devices, right?
>> 
>> IIUC. until now, mixing buffered I/O and direct I/O (for doing I/O on
>> /dev/xxx) on separate boundaries (blocksize == pagesize) worked fine,
>> since direct I/O would only invalidate its corresponding page in the
>> page cache. This assumes that both direct I/O and buffered I/O use the
>> same blocksize and pagesize (e.g. both using 4K or both using 64K).
>> However with large folios now introduced in the buffered I/O path for
>> block devices, direct I/O may end up invalidating an entire large folio,
>> which could span across a region where an ongoing direct I/O operation
>
> I don't understand the question.  Should this read  ^^^ "buffered"?

oops, yes.

> As in, directio submits its write bio, meanwhile another thread
> initiates a buffered write nearby, the write gets a 2MB folio, and
> then the post-write invalidation knocks down the entire large folio?
> Even though the two ranges written are (say) 256k apart?
>

Yes, Darrick. That is my question. 

i.e. w/o large folios in block devices one could do direct-io &
buffered-io in parallel even just next to each other (assuming 4k pagesize). 

           |4k-direct-io | 4k-buffered-io | 


However with large folios now supported in buffered-io path for block
devices, the application cannot submit such direct-io + buffered-io
pattern in parallel. Since direct-io can end up invalidating the folio
spanning over it's 4k range, on which buffered-io is in progress.

So now applications need to be careful to not submit any direct-io &
buffered-io in parallel with such above patterns on a raw block device,
correct? That is what I would like to confirm.

> --D
>
>> is taking place. That means, with large folio support in block devices,
>> application developers must now ensure that direct I/O and buffered I/O
>> operations on block devices are properly serialized, correct?
>> 
>> I was looking at posix page [1] and I don't think posix standard defines
>> the semantics for operations on block devices. So it is really upto the
>> individual OS implementation, correct? 
>> 
>> And IIUC, what Linux recommends is to never mix any kind of direct-io
>> and buffered-io when doing I/O on raw block devices, but I cannot find
>> this recommendation in any Documentation? So can someone please point me
>> one where we recommend this?

And this ^^^ 


-ritesh

>> 
>> [1]: https://pubs.opengroup.org/onlinepubs/9799919799/
>> 
>> 
>> -ritesh
>> 
>> >
>> > The other two nvme-pci patches in that series are to just help with
>> > experimentation now and they can be ignored.
>> >
>> > It does beg a few questions:
>> >
>> >  - How are we computing the new max single IO anyway? Are we really
>> >    bounded only by what devices support?
>> >  - Do we believe this is the step in the right direction?
>> >  - Is 2 MiB a sensible max block sector size limit for the next few years?
>> >  - What other considerations should we have?
>> >  - Do we want something more deterministic for large folios for direct IO?
>> >
>> > [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org
>> >
>> >   Luis
>> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21  2:13     ` Ritesh Harjani
@ 2025-03-21  3:05       ` Darrick J. Wong
  2025-03-21  4:56         ` Theodore Ts'o
  2025-03-21 16:38       ` Keith Busch
  1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2025-03-21  3:05 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Luis Chamberlain, linux-fsdevel, linux-mm, linux-block, lsf-pc,
	david, leon, hch, kbusch, sagi, axboe, joro, brauner, hare,
	willy, john.g.garry, p.raghav, gost.dev, da.gomez

On Fri, Mar 21, 2025 at 07:43:09AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > On Fri, Mar 21, 2025 at 12:16:28AM +0530, Ritesh Harjani wrote:
> >> Luis Chamberlain <mcgrof@kernel.org> writes:
> >> 
> >> > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> >> > This is due to the number of DMA segments and the segment size. With LBS the
> >> > segments can be much bigger without using huge pages, and so on a 64 KiB
> >> > block size filesystem you can now see 2 MiB IOs when using buffered IO.
> >> > But direct IO is still crippled, because allocations are from anonymous
> >> > memory, and unless you are using mTHP you won't get large folios. mTHP
> >> > is also non-deterministic, and so you end up in a worse situation for
> >> > direct IO if you want to rely on large folios, as you may *sometimes*
> >> > end up with large folios and sometimes you might not. IO patterns can
> >> > therefore be erratic.
> >> >
> >> > As I just posted in a simple RFC [0], I believe the two step DMA API
> >> > helps resolve this.  Provided we move the block integrity stuff to the
> >> > new DMA API as well, the only patches really needed to support larger
> >> > IOs for direct IO for NVMe are:
> >> >
> >> >   iomap: use BLK_MAX_BLOCK_SIZE for the iomap zero page
> >> >   blkdev: lift BLK_MAX_BLOCK_SIZE to page cache limit
> >> 
> >> Maybe some naive questions, however I would like some help from people
> >> who could confirm if my understanding here is correct or not.
> >> 
> >> Given that we now support large folios in buffered I/O directly on raw
> >> block devices, applications must carefully serialize direct I/O and
> >> buffered I/O operations on these devices, right?
> >> 
> >> IIUC. until now, mixing buffered I/O and direct I/O (for doing I/O on
> >> /dev/xxx) on separate boundaries (blocksize == pagesize) worked fine,
> >> since direct I/O would only invalidate its corresponding page in the
> >> page cache. This assumes that both direct I/O and buffered I/O use the
> >> same blocksize and pagesize (e.g. both using 4K or both using 64K).
> >> However with large folios now introduced in the buffered I/O path for
> >> block devices, direct I/O may end up invalidating an entire large folio,
> >> which could span across a region where an ongoing direct I/O operation
> >
> > I don't understand the question.  Should this read  ^^^ "buffered"?
> 
> oops, yes.
> 
> > As in, directio submits its write bio, meanwhile another thread
> > initiates a buffered write nearby, the write gets a 2MB folio, and
> > then the post-write invalidation knocks down the entire large folio?
> > Even though the two ranges written are (say) 256k apart?
> >
> 
> Yes, Darrick. That is my question. 
> 
> i.e. w/o large folios in block devices one could do direct-io &
> buffered-io in parallel even just next to each other (assuming 4k pagesize). 
> 
>            |4k-direct-io | 4k-buffered-io | 
> 
> 
> However with large folios now supported in buffered-io path for block
> devices, the application cannot submit such direct-io + buffered-io
> pattern in parallel. Since direct-io can end up invalidating the folio
> spanning over it's 4k range, on which buffered-io is in progress.
> 
> So now applications need to be careful to not submit any direct-io &
> buffered-io in parallel with such above patterns on a raw block device,
> correct? That is what I would like to confirm.

I think that's correct, and kind of horrifying if true.  I wonder if
->invalidate_folio might be a reasonable way to clear the uptodate bits
on the relevant parts of a large folio without having to split or remove
it?

--D

> > --D
> >
> >> is taking place. That means, with large folio support in block devices,
> >> application developers must now ensure that direct I/O and buffered I/O
> >> operations on block devices are properly serialized, correct?
> >> 
> >> I was looking at posix page [1] and I don't think posix standard defines
> >> the semantics for operations on block devices. So it is really upto the
> >> individual OS implementation, correct? 
> >> 
> >> And IIUC, what Linux recommends is to never mix any kind of direct-io
> >> and buffered-io when doing I/O on raw block devices, but I cannot find
> >> this recommendation in any Documentation? So can someone please point me
> >> one where we recommend this?
> 
> And this ^^^ 
> 
> 
> -ritesh
> 
> >> 
> >> [1]: https://pubs.opengroup.org/onlinepubs/9799919799/
> >> 
> >> 
> >> -ritesh
> >> 
> >> >
> >> > The other two nvme-pci patches in that series are to just help with
> >> > experimentation now and they can be ignored.
> >> >
> >> > It does beg a few questions:
> >> >
> >> >  - How are we computing the new max single IO anyway? Are we really
> >> >    bounded only by what devices support?
> >> >  - Do we believe this is the step in the right direction?
> >> >  - Is 2 MiB a sensible max block sector size limit for the next few years?
> >> >  - What other considerations should we have?
> >> >  - Do we want something more deterministic for large folios for direct IO?
> >> >
> >> > [0] https://lkml.kernel.org/r/20250320111328.2841690-1-mcgrof@kernel.org
> >> >
> >> >   Luis
> >> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21  3:05       ` Darrick J. Wong
@ 2025-03-21  4:56         ` Theodore Ts'o
  2025-03-21  5:00           ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2025-03-21  4:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ritesh Harjani, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, hch, kbusch, sagi, axboe, joro,
	brauner, hare, willy, john.g.garry, p.raghav, gost.dev, da.gomez

On Thu, Mar 20, 2025 at 08:05:26PM -0700, Darrick J. Wong wrote:
> > So now applications need to be careful to not submit any direct-io &
> > buffered-io in parallel with such above patterns on a raw block device,
> > correct? That is what I would like to confirm.
> 
> I think that's correct, and kind of horrifying if true.  I wonder if
> ->invalidate_folio might be a reasonable way to clear the uptodate bits
> on the relevant parts of a large folio without having to split or remove
> it?

FWIW, I've always recommended not mixing DIO and buffered I/O, either
for filesystems or block device.

> > >> And IIUC, what Linux recommends is to never mix any kind of direct-io
> > >> and buffered-io when doing I/O on raw block devices, but I cannot find
> > >> this recommendation in any Documentation? So can someone please point me
> > >> one where we recommend this?
> > 
> > And this ^^^

From the open(2) man page, in the NOTES section:

    Applications should avoid mixing O_DIRECT and normal I/O to the
    same file, and especially to overlap‐ ping byte regions in the
    same file.  Even when the filesystem correctly handles the
    coherency issues in this situation, overall I/O throughput is
    likely to be slower than using either mode alone.  Likewise,
    applications should avoid mixing mmap(2) of files with direct I/O
    to the same files.

As I recall, in the eary days Linux's safety for DIO and Bufered I/O
was best efforts, and other Unix system the recommendation to "don't
mix the streams" was far stronger.  Even if it works reliably for
Linux, it's still something I recommend that people avoid if at all
possible.

						- Ted


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21  4:56         ` Theodore Ts'o
@ 2025-03-21  5:00           ` Christoph Hellwig
  2025-03-21 18:39             ` Ritesh Harjani
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-03-21  5:00 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, Ritesh Harjani, Luis Chamberlain, linux-fsdevel,
	linux-mm, linux-block, lsf-pc, david, leon, hch, kbusch, sagi,
	axboe, joro, brauner, hare, willy, john.g.garry, p.raghav,
	gost.dev, da.gomez

On Fri, Mar 21, 2025 at 12:56:04AM -0400, Theodore Ts'o wrote:
> As I recall, in the eary days Linux's safety for DIO and Bufered I/O
> was best efforts, and other Unix system the recommendation to "don't
> mix the streams" was far stronger.  Even if it works reliably for
> Linux, it's still something I recommend that people avoid if at all
> possible.

It still is a best effort, just a much better effort now.  It's still
pretty easy to break the coherent.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-20 14:54   ` Christoph Hellwig
@ 2025-03-21  9:14     ` Daniel Gomez
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Gomez @ 2025-03-21  9:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Gomez, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, kbusch, sagi, axboe, joro,
	brauner, hare, willy, djwong, john.g.garry, ritesh.list,
	p.raghav, gost.dev

On Thu, Mar 20, 2025 at 03:54:49PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 20, 2025 at 02:47:22PM +0100, Daniel Gomez wrote:
> > On Thu, Mar 20, 2025 at 04:41:11AM +0100, Luis Chamberlain wrote:
> > > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> > > This is due to the number of DMA segments and the segment size. With LBS the
> > > segments can be much bigger without using huge pages, and so on a 64 KiB
> > > block size filesystem you can now see 2 MiB IOs when using buffered IO.
> > 
> > Actually up to 8 MiB I/O with 64k filesystem block size with buffered I/O
> > as we can describe up to 128 segments at 64k size.
> 
> Block layer segments are in no way limited to the logical block size.

You are right but that was not what I meant. I'll use a 16 KiB fs
example as with 64 KiB you hit the current NVMe 8 MiB driver limit
(NVME_MAX_KB_SZ):

"on a 16 KiB block size filesystem, using buffered I/O will always allow
at least 2 MiB I/O, though higher I/O may be possible".

And yes, we can do 8 MiB I/O with direct I/O as well. It's just not
reliable unless huge pages are used. The maximum reliable supported I/O
size is 512 KiB.

With buffered I/O, a larger fs block size guarantees a specific upper
limit, i.e 2 MiB for 16 KiB, 4 MiB for 32 KiB and 8 MiB for 64 KiB.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21  2:13     ` Ritesh Harjani
  2025-03-21  3:05       ` Darrick J. Wong
@ 2025-03-21 16:38       ` Keith Busch
  2025-03-21 17:21         ` Ritesh Harjani
  1 sibling, 1 reply; 25+ messages in thread
From: Keith Busch @ 2025-03-21 16:38 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, hch, sagi, axboe, joro,
	brauner, hare, willy, john.g.garry, p.raghav, gost.dev, da.gomez

On Fri, Mar 21, 2025 at 07:43:09AM +0530, Ritesh Harjani wrote:
> i.e. w/o large folios in block devices one could do direct-io &
> buffered-io in parallel even just next to each other (assuming 4k pagesize). 
> 
>            |4k-direct-io | 4k-buffered-io | 
> 
> 
> However with large folios now supported in buffered-io path for block
> devices, the application cannot submit such direct-io + buffered-io
> pattern in parallel. Since direct-io can end up invalidating the folio
> spanning over it's 4k range, on which buffered-io is in progress.

Why would buffered io span more than the 4k range here? You're talking
to the raw block device in both cases, so they have the exact same
logical block size alignment. Why is buffered io allocating beyond
the logical size granularity?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21 16:38       ` Keith Busch
@ 2025-03-21 17:21         ` Ritesh Harjani
  2025-03-21 18:55           ` Keith Busch
  0 siblings, 1 reply; 25+ messages in thread
From: Ritesh Harjani @ 2025-03-21 17:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, hch, sagi, axboe, joro,
	brauner, hare, willy, john.g.garry, p.raghav, gost.dev, da.gomez

Keith Busch <kbusch@kernel.org> writes:

> On Fri, Mar 21, 2025 at 07:43:09AM +0530, Ritesh Harjani wrote:
>> i.e. w/o large folios in block devices one could do direct-io &
>> buffered-io in parallel even just next to each other (assuming 4k pagesize). 
>> 
>>            |4k-direct-io | 4k-buffered-io | 
>> 
>> 
>> However with large folios now supported in buffered-io path for block
>> devices, the application cannot submit such direct-io + buffered-io
>> pattern in parallel. Since direct-io can end up invalidating the folio
>> spanning over it's 4k range, on which buffered-io is in progress.
>
> Why would buffered io span more than the 4k range here? You're talking
> to the raw block device in both cases, so they have the exact same
> logical block size alignment. Why is buffered io allocating beyond
> the logical size granularity?

This can happen in following 2 cases - 
1. System's page size is 64k. Then even though the logical block size
granularity for buffered-io is set to 4k (blockdev --setbsz 4k
/dev/sdc), it still will instantiate a 64k page in the page cache.

2. Second is the recent case where (correct me if I am wrong) we now
have large folio support for block devices. So here again we can
instantiate a large folio in the page cache where buffered-io is in
progress correct? (say a previous read causes a readahead and installs a
large folio in that region). Or even iomap_write_iter() these days tries
to first allocate a chunk of size mapping_max_folio_size().

However with large folio support now in block devices, I am not sure
whether an application can retain much benefit of doing buffered-io (if
they happen to mix buffered-io and direct-io carefully over a logical
boundary). Because the direct-io can end up invalidating the entire
large folio, if there is one, in the region where the direct-io
operation is taking place. However this may still be useful if only
buffered-io is being performed on the block device.

-ritesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21  5:00           ` Christoph Hellwig
@ 2025-03-21 18:39             ` Ritesh Harjani
  0 siblings, 0 replies; 25+ messages in thread
From: Ritesh Harjani @ 2025-03-21 18:39 UTC (permalink / raw)
  To: Christoph Hellwig, Theodore Ts'o
  Cc: Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, hch, kbusch, sagi, axboe, joro,
	brauner, hare, willy, john.g.garry, p.raghav, gost.dev, da.gomez

Christoph Hellwig <hch@lst.de> writes:

> On Fri, Mar 21, 2025 at 12:56:04AM -0400, Theodore Ts'o wrote:
>> As I recall, in the eary days Linux's safety for DIO and Bufered I/O
>> was best efforts, and other Unix system the recommendation to "don't
>> mix the streams" was far stronger.  Even if it works reliably for
>> Linux, it's still something I recommend that people avoid if at all
>> possible.
>
> It still is a best effort, just a much better effort now.  It's still
> pretty easy to break the coherent.

Thanks Ted & Christoph for the info. Do you think we should document
this recommendation, maybe somewhere in the kernel Documentation where
we can also lists the possible cases where the coherency could break?
(I am not too well aware of those cases though).

One case which I recently came across was where the application was not
setting --setbsz properly on a block device where system's pagesize is
64k. This if I understand correctly will install 1 buffer_head for a 64k
page for any buffered-io operation. Then, if someone mixes the 4k
buffered-io write, right next to 4k direct-io write, then well it
definitely ends up problematic. Because the 4k buffered-io write will
end up making a read-modify-write over a 64k page (1 buffer_head). This
means we now have the entire 64k dirty page, while there is also a
direct-io write operation in that region. This means both writes got
overlapped, hence causing coherency issues.

Such cases, I believe, are easy to miss. And now, with large folios
being used in block devices, I am not sure if there is much value in
applications mixing buffered I/O and direct I/O. Since direct I/O write
will just end up invalidating the entire large folio, that means it
could negate any benefits of using buffered I/O alongside it, on the
same block device.

-ritesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
  2025-03-21 17:21         ` Ritesh Harjani
@ 2025-03-21 18:55           ` Keith Busch
  0 siblings, 0 replies; 25+ messages in thread
From: Keith Busch @ 2025-03-21 18:55 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Darrick J. Wong, Luis Chamberlain, linux-fsdevel, linux-mm,
	linux-block, lsf-pc, david, leon, hch, sagi, axboe, joro,
	brauner, hare, willy, john.g.garry, p.raghav, gost.dev, da.gomez

On Fri, Mar 21, 2025 at 10:51:42PM +0530, Ritesh Harjani wrote:
> Keith Busch <kbusch@kernel.org> writes:
> 
> > On Fri, Mar 21, 2025 at 07:43:09AM +0530, Ritesh Harjani wrote:
> >> i.e. w/o large folios in block devices one could do direct-io &
> >> buffered-io in parallel even just next to each other (assuming 4k pagesize). 
> >> 
> >>            |4k-direct-io | 4k-buffered-io | 
> >> 
> >> 
> >> However with large folios now supported in buffered-io path for block
> >> devices, the application cannot submit such direct-io + buffered-io
> >> pattern in parallel. Since direct-io can end up invalidating the folio
> >> spanning over it's 4k range, on which buffered-io is in progress.
> >
> > Why would buffered io span more than the 4k range here? You're talking
> > to the raw block device in both cases, so they have the exact same
> > logical block size alignment. Why is buffered io allocating beyond
> > the logical size granularity?
> 
> This can happen in following 2 cases - 
> 1. System's page size is 64k. Then even though the logical block size
> granularity for buffered-io is set to 4k (blockdev --setbsz 4k
> /dev/sdc), it still will instantiate a 64k page in the page cache.

But that already happens without large folio support, so I wasn't
considering that here.
 
> 2. Second is the recent case where (correct me if I am wrong) we now
> have large folio support for block devices. So here again we can
> instantiate a large folio in the page cache where buffered-io is in
> progress correct? (say a previous read causes a readahead and installs a
> large folio in that region). Or even iomap_write_iter() these days tries
> to first allocate a chunk of size mapping_max_folio_size().
 
Okay, I am also not sure on what happens for this part on speculative
allocations.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-03-21 18:57 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-20 11:41 [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64 Luis Chamberlain
2025-03-20 12:11 ` Matthew Wilcox
2025-03-20 13:29   ` Daniel Gomez
2025-03-20 14:31     ` Matthew Wilcox
2025-03-20 13:47 ` Daniel Gomez
2025-03-20 14:54   ` Christoph Hellwig
2025-03-21  9:14     ` Daniel Gomez
2025-03-20 14:18 ` Christoph Hellwig
2025-03-20 15:37   ` Bart Van Assche
2025-03-20 15:58     ` Keith Busch
2025-03-20 16:13       ` Kanchan Joshi
2025-03-20 16:38       ` Christoph Hellwig
2025-03-20 21:50         ` Luis Chamberlain
2025-03-20 21:46       ` Luis Chamberlain
2025-03-20 21:40   ` Luis Chamberlain
2025-03-20 18:46 ` Ritesh Harjani
2025-03-20 21:30   ` Darrick J. Wong
2025-03-21  2:13     ` Ritesh Harjani
2025-03-21  3:05       ` Darrick J. Wong
2025-03-21  4:56         ` Theodore Ts'o
2025-03-21  5:00           ` Christoph Hellwig
2025-03-21 18:39             ` Ritesh Harjani
2025-03-21 16:38       ` Keith Busch
2025-03-21 17:21         ` Ritesh Harjani
2025-03-21 18:55           ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox