Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Keith Busch <kbusch@kernel.org>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Luis Chamberlain <mcgrof@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	david@fromorbit.com, leon@kernel.org, sagi@grimberg.me,
	axboe@kernel.dk, joro@8bytes.org, brauner@kernel.org,
	hare@suse.de, willy@infradead.org, djwong@kernel.org,
	john.g.garry@oracle.com, ritesh.list@gmail.com,
	p.raghav@samsung.com, gost.dev@samsung.com, da.gomez@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] breaking the 512 KiB IO boundary on x86_64
Date: Thu, 20 Mar 2025 09:58:47 -0600	[thread overview]
Message-ID: <Z9w7Nz-CxWSqj__H@kbusch-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <a40a704f-22c8-4ae9-9800-301c9865cee4@acm.org>

On Thu, Mar 20, 2025 at 08:37:05AM -0700, Bart Van Assche wrote:
> On 3/20/25 7:18 AM, Christoph Hellwig wrote:
> > On Thu, Mar 20, 2025 at 04:41:11AM -0700, Luis Chamberlain wrote:
> > > We've been constrained to a max single 512 KiB IO for a while now on x86_64.
> > 
> > No, we absolutely haven't.  I'm regularly seeing multi-MB I/O on both
> > SCSI and NVMe setup.
> 
> Is NVME_MAX_KB_SZ the current maximum I/O size for PCIe NVMe
> controllers? From drivers/nvme/host/pci.c:

Yes, this is the driver's limit. The device's limit may be lower or
higher.

I allocate out of hugetlbfs to reliably send direct IO at this size
because the nvme driver's segment count is limited to 128. The driver
doesn't impose a segment size limit, though. If each segment is only 4k
(a common occurance), I guess that's where Luis is getting the 512K
limit?

> /*
>  * These can be higher, but we need to ensure that any command doesn't
>  * require an sg allocation that needs more than a page of data.
>  */
> #define NVME_MAX_KB_SZ	8192
> #define NVME_MAX_SEGS	128
> #define NVME_MAX_META_SEGS 15
> #define NVME_MAX_NR_ALLOCATIONS	5
> 
> > > This is due to the number of DMA segments and the segment size.
> > 
> > In nvme the max_segment_size is UINT_MAX, and for most SCSI HBAs it is
> > fairly large as well.
> 
> I have a question for NVMe device manufacturers. It is known since a
> long time that submitting large I/Os with the NVMe SGL format requires
> less CPU time compared to the NVMe PRP format. Is this sufficient to
> motivate NVMe device manufacturers to implement the SGL format? All SCSI
> controllers I know of, including UFS controllers, support something that
> is much closer to the NVMe SGL format rather than the NVMe PRP format.

SGL support does seem less common than you'd think. It is more efficient
when you have physically contiguous pages, or an IOMMU mapped
discontiguous pages into a dma contiguous IOVA. If you don't have that,
PRP is a little more efficient for memory and CPU usage. But in the
context of large folios, yeah, SGL is the better option.

next prev parent reply	other threads:[~2025-03-20 15:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-20 11:41 Luis Chamberlain
2025-03-20 12:11 ` Matthew Wilcox
2025-03-20 13:29   ` Daniel Gomez
2025-03-20 14:31     ` Matthew Wilcox
2025-03-20 13:47 ` Daniel Gomez
2025-03-20 14:54   ` Christoph Hellwig
2025-03-21  9:14     ` Daniel Gomez
2025-03-20 14:18 ` Christoph Hellwig
2025-03-20 15:37   ` Bart Van Assche
2025-03-20 15:58     ` Keith Busch [this message]
2025-03-20 16:13       ` Kanchan Joshi
2025-03-20 16:38       ` Christoph Hellwig
2025-03-20 21:50         ` Luis Chamberlain
2025-03-20 21:46       ` Luis Chamberlain
2025-03-20 21:40   ` Luis Chamberlain
2025-03-20 18:46 ` Ritesh Harjani
2025-03-20 21:30   ` Darrick J. Wong
2025-03-21  2:13     ` Ritesh Harjani
2025-03-21  3:05       ` Darrick J. Wong
2025-03-21  4:56         ` Theodore Ts'o
2025-03-21  5:00           ` Christoph Hellwig
2025-03-21 18:39             ` Ritesh Harjani
2025-03-21 16:38       ` Keith Busch
2025-03-21 17:21         ` Ritesh Harjani
2025-03-21 18:55           ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9w7Nz-CxWSqj__H@kbusch-mbp.dhcp.thefacebook.com \
    --to=kbusch@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=bvanassche@acm.org \
    --cc=da.gomez@samsung.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=john.g.garry@oracle.com \
    --cc=joro@8bytes.org \
    --cc=leon@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=sagi@grimberg.me \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox