Re: [PATCH v5 3/3] nvme/pci: make PRP list DMA pools per-NUMA-node

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Caleb Sander Mateos <csander@purestorage.com>
To: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@kernel.dk>,
	Sagi Grimberg <sagi@grimberg.me>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Kanchan Joshi <joshi.k@samsung.com>,
	 linux-nvme@lists.infradead.org, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 3/3] nvme/pci: make PRP list DMA pools per-NUMA-node
Date: Thu, 24 Apr 2025 08:46:16 -0700	[thread overview]
Message-ID: <CADUfDZqA0v-i1=TSkW6HkUNdN-_954Kq0hJS4H4cgbPb5o9EgA@mail.gmail.com> (raw)
In-Reply-To: <aApbYhyeYcCifoYI@kbusch-mbp.dhcp.thefacebook.com>

On Thu, Apr 24, 2025 at 8:40 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Apr 24, 2025 at 04:12:49PM +0200, Christoph Hellwig wrote:
> > On Tue, Apr 22, 2025 at 04:09:52PM -0600, Caleb Sander Mateos wrote:
> > > NVMe commands with more than 4 KB of data allocate PRP list pages from
> > > the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
> >
> > That's not actually true.  We can transfer all of the MDTS without a
> > single pool allocation when using SGLs.
>
> Let's just change it to say discontiguous data, then.
>
> Though even wtih PRP's, you could transfer up to 8k without allocating a
> list, if its address is 4k aligned.

Right, it depends on the alignment of the data pages. Christoph is
correct that commands using SGLs don't need to allocate PRP list
pages, however not all NVMe controllers support SGLs for data
transfers.

>
> > > Each call
> > > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock.
> > > These device-global spinlocks are a significant source of contention
> > > when many CPUs are submitting to the same NVMe devices. On a workload
> > > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes
> > > to 23 NVMe devices, we observed 2.4% of CPU time spent in
> > > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
> > >
> > > Ideally, the dma_pools would be per-hctx to minimize
> > > contention. But that could impose considerable resource costs in a
> > > system with many NVMe devices and CPUs.
> >
> > Should we try to simply do a slab allocation first and only allocate
> > from the dmapool when that fails?  That should give you all the
> > scalability from the slab allocator without very little downsides.
>
> The dmapool allocates dma coherent memory, and it's mapped for the
> remainder of lifetime of the pool. Allocating slab memory and dma
> mapping per-io would be pretty costly in comparison, I think.

I'm not very familiar with the DMA subsystem, but this was my impression too.

Thanks,
Caleb

next prev parent reply	other threads:[~2025-04-24 15:46 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22 22:09 [PATCH v5 0/3] nvme/pci: PRP list DMA pool partitioning Caleb Sander Mateos
2025-04-22 22:09 ` [PATCH v5 1/3] dmapool: add NUMA affinity support Caleb Sander Mateos
2025-04-25 21:44   ` Sagi Grimberg
2025-04-22 22:09 ` [PATCH v5 2/3] nvme/pci: factor out nvme_init_hctx() helper Caleb Sander Mateos
2025-04-22 22:09 ` [PATCH v5 3/3] nvme/pci: make PRP list DMA pools per-NUMA-node Caleb Sander Mateos
2025-04-24 14:12   ` Christoph Hellwig
2025-04-24 15:40     ` Keith Busch
2025-04-24 15:46       ` Caleb Sander Mateos [this message]
2025-04-25 13:21       ` Christoph Hellwig
2025-04-25 18:02         ` Keith Busch
2025-04-23 13:21 ` [PATCH v5 0/3] nvme/pci: PRP list DMA pool partitioning Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADUfDZqA0v-i1=TSkW6HkUNdN-_954Kq0hJS4H4cgbPb5o9EgA@mail.gmail.com' \
    --to=csander@purestorage.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=joshi.k@samsung.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox