From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA802C369AB for ; Thu, 24 Apr 2025 15:40:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D90BB6B00AB; Thu, 24 Apr 2025 11:40:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D69A26B00BE; Thu, 24 Apr 2025 11:40:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C585F6B00AD; Thu, 24 Apr 2025 11:40:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A0D886B00C7 for ; Thu, 24 Apr 2025 11:40:22 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 87590B87C7 for ; Thu, 24 Apr 2025 15:40:23 +0000 (UTC) X-FDA: 83369349126.05.8DE94E7 Received: from tor.source.kernel.org (tor.source.kernel.org [172.105.4.254]) by imf19.hostedemail.com (Postfix) with ESMTP id 1F0D31A0003 for ; Thu, 24 Apr 2025 15:40:21 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=F4SnX1Sp; spf=pass (imf19.hostedemail.com: domain of kbusch@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=kbusch@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745509222; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8bYjJFvQS35SDJcCJWORo0DhaNJBhkeJWgzxAxJLxJQ=; b=z55F6m43g3P9/b5gc361/syFwIhSTO95gnjqfwb4+c2iXFhGT67S9xOQejB9SfAbKLVddp UHjocCliQ6Kh2LWGnUYT0CYbz5v1TY8D++KPuyrg6FptOY+pVXPIjDObaUmNE32Ne+wJUn FPLFPOsYzZwTQK0eiAzi7W1mdhDeNck= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=F4SnX1Sp; spf=pass (imf19.hostedemail.com: domain of kbusch@kernel.org designates 172.105.4.254 as permitted sender) smtp.mailfrom=kbusch@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745509222; a=rsa-sha256; cv=none; b=WymdyCqCgdNaKn3KKx76pvb+sVa3on/emIzIZuQ66x5HfmE+rZ7zNoXw/erLiOjv+ihp7j zKp1ibS6lM3uyUfa4jRoQO/vhVPE98aF5oUrgECF/9I8Ad724E364s58cOiYCLDauWaCYS BHws1W9WAP2rAQM5C6xSRr829IrH1o0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by tor.source.kernel.org (Postfix) with ESMTP id 8B2DA6845D; Thu, 24 Apr 2025 15:39:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A503BC4CEE3; Thu, 24 Apr 2025 15:40:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745509221; bh=m79yZBVMxE4AA+f552+EtExtizaGN1TqMSESd2aS1SA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=F4SnX1SpGM96KuxInmm2t03yLng6qqivRc2wHB4xd8CtP+uvMU7jkKFpKZe45BteH zqcj1DUZMShku6uzUtSBDI5rx5uEh1QHaFhgPxUGwXNZbuShRbDVG+xHDAzy3OPROr rU+2l58DxT4fbGkZO8xoTv6BlTIx00VCgB+eat7UHHYIaDKSumURqvQC09O/ZPdBkK 0HQGmwElb4FhGgmcEB25UvRzoltQujcCwEWIvnR1GDjHtHFmnfnlwYXFHVMuRoNIDH x9AJZp8qsMfWve6AsPQDHkOEovPrQ+qb5dceT168S5fh8FnQW7YJsliQPupcjWm/1m HGS+iQba20kDw== Date: Thu, 24 Apr 2025 09:40:18 -0600 From: Keith Busch To: Christoph Hellwig Cc: Caleb Sander Mateos , Jens Axboe , Sagi Grimberg , Andrew Morton , Kanchan Joshi , linux-nvme@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v5 3/3] nvme/pci: make PRP list DMA pools per-NUMA-node Message-ID: References: <20250422220952.2111584-1-csander@purestorage.com> <20250422220952.2111584-4-csander@purestorage.com> <20250424141249.GA18970@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250424141249.GA18970@lst.de> X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 1F0D31A0003 X-Stat-Signature: 87pz1b7hoihkim6918c38kox63z7yhz9 X-HE-Tag: 1745509221-690573 X-HE-Meta: U2FsdGVkX1/8DNuE9nSAHRVUwFkx6b9u1bdoAzvOWR4NmsOaksrNGw8uUnZ++pCIyMINfeEAf9aisrjl1ngydh1GMbPomBz0/K/gPsRLXyJaVTcmmy8bBn/9a6aMFXTcdf1TYmbiBgxozq9MIhhH9vJ/UPuLLxsb47zcc4txA8OfFP+Zx876O9YIMfw+eqhbWeU36rKiAxoedFL9dJ8U6f1FwfiaH8uZ96t0FwP+fyebEURHzZj6xRRaEg7bpEt2865OAolGjMZbdke55Uc9yK735lh/Byx0xutLKaWMvq7BcxZdjTEn6W73KlEjn/x+Iuj487Rlf7Tor1ZZy+0v03PnMLF8kDtnunlbQIVb2Afff+BVvAERO34mZHz2UMfKiJ50LkurAfWAEDXQkSt5Y6UBYOHkKlXsHLkn7H66eL8d9cA3x8qm/OonT2NE/8HMUo0u0xgVXMpeGOmGcZbBSZtfNDzL3bmGcuco2oSqx4kJtOGv+nNb2rPqt7JD0OVLC2KJwN4rvgCfuAmsqsET4c/nqYlVmrk6w8OMy6o1T0NHKGOgMyWDBWno0MfzzWzRzdsj6WyCEzGCDkAOKHfzMWAWKM9+hlp91nvcCml5I2ayb+jpMrU7X78eWCudg1p5JoL0hJWFJiTdEksropSSxu1KKWFHj/q+rEsBrAeu9tiftb4sEEUOostSx7x1QbCdgJ4sblE1Jf6BZ0HkySeSiEpRc4DkIkScp385h/QtyJqs23hIhGyLTFwZ3YiFA+dP7zIy0gWhCMrDSfQCVu9TEiWh0ZobNySvigy+9FlSXGXgCgntnpu7SKQhjgD3eUJzNYyRMIVBwot50igbzMSM7ddtk6ZoIzmJiAogN3Wi/ZEH1NXIiWQHL5FlL19XnV7S6dbvsTFRa00= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 24, 2025 at 04:12:49PM +0200, Christoph Hellwig wrote: > On Tue, Apr 22, 2025 at 04:09:52PM -0600, Caleb Sander Mateos wrote: > > NVMe commands with more than 4 KB of data allocate PRP list pages from > > the per-nvme_device dma_pool prp_page_pool or prp_small_pool. > > That's not actually true. We can transfer all of the MDTS without a > single pool allocation when using SGLs. Let's just change it to say discontiguous data, then. Though even wtih PRP's, you could transfer up to 8k without allocating a list, if its address is 4k aligned. > > Each call > > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. > > These device-global spinlocks are a significant source of contention > > when many CPUs are submitting to the same NVMe devices. On a workload > > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes > > to 23 NVMe devices, we observed 2.4% of CPU time spent in > > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. > > > > Ideally, the dma_pools would be per-hctx to minimize > > contention. But that could impose considerable resource costs in a > > system with many NVMe devices and CPUs. > > Should we try to simply do a slab allocation first and only allocate > from the dmapool when that fails? That should give you all the > scalability from the slab allocator without very little downsides. The dmapool allocates dma coherent memory, and it's mapped for the remainder of lifetime of the pool. Allocating slab memory and dma mapping per-io would be pretty costly in comparison, I think.