From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16BEFC369AB for ; Thu, 24 Apr 2025 15:46:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9E91F6B00C5; Thu, 24 Apr 2025 11:46:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 997936B00C9; Thu, 24 Apr 2025 11:46:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 838356B00CA; Thu, 24 Apr 2025 11:46:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 626F06B00C5 for ; Thu, 24 Apr 2025 11:46:30 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 1DB81B8BBF for ; Thu, 24 Apr 2025 15:46:32 +0000 (UTC) X-FDA: 83369364624.16.D4D3967 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf30.hostedemail.com (Postfix) with ESMTP id 2459780008 for ; Thu, 24 Apr 2025 15:46:29 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=purestorage.com header.s=google2022 header.b="PcWiw5/5"; dmarc=pass (policy=reject) header.from=purestorage.com; spf=pass (imf30.hostedemail.com: domain of csander@purestorage.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=csander@purestorage.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745509590; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Q7ZGm3rblj2kgAcpsN0EQih8yO5JOg9TsIGPUfiFTgs=; b=XLfKfZdPSN7aEEW/g+/IqhhY6oOj0k2nRs4HAERjKPmLtW6y2CSmSIAvessDzogt7hlIro N6TFPVeYH3TwEU6lVZfSx+idUFdOyeRs+q+CSfdTGrxr+CNsK3Wb5fM8DlBQ90okEpWlBt nV20ww0jZIfndig98FoC/I7RUDHotVU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745509590; a=rsa-sha256; cv=none; b=NvK1hcGCsQ+q/nGrhatxG4IjSLDGMzvagqldkEOOGDFXoo6ouJD9RI8yMzvxjCJq3cOf5z D/IOVgUVMtq2Ov6WGFS6rWaAQorY+4CGCaMR85dj111uGPzg2yr+RAjORuKZKe3jLV9PUw /dUG+4QlFu61GdZe2N3oQ12H4VkPtDk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=purestorage.com header.s=google2022 header.b="PcWiw5/5"; dmarc=pass (policy=reject) header.from=purestorage.com; spf=pass (imf30.hostedemail.com: domain of csander@purestorage.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=csander@purestorage.com Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-3087a70557bso214987a91.2 for ; Thu, 24 Apr 2025 08:46:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1745509589; x=1746114389; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Q7ZGm3rblj2kgAcpsN0EQih8yO5JOg9TsIGPUfiFTgs=; b=PcWiw5/58nbmeGRFy7/2HovOh4VM6HFn0LJ++ts5jTQPfZWp4oUnwlUYGt/HqvPfgt Zd0MrBoikrBbSDC7775KcfwVwKWXpl/OSmfMfa3PaOhYmF0SWqiN8H8twR1/hLnObDbo aiQdhxZNDeuwNuDmWnWgIJvpOrpD8RH8m/xl05oqmAQQu4QfAuHcbhGdmjUssi7bfRRC OyZi0taO43CFpUpZk3Fb3NzTydNIv2O+zfeT0GetPouTcJwSFPookcWCKS/+RVpIEYjO O2ZbVgc4dLzb8rysFAe8AmRV8XAUD6sO/KBCSE0hlq3RRtJZjBPso0eLkzw6Pa6DKI1Z vI9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745509589; x=1746114389; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Q7ZGm3rblj2kgAcpsN0EQih8yO5JOg9TsIGPUfiFTgs=; b=UFO+r1MCreS8YIQE+p3hrve002rFa05sWDUPgxpzzMnN7fQshmjxhhW5N6XAB7ZRmp LTPCOImX4x6Ux9+QMDS6mskeWPr1ct4F47YUPH8KUMRNUB607Mw6rthNHOJQnISA2szV GJeTfvIh99DIBanK+/gpRN+nuHAgxMV1iUhgdbeS4X88wd8cmosIMP4uVgIfIextHwza ms0QD5/88huCOyQaF2WEtb4wonFx3u5Uw0KL8dcENKn0DcosN3XUmbLc5hrn3SdIOf/S 0Gqm0CQec8x2tsjEthxu3LwmUf3z70rRSilpJEkxvQ6l+UvoVoMPVtkDAKv0V0QdGE27 qbGg== X-Forwarded-Encrypted: i=1; AJvYcCU/AHbACuiKZ7FSxTbvwOfcM9/tasxXWVgzNMc+UHWqdds6pIZEjWiB7yRHHKthhhe4XlDCVotomQ==@kvack.org X-Gm-Message-State: AOJu0YwjF11U4YEP8T1xJgLO/bIwKDiuB5iUrL3i0OvuNPsNYYzeMF1y OS3502UqYuXt6Nd2LPcn70Q8m6Vuougcfq2+G0MG6+9ni6kvipPZYWaVhik2YSQWzZPgtgYNFfR B6P57knMYl4vdQwaPlyjJG6bqjokaDgFseVbGIQ== X-Gm-Gg: ASbGncu1EcdVXLExeyzhJQKA0YnkXslJMP6PW58x42D9qpXGyLL06hwoB9x/9uR17qj eRbODAihjH5lpCCISk1lEuDM2IShCm+Z+/zUt1Izhy1QoWAsYWD5k63PNZyreaQ/TvpI9KmY0KV T9UqmAImRLSVA/et/R4nBt X-Google-Smtp-Source: AGHT+IHNf2GMNeqL16Klsm77Qm5cAQu6Mvza6m/E48PAkzIQkLWGytFSKzo4VUqKfp8jikfzSzd9uZhcyM0JdWSseY0= X-Received: by 2002:a17:90b:3906:b0:2ff:5759:549a with SMTP id 98e67ed59e1d1-309ed24cc88mr1987895a91.1.1745509589025; Thu, 24 Apr 2025 08:46:29 -0700 (PDT) MIME-Version: 1.0 References: <20250422220952.2111584-1-csander@purestorage.com> <20250422220952.2111584-4-csander@purestorage.com> <20250424141249.GA18970@lst.de> In-Reply-To: From: Caleb Sander Mateos Date: Thu, 24 Apr 2025 08:46:16 -0700 X-Gm-Features: ATxdqUHCi4C_REHlKrLnX5wREYYxsmbvpIQODsvqs0AotMM04vU2aN13a--TT-4 Message-ID: Subject: Re: [PATCH v5 3/3] nvme/pci: make PRP list DMA pools per-NUMA-node To: Keith Busch Cc: Christoph Hellwig , Jens Axboe , Sagi Grimberg , Andrew Morton , Kanchan Joshi , linux-nvme@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2459780008 X-Rspam-User: X-Stat-Signature: uubfhs6jqwpjbyedbyr3j19xyn7wwjuc X-HE-Tag: 1745509589-81253 X-HE-Meta: U2FsdGVkX18wd5Pfkd/+2fNYcJevaDRDdo/eahAx/v6zGu/BDinubjPhuFBSEgbSBIinjFgiP1uqYaCBrtINYue3zDYP7SUT94hAxOTezGGdJ7VlILUuJavB4Di4yS5fiA+A9FfhBSaitE6SOl3jrtY5QqdosbeTKeDg4pxeajI4r7w+b2iRq+o0drVsYFL5H23cPCD8LMqedHmyLTNg8txS94K3+G0ONBcgJ5WQRkfoSF0ov1dpu4UbCwTS7R4aZjYH39BZQq/0eav92X/WBm3cxf7XwvMH7bNfNRpTlx1ZLsCH0kZ45SQ1+hNdBio04werjcjoFfy4JR19AIpmME00jBUg51zN1U+uTJPQkfQJl7IqO9l1haMQcv8jBbVDMnUgv0wmbt51B8NSALuJmLMRq7CaYUvjH8Ynz4W82oi02dTOYdf55SbNqRKU2V/g26R/I4LyHJrDvqOAvqeTd+SfDacMHtPa6UxQbej8ETuNeGWZwUbyLbuNa3FMpgE1HeTSSCZq20f86njn/k16ZE5Rmy2D1lSUGaE9DaJJY7Ycb85Ggioq/R2NYlueVqD1ENLdzUsY8NrqSup5/mFD/1oDY7MAte7wbqrOYOjxPmRqqe4hAmk04FrSnHp7J5Z8Uy2nmPA9+jVrMpbbzmwDd37Io5wGoteG6z5hUt4bU1SMhoGVWUPFjqiQ4obU99/U2vzsiRbQAuIMtYeMf5WfnAAlabXwaY6kklZLgWfVCbwfTXVxU3AXNVBT8fiUt6YWSlJZi+6L7sR9Mu1fniXQv4Dn9gqi+I7oDZKE8QZdOIqG05a3bSVEwS5flbXVDsz5iK+LWJHhFsUrYznjqNuoIE/hoK5ZPQmXX2/H1kB1swSbdN0BjLSK5fq8+4EBs7sSbnaV4xKX6M0tJu7cExizaN6q6lpPu0ITbrXhsFWk/zz/LpyYHPVNF7uFc8zlqfjdUHH4V+gidE++GvAoAxU NSnU3DrR rNkJeS8or9oUV3v7mzrNiVh5c/oNxfCOkLvhHPW9EEyA5RsmOIdKS27N9aEeqvKuA+mzrPPUwut7wk2LypQ7iRJpXbGv2uvs2lMfTmAQeWdP3xyo3bIfZpFC6557j2ltIe5/scKsT+Dgf1WPc9z1CKzO64qE2OsIUHm6tPzLJStxRvd9EK9LjkhRbz8LCIiaZCxeieh5c0somewAZ8UahA4XENvHn1Z4WToumNRcS4PESDxA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 24, 2025 at 8:40=E2=80=AFAM Keith Busch wro= te: > > On Thu, Apr 24, 2025 at 04:12:49PM +0200, Christoph Hellwig wrote: > > On Tue, Apr 22, 2025 at 04:09:52PM -0600, Caleb Sander Mateos wrote: > > > NVMe commands with more than 4 KB of data allocate PRP list pages fro= m > > > the per-nvme_device dma_pool prp_page_pool or prp_small_pool. > > > > That's not actually true. We can transfer all of the MDTS without a > > single pool allocation when using SGLs. > > Let's just change it to say discontiguous data, then. > > Though even wtih PRP's, you could transfer up to 8k without allocating a > list, if its address is 4k aligned. Right, it depends on the alignment of the data pages. Christoph is correct that commands using SGLs don't need to allocate PRP list pages, however not all NVMe controllers support SGLs for data transfers. > > > > Each call > > > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlo= ck. > > > These device-global spinlocks are a significant source of contention > > > when many CPUs are submitting to the same NVMe devices. On a workload > > > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA no= des > > > to 23 NVMe devices, we observed 2.4% of CPU time spent in > > > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. > > > > > > Ideally, the dma_pools would be per-hctx to minimize > > > contention. But that could impose considerable resource costs in a > > > system with many NVMe devices and CPUs. > > > > Should we try to simply do a slab allocation first and only allocate > > from the dmapool when that fails? That should give you all the > > scalability from the slab allocator without very little downsides. > > The dmapool allocates dma coherent memory, and it's mapped for the > remainder of lifetime of the pool. Allocating slab memory and dma > mapping per-io would be pretty costly in comparison, I think. I'm not very familiar with the DMA subsystem, but this was my impression to= o. Thanks, Caleb