From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3389D132A2 for ; Mon, 4 Nov 2024 09:58:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 34D666B0083; Mon, 4 Nov 2024 04:58:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2FDFF6B0085; Mon, 4 Nov 2024 04:58:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C6236B0088; Mon, 4 Nov 2024 04:58:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id ED7AA6B0083 for ; Mon, 4 Nov 2024 04:58:39 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 61B7BABE22 for ; Mon, 4 Nov 2024 09:58:39 +0000 (UTC) X-FDA: 82747961562.07.10DF18C Received: from verein.lst.de (verein.lst.de [213.95.11.211]) by imf20.hostedemail.com (Postfix) with ESMTP id 0E78B1C0012 for ; Mon, 4 Nov 2024 09:57:57 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of hch@lst.de designates 213.95.11.211 as permitted sender) smtp.mailfrom=hch@lst.de; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730714270; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KfUHL7XnScifBKXkauhDSwlfC4K+DOh7aGylcqNHAcI=; b=ADoxpQy7lUUzBL7zrYlZsc4Abp16jjJj6aV9KWXMLKTs0lgKh/qTO2s6VhevakfnBAb6xo 8ruzfe/FhbLNynn5bymyiGXGeV+rHceoLnyV/g7NrG52OaEMO+mKkE9LKl6YM+F1SGDjxz Yr2jlPtB5LQOSmyeb1a9zsiW6uR0TLQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730714270; a=rsa-sha256; cv=none; b=S6C3IuARCvmM7Pm8sCetLQHQyRWUFakqNFZ63gClMKIn+ECzSd7fRR64nhvfAPwFnfMwNU c8j2XJ86ydDhCpaNGupVwnhPTV0weTl/+D1F+3FaZe5nkrrTSX+UDV3/X+aO7KKyW3hM7G tUpjB6juZ0DsBK/1cViaIkZa3Tccttg= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of hch@lst.de designates 213.95.11.211 as permitted sender) smtp.mailfrom=hch@lst.de; dmarc=none Received: by verein.lst.de (Postfix, from userid 2407) id EF664227AAD; Mon, 4 Nov 2024 10:58:31 +0100 (CET) Date: Mon, 4 Nov 2024 10:58:31 +0100 From: Christoph Hellwig To: Robin Murphy Cc: Leon Romanovsky , Jens Axboe , Jason Gunthorpe , Joerg Roedel , Will Deacon , Christoph Hellwig , Sagi Grimberg , Keith Busch , Bjorn Helgaas , Logan Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , Marek Szyprowski , =?iso-8859-1?B?Suly9G1l?= Glisse , Andrew Morton , Jonathan Corbet , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v1 00/17] Provide a new two step DMA mapping API Message-ID: <20241104095831.GA28751@lst.de> References: <3567312e-5942-4037-93dc-587f25f0778c@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3567312e-5942-4037-93dc-587f25f0778c@arm.com> User-Agent: Mutt/1.5.17 (2007-11-01) X-Stat-Signature: gyau4935w8npxjinkugm56cx819c9tpw X-Rspam-User: X-Rspamd-Queue-Id: 0E78B1C0012 X-Rspamd-Server: rspam02 X-HE-Tag: 1730714277-460275 X-HE-Meta: U2FsdGVkX1+IYaY2bBOCjElBkzZ8AXEeDiVSR375GSTzX2BPqvW44it6TtFIl3Yee4bXZWTy8KXUv3b/6OgfLpG4tuZKbkflQl05zkbLeMfj1ziPUVbDB+hNYSa/cIBhorudexY+aYA1dJuGvUSEiKnqCLQhA0JlBK+vI/xF09R3M7gsgfyxhQQMNo1oUu3VYfTjKL6EupFOsG08wWPRSUiPR43OYo+tAM+uh3XfH1sdBAtYdb7gHwnDebg4c+rMbnZyXhampwZ5nGBInwoX8DWvHzHg72HCklrJWIjYaPw4EJe30eSrcQPCmH6XrbHKqd1i018MPbjXolRlqonk5211mVrIVL5b7lwFnqdCvAyB1STMtDAKi7kcEJwSSAPuK0HkxlIdYpx9BH/GRMhEEAjPEuaYqCk5WFNbG9TuX2SKK1SxYhHrl858ZSyhDJHnm51IdXq4rBsZ/LEWP3BWxuMBNI5VTLGp6ZD4UKHrZIFdweRnd4BoT1fweSeqvbeJadmpJg1smGK7dQHTMJr8I2/noSz3t5iXo0GCZc7kvOqpmYHkKUNxaXP71spXSsUn3kla21n/fRXwCExKZjeCzgwNMPHpXZ0R/9TzN8TPmc0oIXLaLjMqMlEeepIIhL6MlLU71so0grPAmlWJrP6R9diGAG9DLe5ykm0KAj8K7rexUFA/WZq4p9anF/tnzEyi/mwUJulaN8yiVTk19x+/I1FV+gP7hq2LMeyoPlbHPPTRublsnB16ACaLscTpfX1DuYORiZ5gEkutG5A8fXvSwngZGnUAtQdffZrRVOLCZMgfzBvwzKp19apTPZZjPCSMPxpObLHWbEF4A/FxHxN0/B0v1AJz45cZv9UL4oDH/pkjbBLPnDx+uLcXHNfwF6pQXcYh/bnpdskTqAJroMGcqwf04WjsSMoYtfo1yUXhZ3wgAd0jhHeZUHO+YbpAGR84UblvK3iF2jHeOfyBLLG YeE8zVdJ VO4koZLr7ekY3t40kcbVmU6Pi7Fwh4mlAP6V4cF4nSPU+A/yJ2Cz1826XH51zigtWqljAvrHZaSEv7dQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote: > The hilarious amount of work that iommu_dma_map_sg() does is pretty much > entirely for the benefit of v4l2 and dma-buf importers who *depend* on > being able to linearise a scatterlist in DMA address space. TBH I doubt > there are many actual scatter-gather-capable devices with significant > enough limitations to meaningfully benefit from DMA segment combining these > days - I've often thought that by now it might be a good idea to turn that > behaviour off by default and add an attribute for callers to explicitly > request it. Even when devices are not limited they often perform significantly better when IOVA space is not completely fragmented. While the dma_map_sg code is a bit gross due to the fact that it has to deal with unaligned segments, the coalescing itself often is a big win. Note that dma_map_sg also has two other very useful features: batching of the iotlb flushing, and support for P2P, which to be efficient also requires batching the lookups. >> This uniqueness has been a long standing pain point as the scatterlist API >> is mandatory, but expensive to use. > > Huh? When and where has anything ever called it mandatory? Nobody's getting > sent to DMA jail for open-coding: You don't get sent to jail. But you do not get batched iotlb sync, you don't get properly working P2P, and you don't get IOVA coalescing. >> Several approaches have been explored to expand the DMA API with additional >> scatterlist-like structures (BIO, rlist), instead split up the DMA API >> to allow callers to bring their own data structure. > > And this line of reasoning is still "2 + 2 = Thursday" - what is to say > those two notions in any way related? We literally already have one generic > DMA operation which doesn't operate on struct page, yet needed nothing > "split up" to be possible. Yeah, I don't really get the struct page argument. In fact if we look at the nitty-gritty details of dma_map_page it doesn't really need a page at all. I've been looking at cleaning some of this up and providing a dma_map_phys/paddr which would be quite handy in a few places. Note because we don't have a struct page for the memory, but because converting to/from it all the time is not very efficient. >> 2. VFIO PCI live migration code is building a very large "page list" >> for the device. Instead of allocating a scatter list entry per allocated >> page it can just allocate an array of 'struct page *', saving a large >> amount of memory. > > VFIO already assumes a coherent device with (realistically) an IOMMU which > it explicitly manages - why is it even pretending to need a generic DMA > API? AFAIK that does isn't really vfio as we know it but the control device for live migration. But Leon or Jason might fill in more. The point is that quite a few devices have these page list based APIs (RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs). > >> 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter >> list without having to allocate then populate an intermediate SG table. > > As above, given that a bio_vec still deals in struct pages, that could > seemingly already be done by just mapping the pages, so how is it proving > any benefit of a fragile new interface? Because we only need to preallocate the tiny constant sized dma_iova_state as part of the request instead of an additional scatterlist that requires sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int) per segment, including a memory allocation per I/O for that. > My big concern here is that a thin and vaguely-defined wrapper around the > IOMMU API is itself a step which smells strongly of "abuse and design > mistake", given that the basic notion of allocating DMA addresses in > advance clearly cannot generalise. Thus it really demands some considered > justification beyond "We must do something; This is something; Therefore we > must do this." to be convincing. At least for the block code we have a nice little core wrapper that is very easy to use, and provides a great reduction of memory use and allocations. The HMM use case I'll let others talk about.