From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63B7DE67480 for ; Thu, 31 Oct 2024 21:18:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D7A336B0082; Thu, 31 Oct 2024 17:18:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D27DB6B0089; Thu, 31 Oct 2024 17:18:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC7FF6B008A; Thu, 31 Oct 2024 17:18:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9AE3F6B0082 for ; Thu, 31 Oct 2024 17:18:09 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id EF868C0911 for ; Thu, 31 Oct 2024 21:18:08 +0000 (UTC) X-FDA: 82735159416.21.AC29918 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf08.hostedemail.com (Postfix) with ESMTP id 68C98160018 for ; Thu, 31 Oct 2024 21:17:48 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf08.hostedemail.com: domain of robin.murphy@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=robin.murphy@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730409355; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rUeLLd6vzqTSlz6rrGkZUS4jkWPYe2JbJGcnjlOi4+k=; b=WIpXQrKembludxJEOOKVciciCvDIGzhtjLEeZylXzWRpZr4X6nGV4PeicaEAw9vWERTQdA xx08ThaW1a590dCYZ+YXIDA2XF5Ak0s8vggKzMIie3rrYfDDBJ5eWosZhRmQabE02+ohAq cKEyV0In00BD8g3R78MfRyIHhc1LWH0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730409355; a=rsa-sha256; cv=none; b=F7G9g15QpD3eTssWcvFZbladHh5tNRlTbbly2IpL0kIZtht7Uc107mPM3rkbYo8d/5NKGo dd5Yv5pziBluqeu3nBuHK+e7Ka40cj4pui64i58O0VFB3R5rmiAcxziQs6aiVAWgJgXxHR lCfTuRtK+Lm7Cr/8Isy6zhyn4979qz8= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf08.hostedemail.com: domain of robin.murphy@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=robin.murphy@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 478E21063; Thu, 31 Oct 2024 14:18:35 -0700 (PDT) Received: from [10.1.196.40] (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 279FA3F73B; Thu, 31 Oct 2024 14:17:51 -0700 (PDT) Message-ID: <3567312e-5942-4037-93dc-587f25f0778c@arm.com> Date: Thu, 31 Oct 2024 21:17:45 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 00/17] Provide a new two step DMA mapping API To: Leon Romanovsky , Jens Axboe , Jason Gunthorpe , Joerg Roedel , Will Deacon , Christoph Hellwig , Sagi Grimberg Cc: Keith Busch , Bjorn Helgaas , Logan Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , Marek Szyprowski , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Andrew Morton , Jonathan Corbet , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org References: From: Robin Murphy Content-Language: en-GB In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 68C98160018 X-Stat-Signature: x8w7gun8eabsnrfma6r6jipnde1c9ruh X-HE-Tag: 1730409468-469568 X-HE-Meta: U2FsdGVkX19sfia1yJ8KANLmrkQh6ZttZXKx2K2BcE8g10dr9mOvjDtDXvCsLTBi5mZFJ81novJr/f6J6YfZ7OZFvigLkoK8h9gBObdxaHmZoLL4okuAoq6Y2t33oUeeRhk4nvBionkIelX25t1bFd/chkMnW7Fmh9clIKs6EZziLFEA4TvhBTmbE4Dr3+NpJY84GZzVeYosptsBUDtaQUFZW4bZWogkAp7GWaJd4tLI/kKmxc+8SHFP+ZJpzM1R1A5hBoMLRMIFDbQ3pzcmJxTovZjYRRxytpwHX+Lup5yeQJjOITnamQRr0UnrxVAcOX9olJzbz4qSm3Xo+rDT2TUo+V43LKJuYER6KI9Xqf20CJ44HNFCLf0BBkQq2ex02s0V1DnLuo9r5KlD0HF3HQmXLOY/NCVfrt5xp2ELvcf5vQ6v2LnKr0+BrjHl6UHBKet1LbslH5+bampLVY9ZPaZwEKaLvW6RMp8k3I8DPD9DQ2hdCCSrUe9LSXdMZ/cRn7i5EeebYZPRsmXytGFSF0WmBcfXTBOqSOsbhPzERsfGyV3Q6Wqnxc2HBWjXdrSfm38DE3ooQIPA92qWSsv+p+ZuSwYPB427v5Au6qoFeGiWMicXckgqSCuZMf2n2gO5keJFa7yQbjAnYrHaifwfZ3DjmyLTuJGhAn8niayAI+wZqWdeLOQsvp9+W6MtOt2UyMILedQmgV1T3uOf3nuiysql7wHdWMqUCh3v8tDWCJlelwLVgJDpxVkSkAMA4hN1naXLXdL0JsbXJGRIhUZWiykZZ5JgHPxBUSt3PGp+Q5ZcfrWA9G/5Kns2QrbW1Y8S9ZnAm9KDWaeHZeVl1yMHtNFKhvRqzpXkGTV/3izcfUrMrPyoFI/+ZRSexkjBxQltuLEtdgABlJOYjyASFh8w5gV+lkIyjqBvEYYM17cFvyls85gDoFDJu5SrtdIIy8VQVd3xP+Z+wnei/PzE0Vp B5sBvguC l1mQaTE9stytgVEd8xim74R/M6GMHFd4qEKXWOC9Sd4zBYSFSUxcWdZyXoBL+8jK0bRzLjywxgcnhvPF2uZNweO6s3jETVZFFwa/qA5Tz6QplBURqKQg/QLXf4AOj+qKPqgiN7sS7hh9IjOg8soGev0j7aafACV+4iHcRtQHGH7UiHm8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 30/10/2024 3:12 pm, Leon Romanovsky wrote: > Changelog: > v1: > * Squashed two VFIO patches into one > * Added Acked-by/Reviewed-by tags > * Fix docs spelling errors > * Simplified dma_iova_sync() API > * Added extra check in dma_iova_destroy() if mapped size to make code more clear > * Fixed checkpatch warnings in p2p patch > * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to > be more general > * Reduced the number of changes in VFIO patch > v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org > > ---------------------------------------------------------------------------- > The code can be downloaded from: > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30 > > ---------------------------------------------------------------------------- > Currently the only efficient way to map a complex memory description through > the DMA API is by using the scatterlist APIs. It's really not efficient... In most cases they're just wrappers for a bunch of dma_map_page() etc. calls for the convenience of callers who are using a scatterlist for their own reasons anyway. Even with iommu-dma, I expect that approach would likely perform better for most users as well, given that typical individual segment sizes are much more likely to be in scope of the IOVA caches. The hilarious amount of work that iommu_dma_map_sg() does is pretty much entirely for the benefit of v4l2 and dma-buf importers who *depend* on being able to linearise a scatterlist in DMA address space. TBH I doubt there are many actual scatter-gather-capable devices with significant enough limitations to meaningfully benefit from DMA segment combining these days - I've often thought that by now it might be a good idea to turn that behaviour off by default and add an attribute for callers to explicitly request it. > The SG APIs are unique in that > they efficiently combine the two fundamental operations of sizing and allocating > a large IOVA window from the IOMMU and processing all the per-address > swiotlb/flushing/p2p/map details. Except that's obviously not unique when the page APIs also combine the exact same operations? :/ > This uniqueness has been a long standing pain point as the scatterlist API > is mandatory, but expensive to use. Huh? When and where has anything ever called it mandatory? Nobody's getting sent to DMA jail for open-coding: for_each_sg(...) my_dma_addr = dma_map_page(..., sg_page()); if they do know the map_sg operation is unnecessarily expensive for their needs. > It prevents any kind of optimization or > feature improvement (such as avoiding struct page for P2P) due to the impossibility > of improving the scatterlist. > > Several approaches have been explored to expand the DMA API with additional > scatterlist-like structures (BIO, rlist), instead split up the DMA API > to allow callers to bring their own data structure. And this line of reasoning is still "2 + 2 = Thursday" - what is to say those two notions in any way related? We literally already have one generic DMA operation which doesn't operate on struct page, yet needed nothing "split up" to be possible. Fair enough if callers want some alternative interfaces for mapping memory as well, but to be a common DMA API it has to be usable everywhere and cover all the DMA operations that the current page-based APIs provide, otherwise those callers obviously can't stop using struct pages. What precludes a straightforward dma_map_phys() etc. to parallel the existing API? What's the justification for an IOMMU-specific design when surely if anyone can benefit from more memory-efficient structures across drivers and subsystems it's the little embedded platforms, not the big servers already happy to spend tens to hundreds of megabytes on IOMMU pagetables? > The API is split up into parts: > - Allocate IOVA space: > To do any pre-allocation required. This is done based on the caller > supplying some details about how much IOMMU address space it would need > in worst case. > - Map and unmap relevant structures to pre-allocated IOVA space: > Perform the actual mapping into the pre-allocated IOVA. This is very > similar to dma_map_page(). > > In this and the next series [1], examples of three different users are converted > to the new API to show the benefits and its versatility. Each user has a unique > flow: > 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to > dynamically map/unmap large numbers of single pages. This becomes > significantly faster in the IOMMU case as the map/unmap is now just > a page table walk, the IOVA allocation is pre-computed once. Significant > amounts of memory are saved as there is no longer a need to store the > dma_addr_t of each page. I particularly enjoy the comment in patch #11 calling out how this "unique flow" is fundamentally incompatible with the API it's supposed to show off and has to rely on a sketchy hack to abuse its "versatility". Great stuff. > 2. VFIO PCI live migration code is building a very large "page list" > for the device. Instead of allocating a scatter list entry per allocated > page it can just allocate an array of 'struct page *', saving a large > amount of memory. VFIO already assumes a coherent device with (realistically) an IOMMU which it explicitly manages - why is it even pretending to need a generic DMA API? > 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter > list without having to allocate then populate an intermediate SG table. As above, given that a bio_vec still deals in struct pages, that could seemingly already be done by just mapping the pages, so how is it proving any benefit of a fragile new interface? Heck, not that I really want to encourage it, but we also already have network drivers who don't have the space to stash both a DMA address *and* a page address in their descriptors, and economise on shadow storage by instead grovelling into the default IOMMU domain with iova_to_phys(). I mean, I'd _kinda_ like to send them to DMA jail, but it's not an absolutely unreasonable trick to play... also DMA jail doesn't exist. > To make the use of the new API easier, HMM and block subsystems are extended > to hide the optimization details from the caller. Among these optimizations: > * Memory reduction as in most real use cases there is no need to store mapped > DMA addresses and unmap them. > * Reducing the function call overhead by removing the need to call function > pointers and use direct calls instead. > > This step is first along a path to provide alternatives to scatterlist and > solve some of the abuses and design mistakes, for instance in DMABUF's P2P > support. My big concern here is that a thin and vaguely-defined wrapper around the IOMMU API is itself a step which smells strongly of "abuse and design mistake", given that the basic notion of allocating DMA addresses in advance clearly cannot generalise. Thus it really demands some considered justification beyond "We must do something; This is something; Therefore we must do this." to be convincing. Thanks, Robin. > > Thanks > > [1] This still points to v0, as the change is just around handling dma_iova_sync(): > https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org > > Christoph Hellwig (6): > PCI/P2PDMA: Refactor the p2pdma mapping helpers > dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h > iommu: generalize the batched sync after map interface > iommu/dma: Factor out a iommu_dma_map_swiotlb helper > dma-mapping: add a dma_need_unmap helper > docs: core-api: document the IOVA-based API > > Leon Romanovsky (11): > dma-mapping: Add check if IOVA can be used > dma: Provide an interface to allow allocate IOVA > dma-mapping: Implement link/unlink ranges API > mm/hmm: let users to tag specific PFN with DMA mapped bit > mm/hmm: provide generic DMA managing logic > RDMA/umem: Store ODP access mask information in PFN > RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page > linkage > RDMA/umem: Separate implicit ODP initialization from explicit ODP > vfio/mlx5: Explicitly use number of pages instead of allocated length > vfio/mlx5: Rewrite create mkey flow to allow better code reuse > vfio/mlx5: Convert vfio to use DMA link API > > Documentation/core-api/dma-api.rst | 70 ++++ > drivers/infiniband/core/umem_odp.c | 250 +++++---------- > drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +- > drivers/infiniband/hw/mlx5/odp.c | 65 ++-- > drivers/infiniband/hw/mlx5/umr.c | 12 +- > drivers/iommu/dma-iommu.c | 459 +++++++++++++++++++++++---- > drivers/iommu/iommu.c | 65 ++-- > drivers/pci/p2pdma.c | 38 +-- > drivers/vfio/pci/mlx5/cmd.c | 373 +++++++++++----------- > drivers/vfio/pci/mlx5/cmd.h | 35 +- > drivers/vfio/pci/mlx5/main.c | 87 +++-- > include/linux/dma-map-ops.h | 54 ---- > include/linux/dma-mapping.h | 85 +++++ > include/linux/hmm-dma.h | 32 ++ > include/linux/hmm.h | 16 + > include/linux/iommu.h | 4 + > include/linux/pci-p2pdma.h | 84 +++++ > include/rdma/ib_umem_odp.h | 25 +- > kernel/dma/direct.c | 44 +-- > kernel/dma/mapping.c | 20 ++ > mm/hmm.c | 231 +++++++++++++- > 21 files changed, 1377 insertions(+), 684 deletions(-) > create mode 100644 include/linux/hmm-dma.h >