From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07A4ED36119 for ; Tue, 5 Nov 2024 19:54:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83A936B0089; Tue, 5 Nov 2024 14:54:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E9AF6B008C; Tue, 5 Nov 2024 14:54:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68AC86B0092; Tue, 5 Nov 2024 14:54:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 4B8626B0089 for ; Tue, 5 Nov 2024 14:54:02 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id C6CFA1615EF for ; Tue, 5 Nov 2024 19:54:01 +0000 (UTC) X-FDA: 82753091316.24.39C4B62 Received: from mail-qv1-f49.google.com (mail-qv1-f49.google.com [209.85.219.49]) by imf24.hostedemail.com (Postfix) with ESMTP id 5A53F18001B for ; Tue, 5 Nov 2024 19:53:55 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=VfgxRG5O; spf=pass (imf24.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.219.49 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730836255; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+fX00798BJ3ATWBHAwavTUTImd2ssZMf/WzAfMHnat0=; b=PPlKUg6U3SdFYk98+RbiKIq6pL3eX2fPluO8EM2moQxqse5/TeJcNLq1+pSUyZiCuLjp39 gdflBXGwCRfNTH79c4qq4l7MKXb+9/XHk7tHmo+hxfYH2I7vom1DwFQ7Rw8MBwcxjykMJp 8JiCau2RD1qBkC3cP+2on6QPBoRm1/8= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=VfgxRG5O; spf=pass (imf24.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.219.49 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730836255; a=rsa-sha256; cv=none; b=A5pYR6ODTi6i3+Lyx9fB3yY7aY1el0MNBVPzC5UWxdYUQVw4EGN6zfkfyEfQ7AdDzpEwq8 r8Jd+MmJ2HWDmVfoZkhFEM1xKNsd2tDy5oii2KIEd54pi+uXQvnMEt/yHfbTZo6L/L9Qci /a7rQGm9PpjMI3lYXBFEOpwKFgfhnRQ= Received: by mail-qv1-f49.google.com with SMTP id 6a1803df08f44-6cbf340fccaso1316416d6.1 for ; Tue, 05 Nov 2024 11:53:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1730836439; x=1731441239; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=+fX00798BJ3ATWBHAwavTUTImd2ssZMf/WzAfMHnat0=; b=VfgxRG5OZxUuA0iBgIZu7JzTOg2VXJElZczZNbLP8WZonntb1J9hKOf39jJ0uXCJwl jHx7FhzJPx+bGA1Xm6GWNZ6zLWjqhiAVydLssngHOcAuOlpjA2GlT5y0IHgkm/fOGbpn s4l+hd2Z6YJZ+e40PZDqeZ1RVCjbxGQKTx4HgpRZyorjKFwaXhtJJ6Mz1qjqX6JJOYIw Lm65li7UiRBv6cflOKEYg32OVdtFTpaPx7Ac5N24eEoDW9r0c0RDPdODFkU3OgtY0W95 3b3AMs4s4r5jXBSMzzwoTTdHpMsjG7x7RG2gwLacTBcmDsC8ECPRJsM3DfnWFgZ8H5Fe j+Pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730836439; x=1731441239; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=+fX00798BJ3ATWBHAwavTUTImd2ssZMf/WzAfMHnat0=; b=iFZbyzUDYlFaZJ4SNM0lSG8H70gWE25+sHWyMFyvIUuXTLQM8bwG4uPtTHzXPY5Rh4 j7OMtb2idJsFGTPRJ1tq/sBA0ViCxAVL9YqSI+MdP8QoelEdfkQpK6ou/Im05tCS5ahx WY02txcGfGUHLMzUh1dsFBwNW0+dBrt576lOUYmFetg/UcLAdHr22EdRlmhczj9n/nNi l7g4xxduT48kITFIHltWabz615VA6HJ0PpIi3dq6DhirTSpD0J4hMLmQCVKjzrwfpJfQ 6WD2NDBk4LdB6IoVKP+FYofkZG2Lk6/NTz5+IduhO3jxKEC+la5lOWEGTd1RXYYLtvmE dKaw== X-Forwarded-Encrypted: i=1; AJvYcCUYIKL04tVtZelfCSZK/3sIt5AHXJeYFv94fCJPY2lappuzmv6xnarq7NO5suxJYsqQk2MpII4Q7g==@kvack.org X-Gm-Message-State: AOJu0YwrnGzL07AEerFBKe+rePW+SjweEcEWZj9LHUFP9yLvtEuQkp9g IvFqxM1LcfQLVEbo/BhB8s36AGicGp1rVhLPqDwEZ+blEBi8LEVYKucBkTMBqZM= X-Google-Smtp-Source: AGHT+IEaav3D7n+muLVYjXAvVqOfQiyB3KvOcn2o5SW6dcKTwKh96GbO3Tfnssa4/aNZfrPdV+MQVQ== X-Received: by 2002:ad4:4eeb:0:b0:6cb:ed27:145c with SMTP id 6a1803df08f44-6d35b9510a3mr302739006d6.19.1730836438787; Tue, 05 Nov 2024 11:53:58 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-68-128-5.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.68.128.5]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6d35415a62asm63870536d6.80.2024.11.05.11.53.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Nov 2024 11:53:58 -0800 (PST) Received: from jgg by wakko with local (Exim 4.97) (envelope-from ) id 1t8Pcj-000000023mq-1vDZ; Tue, 05 Nov 2024 15:53:57 -0400 Date: Tue, 5 Nov 2024 15:53:57 -0400 From: Jason Gunthorpe To: Christoph Hellwig Cc: Robin Murphy , Leon Romanovsky , Jens Axboe , Joerg Roedel , Will Deacon , Sagi Grimberg , Keith Busch , Bjorn Helgaas , Logan Gunthorpe , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , Marek Szyprowski , =?utf-8?B?SsOpcsO0bWU=?= Glisse , Andrew Morton , Jonathan Corbet , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v1 00/17] Provide a new two step DMA mapping API Message-ID: <20241105195357.GI35848@ziepe.ca> References: <3567312e-5942-4037-93dc-587f25f0778c@arm.com> <20241104095831.GA28751@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241104095831.GA28751@lst.de> X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 5A53F18001B X-Stat-Signature: tqzauqpg9xjfjidbwrmym6e9p6eqi11r X-Rspam-User: X-HE-Tag: 1730836435-691708 X-HE-Meta: U2FsdGVkX199XBprPX0C7R3VoxUI4D5MG3KJNuNMXvT0Jbtni5xvScDZxG9vRzL2X6LbzYRbFt39+CfoxIqEycCCnmpJ9MmKNT28x14qWWc0NJxGOHdysMHstdu+2tKbWMzcaP0lrz9iJRJvZAYaCUXRHAxE9ng0oHyPYTwdYcnSOW2dp9KWf1nXnjzkeNu4L4k1s1ayXkqe0d23tGpQLoPGvtaY5rMuJikp+XYiofFd+Oaurox9WcUH5fZn9cmumKMUmM+6wOsWA0cA/D/HDSSq7osf1UoIl+hpn+XJn5ZZa4Hc39cQJcidRfJtprWibcjCBOLbTVFSzTVFhYbsMkOrfkGdk+9w65Z1XN5ekODkoMKozXY6nMEqjDuSfUuN7dzcam/dsmgTslmSn+LXlP+20WJ6xZY4ctvKjW0HVcH8cAgljCsbB4dFG8JB63PUhd78/YlRXnWzRGauHY0SvAa5yEQ9cr9tSM9s/M2w0jO/gNkIXsnGOLsaD4regdAptizgbWEIzzqT7SC+1QO2Aj+ud/aZ7/fhe9Q+SvfL/xZ0U7G76TktZtXeYif4mFBVXCYoACm8LQPC0n4uEj7lfGs20IvD+PEmCZWb44/vPYpt2qstCR3vcM2KkQhRqmyqUoVwUTqNtLWVO8zUhNEW2YwfeCq0eAIWPQRH3GV6vgfRQok+r/VVTnROwMqe2JvrlJ934AgXN4Q8+91euxqLcXuAPYE5LOTqIOaEyJQhMuQikTOQ2piIQw3IKeUEvjxhvNaIavYCbK6GIPXfxbpr1v3Qu9zJ3bF1v2czRPCZarflK25E69L96VceSUcz2AtfX+/QSpiBWgrgo+PvoRcXPI4QiE8uIUtd4qOybCZQ8EvBfDtXeLsZNzbFFoyjESjfMIiXuvNn4I5i8+Y2dVT+ui12JRMni9C06daGZyAM0k0t58m2bbfXcp0InLYUyYdWaPfkT6gN7dMIZyrMTxn OkdsQSjt +w0EW/RpzqxW9ld+MZFaidOm05yTsIkxsADsR/NXjK2JTSpVOJLJ1LBq9FkSj92FbDeQsbolOOWFZfmE0cBG9OmQmghu7UAdUWhLNTC92nwZ9UiPtvOfbcs2/GlU213r20or+GD1VlqaZ+iOuvDE6WcNep9FEzpkEgGGYLFewGwyO5I+CaezdntDfK1UBh6GYGsCF8/l6FZcew6uMjlWXIJXRk+Xzw1xDaeur/HzftBk3UPMmwgXA7I5q4x91ui+wS6QqEd4/jPF8XdE9qkrHyar6ViB82aSbwlTR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 04, 2024 at 10:58:31AM +0100, Christoph Hellwig wrote: > On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote: > > The hilarious amount of work that iommu_dma_map_sg() does is pretty much > > entirely for the benefit of v4l2 and dma-buf importers who *depend* on > > being able to linearise a scatterlist in DMA address space. TBH I doubt > > there are many actual scatter-gather-capable devices with significant > > enough limitations to meaningfully benefit from DMA segment combining these > > days - I've often thought that by now it might be a good idea to turn that > > behaviour off by default and add an attribute for callers to explicitly > > request it. > > Even when devices are not limited they often perform significantly better > when IOVA space is not completely fragmented. While the dma_map_sg code > is a bit gross due to the fact that it has to deal with unaligned segments, > the coalescing itself often is a big win. RDMA is like this too, Almost all the MR HW gets big wins if the entire scatter list is IOVA contiguous. One of the future steps I'd like to see on top of this is to fine tune the IOVA allocation backing MRs to exactly match the HW needs. Having proper alignment and contiguity can be huge reduction in device overhead, like a 100MB MR may need to store 200K of mapping information on-device, but with a properly aligned IOVA this can be reduced to only 16 bytes. Avoiding a double translation tax when the iommu HW is enabled is potentially significant. We have some RDMA workloads with VMs where the NIC is holding ~1GB of memory just for translations, but the iommu is active as the S2. ie we are paying a double tax on translation. It could be a very interesting trade off to reduce the NIC side to nothing and rely on the CPU IOMMU with nested translation instead. > Note that dma_map_sg also has two other very useful features: batching > of the iotlb flushing, and support for P2P, which to be efficient also > requires batching the lookups. This is the main point, and I think, is the uniqueness Leon is talking about. We don't get those properties through any other API and this one series preserves them. In fact I would say that is the entire point of this series: preserve everything special about dma_map_sg() compared to dma_map_page() but don't require a scatterlist. > >> Several approaches have been explored to expand the DMA API with additional > >> scatterlist-like structures (BIO, rlist), instead split up the DMA API > >> to allow callers to bring their own data structure. > > > > And this line of reasoning is still "2 + 2 = Thursday" - what is to say > > those two notions in any way related? We literally already have one generic > > DMA operation which doesn't operate on struct page, yet needed nothing > > "split up" to be possible. > > Yeah, I don't really get the struct page argument. In fact if we look > at the nitty-gritty details of dma_map_page it doesn't really need a > page at all. Today, if you want to map a P2P address you must have a struct page, because page->pgmap is the only source of information on the P2P topology. So the logic is, to get P2P without struct page we need a way to have all the features of dma_map_sg() but without a mandatory scatterlist because we cannot remove struct page from scatterlist. This series gets to the first step - no scatterlist. There will need to be another series to provide an alternative to page->pgmap to get the P2P information. Then we really won't have struct page dependence in the DMA API. I actually once looked at how to enhance dma_map_resource() to support P2P and it was not very nice, the unmap side became quite complex. I think this is a more elgant solution than what I was sketching. > >> for the device. Instead of allocating a scatter list entry per allocated > >> page it can just allocate an array of 'struct page *', saving a large > >> amount of memory. > > > > VFIO already assumes a coherent device with (realistically) an IOMMU which > > it explicitly manages - why is it even pretending to need a generic DMA > > API? > > AFAIK that does isn't really vfio as we know it but the control device > for live migration. But Leon or Jason might fill in more. Yes, this is the control side of the VFIO live migration driver that needs rather a lot of memory to store the migration blob. There is definitely an iommu, and the VF function is definitely translating, but it doesn't mean the PF function is using dma-iommu.c, it is often in iommu passthrough/identity and using DMA direct. It was done as an alternative example on how to use the API. Again there are more improvements possible there, the driver does not take advantage of contiguity or alignment when programming the HW. > Because we only need to preallocate the tiny constant sized dma_iova_state > as part of the request instead of an additional scatterlist that requires > sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int) > per segment, including a memory allocation per I/O for that. Right, eliminating scatterlist entirely on fast paths is a big point. I recall Chuck was keen on the same thing for NFSoRDMA as well. > At least for the block code we have a nice little core wrapper that is > very easy to use, and provides a great reduction of memory use and > allocations. The HMM use case I'll let others talk about. I saw the Intel XE team make a complicated integration with the DMA API that wasn't so good. They were looking at an earlier version of this and I think the feedback was positive. It should make a big difference, but we will need to see what they come up and possibly tweak things. Jason