From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD675C54E4A for ; Thu, 7 Mar 2024 21:01:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 437F46B02B5; Thu, 7 Mar 2024 16:01:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C1986B02B6; Thu, 7 Mar 2024 16:01:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 261CC6B02B7; Thu, 7 Mar 2024 16:01:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 110986B02B5 for ; Thu, 7 Mar 2024 16:01:24 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id AB4061A0CA9 for ; Thu, 7 Mar 2024 21:01:23 +0000 (UTC) X-FDA: 81871463646.29.4E6BCC3 Received: from mail-ot1-f47.google.com (mail-ot1-f47.google.com [209.85.210.47]) by imf07.hostedemail.com (Postfix) with ESMTP id 20C104001C for ; Thu, 7 Mar 2024 21:01:18 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=IW0tWWMb; spf=pass (imf07.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.210.47 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709845280; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B2cqZbDJobgrAlQUN+X6Gc/N/G77Yt6hRBbHc0B6Byw=; b=bFqUCVKCEInUBBLP6G/YbxriEHBg6apC+w5XdHn3zXbAjeTafDmgqYxUrVnSBCgVYFejW5 fiKPQGSaZscyX42PpjqTRy7zWYX1DMV1M3ZGMNJkBo0FQxU8JJ8bdxYiTMx+Jlxn+XvNlZ 6GL/y4tgZZU0lAlu3lZVJ21HOo/IqIM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=IW0tWWMb; spf=pass (imf07.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.210.47 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709845280; a=rsa-sha256; cv=none; b=QjMJuvBKWyxXnOyKQKqUQF1s6QJk++55oz/JfxsVPa74u/a+C/rphNN8UKK2nKt3bNqKnE dqTlkblrnEVsMT8dYwwzSitje2fXlNqcx0R1FXToHXcAFdNakgV16zik2rsVmN56ZHsZBD VHSUBUmQvz4+pdzS14L0eepObmPMUaI= Received: by mail-ot1-f47.google.com with SMTP id 46e09a7af769-6e45ef83c54so880027a34.2 for ; Thu, 07 Mar 2024 13:01:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1709845279; x=1710450079; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=B2cqZbDJobgrAlQUN+X6Gc/N/G77Yt6hRBbHc0B6Byw=; b=IW0tWWMbidc96IeyJDsRV46B8BdTjNE6WtenC5OZljUKsd39z3z4E3dDJirsYoeZzW Vzg4g/MgmqS7fnr1ycIRWD5G81bE+XYbJXPnQolFrfjYCSwG153a3isPRQrEpwJ/CbNc 09eu+ltnrgIRH1a7Ks2PkN732AkCKOL2RcTSUxlihVF/6t1vMzXa5wgDkziq6ALhESwR TJsifRsAEV0jcXjHCPJ1szwlr7aOnqfUHD1jjZ0I22rzY3/2Acu6R896URcgNOGRKWxI W7DfW/Ddb9CBJAbadw4lDDKd0MyQqKcXxn2Q0Tr/o8uh9uvA8ChegMJLzVvMOhbxOCpn fCiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709845279; x=1710450079; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=B2cqZbDJobgrAlQUN+X6Gc/N/G77Yt6hRBbHc0B6Byw=; b=d5KGnArKM7OTu2RMqFIRXeF0ogDP/cGEoMLc36LQUR8y+quTRUJLcqVrVG2H9gWpDg oWRmT7G+ATG3vYrQngzN3a58uevPgZp02C5WFhIUNtnPaXc2lVAFeTGLM4ylUgENVp1E qOfV76af4PL+gbb1AKqgf6FWwMp/5y0OWn8w4Q2io4rewWJ2kKcuMFQ/PJpHiludb3Kn WTc42aCTzFR1cMAlvCKmv1USfZSXykwCV5PqytlkBKh1MkFcaSlBhFrtEt2Eg31ADssu 6ePpdajuyJIMCQ+RUPv5eXbl6cuQ4PdZj0GbMNXU1GOA0HakHAr8RLMX7CX9lf/ZkaYk CDRQ== X-Forwarded-Encrypted: i=1; AJvYcCXOjhzAZ12w74sMxfiQkJZV35buZgFz76v10yaeSq6xvk0U5ZbRVcqw2ehje29Qq5n9P4vkW2VbtXXUsg3ug/wW6BE= X-Gm-Message-State: AOJu0YxuopBRjkSsNcWp4c4J+gv7qti6mk0Ab/bbfzmXhrdWrq/ZoyjL To76Hhbdl7qd2ryl6QeUtLx6bZ4ZLNTV8LhWHP2wWPCL0SmwQWpMZoWc8sjFdqY= X-Google-Smtp-Source: AGHT+IHVDDnrRb4IzGsRrjreYb2lzNqXBjfY5dZE5DynfGK6hFbMLsAKifeob/DJ0ED9qSx2chYugA== X-Received: by 2002:a05:6870:8a06:b0:21e:a40e:7465 with SMTP id p6-20020a0568708a0600b0021ea40e7465mr1134062oaq.24.1709845279574; Thu, 07 Mar 2024 13:01:19 -0800 (PST) Received: from ziepe.ca ([12.97.180.36]) by smtp.gmail.com with ESMTPSA id mt9-20020a0568706b0900b00220b0891304sm3660721oab.1.2024.03.07.13.01.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Mar 2024 13:01:18 -0800 (PST) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1riKrc-004Zwv-Aq; Thu, 07 Mar 2024 17:01:16 -0400 Date: Thu, 7 Mar 2024 17:01:16 -0400 From: Jason Gunthorpe To: Christoph Hellwig Cc: Leon Romanovsky , Robin Murphy , Marek Szyprowski , Joerg Roedel , Will Deacon , Chaitanya Kulkarni , Jonathan Corbet , Jens Axboe , Keith Busch , Sagi Grimberg , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , =?utf-8?B?SsOpcsO0bWU=?= Glisse , Andrew Morton , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, kvm@vger.kernel.org, linux-mm@kvack.org, Bart Van Assche , Damien Le Moal , Amir Goldstein , "josef@toxicpanda.com" , "Martin K. Petersen" , "daniel@iogearbox.net" , Dan Williams , "jack@suse.com" , Zhu Yanjun Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps Message-ID: <20240307210116.GQ9225@ziepe.ca> References: <47afacda-3023-4eb7-b227-5f725c3187c2@arm.com> <20240305122935.GB36868@unreal> <20240306144416.GB19711@lst.de> <20240306154328.GM9225@ziepe.ca> <20240306162022.GB28427@lst.de> <20240306174456.GO9225@ziepe.ca> <20240306221400.GA8663@lst.de> <20240307000036.GP9225@ziepe.ca> <20240307150505.GA28978@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240307150505.GA28978@lst.de> X-Rspamd-Queue-Id: 20C104001C X-Rspam-User: X-Stat-Signature: g8y5wrdieja943iad1hrom4jfi489gsx X-Rspamd-Server: rspam01 X-HE-Tag: 1709845278-959170 X-HE-Meta: U2FsdGVkX1+0siNt3txnoMykNc06u2AuqKzRv6W8IZ3rSt6vzCcyn/aCOdGNXNdI6FIhWidDJB5I2F9VdCljIOTNmGOItizqvv2OMeUOkpgejk9I75uj1Ri1rRrYLBVVNIH45y0ojr9QsHNUANDRYfIoqT/jvb0aRFM2qnz+Lu0hCKSD3GgVN7ASrXcUgP13aUmwA5r3lX7iEQOiwqUfj/uHoma/X6UjQGVtLKh52jrBgivMbSkwEa/gMVVVWzthMXjTec1gQXK2owMYmnHJ3n0PBkaQpkdCcQSjKpY+he492rQJ+fqC1yDqSady5hKxGhXsKyEuH8E9R/Y19FzRi37Y0rpX69TxYrxYRWeuiTZZom6/Z1vULMeIYnQZ50vc4dNAd89UktYc8M4hdJWjd7s0JQtDw2f2PBSEut3eAg3B4ucaXVSPoww3o+3WRXKOw5V2EgRbF02vc+4kMN/BisLqV1ezBDvpRNT1Pivn8a1U0B6yhrXc7FRJ97uD6bHmtXnfaDRgXC1kQBIzNbJ0rCnHbnZqZSvp7EGUILwmlyZdlQ6H6XLb4a9V+Zpclf8xy/egdXXhYS8LL4Ml9NzeI9lr3LTA1mDRv8YNiNp+LmYNyit77z+jBiqYgAjWlFg51ytwgFsgAgJdrmzb/dD4NtYcOZdSUT3IOOmwFrmEAWWS7cQTN/yUY42j7m8GTfQPSl2CS7Ao9m2zeetX61IG6HKY2ZeWH2RFYfd1wgGJwum2QF7ueXfPUGwVEJt5axM1lQ4QnqBzaOSGQC+GpqKldtW/MW2J7lvsjO9lZyctcDkWQ/UT3yKIykhaQzvvvuOSyFbDEcW+BAp5aMoNAnurYP8yd38SCMhbgytZHkHW0saQ8s0tsy/i5Bm8J7Jy0Pq1G4mgzCr1jNSIEroBQIuW6hkg26wykGc4sVpRkLIL5/TeSJqiSak/aEJz6IpxCPAi88Vq0WrFLAoSTY3A6HX 2FSslBQR nuIlQ22w1eK7OeY6bOCMTqSsAYSilU5jPB7naAUW9RZa9yeVYN6Ao2ND2xfd8p/+C3gevEBQxJ/2VCxcb70q2Qn/+qHKoxBIseggluu53qd9OMMi38ixHRQr/Ww+9it+wwuBNxEPcp3oK2LbmAW37dKdZQT7hOSHGd5eWcuieFas37oVjdb8A5ClwAq5KRzVCU7KG/rxOKkESu9yfI081oegfkKxwuLkCOFXkekEznGLDxtjrwADQ6aEBYCUebJy6UDr097mSAshViBJLgNYopCm4cdR9JH4TERSUfVkA5cQEc5YbJrOLflleHykaJZwFsWQSTfVNYPHKW1cFxtCQbfMsnfTq3/hXNBfCYAyPYNK748g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 07, 2024 at 04:05:05PM +0100, Christoph Hellwig wrote: > On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote: > > > > > > I don't think you can do without dma_addr_t storage. In most cases > > > your can just store the dma_addr_t in the LE/BE encoded hardware > > > SGL, so no extra storage should be needed though. > > > > RDMA (and often DRM too) generally doesn't work like that, the driver > > copies the page table into the device and then the only reason to have > > a dma_addr_t storage is to pass that to the dma unmap API. Optionally > > eliminating long term dma_addr_t storage would be a worthwhile memory > > savings for large long lived user space memory registrations. > > It's just kinda hard to do. For aligned IOMMU mapping you'd only > have one dma_addr_t mappings (or maybe a few if P2P regions are > involved), so this probably doesn't matter. For direct mappings > you'd have a few, but maybe the better answer is to use THP > more aggressively and reduce the number of segments. Right, those things have all been done. 100GB of huge pages is still using a fair amount of memory for storing dma_addr_t's. It is hard to do perfectly, but I think it is not so bad if we focus on the direct only case and simple systems that can exclude swiotlb early on. > > > > So are you thinking something more like a driver flow of: > > > > > > > > .. extent IO and get # aligned pages and know if there is P2P .. > > > > dma_init_io(state, num_pages, p2p_flag) > > > > if (dma_io_single_range(state)) { > > > > // #2, #4 > > > > for each io() > > > > dma_link_aligned_pages(state, io range) > > > > hw_sgl = (state->iova, state->len) > > > > } else { > > > > > > I think what you have a dma_io_single_range should become before > > > the dma_init_io. If we know we can't coalesce it really just is a > > > dma_map_{single,page,bvec} loop, no need for any extra state. > > > > I imagine dma_io_single_range() to just check a flag in state. > > > > I still want to call dma_init_io() for the non-coalescing cases > > because all the flows, regardless of composition, should be about as > > fast as dma_map_sg is today. > > If all flows includes multiple non-coalesced regions that just makes > things very complicated, and that's exactly what I'd want to avoid. I don't see how to avoid it unless we say RDMA shouldn't use this API, which is kind of the whole point from my perspective.. I want an API that can handle all the same complexity as dma_map_sg() without forcing the use of scatterlist. Instead "bring your own datastructure". This is the essence of what we discussed. An API that is inferior to dma_map_sg() is really problematic to use with RDMA. > > That means we need to always pre-allocate the IOVA in any case where > > the IOMMU might be active - even on a non-coalescing flow. > > > > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to > > be used and we can't just call today's dma_map_page() in a loop on the > > non-coalescing side and pay the overhead of Nx IOVA allocations. > > > > In large part this is for RDMA, were a single P2P page in a large > > multi-gigabyte user memory registration shouldn't drastically harm the > > registration performance by falling down to doing dma_map_page, and an > > IOVA allocation, on a 4k page by page basis. > > But that P2P page needs to be handled very differently, as with it > we can't actually use a single iova range. So I'm not sure how that > is even supposed to work. If you have > > +-------+-----+-------+ > | local | P2P | local | > +-------+-----+-------+ > > you need at least 3 hw SGL entries, as the IOVA won't be contigous. Sure, 3 SGL entries is fine, that isn't what I'm pointing at I'm saying that today if you give such a scatterlist to dma_map_sg() it scans it and computes the IOVA space need, allocates one IOVA space, then subdivides that single space up into the 3 HW SGLs you show. If you don't preserve that then we are calling, 4k at a time, a dma_map_page() which is not anywhere close to the same outcome as what dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs, and we call into the IOVA allocator a huge number of times. It needs to work following the same basic structure of dma_map_sg, unfolding that logic into helpers so that the driver can provide the data structure: - Scan the io ranges and figure out how much IOVA needed (dma_io_summarize_range) - Allocate the IOVA (dma_init_io) - Scan the io ranges again generate the final HW SGL (dma_io_link_page) - Finish the iommu batch (dma_io_done_mapping) And you can make that pattern work for all the other cases too. So I don't see this as particularly worse, calling some other API instead of dma_map_page is not really a complexity on the driver. Calling dma_init_io every time is also not a complexity. The DMA API side is a bit more, but not substantively different logic from what dma_map_sg already does. Otherwise what is the alternative? How do I keep these complex things working in RDMA and remove scatterlist? > > The other thing that got hand waved here is how does dma_init_io() > > know which of the 6 states we are looking at? I imagine we probably > > want to do something like: > > > > struct dma_io_summarize summary = {}; > > for each io() > > dma_io_summarize_range(&summary, io range) > > dma_init_io(dev, &state, &summary); > > if (state->single_range) { > > } else { > > } > > dma_io_done_mapping(&state); <-- flush IOTLB once > > That's why I really just want 2 cases. If the caller guarantees the > range is coalescable and there is an IOMMU use the iommu-API like > API, else just iter over map_single/page. But how does the caller even know if it is coalescable? Other than the trivial case of a single CPU range, that is a complicated detail based on what pages are inside the range combined with the capability of the device doing DMA. I don't see a simple way for the caller to figure this out. You need to sweep every page and collect some information on it. The above is to abstract that detail. It was simpler before the confidential compute stuff :( Jason