From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04693C77B72 for ; Mon, 17 Apr 2023 19:59:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 76ECB900004; Mon, 17 Apr 2023 15:59:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 71E8E900002; Mon, 17 Apr 2023 15:59:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E62B900004; Mon, 17 Apr 2023 15:59:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 517D8900002 for ; Mon, 17 Apr 2023 15:59:54 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 284DAC06F6 for ; Mon, 17 Apr 2023 19:59:54 +0000 (UTC) X-FDA: 80691948708.26.683A86F Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf23.hostedemail.com (Postfix) with ESMTP id 321F314000A for ; Mon, 17 Apr 2023 19:59:50 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=TLAsAz+h; spf=pass (imf23.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.214.180 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681761591; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=phtw730Z1oAu+dFt847+KyYIBXIePeo4uDtmDchStzE=; b=XGzirVcsSEQV+8hcanbdZvJGrKsh8IECJyc6d/xCoihrxiei5/xKLQHfL+iWRjwL5E/Rtz OznoyNvAmKkavLGl8t2a8hrkRJYarACdD1qH9xx1/ZXAWRnYdon7/ng0uY0Gm402qeaig7 DQ9tGOoPwQJtjb3wgecfg4KVKF5ngAc= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=TLAsAz+h; spf=pass (imf23.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.214.180 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681761591; a=rsa-sha256; cv=none; b=L2NFVGpJIly+mW5i3MkkVl39lxMU2w2AAyb7Zf1x0TDsA411MVsFX3kW2imu069IKiIUh3 OodjRD7JnUeSAQbV/K3l7TVNs2IrzawRWkePvYwvGsDchq5oFHHkedF12bpU9Oaj2NJ5Ha VS7tPiurot3AIuq45knvuPScXadF4Y4= Received: by mail-pl1-f180.google.com with SMTP id w1so4250199plg.6 for ; Mon, 17 Apr 2023 12:59:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1681761589; x=1684353589; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=phtw730Z1oAu+dFt847+KyYIBXIePeo4uDtmDchStzE=; b=TLAsAz+hMtLXec8QY1C2WsIM8VDQFspPSCbSJMVu2rVYwOiLU0KvEWfDP4CBBRw0gr bRNo3KJJUZXtUeIGlqCJV9TzBl7pYcte95+yWVoc6nsxtbJ5WsSUAsm38oyb/FeU8AgF GHTYs3NDb0JhHvKai/TwBNa8k6eM1TqUOUif+ey0ZcIVh5FuyHrjG3Ahilh8ydsBvXOd 76jmncwIUT93XLkUZQiwDBd/srfa6iivy309KkBSxa/5Cu0g077Fa5JEzajp5BBomjSp 7LCqynGuLGlQydI7UVc9Zz+TwuFTQsxmN7MI5y28/6i1OumMtCeXp8nrtFR+0UlK2HgH Rbtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681761589; x=1684353589; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=phtw730Z1oAu+dFt847+KyYIBXIePeo4uDtmDchStzE=; b=RIB1griMmimgEC9Te8OXfx5UiY87y3WSMAMtTewbEd/6H2akRpKLRfTRI9XvkfrkQ/ gF1hNwAdRncbaSqfzSz+9nmwmOEEwGwht/Z51PwP7wGtglNvb/m7+3/ri/VXfnNXiNkG xhPtXuMtjH/LrxPqFZsMfP+w8K0xhaCAubMVpU8V5Ly8vWepxns4j3gTe9aHrYf1bMxM xYsQMsaiWfaChqEcLu98lLlVe6rCf71kMzfszGlNphb1Y5YdH9Lx364Nez0HlK6+Dvh3 JIESSrhqoJCr88OBht3+3E2NmJfQlQebMqRw+VUHerI25pKobQEfPw+TxbylMH8JCW7l gkOg== X-Gm-Message-State: AAQBX9fNhL45+1Y7zlbzqs3S4IU57Nk/oHDvYGNKEPn4CXXZSwpcbAbK ico3JfMa06qdkSoaoyZ+ROx79r1zFt0RdmPeq3c= X-Google-Smtp-Source: AKy350ZOIVs0PXqCX3v2o9m6GdrpUCcVieJDD3O0SmT6MDiPWIyFGHmWHgQbBGKGTvc5v1eLblXv7w== X-Received: by 2002:a17:902:ce10:b0:1a6:cb66:681f with SMTP id k16-20020a170902ce1000b001a6cb66681fmr74865plg.46.1681761589427; Mon, 17 Apr 2023 12:59:49 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-142-68-25-194.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.68.25.194]) by smtp.gmail.com with ESMTPSA id q24-20020a170902b11800b0019e8915b1b5sm8037811plr.105.2023.04.17.12.59.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Apr 2023 12:59:48 -0700 (PDT) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1poV0s-00BlVY-9h; Mon, 17 Apr 2023 16:59:46 -0300 Date: Mon, 17 Apr 2023 16:59:46 -0300 From: Jason Gunthorpe To: linux-mm@kvack.org Cc: lsf-pc@lists.linuxfoundation.org, linux-mm@kvack.org, iommu@lists.linux.dev, linux-rdma@vger.kernel.org, Matthew Wilcox , Christoph Hellwig , Joao Martins , John Hubbard , Logan Gunthorpe , Ming Lei , linux-block@vger.kernel.org, netdev@vger.kernel.org, dri-devel@lists.freedesktop.org, nvdimm@lists.linux.dev, "T.J. Mercier" , Zhu Yanjun , Dan Williams , Mike Rapoport , Bart Van Assche , Chaitanya Kulkarni Subject: Re: [LSF/MM/BPF proposal]: Physr discussion Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 321F314000A X-Stat-Signature: srref8ymh6xd7fsd83wt5yoqowkg1w46 X-HE-Tag: 1681761590-213598 X-HE-Meta: U2FsdGVkX18c6mQ5thHkmg6vEjqLm7M0/F/agtp5GlRj4dIizscK9bS+NxmSHXt4GxcEX6PttKvoxY86uEibWRQkAqZRk9Ka8ZuWptQFi97mdg1YQwmgkFul8hoRlAcvyD2Wievku/qCrw5U2MRXN1fB3CEzgMh/de+PniwSdHRePtrmlAUkLYq2/1jEQbVpr46CAxqF0qGbLxmRvWhn3MfAdE/7rqhu23ilOU9HthbcbwERqkuICK6zYOyTipodmKIQEDSXz4QlbeZTbQW3TsMnj5w4pRPmiGhgZdetgNU4RLyO/vXvn5GlDWzcwSrQqWWGpAwsNc2DATm+gy8HcRAr87R/Dz/GVqhSNoNvRL+ArJv5MYsbaQY5FNRsOu933Dk9exRZwq9speb6F5yqqzfVqjF6HvFExsCCFxnX+YBA58ziUVOK9Hsb8RTJxQIuax7UuyHNBGUi2hqdaMjJUitl8xsEQGMWsRcJFOcoI/S/PhuBHWBvLyZk22sDpPMHN7tQK+Kjq/fvtzRfLuu2GDgQjtCDlc1dE8MJDBv62WBbK5dAz8Qc4nJziM1UM/+IZqeDxwY4FaJomTKOCXDtcvGvimbvlrIh/5LzkK6hVrFDDsyJn9f6MrsNfgwnxG5wo/9eb5AfBAZo1qil8TMkFewvl6itAwgO3hyJJyzCTmSqDIpGU56L2OpBj8L5oIyDp5E94ZSk0oTrU6rPZDlokr/5xMHV5oJS4YcXpDTavy+fk1Q4REa1tIohhnQsDh2yb6dSiZIawQuYHg/dYlvQvRZkDZeBEZYaUn8w8+8VwOM7AL64dNnQcQSU//dCNXjNOSUFoL2C1zKCpDabMYxfHo7Jl5fe8EpP2D9KUO6u6MjHTOmN1OJ7OIxd/SKDKpf3Yo6qwWI1XamPSbafYENEgX1vLSfVeOK8wX8MLkUY7g6mNoAHGaj4eKFYNu9KwPAolz51krgUnNcTInVtyeG Zt/fj+Dw FefWm6HIESA/WHQrXGAvYNuClOfS5NqRWVPKqdmKWhM3ph762BqJL7TsxA9tyyYVuw0kwcwFXZiUg5dKtyxDdrqwBP5osvHSClW7x/2kD97YMAQHE9uNMbEVqWOavijTOp+bYKolZBoB+5yTGpM5iBTVpH074jyPisu5iQ+ZDzhKUx5P2z2V2BDi4+X2nFOmi+WTomqxEnRBxg2D2QTpE9kyW4d6xjuW68ArtctgjBv/wMi7khWffVOsGztuohSUBZYaui9s44TgSW0lm5/V3beMHI+VBaJk6z4j5q6zy4sicxrq6dYsMt1+BKj4OeKY7o5CmDytLkN3Wiry/T9MnCLePeS1+425JxitbJyQPlvHnTN2Xb+dLdq9Dxmi3/jXFKSwYOPcq35zOuTILDvNBc6lkFXxQXbLBMwgb+Ltr9gzo79K0m3WJEnBSD5Jg7SKjgn6lD4Sc2TcR0ljA4rbe7G8gccvQn4ZceV3nKicNIuPYeAlRVAM9kg8yvg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 28, 2023 at 12:59:41PM -0800, T.J. Mercier wrote: > On Sat, Jan 21, 2023 at 7:03 AM Jason Gunthorpe wrote: > > > > I would like to have a session at LSF to talk about Matthew's > > physr discussion starter: > > > > https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/ > > > > I have become interested in this with some immediacy because of > > IOMMUFD and this other discussion with Christoph: > > > > https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/ > > > > Which results in, more or less, we have no way to do P2P DMA > > operations without struct page - and from the RDMA side solving this > > well at the DMA API means advancing at least some part of the physr > > idea. [..] I got fairly far along this and had to put it aside for some other tasks, but here is what I came up with so far: https://github.com/jgunthorpe/linux/commits/rlist PCI/P2PDMA: Do not store bus_off in the pci_p2pdma_map_state PCI/P2PDMA: Split out the information about the providing device from pgmap PCI/P2PDMA: Move the DMA API helpers to p2pdma_provider lib/rlist: Introduce range list lib/rlist: Introduce rlist cpu range iterator PCI/P2PDMA: Store the p2pdma_provider structs in an xarray lib/rlist: Introduce rlist_dma dma: Add DMA direct support for rlist mapping dma: Generic rlist dma_map_ops dma: Add DMA API support for mapping a rlist_cpu to a rlist_dma iommu/dma: Implement native rlist dma_map_ops dma: Use generic_dma.*_rlist in simple dma_map_ops implementations dma: Use generic_dma.*_rlist when map_sg just does map_page for n=1 dma: Use generic_dma.*_rlist when iommu_area_alloc() is used dma/dummy: Add rlist s390/dma: Use generic_dma.*_rlist mm/gup: Create a wrapper for pin_user_pages to return a rlist dmabuf: WIP DMABUF exports the backing memory through rcpu RDMA/mlx5: Use rdma_umem_for_each_dma_block() RMDA/mlx: Use rdma_umem_for_each_dma_block() instead of sg_dma_address RDMA/mlx5: Use the length of the MR not the umem RDMA/umem: Add ib_umem_length() instead of open coding RDMA: Add IB DMA API wrappers for rlist RDMA: Switch ib_umem to rlist cover-letter: RFC Create an alternative to scatterlist in the DMA API It is huge and scary. It is not quite nice enough to post but should be an interesting starting point for LSF/MM. At least it broadly shows all the touching required and why this is such a nasty problem. The draft cover letter explaining what the series does: cover-letter: RFC Create an alternative to scatterlist in the DMA API This was kicked off by Matthew with his phyr idea from this thread: https://lore.kernel.org/linux-mm/YdyKWeU0HTv8m7wD@casper.infradead.org/ Hwoevr, I have become interested in this with some immediacy because of IOMMUFD and this other discussion with Christoph: https://lore.kernel.org/kvm/4-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/ Which results in, more or less, we have no way to do P2P DMA operations without struct page. This becomes complicated when we touch RDMA which highly relies on scatterlist for its internal implementations, so being unable to use scatterlist to store only dma_addr_t's means RDMA needs a complete scatterlist replacement that can. So - my objective is to enable to DMA API to "DMA map" something that is not a scatterlist, may or may not contain struct pages, but can still contain P2P DMA physical addresses. With this tool, transform the RDMA subystem to use the new DMA API and then go into DMABUF and stop creating scatterlists without any CPU pages. From that point we could implement DMABUF in VFIO (as above) and use the DMABUF to feed the MMIO pages into IOMMUFD to restore the PCI P2P support in VMs withotu creating the follow_pte security problem that VFIO has. After going through the thread again, and making some sketches, I've come up with this suggestion as a path forward, explored very roughly in this RFC: 1) Create something I've called a 'range list CPU iterator'. This is an API that abstractly iterates over CPU physical memory ranges. It has useful helpers to iterate over things in 'struct page/folio *', physical ranges, copy to/from, and so on It has the necessary extra bits beyond the physr sketch to support P2P in the DMA API based on what was done for the pgmap based stuff. ie we need to know the provider of the non-struct page memory to get the struct device to compute the p2p distance and compute the pci_offset. The immediate idea is this is an iterator, not a data structure. So it can iterate over different kinds of storage. This frees us from having to immediatly consolidate all the different storage schemes in the kernel and lets that work happen over time. I imagine we would want to have this work with struct page * (for GUP) and bio_vec (for storage/net) and something else for the "kitchen sink" with DMABUF/etc. We may also want to allow it to wrapper scatterlist to provide for a more gradual code migration. Things are organized so sometime in the future this could collapse down into something that is not a multi-storage iterator, but perhaps just a single storage type that everyone is happy with. In the mean time we can use the API to progress all the other related infrastructure. Fundamentally this tries to avoid the scatterlist mistake of leaking too much of the storage implementation detail to the user. 2) Create a general storage called the "range list". This is intended to be a general catch-all like scatterlist is, and it is optimized towards page list users, so it is quite good for what RDMA wants. 3) Create a "range list DMA iterator" which is the dma_addr_t version of #1. This needs to have all the goodies to break up the ranges into things HW would like, such as page lists, or restricted scatter/gather records. I've been able to draft several optimizations in the DMA mapping path that should help offset some of the CPU cost of the more abstracted iterators: - DMA direct can directly re-use the CPU list with no iteration or memory allocation. - The IOMMU path can do only one iteration by pre-recording if the CPU list was all page aligned when it was created The following patches go deeper into my thinking, present fairly complete drafts of what things could look like, and more broadly explores the whole idea. At the end of the series we have - rlist, rlist_cpu, rlist_dma implementations - An rlist implementation for every dma_map_ops - Good rlist implementations for DMA direct and dma-iommu.c - A pin_user_pages() wrapper - RDMA umem converted and compiling with some RDMA drivers - Compile tested only :) It is a huge amount of work, I'd like to get a sense of what people think before going more deepely into a more final tested implementation. I know this is not quite what Matthew and Christoph have talked about.