From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2557C433E0 for ; Fri, 12 Mar 2021 19:47:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3CFEA64F5F for ; Fri, 12 Mar 2021 19:47:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3CFEA64F5F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AEC476B006C; Fri, 12 Mar 2021 14:47:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A9BCC6B006E; Fri, 12 Mar 2021 14:47:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F08F6B0070; Fri, 12 Mar 2021 14:47:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0214.hostedemail.com [216.40.44.214]) by kanga.kvack.org (Postfix) with ESMTP id 6ED496B006C for ; Fri, 12 Mar 2021 14:47:20 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 26622999D for ; Fri, 12 Mar 2021 19:47:20 +0000 (UTC) X-FDA: 77912256240.05.9F1A298 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf22.hostedemail.com (Postfix) with ESMTP id 4D7FAC0007C8 for ; Fri, 12 Mar 2021 19:47:16 +0000 (UTC) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 20D92ED1; Fri, 12 Mar 2021 11:47:18 -0800 (PST) Received: from [10.57.52.136] (unknown [10.57.52.136]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CA35E3F793; Fri, 12 Mar 2021 11:47:13 -0800 (PST) Subject: Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg To: Logan Gunthorpe , linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, iommu@lists.linux-foundation.org Cc: Minturn Dave B , John Hubbard , Dave Hansen , Ira Weiny , Matthew Wilcox , =?UTF-8?Q?Christian_K=c3=b6nig?= , Jason Gunthorpe , Jason Ekstrand , Daniel Vetter , Dan Williams , Stephen Bates , Jakowski Andrzej , Christoph Hellwig , Xiong Jianxin References: <20210311233142.7900-1-logang@deltatee.com> <20210311233142.7900-9-logang@deltatee.com> <45701356-ee41-1ad2-0e06-ca74af87b05a@deltatee.com> From: Robin Murphy Message-ID: <76cc1c82-3cf4-92d3-992f-5c876ed30523@arm.com> Date: Fri, 12 Mar 2021 19:47:08 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <45701356-ee41-1ad2-0e06-ca74af87b05a@deltatee.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB X-Stat-Signature: wey6pz8din5rxmwkacqdzh4etsds6geb X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 4D7FAC0007C8 Received-SPF: none (arm.com>: No applicable sender policy available) receiver=imf22; identity=mailfrom; envelope-from=""; helo=foss.arm.com; client-ip=217.140.110.172 X-HE-DKIM-Result: none/none X-HE-Tag: 1615578436-441786 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2021-03-12 17:03, Logan Gunthorpe wrote: >=20 >=20 > On 2021-03-12 8:52 a.m., Robin Murphy wrote: >> On 2021-03-11 23:31, Logan Gunthorpe wrote: >>> When a PCI P2PDMA page is seen, set the IOVA length of the segment >>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(= ), >>> apply the appropriate bus address to the segment. The IOVA is not >>> created if the scatterlist only consists of P2PDMA pages. >> >> This misled me at first, but I see the implementation does actually >> appear to accomodate the case of working ACS where P2P *would* still >> need to be mapped at the IOMMU. >=20 > Yes, that's correct. >>> =C2=A0 static int __finalise_sg(struct device *dev, struct scatterli= st *sg, >>> int nents, >>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 dma_addr_t dma_addr) >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 dma_addr_t dma_addr, unsi= gned long attrs) >>> =C2=A0 { >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct scatterlist *s, *cur =3D sg; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long seg_mask =3D dma_get_se= g_boundary(dev); >>> @@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev, >>> struct scatterlist *sg, int nents, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sg_dma_addres= s(s) =3D DMA_MAPPING_ERROR; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sg_dma_len(s)= =3D 0; >>> =C2=A0 +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_pci_p2pdma= _page(sg_page(s)) && !s_iova_len) { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (i > 0) >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 cur =3D sg_next(cur); >>> + >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 s= g_dma_address(cur) =3D sg_phys(s) + s->offset - >> >> Are you sure about that? ;) >=20 > Do you see a bug? I don't follow you... sg_phys() already accounts for the offset, so you're adding it twice. >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 pci_p2pdma_bus_offset(sg_page(s)); >> >> Can the bus offset make P2P addresses overlap with regions of mem spac= e >> that we might use for regular IOVA allocation? That would be very bad.= .. >=20 > No. IOMMU drivers already disallow all PCI addresses from being used as > IOVA addresses. See, for example, dmar_init_reserved_ranges(). It woul= d > be a huge problem for a whole lot of other reasons if it didn't. I know we reserve the outbound windows (largely *because* some host=20 bridges will consider those addresses as attempts at unsupported P2P and=20 prevent them working), I just wanted to confirm that this bus offset is=20 always something small that stays within the relevant window, rather=20 than something that might make a BAR appear in a completely different=20 place for P2P purposes. If so, that's good. >>> @@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev, >>> struct scatterlist *sg, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct iommu_dma_cookie *cookie =3D d= omain->iova_cookie; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct iova_domain *iovad =3D &cookie= ->iovad; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct scatterlist *s, *prev =3D NULL= ; >>> +=C2=A0=C2=A0=C2=A0 struct dev_pagemap *pgmap =3D NULL; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int prot =3D dma_info_to_prot(dir, de= v_is_dma_coherent(dev), attrs); >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 dma_addr_t iova; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 size_t iova_len =3D 0; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unsigned long mask =3D dma_get_seg_bo= undary(dev); >>> -=C2=A0=C2=A0=C2=A0 int i; >>> +=C2=A0=C2=A0=C2=A0 int i, map =3D -1, ret =3D 0; >>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (static_branch_unlikely(&io= mmu_deferred_attach_enabled) && >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 iommu_deferre= d_attach(dev, domain)) >>> @@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev, >>> struct scatterlist *sg, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 s_length =3D = iova_align(iovad, s_length + s_iova_off); >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 s->length =3D= s_length; >>> =C2=A0 +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_pci_p2pdma= _page(sg_page(s))) { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (sg_page(s)->pgmap !=3D pgmap) { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 pgmap =3D sg_page(s)->pgmap; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 map =3D pci_p2pdma_dma_map_type(dev, pgmap); >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>> + >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (map < 0) { >> >> It rather feels like it should be the job of whoever creates the list = in >> the first place not to put unusable pages in it, especially since the >> p2pdma_map_type looks to be a fairly coarse-grained and static thing. >> The DMA API isn't responsible for validating normal memory pages, so >> what makes P2P special? >=20 > Yes, that would be ideal, but there's some difficulties there. For the > driver to check the pages, it would need to loop through the entire SG > one more time on every transaction, regardless of whether there are > P2PDMA pages, or not. So that will have a performance impact even when > the feature isn't being used. I don't think that'll be acceptable for > many drivers. >=20 > The other possibility is for GUP to do it when it gets the pages from > userspace. But GUP doesn't have all the information to do this at the > moment. We'd have to pass the struct device that will eventually map th= e > pages through all the nested functions in the GUP to do that test at > that time. This might not be a bad option (that I half looked into), bu= t > I'm not sure how acceptable it would be to the GUP developers. Urgh, yes, if a page may or may not be valid for p2p depending on which=20 device is trying to map it, then it probably is most reasonable to=20 figure that out at this point. It's a little unfortunate having to cope=20 with failure so late, but oh well. > But even if we do verify the pages ahead of time, we still need the sam= e > infrastructure in dma_map_sg(); it could only now be a BUG if the drive= r > sent invalid pages instead of an error return. The hope was that we could save doing even that - e.g. if you pass a=20 dodgy page into dma_map_page(), maybe page_to_phys() will crash, maybe=20 you'll just end up with a DMA address that won't work, but either way it=20 doesn't care in its own right - but it seems that's moot. >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 ret =3D -EREMOTEIO; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 goto out_restore_sg; >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>> + >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (map) { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 s->length =3D 0; >> >> I'm not really thrilled about the idea of passing zero-length segments >> to iommu_map_sg(). Yes, it happens to trick the concatenation logic in >> the current implementation into doing what you want, but it feels frag= ile. >=20 > We're not passing zero length segments to iommu_map_sg() (or any > function). This loop is just scanning to calculate the length of the > required IOVA. __finalise_sg() (which is intimately tied to this loop) > then needs a way to determine which segments were P2P segments. The > existing code already overwrites s->length with an aligned length and > stores the original length in sg_dma_len. So we're not relying on > tricking any logic here. Yes, we temporarily shuffle in page-aligned quantities to satisfy the=20 needs of the iommu_map_sg() call, before unpacking things again in=20 __finalise_sg(). It's some disgusting trickery that I'm particularly=20 proud of. My point is that if you have a mix of both p2p and normal=20 segments - which seems to be a case you want to support - then the=20 length of 0 that you set to flag p2p segments here will be seen by=20 iommu_map_sg() (as it walks the list to map the other segments) before=20 you then use it as a key to override the DMA address in the final step.=20 It's not a concern if you have a p2p-only list and short-circuit=20 straight to that step (in which case all the shuffling was wasted effort=20 anyway), but since it's not entirely clear what a segment with zero=20 length would mean in general, it seems like a good idea to avoid passing=20 the list across a public boundary in that state, if possible. Robin. >>> =C2=A0 } >>> =C2=A0 =C2=A0 static void iommu_dma_unmap_sg(struct device *dev, str= uct >>> scatterlist *sg, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int nents, en= um dma_data_direction dir, unsigned long attrs) >>> =C2=A0 { >>> -=C2=A0=C2=A0=C2=A0 dma_addr_t start, end; >>> +=C2=A0=C2=A0=C2=A0 dma_addr_t end, start =3D DMA_MAPPING_ERROR; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct scatterlist *tmp; >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int i; >>> =C2=A0 @@ -1054,14 +1090,20 @@ static void iommu_dma_unmap_sg(struct= device >>> *dev, struct scatterlist *sg, >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * The scatterlist segments are = mapped into a single >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * contiguous IOVA allocation, s= o this is incredibly easy. >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ >>> -=C2=A0=C2=A0=C2=A0 start =3D sg_dma_address(sg); >>> -=C2=A0=C2=A0=C2=A0 for_each_sg(sg_next(sg), tmp, nents - 1, i) { >>> +=C2=A0=C2=A0=C2=A0 for_each_sg(sg, tmp, nents, i) { >>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (sg_is_pci_p2pdma(tmp)= ) >> >> Since the flag is associated with the DMA address which will no longer >> be valid, shouldn't it be cleared? The circumstances in which leaving = it >> around could cause a problem are tenuous, but definitely possible. >=20 > Yes, that's a good idea. >=20 > Thanks for the review! >=20 > Logan >=20