From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A870DC0218A for ; Tue, 28 Jan 2025 15:16:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 19D08280232; Tue, 28 Jan 2025 10:16:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 14D5728022B; Tue, 28 Jan 2025 10:16:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE4ED280232; Tue, 28 Jan 2025 10:16:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D03E028022B for ; Tue, 28 Jan 2025 10:16:15 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 63CDD4A1E6 for ; Tue, 28 Jan 2025 15:16:15 +0000 (UTC) X-FDA: 83057211510.27.7E45391 Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf09.hostedemail.com (Postfix) with ESMTP id 38D1C140007 for ; Tue, 28 Jan 2025 15:16:13 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b="IpY9t/25"; spf=pass (imf09.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.160.181 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738077373; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9asZltzPGSVkB8UT62aMxicpstwn5ZY78K38XF9efC4=; b=TRsmmaRJhm7yivkwmCKHvyTZ5k2lsZV9hA2S0D3a//FL9lZUwP5B0WVtv/jtwv3WYyOnyZ mbrfuJy1PbTEidmiErne0F7W069TYzgvlRMb1yPCze8U7DScSsJLZH/lbw3dSA3xIXeGrb rLOQQtJOnogqhCaDM4tQcQv03G7WacI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b="IpY9t/25"; spf=pass (imf09.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.160.181 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738077373; a=rsa-sha256; cv=none; b=sjiLEomX0TANAoI09xmBs28xL61SPZNrNqEU/F+FedciCoC/GI4E4QIaxtsqaKyHQ/l8EL 13krX9cvyhFps0ficrLv6S+WCcR6YIzXblwgd9Ukf4NGRaw/M8JRIEJA+wOrTkgUX/Zza4 TT3FI/98vYXJ1N2hnhilNM0drvhhOHk= Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-467a17055e6so62455281cf.3 for ; Tue, 28 Jan 2025 07:16:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1738077372; x=1738682172; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=9asZltzPGSVkB8UT62aMxicpstwn5ZY78K38XF9efC4=; b=IpY9t/25re9OKmTlfPo10FkResipEo64NwvWJOOO7fkPJtwWQbczEaYk4UEQ/okWYf NymUOnbqQd2DygAYbUTdwxSbhVbPRBzk9OFS8lqK/XgjqZWEmF7xHj4Irz8d7Hdsd1wZ 0Tm+IrhzcMrmy9uB5zsQhXzNij+z7RBoczY4uVJRwbQvXIr8c+QZIp7T051Ev4K5DC9J nwtTChi09had7HNEk2nsloLhVR01tKQHY/luRRLctrMNTLGHd7pVo8xM+C7290iDuSIS iOGmviLRI96ANMv9jcn30CzrdEYS7NI92RU7WK0HR8Vt8NoLMNKbwgcfzwm18FC/MjXF s5Yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738077372; x=1738682172; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9asZltzPGSVkB8UT62aMxicpstwn5ZY78K38XF9efC4=; b=V2SNjj0JqTK12vi8771HVx9cUtrijrtUuGq/QYs134GOP1HBYwRkgcZUcQl6D7975i nJFkp+UeXDSwiLU3YRppalzz/moaUzSV2dLXD1miJcj+JJAHjMwwPSzHyx28fZLBNejZ k0631fpbJwKkYFs4yBaREOkUHkdbg0UpH0I3cqYquAxgIIdy37MCZM5gcezESUaPxP56 MfpGGh2o21LCLWyff8voWj9Wo5aEo7cvDxsvevBB3j9opGM7QVyzXy9BBbxynlj5UK1K TiomIzWFro5Niwsbz+PeYLjGfK9efs/dUHdIYDIuHq/XvS33sLqQyEM3b5DZ+JUGa0G3 veWA== X-Forwarded-Encrypted: i=1; AJvYcCWVzHAH3rRJtA5I8fucwFQy+vPtjaLwWBFKVjlvdtnVXF3hGBzi99/t3ZgdKR871rTy2B9/KP/Ogg==@kvack.org X-Gm-Message-State: AOJu0YzM7UURtgjGT3mSKwBT5/pDITTT8hNyMlxx6MGHwTkKQQpnoZmm Xv6FZvPukwBvg+U8P47Vh+Q0BjNNUNNiA1Whx2BGCF7rPc2VccouOAtj95wTDEc= X-Gm-Gg: ASbGnctNMRfmvhnMnbVeY9wro9eEexBh6bjAQHxQko1D8+2il5DBS7IOoqdpxScMJyB zahbiRKsAUksuiTJ3J4cYzimDOHK38MtWUH9dd38onOoaa0lDJ+05u57zEEdSbKXxWdO3GEG/zV MkSg7wKjKXAhKKJRyvwLXO2SFomkvPl3zJko3ycVymg+15gcbex/DSNZjhLIZKpxDnugjuG4FpH Xux1ukWRkExX7m7JkpAgjzOr644wR5VKyGUwJCfcyPmMc+HH3Q2DdVXGmLr0BPeQurnLE3LHgBO rjzwt2/dsEpn7xlMPm9vVNGFiQ9NDH8pLYvz+M6QGKs/qQAnjpDrxubOXBwbhRyF X-Google-Smtp-Source: AGHT+IGRdYaLk5GMFm39sxxEUnAoVN5fAg2yjnF65TF+ISHX+uAwVSwQhG2SfGCyK/fZcBn4SrAwBQ== X-Received: by 2002:a05:622a:40f:b0:467:6833:e30c with SMTP id d75a77b69052e-46e12a8affdmr746561081cf.30.1738077372160; Tue, 28 Jan 2025 07:16:12 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-68-128-5.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.68.128.5]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-46e6687dd65sm51648611cf.18.2025.01.28.07.16.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Jan 2025 07:16:11 -0800 (PST) Received: from jgg by wakko with local (Exim 4.97) (envelope-from ) id 1tcnJy-00000007czI-18wL; Tue, 28 Jan 2025 11:16:10 -0400 Date: Tue, 28 Jan 2025 11:16:10 -0400 From: Jason Gunthorpe To: Thomas =?utf-8?Q?Hellstr=C3=B6m?= Cc: Yonatan Maman , kherbst@redhat.com, lyude@redhat.com, dakr@redhat.com, airlied@gmail.com, simona@ffwll.ch, leon@kernel.org, jglisse@redhat.com, akpm@linux-foundation.org, GalShalom@nvidia.com, dri-devel@lists.freedesktop.org, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-mm@kvack.org, linux-tegra@vger.kernel.org Subject: Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Message-ID: <20250128151610.GC1524382@ziepe.ca> References: <20241201103659.420677-1-ymaman@nvidia.com> <20241201103659.420677-2-ymaman@nvidia.com> <7282ac68c47886caa2bc2a2813d41a04adf938e1.camel@linux.intel.com> <20250128132034.GA1524382@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 38D1C140007 X-Stat-Signature: focohrx3dztym3j8u7ku99a678anbmmz X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1738077373-847010 X-HE-Meta: U2FsdGVkX1/lZoFSrQXq39xZWpaLfg+24VVnAe9BLUdXiY3TQ2at49JVYB2KQnUm3BMblvh88z2BaWNEwzQmv+v/1HQI22py3DUjvn/Bow3UJ5KXRROfHlAkn+/rxUg6EgtSeoqXerS5wXCVE5BaJ12EDP4GOBWgZIKJaK3Adp/5TaSeSOm61ktKngTb9GkbWxi9xDpguSpP0aTIyAgCLL7/LEZ45RIgh+DQBFxtFVrD2U66Vljsiz76O6+YpJv1ys/wWdRZWVQP+imwS1sqWIWLaoBxXLEnRWmDq7TnsqZIO+koA8jxOCfSOEj4WGvn364p+nqC5S+QOIthE3NdKQwV8TeB/KVrixn9HNpVpfEk0Y2wXYHERJI2Tiz3kb8tgHDcWqzoHooRt3mk2Y7Ki9SAYnMNJhFQgFNkMyKp6S8BzX9sV7594/qF3Ecd7tCYypI0+feqW+hGPelqzTR8aOlvw7dX0X4Lu0xwUiATQ8SW79XxKkea7y2pl3DzxDSbXjSkhtwLArUf1mOjIdOoT49p1JPMppHAl+VonckPjgcXzwHqAXl1ae/WUwkg3XkZMWpc/b81tu6NTWve0Ed/ch+Cc/GEJ/BOTO/9bGVb001EraXh5aS5NqOVzqSOH0zDo9QLV/LKdLCVz1Y9i4OAHMnOI9Xd15ibVcbNzeBz4UUexuRbGkBxbj25dBmRgmctuoj/fUDsQ1gesf3JaLIbz5V04+s3RX1tdYZ1ctVfa8hrgKgtI5Wa8++5I3b+N3vaQar5TcGlWqWvUP+FtJlWFAq0oLXLXHIlfeYuXprE3pyss1m1iY0Fl2p3uDdidsSh5eBvQGmCSm0/URRVEzb+LGeYTIxF90QjYO61ysbK0uM0f+HPPJSixVwsOwc3GchYrv5qwS4QeFlOaSYr+KTcHzZmwJWcOlDzq1mbM1AB1YFwlGSvSYi2nL8RT/qYLHV0H+pNP2mbsKoTD0Wv5i9 TcvdOY1K jWqub9O+8bUDDV5Cv0lHwDsWVzJq1JkGkadeqVpj6AGd7/NyNy46FGKxS1U+O0tqqcoO7wvmLmzlsg32/45rFhuGL1lDsFioYqougzeGN37twMe0LVXZyJZsImZuR8pbQ175bXztNgqAnD2Oe5hB9PuHgT3lXB7jHnhRIezT7JykL/BrS046z9DLnrcQeSXViAwDgkl13pRAQe7WwGLB8zIDwXVTMpEt6vmDqPXoPBUNgGbZinsxelaiOsdHFmLLyy8OObNYvCpx3/+QrjjcdNnvFf4YRBxkpki7ayLjKguL1sWRl2gdR9XS8U4UnVdHB54rGk9cjxf2tkN+Gjc8PFwFPln/AjN1gTPzfw9D6XgkoxSE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 28, 2025 at 03:48:54PM +0100, Thomas Hellström wrote: > On Tue, 2025-01-28 at 09:20 -0400, Jason Gunthorpe wrote: > > On Tue, Jan 28, 2025 at 09:51:52AM +0100, Thomas Hellström wrote: > > > > > How would the pgmap device know whether P2P is actually possible > > > without knowing the client device, (like calling > > > pci_p2pdma_distance) > > > and also if looking into access control, whether it is allowed? > > > > The DMA API will do this, this happens after this patch is put on top > > of Leon's DMA API patches. The mapping operation will fail and it > > will > > likely be fatal to whatever is going on. > >   > > get_dma_pfn_for_device() returns a new PFN, but that is not a DMA > > mapped address, it is just a PFN that has another struct page under > > it. > > > > There is an implicit assumption here that P2P will work and we don't > > need a 3rd case to handle non-working P2P.. > > OK. We will have the case where we want pfnmaps with driver-private > fast interconnects to return "interconnect possible, don't migrate" > whereas possibly other gpus and other devices would return > "interconnect unsuitable, do migrate", so (as I understand it) > something requiring a more flexible interface than this. I'm not sure this doesn't handle that case? Here we are talking about having DEVICE_PRIVATE struct page mappings. On a GPU this should represent GPU local memory that is non-coherent with the CPU, and not mapped into the CPU. This series supports three case: 1) pgmap->owner == range->dev_private_owner This is "driver private fast interconnect" in this case HMM should immediately return the page. The calling driver understands the private parts of the pgmap and computes the private interconnect address. This requires organizing your driver so that all private interconnect has the same pgmap->owner. 2) The page is DEVICE_PRIVATE and get_dma_pfn_for_device() exists. The exporting driver has the option to return a P2P struct page that can be used for PCI P2P without any migration. In a PCI GPU context this means the GPU has mapped its local memory to a PCI address. The assumption is that P2P always works and so this address can be DMA'd from. 3) Migrate back to CPU memory - then eveything works. Is that not enough? Where do you want something different? > > > but leaves any dma- mapping or pfn mangling to be done after the > > > call to hmm_range_fault(), since hmm_range_fault() really only > > > needs > > > to know whether it has to migrate to system or not. > > > > See above, this is already the case.. > > Well what I meant was at hmm_range_fault() time only consider whether > to migrate or not. Afterwards at dma-mapping time you'd expose the > alternative pfns that could be used for dma-mapping. That sounds like you are talking about multipath, we are not really ready to tackle general multipath yet at the DMA API level, IMHO. If you are just talking about your private multi-path, then that is already handled.. > We were actually looking at a solution where the pagemap implements > something along > > bool devmem_allowed(pagemap, client); //for hmm_range_fault > > plus dma_map() and dma_unmap() methods. This sounds like dmabuf philosophy, and I don't think we should go in this direction. The hmm caller should always be responsible for dma mapping and we need to improve the DMA API to make this work better, not build side hacks like this. You can read my feelings and reasoning on this topic within this huge thread: https://lore.kernel.org/dri-devel/20250108132358.GP5556@nvidia.com/ > In this way you'd don't need to expose special p2p dma pages and the Removing the "special p2p dma pages" has to be done by improving the DMA API to understand how to map phsyical addresses without struct page. We are working toward this, slowly. pgmap->ops->dma_map/unmap() ideas just repeat the DMABUF mistake of mis-using the DMA API for P2P cases. Today you cannot correctly DMA map P2P memory without the struct page. > interface could also handle driver-private interconnects, where > dma_maps and dma_unmap() methods become trivial. We already handle private interconnect. > > > One benefit of using this alternative > > > approach is that struct hmm_range can be subclassed by the caller > > > and > > > for example cache device pairs for which p2p is allowed. > > > > If you want to directly address P2P non-uniformity I'd rather do it > > directly in the core code than using a per-driver callback. Every > > driver needs exactly the same logic for such a case. > > Yeah, and that would look something like the above No, it would look like the core HMM code calling pci distance on the P2P page returned from get_dma_pfn_for_device() and if P2P was impossible then proceed to option #3 fault to CPU. > although initially we intended to keep these methods in drm > allocator around its pagemaps, but could of course look into doing > this directly in dev_pagemap ops.  But still would probably need > some guidance into what's considered acceptable, and I don't think > the solution proposed in this patch meets our needs. I'm still not sure what you are actually trying to achieve? Jason