* [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2024-12-01 10:36 [RFC 0/5] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
@ 2024-12-01 10:36 ` Yonatan Maman
2025-01-28 8:51 ` Thomas Hellström
2024-12-01 10:36 ` [RFC 2/5] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
` (3 subsequent siblings)
4 siblings, 1 reply; 26+ messages in thread
From: Yonatan Maman @ 2024-12-01 10:36 UTC (permalink / raw)
To: kherbst, lyude, dakr, airlied, simona, jgg, leon, jglisse, akpm,
Ymaman, GalShalom, dri-devel, nouveau, linux-kernel, linux-rdma,
linux-mm, linux-tegra
From: Yonatan Maman <Ymaman@Nvidia.com>
hmm_range_fault() by default triggered a page fault on device private
when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
cases, such as with RDMA devices, the migration overhead between the
device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
private page might be crucial for minimizing data transfer overhead.
Introduced an API to support P2P DMA for device private pages,includes:
- Leveraging the struct pagemap_ops for P2P Page Callbacks. This callback
involves mapping the page for P2P DMA and returning the corresponding
PCI_P2P page.
- Utilizing hmm_range_fault for initializing P2P DMA. The API
also adds the HMM_PFN_REQ_TRY_P2P flag option for the
hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
attempts initializing the P2P connection first, if the owner device
supports P2P, using p2p_page. In case of failure or lack of support,
hmm_range_fault will continue with the regular flow of migrating the
page to RAM.
This change does not affect previous use-cases of hmm_range_fault,
because both the caller and the page owner must explicitly request and
support it to initialize P2P connection.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
include/linux/hmm.h | 3 ++-
include/linux/memremap.h | 8 ++++++
mm/hmm.c | 57 +++++++++++++++++++++++++++++++++-------
3 files changed, 57 insertions(+), 11 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 62980ca8f3c5..017f22cef893 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -26,6 +26,7 @@ struct mmu_interval_notifier;
* HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
* to mark that page is already DMA mapped
+ * HMM_PFN_ALLOW_P2P - Allow returning PCI P2PDMA page
*
* On input:
* 0 - Return the current state of the page, do not fault it.
@@ -41,7 +42,7 @@ enum hmm_pfn_flags {
HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
/* Sticky flag, carried from Input to Output */
+ HMM_PFN_ALLOW_P2P = 1UL << (BITS_PER_LONG - 6),
HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143ade32c..cdf5189be5e9 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,14 @@ struct dev_pagemap_ops {
*/
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+ /*
+ * Used for private (un-addressable) device memory only. Return a
+ * corresponding PFN for a page that can be mapped to device
+ * (e.g using dma_map_page)
+ */
+ int (*get_dma_pfn_for_device)(struct page *private_page,
+ unsigned long *dma_pfn);
+
/*
* Handle the memory failure happens on a range of pfns. Notify the
* processes who are using these pfns, and try to recover the data on
diff --git a/mm/hmm.c b/mm/hmm.c
index a852d8337c73..1c080bc00ee8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -226,6 +226,51 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
}
+static bool hmm_handle_device_private(struct hmm_range *range,
+ unsigned long pfn_req_flags,
+ swp_entry_t entry,
+ unsigned long *hmm_pfn)
+{
+ struct page *page = pfn_swap_entry_to_page(entry);
+ struct dev_pagemap *pgmap = page->pgmap;
+ int ret;
+ pfn_req_flags &= range->pfn_flags_mask;
+ pfn_req_flags |= range->default_flags;
+
+ /*
+ * Don't fault in device private pages owned by the caller,
+ * just report the PFN.
+ */
+ if (pgmap->owner == range->dev_private_owner) {
+ *hmm_pfn = swp_offset_pfn(entry);
+ goto found;
+ }
+
+ /*
+ * P2P for supported pages, and according to caller request
+ * translate the private page to the match P2P page if it fails
+ * continue with the regular flow
+ */
+ if (pfn_req_flags & HMM_PFN_ALLOW_P2P &&
+ pgmap->ops->get_dma_pfn_for_device) {
+ ret = pgmap->ops->get_dma_pfn_for_device(page, hmm_pfn);
+ if (!ret) {
+ *hmm_pfn |= HMM_PFN_ALLOW_P2P;
+ goto found;
+ }
+ }
+
+ return false;
+
+found:
+ *hmm_pfn |= HMM_PFN_VALID;
+ if (is_writable_device_private_entry(entry))
+ *hmm_pfn |= HMM_PFN_WRITE;
+ return true;
+}
+
static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
unsigned long end, pmd_t *pmdp, pte_t *ptep,
unsigned long *hmm_pfn)
@@ -249,17 +294,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
if (!pte_present(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);
- /*
- * Don't fault in device private pages owned by the caller,
- * just report the PFN.
- */
if (is_device_private_entry(entry) &&
- pfn_swap_entry_to_page(entry)->pgmap->owner ==
- range->dev_private_owner) {
- cpu_flags = HMM_PFN_VALID;
- if (is_writable_device_private_entry(entry))
- cpu_flags |= HMM_PFN_WRITE;
- *hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | swp_offset_pfn(entry) | cpu_flags;
+ hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn)) {
+ *hmm_pfn = *hmm_pfn & HMM_PFN_DMA_MAPPED;
return 0;
}
--
2.34.1
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2024-12-01 10:36 ` [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
@ 2025-01-28 8:51 ` Thomas Hellström
2025-01-28 13:20 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Thomas Hellström @ 2025-01-28 8:51 UTC (permalink / raw)
To: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, jgg, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
Hi, Jonatan
On Sun, 2024-12-01 at 12:36 +0200, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
>
> hmm_range_fault() by default triggered a page fault on device private
> when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In
> some
> cases, such as with RDMA devices, the migration overhead between the
> device (e.g., GPU) and the CPU, and vice-versa, significantly
> degrades
> performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
> private page might be crucial for minimizing data transfer overhead.
>
> Introduced an API to support P2P DMA for device private
> pages,includes:
> - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
> callback
> involves mapping the page for P2P DMA and returning the
> corresponding
> PCI_P2P page.
>
> - Utilizing hmm_range_fault for initializing P2P DMA. The API
> also adds the HMM_PFN_REQ_TRY_P2P flag option for the
> hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
> attempts initializing the P2P connection first, if the owner
> device
> supports P2P, using p2p_page. In case of failure or lack of
> support,
> hmm_range_fault will continue with the regular flow of migrating
> the
> page to RAM.
>
> This change does not affect previous use-cases of hmm_range_fault,
> because both the caller and the page owner must explicitly request
> and
> support it to initialize P2P connection.
>
> Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
> Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
> ---
> include/linux/hmm.h | 3 ++-
> include/linux/memremap.h | 8 ++++++
> mm/hmm.c | 57 +++++++++++++++++++++++++++++++++-----
> --
> 3 files changed, 57 insertions(+), 11 deletions(-)
It appears we're working on a very similar thing, (In fact the original
proposals were sent out the same day. I have a couple of questions).
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 62980ca8f3c5..017f22cef893 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -26,6 +26,7 @@ struct mmu_interval_notifier;
> * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output
> transformation
> * to mark that page is already DMA mapped
> + * HMM_PFN_ALLOW_P2P - Allow returning PCI P2PDMA page
> *
> * On input:
> * 0 - Return the current state of the page, do not
> fault it.
> @@ -41,7 +42,7 @@ enum hmm_pfn_flags {
> HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
>
> /* Sticky flag, carried from Input to Output */
> + HMM_PFN_ALLOW_P2P = 1UL << (BITS_PER_LONG - 6),
> HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
>
> HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 3f7143ade32c..cdf5189be5e9 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> */
> vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>
> + /*
> + * Used for private (un-addressable) device memory only.
> Return a
> + * corresponding PFN for a page that can be mapped to device
> + * (e.g using dma_map_page)
> + */
> + int (*get_dma_pfn_for_device)(struct page *private_page,
> + unsigned long *dma_pfn);
> +
> /*
> * Handle the memory failure happens on a range of pfns.
> Notify the
> * processes who are using these pfns, and try to recover
> the data on
> diff --git a/mm/hmm.c b/mm/hmm.c
> index a852d8337c73..1c080bc00ee8 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -226,6 +226,51 @@ static inline unsigned long
> pte_to_hmm_pfn_flags(struct hmm_range *range,
> return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) :
> HMM_PFN_VALID;
> }
>
> +static bool hmm_handle_device_private(struct hmm_range *range,
> + unsigned long pfn_req_flags,
> + swp_entry_t entry,
> + unsigned long *hmm_pfn)
> +{
> + struct page *page = pfn_swap_entry_to_page(entry);
> + struct dev_pagemap *pgmap = page->pgmap;
> + int ret;
>
> + pfn_req_flags &= range->pfn_flags_mask;
> + pfn_req_flags |= range->default_flags;
> +
> + /*
> + * Don't fault in device private pages owned by the caller,
> + * just report the PFN.
> + */
> + if (pgmap->owner == range->dev_private_owner) {
> + *hmm_pfn = swp_offset_pfn(entry);
> + goto found;
> + }
> +
> + /*
> + * P2P for supported pages, and according to caller request
> + * translate the private page to the match P2P page if it
> fails
> + * continue with the regular flow
> + */
> + if (pfn_req_flags & HMM_PFN_ALLOW_P2P &&
> + pgmap->ops->get_dma_pfn_for_device) {
> + ret = pgmap->ops->get_dma_pfn_for_device(page,
> hmm_pfn);
How would the pgmap device know whether P2P is actually possible
without knowing the client device, (like calling pci_p2pdma_distance)
and also if looking into access control, whether it is allowed?
I wonder whether you could consider using something that is a little
more generic that would fit also our use-case. Here the caller provides
a callback as to whether devmem access is allowed, but leaves any dma-
mapping or pfn mangling to be done after the call to hmm_range_fault(),
since hmm_range_fault() really only needs to know whether it has to
migrate to system or not. One benefit of using this alternative
approach is that struct hmm_range can be subclassed by the caller and
for example cache device pairs for which p2p is allowed.
Current version (after the feedback from Jason looks like this). It
looks like your use-case could easily be made to fit this one, but, as
I understand it, not vice versa: (Could send this as a separate patch
if needed).
Thanks,
Thomas
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..8ac1f4125e30 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,21 @@ static inline unsigned int
hmm_pfn_to_map_order(unsigned long hmm_pfn)
return (hmm_pfn >> HMM_PFN_ORDER_SHIFT) & 0x1F;
}
+struct hmm_range;
+
+/**
+ * struct hmm_range_ops - Functions for detailed cross-device access.
+ */
+struct hmm_range_ops {
+ /**
+ * @devmem_allow: Whether to allow cross-device access to
device_private pages.
+ * @hrange: Pointer to a struct hmm_range. Typically
subclassed by the caller
+ * to provide needed information.
+ * @page: The page being queried.
+ */
+ bool (*devmem_allow)(struct hmm_range *hrange, struct page
*page);
+};
+
/*
* struct hmm_range - track invalidation lock on virtual address range
*
@@ -87,6 +102,7 @@ static inline unsigned int
hmm_pfn_to_map_order(unsigned long hmm_pfn)
* @default_flags: default flags for the range (write, read, ... see
hmm doc)
* @pfn_flags_mask: allows to mask pfn flags so that only
default_flags matter
* @dev_private_owner: owner of device private pages
+ * @ops: Pointer to a struct hmm_range_ops or NULL if no ops provided.
*/
struct hmm_range {
struct mmu_interval_notifier *notifier;
@@ -97,6 +113,7 @@ struct hmm_range {
unsigned long default_flags;
unsigned long pfn_flags_mask;
void *dev_private_owner;
+ const struct hmm_range_ops *ops;
};
/*
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..ea4e08caa14a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -220,6 +220,15 @@ static inline unsigned long
pte_to_hmm_pfn_flags(struct hmm_range *range,
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) :
HMM_PFN_VALID;
}
+static bool hmm_devmem_allow(struct hmm_range *range, struct page
*page)
+{
+ if (likely(page->pgmap->owner == range->dev_private_owner))
+ return true;
+ if (likely(!range->ops))
+ return false;
+ return range->ops->devmem_allow(range, page);
+}
+
static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long
addr,
unsigned long end, pmd_t *pmdp, pte_t
*ptep,
unsigned long *hmm_pfn)
@@ -245,11 +254,10 @@ static int hmm_vma_handle_pte(struct mm_walk
*walk, unsigned long addr,
/*
* Don't fault in device private pages owned by the
caller,
- * just report the PFN.
+ * or that are accessible to the caller. Just report
the PFN.
*/
if (is_device_private_entry(entry) &&
- pfn_swap_entry_to_page(entry)->pgmap->owner ==
- range->dev_private_owner) {
+ hmm_devmem_allow(range,
pfn_swap_entry_to_page(entry))) {
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
--
2.48.1
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-28 8:51 ` Thomas Hellström
@ 2025-01-28 13:20 ` Jason Gunthorpe
2025-01-28 14:48 ` Thomas Hellström
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-01-28 13:20 UTC (permalink / raw)
To: Thomas Hellström
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, Jan 28, 2025 at 09:51:52AM +0100, Thomas Hellström wrote:
> How would the pgmap device know whether P2P is actually possible
> without knowing the client device, (like calling pci_p2pdma_distance)
> and also if looking into access control, whether it is allowed?
The DMA API will do this, this happens after this patch is put on top
of Leon's DMA API patches. The mapping operation will fail and it will
likely be fatal to whatever is going on.
get_dma_pfn_for_device() returns a new PFN, but that is not a DMA
mapped address, it is just a PFN that has another struct page under
it.
There is an implicit assumption here that P2P will work and we don't
need a 3rd case to handle non-working P2P..
> but leaves any dma- mapping or pfn mangling to be done after the
> call to hmm_range_fault(), since hmm_range_fault() really only needs
> to know whether it has to migrate to system or not.
See above, this is already the case..
> One benefit of using this alternative
> approach is that struct hmm_range can be subclassed by the caller and
> for example cache device pairs for which p2p is allowed.
If you want to directly address P2P non-uniformity I'd rather do it
directly in the core code than using a per-driver callback. Every
driver needs exactly the same logic for such a case.
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-28 13:20 ` Jason Gunthorpe
@ 2025-01-28 14:48 ` Thomas Hellström
2025-01-28 15:16 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Thomas Hellström @ 2025-01-28 14:48 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, 2025-01-28 at 09:20 -0400, Jason Gunthorpe wrote:
> On Tue, Jan 28, 2025 at 09:51:52AM +0100, Thomas Hellström wrote:
>
> > How would the pgmap device know whether P2P is actually possible
> > without knowing the client device, (like calling
> > pci_p2pdma_distance)
> > and also if looking into access control, whether it is allowed?
>
> The DMA API will do this, this happens after this patch is put on top
> of Leon's DMA API patches. The mapping operation will fail and it
> will
> likely be fatal to whatever is going on.
>
> get_dma_pfn_for_device() returns a new PFN, but that is not a DMA
> mapped address, it is just a PFN that has another struct page under
> it.
>
> There is an implicit assumption here that P2P will work and we don't
> need a 3rd case to handle non-working P2P..
OK. We will have the case where we want pfnmaps with driver-private
fast interconnects to return "interconnect possible, don't migrate"
whereas possibly other gpus and other devices would return
"interconnect unsuitable, do migrate", so (as I understand it)
something requiring a more flexible interface than this.
>
> > but leaves any dma- mapping or pfn mangling to be done after the
> > call to hmm_range_fault(), since hmm_range_fault() really only
> > needs
> > to know whether it has to migrate to system or not.
>
> See above, this is already the case..
Well what I meant was at hmm_range_fault() time only consider whether
to migrate or not. Afterwards at dma-mapping time you'd expose the
alternative pfns that could be used for dma-mapping.
We were actually looking at a solution where the pagemap implements
something along
bool devmem_allowed(pagemap, client); //for hmm_range_fault
plus dma_map() and dma_unmap() methods.
In this way you'd don't need to expose special p2p dma pages and the
interface could also handle driver-private interconnects, where
dma_maps and dma_unmap() methods become trivial.
>
> > One benefit of using this alternative
> > approach is that struct hmm_range can be subclassed by the caller
> > and
> > for example cache device pairs for which p2p is allowed.
>
> If you want to directly address P2P non-uniformity I'd rather do it
> directly in the core code than using a per-driver callback. Every
> driver needs exactly the same logic for such a case.
Yeah, and that would look something like the above, although initially
we intended to keep these methods in drm allocator around its pagemaps,
but could of course look into doing this directly in dev_pagemap ops.
But still would probably need some guidance into what's considered
acceptable, and I don't think the solution proposed in this patch meets
our needs.
Thanks,
Thomas
>
> Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-28 14:48 ` Thomas Hellström
@ 2025-01-28 15:16 ` Jason Gunthorpe
2025-01-28 16:32 ` Thomas Hellström
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-01-28 15:16 UTC (permalink / raw)
To: Thomas Hellström
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, Jan 28, 2025 at 03:48:54PM +0100, Thomas Hellström wrote:
> On Tue, 2025-01-28 at 09:20 -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 28, 2025 at 09:51:52AM +0100, Thomas Hellström wrote:
> >
> > > How would the pgmap device know whether P2P is actually possible
> > > without knowing the client device, (like calling
> > > pci_p2pdma_distance)
> > > and also if looking into access control, whether it is allowed?
> >
> > The DMA API will do this, this happens after this patch is put on top
> > of Leon's DMA API patches. The mapping operation will fail and it
> > will
> > likely be fatal to whatever is going on.
> >
> > get_dma_pfn_for_device() returns a new PFN, but that is not a DMA
> > mapped address, it is just a PFN that has another struct page under
> > it.
> >
> > There is an implicit assumption here that P2P will work and we don't
> > need a 3rd case to handle non-working P2P..
>
> OK. We will have the case where we want pfnmaps with driver-private
> fast interconnects to return "interconnect possible, don't migrate"
> whereas possibly other gpus and other devices would return
> "interconnect unsuitable, do migrate", so (as I understand it)
> something requiring a more flexible interface than this.
I'm not sure this doesn't handle that case?
Here we are talking about having DEVICE_PRIVATE struct page
mappings. On a GPU this should represent GPU local memory that is
non-coherent with the CPU, and not mapped into the CPU.
This series supports three case:
1) pgmap->owner == range->dev_private_owner
This is "driver private fast interconnect" in this case HMM should
immediately return the page. The calling driver understands the
private parts of the pgmap and computes the private interconnect
address.
This requires organizing your driver so that all private
interconnect has the same pgmap->owner.
2) The page is DEVICE_PRIVATE and get_dma_pfn_for_device() exists.
The exporting driver has the option to return a P2P struct page
that can be used for PCI P2P without any migration. In a PCI GPU
context this means the GPU has mapped its local memory to a PCI
address. The assumption is that P2P always works and so this
address can be DMA'd from.
3) Migrate back to CPU memory - then eveything works.
Is that not enough? Where do you want something different?
> > > but leaves any dma- mapping or pfn mangling to be done after the
> > > call to hmm_range_fault(), since hmm_range_fault() really only
> > > needs
> > > to know whether it has to migrate to system or not.
> >
> > See above, this is already the case..
>
> Well what I meant was at hmm_range_fault() time only consider whether
> to migrate or not. Afterwards at dma-mapping time you'd expose the
> alternative pfns that could be used for dma-mapping.
That sounds like you are talking about multipath, we are not really
ready to tackle general multipath yet at the DMA API level, IMHO.
If you are just talking about your private multi-path, then that is
already handled..
> We were actually looking at a solution where the pagemap implements
> something along
>
> bool devmem_allowed(pagemap, client); //for hmm_range_fault
>
> plus dma_map() and dma_unmap() methods.
This sounds like dmabuf philosophy, and I don't think we should go in
this direction. The hmm caller should always be responsible for dma
mapping and we need to improve the DMA API to make this work better,
not build side hacks like this.
You can read my feelings and reasoning on this topic within this huge thread:
https://lore.kernel.org/dri-devel/20250108132358.GP5556@nvidia.com/
> In this way you'd don't need to expose special p2p dma pages and the
Removing the "special p2p dma pages" has to be done by improving the
DMA API to understand how to map phsyical addresses without struct
page. We are working toward this, slowly.
pgmap->ops->dma_map/unmap() ideas just repeat the DMABUF mistake
of mis-using the DMA API for P2P cases. Today you cannot correctly DMA
map P2P memory without the struct page.
> interface could also handle driver-private interconnects, where
> dma_maps and dma_unmap() methods become trivial.
We already handle private interconnect.
> > > One benefit of using this alternative
> > > approach is that struct hmm_range can be subclassed by the caller
> > > and
> > > for example cache device pairs for which p2p is allowed.
> >
> > If you want to directly address P2P non-uniformity I'd rather do it
> > directly in the core code than using a per-driver callback. Every
> > driver needs exactly the same logic for such a case.
>
> Yeah, and that would look something like the above
No, it would look like the core HMM code calling pci distance on the
P2P page returned from get_dma_pfn_for_device() and if P2P was
impossible then proceed to option #3 fault to CPU.
> although initially we intended to keep these methods in drm
> allocator around its pagemaps, but could of course look into doing
> this directly in dev_pagemap ops. But still would probably need
> some guidance into what's considered acceptable, and I don't think
> the solution proposed in this patch meets our needs.
I'm still not sure what you are actually trying to achieve?
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-28 15:16 ` Jason Gunthorpe
@ 2025-01-28 16:32 ` Thomas Hellström
2025-01-28 17:21 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Thomas Hellström @ 2025-01-28 16:32 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, 2025-01-28 at 11:16 -0400, Jason Gunthorpe wrote:
> On Tue, Jan 28, 2025 at 03:48:54PM +0100, Thomas Hellström wrote:
> > On Tue, 2025-01-28 at 09:20 -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 28, 2025 at 09:51:52AM +0100, Thomas Hellström wrote:
> > >
> > > > How would the pgmap device know whether P2P is actually
> > > > possible
> > > > without knowing the client device, (like calling
> > > > pci_p2pdma_distance)
> > > > and also if looking into access control, whether it is allowed?
> > >
> > > The DMA API will do this, this happens after this patch is put on
> > > top
> > > of Leon's DMA API patches. The mapping operation will fail and it
> > > will
> > > likely be fatal to whatever is going on.
> > >
> > > get_dma_pfn_for_device() returns a new PFN, but that is not a DMA
> > > mapped address, it is just a PFN that has another struct page
> > > under
> > > it.
> > >
> > > There is an implicit assumption here that P2P will work and we
> > > don't
> > > need a 3rd case to handle non-working P2P..
> >
> > OK. We will have the case where we want pfnmaps with driver-private
> > fast interconnects to return "interconnect possible, don't migrate"
> > whereas possibly other gpus and other devices would return
> > "interconnect unsuitable, do migrate", so (as I understand it)
> > something requiring a more flexible interface than this.
>
> I'm not sure this doesn't handle that case?
>
> Here we are talking about having DEVICE_PRIVATE struct page
> mappings. On a GPU this should represent GPU local memory that is
> non-coherent with the CPU, and not mapped into the CPU.
>
> This series supports three case:
>
> 1) pgmap->owner == range->dev_private_owner
> This is "driver private fast interconnect" in this case HMM
> should
> immediately return the page. The calling driver understands the
> private parts of the pgmap and computes the private interconnect
> address.
>
> This requires organizing your driver so that all private
> interconnect has the same pgmap->owner.
Yes, although that makes this map static, since pgmap->owner has to be
set at pgmap creation time. and we were during initial discussions
looking at something dynamic here. However I think we can probably do
with a per-driver owner for now and get back if that's not sufficient.
>
> 2) The page is DEVICE_PRIVATE and get_dma_pfn_for_device() exists.
> The exporting driver has the option to return a P2P struct page
> that can be used for PCI P2P without any migration. In a PCI GPU
> context this means the GPU has mapped its local memory to a PCI
> address. The assumption is that P2P always works and so this
> address can be DMA'd from.
So do I understand it correctly, that the driver then needs to set up
one device_private struct page and one pcie_p2p struct page for each
page of device memory participating in this way?
>
> 3) Migrate back to CPU memory - then eveything works.
>
> Is that not enough? Where do you want something different?
>
> > > > but leaves any dma- mapping or pfn mangling to be done after
> > > > the
> > > > call to hmm_range_fault(), since hmm_range_fault() really only
> > > > needs
> > > > to know whether it has to migrate to system or not.
> > >
> > > See above, this is already the case..
> >
> > Well what I meant was at hmm_range_fault() time only consider
> > whether
> > to migrate or not. Afterwards at dma-mapping time you'd expose the
> > alternative pfns that could be used for dma-mapping.
>
> That sounds like you are talking about multipath, we are not really
> ready to tackle general multipath yet at the DMA API level, IMHO.
>
> If you are just talking about your private multi-path, then that is
> already handled..
No, the issue I'm having with this is really why would
hmm_range_fault() need the new pfn when it could easily be obtained
from the device-private pfn by the hmm_range_fault() caller? The only
thing hmm_range_fault() needs to know is, again, whether to migrate or
not. But I guess if the plan is to have hmm_range_fault() call
pci_p2pdma_distance() on it, and we don't want the exporter to do that,
it makes sense.
>
> > We were actually looking at a solution where the pagemap implements
> > something along
> >
> > bool devmem_allowed(pagemap, client); //for hmm_range_fault
> >
> > plus dma_map() and dma_unmap() methods.
>
> This sounds like dmabuf philosophy, and I don't think we should go in
> this direction. The hmm caller should always be responsible for dma
> mapping and we need to improve the DMA API to make this work better,
> not build side hacks like this.
>
> You can read my feelings and reasoning on this topic within this huge
> thread:
>
> https://lore.kernel.org/dri-devel/20250108132358.GP5556@nvidia.com/
>
> > In this way you'd don't need to expose special p2p dma pages and
> > the
>
> Removing the "special p2p dma pages" has to be done by improving the
> DMA API to understand how to map phsyical addresses without struct
> page. We are working toward this, slowly.
> pgmap->ops->dma_map/unmap() ideas just repeat the DMABUF mistake
> of mis-using the DMA API for P2P cases. Today you cannot correctly
> DMA
> map P2P memory without the struct page.
Yeah, I don't want to drag hmm into that discussion, although
admittedly the idea of pgmap->ops->dma_map/unmap mimics the dma-buf
behaviour.
So anyway what we'll do is to try to use an interconnect-common owner
for now and revisit the problem if that's not sufficient so we can come
up with an acceptable solution.
/Thomas
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-28 16:32 ` Thomas Hellström
@ 2025-01-28 17:21 ` Jason Gunthorpe
2025-01-29 13:38 ` Simona Vetter
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-01-28 17:21 UTC (permalink / raw)
To: Thomas Hellström
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, Jan 28, 2025 at 05:32:23PM +0100, Thomas Hellström wrote:
> > This series supports three case:
> >
> > 1) pgmap->owner == range->dev_private_owner
> > This is "driver private fast interconnect" in this case HMM
> > should
> > immediately return the page. The calling driver understands the
> > private parts of the pgmap and computes the private interconnect
> > address.
> >
> > This requires organizing your driver so that all private
> > interconnect has the same pgmap->owner.
>
> Yes, although that makes this map static, since pgmap->owner has to be
> set at pgmap creation time. and we were during initial discussions
> looking at something dynamic here. However I think we can probably do
> with a per-driver owner for now and get back if that's not sufficient.
The pgmap->owner doesn't *have* to fixed, certainly during early boot before
you hand out any page references it can be changed. I wouldn't be
surprised if this is useful to some requirements to build up the
private interconnect topology?
> > 2) The page is DEVICE_PRIVATE and get_dma_pfn_for_device() exists.
> > The exporting driver has the option to return a P2P struct page
> > that can be used for PCI P2P without any migration. In a PCI GPU
> > context this means the GPU has mapped its local memory to a PCI
> > address. The assumption is that P2P always works and so this
> > address can be DMA'd from.
>
> So do I understand it correctly, that the driver then needs to set up
> one device_private struct page and one pcie_p2p struct page for each
> page of device memory participating in this way?
Yes, for now. I hope to remove the p2p page eventually.
> > If you are just talking about your private multi-path, then that is
> > already handled..
>
> No, the issue I'm having with this is really why would
> hmm_range_fault() need the new pfn when it could easily be obtained
> from the device-private pfn by the hmm_range_fault() caller?
That isn't the API of HMM, the caller uses hmm to get PFNs it can use.
Deliberately returning PFNs the caller cannot use is nonsensical to
it's purpose :)
> So anyway what we'll do is to try to use an interconnect-common owner
> for now and revisit the problem if that's not sufficient so we can come
> up with an acceptable solution.
That is the intention for sure. The idea was that the drivers under
the private pages would somehow generate unique owners for shared
private interconnect segments.
I wouldn't say this is the end all of the idea, if there are better
ways to handle accepting private pages they can certainly be
explored..
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-28 17:21 ` Jason Gunthorpe
@ 2025-01-29 13:38 ` Simona Vetter
2025-01-29 13:47 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Simona Vetter @ 2025-01-29 13:38 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Tue, Jan 28, 2025 at 01:21:23PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 28, 2025 at 05:32:23PM +0100, Thomas Hellström wrote:
> > > This series supports three case:
> > >
> > > 1) pgmap->owner == range->dev_private_owner
> > > This is "driver private fast interconnect" in this case HMM
> > > should
> > > immediately return the page. The calling driver understands the
> > > private parts of the pgmap and computes the private interconnect
> > > address.
> > >
> > > This requires organizing your driver so that all private
> > > interconnect has the same pgmap->owner.
> >
> > Yes, although that makes this map static, since pgmap->owner has to be
> > set at pgmap creation time. and we were during initial discussions
> > looking at something dynamic here. However I think we can probably do
> > with a per-driver owner for now and get back if that's not sufficient.
>
> The pgmap->owner doesn't *have* to fixed, certainly during early boot before
> you hand out any page references it can be changed. I wouldn't be
> surprised if this is useful to some requirements to build up the
> private interconnect topology?
The trouble I'm seeing is device probe and the fundemantal issue that you
never know when you're done. And so if we entirely rely on pgmap->owner to
figure out the driver private interconnect topology, that's going to be
messy. That's why I'm also leaning towards both comparing owners and
having an additional check whether the interconnect is actually there or
not yet.
You can fake that by doing these checks after hmm_range_fault returned,
and if you get a bunch of unsuitable pages, toss it back to
hmm_range_fault asking for an unconditional migration to system memory for
those. But that's kinda not great and I think goes at least against the
spirit of how you want to handle pci p2p in step 2 below?
Cheers, Sima
> > > 2) The page is DEVICE_PRIVATE and get_dma_pfn_for_device() exists.
> > > The exporting driver has the option to return a P2P struct page
> > > that can be used for PCI P2P without any migration. In a PCI GPU
> > > context this means the GPU has mapped its local memory to a PCI
> > > address. The assumption is that P2P always works and so this
> > > address can be DMA'd from.
> >
> > So do I understand it correctly, that the driver then needs to set up
> > one device_private struct page and one pcie_p2p struct page for each
> > page of device memory participating in this way?
>
> Yes, for now. I hope to remove the p2p page eventually.
>
> > > If you are just talking about your private multi-path, then that is
> > > already handled..
> >
> > No, the issue I'm having with this is really why would
> > hmm_range_fault() need the new pfn when it could easily be obtained
> > from the device-private pfn by the hmm_range_fault() caller?
>
> That isn't the API of HMM, the caller uses hmm to get PFNs it can use.
>
> Deliberately returning PFNs the caller cannot use is nonsensical to
> it's purpose :)
>
> > So anyway what we'll do is to try to use an interconnect-common owner
> > for now and revisit the problem if that's not sufficient so we can come
> > up with an acceptable solution.
>
> That is the intention for sure. The idea was that the drivers under
> the private pages would somehow generate unique owners for shared
> private interconnect segments.
>
> I wouldn't say this is the end all of the idea, if there are better
> ways to handle accepting private pages they can certainly be
> explored..
>
> Jason
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-29 13:38 ` Simona Vetter
@ 2025-01-29 13:47 ` Jason Gunthorpe
2025-01-29 17:09 ` Thomas Hellström
2025-01-30 10:50 ` Simona Vetter
0 siblings, 2 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2025-01-29 13:47 UTC (permalink / raw)
To: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Wed, Jan 29, 2025 at 02:38:58PM +0100, Simona Vetter wrote:
> > The pgmap->owner doesn't *have* to fixed, certainly during early boot before
> > you hand out any page references it can be changed. I wouldn't be
> > surprised if this is useful to some requirements to build up the
> > private interconnect topology?
>
> The trouble I'm seeing is device probe and the fundemantal issue that you
> never know when you're done. And so if we entirely rely on pgmap->owner to
> figure out the driver private interconnect topology, that's going to be
> messy. That's why I'm also leaning towards both comparing owners and
> having an additional check whether the interconnect is actually there or
> not yet.
Hoenstely, I'd rather invest more effort into being able to update
owner for those special corner cases than to slow down the fast path
in hmm_range_fault..
The notion is that owner should represent a contiguous region of
connectivity. IMHO you can always create this, but I suppose there
could be corner cases where you need to split/merge owners.
But again, this isn't set in stone, if someone has a better way to
match the private interconnects without going to driver callbacks then
try that too.
I think driver callbacks inside hmm_range_fault should be the last resort..
> You can fake that by doing these checks after hmm_range_fault returned,
> and if you get a bunch of unsuitable pages, toss it back to
> hmm_range_fault asking for an unconditional migration to system memory for
> those. But that's kinda not great and I think goes at least against the
> spirit of how you want to handle pci p2p in step 2 below?
Right, hmm_range_fault should return pages that can be used and you
should not call it twice.
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-29 13:47 ` Jason Gunthorpe
@ 2025-01-29 17:09 ` Thomas Hellström
2025-01-30 10:50 ` Simona Vetter
1 sibling, 0 replies; 26+ messages in thread
From: Thomas Hellström @ 2025-01-29 17:09 UTC (permalink / raw)
To: Jason Gunthorpe, Yonatan Maman, kherbst, lyude, dakr, airlied,
simona, leon, jglisse, akpm, GalShalom, dri-devel, nouveau,
linux-kernel, linux-rdma, linux-mm, linux-tegra
On Wed, 2025-01-29 at 09:47 -0400, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 02:38:58PM +0100, Simona Vetter wrote:
>
> > > The pgmap->owner doesn't *have* to fixed, certainly during early
> > > boot before
> > > you hand out any page references it can be changed. I wouldn't be
> > > surprised if this is useful to some requirements to build up the
> > > private interconnect topology?
> >
> > The trouble I'm seeing is device probe and the fundemantal issue
> > that you
> > never know when you're done. And so if we entirely rely on pgmap-
> > >owner to
> > figure out the driver private interconnect topology, that's going
> > to be
> > messy. That's why I'm also leaning towards both comparing owners
> > and
> > having an additional check whether the interconnect is actually
> > there or
> > not yet.
>
> Hoenstely, I'd rather invest more effort into being able to update
> owner for those special corner cases than to slow down the fast path
> in hmm_range_fault..
Just a comment on the performance concern here. This can be crafted in
a way that only if the driver provides a callback, there is a (small)
hit. If there is no callback at that point, we're looking at a
migration to ram. If there is a callback it's typically followed by an
address computation and page-table setup. Compared to those, the
callback performance impact is probably unmeasureable.
/Thomas
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-29 13:47 ` Jason Gunthorpe
2025-01-29 17:09 ` Thomas Hellström
@ 2025-01-30 10:50 ` Simona Vetter
2025-01-30 13:23 ` Jason Gunthorpe
1 sibling, 1 reply; 26+ messages in thread
From: Simona Vetter @ 2025-01-30 10:50 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Wed, Jan 29, 2025 at 09:47:57AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 29, 2025 at 02:38:58PM +0100, Simona Vetter wrote:
>
> > > The pgmap->owner doesn't *have* to fixed, certainly during early boot before
> > > you hand out any page references it can be changed. I wouldn't be
> > > surprised if this is useful to some requirements to build up the
> > > private interconnect topology?
> >
> > The trouble I'm seeing is device probe and the fundemantal issue that you
> > never know when you're done. And so if we entirely rely on pgmap->owner to
> > figure out the driver private interconnect topology, that's going to be
> > messy. That's why I'm also leaning towards both comparing owners and
> > having an additional check whether the interconnect is actually there or
> > not yet.
>
> Hoenstely, I'd rather invest more effort into being able to update
> owner for those special corner cases than to slow down the fast path
> in hmm_range_fault..
I'm not sure how you want to make the owner mutable.
The only design that I think is solid is to evict all device private
memory, unregister the dev_pagemap and register a new one with the updated
owner. I think any other approach boils down to the same issue, except we
pretend it's easier and just ignore all the race conditions.
And I've looked at the lifetime fun of unregistering a dev_pagemap for
device hotunplug and pretty firmly concluded it's unfixable and that I
should run away to do something else :-P
An optional callback is a lot less scary to me here (or redoing
hmm_range_fault or whacking the migration helpers a few times) looks a lot
less scary than making pgmap->owner mutable in some fashion.
Cheers, Sima
> The notion is that owner should represent a contiguous region of
> connectivity. IMHO you can always create this, but I suppose there
> could be corner cases where you need to split/merge owners.
>
> But again, this isn't set in stone, if someone has a better way to
> match the private interconnects without going to driver callbacks then
> try that too.
>
> I think driver callbacks inside hmm_range_fault should be the last resort..
>
> > You can fake that by doing these checks after hmm_range_fault returned,
> > and if you get a bunch of unsuitable pages, toss it back to
> > hmm_range_fault asking for an unconditional migration to system memory for
> > those. But that's kinda not great and I think goes at least against the
> > spirit of how you want to handle pci p2p in step 2 below?
>
> Right, hmm_range_fault should return pages that can be used and you
> should not call it twice.
>
> Jason
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-30 10:50 ` Simona Vetter
@ 2025-01-30 13:23 ` Jason Gunthorpe
2025-01-30 16:09 ` Simona Vetter
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-01-30 13:23 UTC (permalink / raw)
To: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Thu, Jan 30, 2025 at 11:50:27AM +0100, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 09:47:57AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 29, 2025 at 02:38:58PM +0100, Simona Vetter wrote:
> >
> > > > The pgmap->owner doesn't *have* to fixed, certainly during early boot before
> > > > you hand out any page references it can be changed. I wouldn't be
> > > > surprised if this is useful to some requirements to build up the
> > > > private interconnect topology?
> > >
> > > The trouble I'm seeing is device probe and the fundemantal issue that you
> > > never know when you're done. And so if we entirely rely on pgmap->owner to
> > > figure out the driver private interconnect topology, that's going to be
> > > messy. That's why I'm also leaning towards both comparing owners and
> > > having an additional check whether the interconnect is actually there or
> > > not yet.
> >
> > Hoenstely, I'd rather invest more effort into being able to update
> > owner for those special corner cases than to slow down the fast path
> > in hmm_range_fault..
>
> I'm not sure how you want to make the owner mutable.
You'd probably have to use a system where you never free them until
all the page maps are destroyed.
You could also use an integer instead of a pointer to indicate the
cluster of interconnect, I think there are many options..
> And I've looked at the lifetime fun of unregistering a dev_pagemap for
> device hotunplug and pretty firmly concluded it's unfixable and that I
> should run away to do something else :-P
? It is supposed to work, it blocks until all the pages are freed, but
AFAIK ther is no fundamental life time issue. The driver is
responsible to free all its usage.
> An optional callback is a lot less scary to me here (or redoing
> hmm_range_fault or whacking the migration helpers a few times) looks a lot
> less scary than making pgmap->owner mutable in some fashion.
It extra for every single 4k page on every user :\
And what are you going to do better inside this callback?
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-30 13:23 ` Jason Gunthorpe
@ 2025-01-30 16:09 ` Simona Vetter
2025-01-30 17:42 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Simona Vetter @ 2025-01-30 16:09 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Thu, Jan 30, 2025 at 09:23:17AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 30, 2025 at 11:50:27AM +0100, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 09:47:57AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 29, 2025 at 02:38:58PM +0100, Simona Vetter wrote:
> > >
> > > > > The pgmap->owner doesn't *have* to fixed, certainly during early boot before
> > > > > you hand out any page references it can be changed. I wouldn't be
> > > > > surprised if this is useful to some requirements to build up the
> > > > > private interconnect topology?
> > > >
> > > > The trouble I'm seeing is device probe and the fundemantal issue that you
> > > > never know when you're done. And so if we entirely rely on pgmap->owner to
> > > > figure out the driver private interconnect topology, that's going to be
> > > > messy. That's why I'm also leaning towards both comparing owners and
> > > > having an additional check whether the interconnect is actually there or
> > > > not yet.
> > >
> > > Hoenstely, I'd rather invest more effort into being able to update
> > > owner for those special corner cases than to slow down the fast path
> > > in hmm_range_fault..
> >
> > I'm not sure how you want to make the owner mutable.
>
> You'd probably have to use a system where you never free them until
> all the page maps are destroyed.
>
> You could also use an integer instead of a pointer to indicate the
> cluster of interconnect, I think there are many options..
Hm yeah I guess an integer allocater of the atomic_inc kind plus "surely
32bit is enough" could work. But I don't think it's needed, if we can
reliable just unregister the entire dev_pagemap and then just set up a new
one. Plus that avoids thinking about which barriers we might need where
exactly all over mm code that looks at the owner field.
> > And I've looked at the lifetime fun of unregistering a dev_pagemap for
> > device hotunplug and pretty firmly concluded it's unfixable and that I
> > should run away to do something else :-P
>
> ? It is supposed to work, it blocks until all the pages are freed, but
> AFAIK ther is no fundamental life time issue. The driver is
> responsible to free all its usage.
Hm I looked at it again, and I guess with the fixes to make migration to
system memory work reliable in Matt Brost's latest series it should indeed
work reliable. The devm_ version still freaks me out because of how easily
people misuse these for things that are memory allocations.
> > An optional callback is a lot less scary to me here (or redoing
> > hmm_range_fault or whacking the migration helpers a few times) looks a lot
> > less scary than making pgmap->owner mutable in some fashion.
>
> It extra for every single 4k page on every user :\
>
> And what are you going to do better inside this callback?
Having more comfy illusions :-P
Slightly more seriously, I can grab some locks and make life easier,
whereas sprinkling locking or even barriers over pgmap->owner in core mm
is not going to fly. Plus more flexibility, e.g. when the interconnect
doesn't work for atomics or some other funny reason it only works for some
of the traffic, where you need to more dynamically decide where memory is
ok to sit. Or checking p2pdma connectivity and all that stuff.
But we can also do all that stuff by checking afterwards or migrating
memory around as needed. At least for drivers who cooperate and all set
the same owner, which I think is Thomas' current plan.
Also note that fundamentally you can't protect against the hotunplug or
driver unload case for hardware access. So some access will go to nowhere
when that happens, until we've torn down all the mappings and migrated
memory out.
-Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-30 16:09 ` Simona Vetter
@ 2025-01-30 17:42 ` Jason Gunthorpe
2025-01-31 16:59 ` Simona Vetter
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-01-30 17:42 UTC (permalink / raw)
To: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Thu, Jan 30, 2025 at 05:09:44PM +0100, Simona Vetter wrote:
> > You could also use an integer instead of a pointer to indicate the
> > cluster of interconnect, I think there are many options..
>
> Hm yeah I guess an integer allocater of the atomic_inc kind plus "surely
> 32bit is enough" could work. But I don't think it's needed, if we can
> reliable just unregister the entire dev_pagemap and then just set up a new
> one. Plus that avoids thinking about which barriers we might need where
> exactly all over mm code that looks at the owner field.
IMHO that is the best answer if it works for the driver.
> > ? It is supposed to work, it blocks until all the pages are freed, but
> > AFAIK ther is no fundamental life time issue. The driver is
> > responsible to free all its usage.
>
> Hm I looked at it again, and I guess with the fixes to make migration to
> system memory work reliable in Matt Brost's latest series it should indeed
> work reliable. The devm_ version still freaks me out because of how easily
> people misuse these for things that are memory allocations.
I also don't like the devm stuff, especially in costly places like
this. Oh well.
> > > An optional callback is a lot less scary to me here (or redoing
> > > hmm_range_fault or whacking the migration helpers a few times) looks a lot
> > > less scary than making pgmap->owner mutable in some fashion.
> >
> > It extra for every single 4k page on every user :\
> >
> > And what are you going to do better inside this callback?
>
> Having more comfy illusions :-P
Exactly!
> Slightly more seriously, I can grab some locks and make life easier,
Yes, but then see my concern about performance again. Now you are
locking/unlocking every 4k? And then it still races since it can
change after hmm_range_fault returns. That's not small, and not really
better.
> whereas sprinkling locking or even barriers over pgmap->owner in core mm
> is not going to fly. Plus more flexibility, e.g. when the interconnect
> doesn't work for atomics or some other funny reason it only works for some
> of the traffic, where you need to more dynamically decide where memory is
> ok to sit.
Sure, an asymmetric mess could be problematic, and we might need more
later, but lets get to that first..
> Or checking p2pdma connectivity and all that stuff.
Should be done in the core code, don't want drivers open coding this
stuff.
> Also note that fundamentally you can't protect against the hotunplug or
> driver unload case for hardware access. So some access will go to nowhere
> when that happens, until we've torn down all the mappings and migrated
> memory out.
I think a literal (safe!) hot unplug must always use the page map
revoke, and that should be safely locked between hmm_range_fault and
the notifier.
If the underlying fabric is loosing operations during an unplanned hot
unplug I expect things will need resets to recover..
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-30 17:42 ` Jason Gunthorpe
@ 2025-01-31 16:59 ` Simona Vetter
2025-02-03 15:08 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Simona Vetter @ 2025-01-31 16:59 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Thu, Jan 30, 2025 at 01:42:17PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 30, 2025 at 05:09:44PM +0100, Simona Vetter wrote:
> > > > An optional callback is a lot less scary to me here (or redoing
> > > > hmm_range_fault or whacking the migration helpers a few times) looks a lot
> > > > less scary than making pgmap->owner mutable in some fashion.
> > >
> > > It extra for every single 4k page on every user :\
> > >
> > > And what are you going to do better inside this callback?
> >
> > Having more comfy illusions :-P
>
> Exactly!
>
> > Slightly more seriously, I can grab some locks and make life easier,
>
> Yes, but then see my concern about performance again. Now you are
> locking/unlocking every 4k? And then it still races since it can
> change after hmm_range_fault returns. That's not small, and not really
> better.
Hm yeah, I think that's the death argument for the callback. Consider me
convinced on that being a bad idea.
> > whereas sprinkling locking or even barriers over pgmap->owner in core mm
> > is not going to fly. Plus more flexibility, e.g. when the interconnect
> > doesn't work for atomics or some other funny reason it only works for some
> > of the traffic, where you need to more dynamically decide where memory is
> > ok to sit.
>
> Sure, an asymmetric mess could be problematic, and we might need more
> later, but lets get to that first..
>
> > Or checking p2pdma connectivity and all that stuff.
>
> Should be done in the core code, don't want drivers open coding this
> stuff.
Yeah so after amdkfd and noveau I agree that letting drivers mess this up
isn't great. But see below, I'm not sure whether going all the way to core
code is the right approach, at least for gpu internal needs.
> > Also note that fundamentally you can't protect against the hotunplug or
> > driver unload case for hardware access. So some access will go to nowhere
> > when that happens, until we've torn down all the mappings and migrated
> > memory out.
>
> I think a literal (safe!) hot unplug must always use the page map
> revoke, and that should be safely locked between hmm_range_fault and
> the notifier.
>
> If the underlying fabric is loosing operations during an unplanned hot
> unplug I expect things will need resets to recover..
So one aspect where I don't like the pgmap->owner approach much is that
it's a big thing to get right, and it feels a bit to me that we don't yet
know the right questions.
A bit related is that we'll have to do some driver-specific migration
after hmm_range_fault anyway for allocation policies. With coherent
interconnect that'd be up to numactl, but for driver private it's all up
to the driver. And once we have that, we can also migrate memory around
that's misplaced for functional and not just performance reasons.
The plan I discussed with Thomas a while back at least for gpus was to
have that as a drm_devpagemap library, which would have a common owner (or
maybe per driver or so as Thomas suggested). Then it would still not be in
drivers, but also a bit easier to mess around with for experiments. And
once we have some clear things that hmm_range_fault should do instead for
everyone, we can lift them up.
Doing this at a pagemap level should also be much more efficient, since I
think we can make the assumption that access limitations are uniform for a
given dev_pagemap (and if they're not if e.g. not the entire vram is bus
visible, drivers can handle that by splitting things up).
But upfront speccing all this out doesn't seem like a good idea to,
because I honestly don't know what we all need.
Cheers, Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-01-31 16:59 ` Simona Vetter
@ 2025-02-03 15:08 ` Jason Gunthorpe
2025-02-04 9:32 ` Thomas Hellström
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-02-03 15:08 UTC (permalink / raw)
To: Thomas Hellström, Yonatan Maman, kherbst, lyude, dakr,
airlied, simona, leon, jglisse, akpm, GalShalom, dri-devel,
nouveau, linux-kernel, linux-rdma, linux-mm, linux-tegra
On Fri, Jan 31, 2025 at 05:59:26PM +0100, Simona Vetter wrote:
> So one aspect where I don't like the pgmap->owner approach much is that
> it's a big thing to get right, and it feels a bit to me that we don't yet
> know the right questions.
Well, I would say it isn't really complete yet. No driver has yet
attempted to use a private interconnect with these scheme. Probably it
needs more work.
> A bit related is that we'll have to do some driver-specific migration
> after hmm_range_fault anyway for allocation policies. With coherent
> interconnect that'd be up to numactl, but for driver private it's all up
> to the driver. And once we have that, we can also migrate memory around
> that's misplaced for functional and not just performance reasons.
Are you sure? This doesn't seem to what any hmm_range_fault() user
should be doing. hmm_range_fault() is to help mirror the page table
to a secondary, that is all. Migration policy shouldn't be part of it,
just mirroring doesn't necessarily mean any access was performed, for
instance.
And mirroring doesn't track any access done by non-faulting cases either.
> The plan I discussed with Thomas a while back at least for gpus was to
> have that as a drm_devpagemap library,
I would not be happy to see this. Please improve pagemap directly if
you think you need more things.
> which would have a common owner (or
> maybe per driver or so as Thomas suggested).
Neither really match the expected design here. The owner should be
entirely based on reachability. Devices that cannot reach each other
directly should have different owners.
> But upfront speccing all this out doesn't seem like a good idea to,
> because I honestly don't know what we all need.
This is why it is currently just void *owner :)
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-02-03 15:08 ` Jason Gunthorpe
@ 2025-02-04 9:32 ` Thomas Hellström
2025-02-04 13:26 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Thomas Hellström @ 2025-02-04 9:32 UTC (permalink / raw)
To: Jason Gunthorpe, Yonatan Maman, kherbst, lyude, dakr, airlied,
simona, leon, jglisse, akpm, GalShalom, dri-devel, nouveau,
linux-kernel, linux-rdma, linux-mm, linux-tegra
On Mon, 2025-02-03 at 11:08 -0400, Jason Gunthorpe wrote:
> On Fri, Jan 31, 2025 at 05:59:26PM +0100, Simona Vetter wrote:
>
> > So one aspect where I don't like the pgmap->owner approach much is
> > that
> > it's a big thing to get right, and it feels a bit to me that we
> > don't yet
> > know the right questions.
>
> Well, I would say it isn't really complete yet. No driver has yet
> attempted to use a private interconnect with these scheme. Probably
> it
> needs more work.
>
> > A bit related is that we'll have to do some driver-specific
> > migration
> > after hmm_range_fault anyway for allocation policies. With coherent
> > interconnect that'd be up to numactl, but for driver private it's
> > all up
> > to the driver. And once we have that, we can also migrate memory
> > around
> > that's misplaced for functional and not just performance reasons.
>
> Are you sure? This doesn't seem to what any hmm_range_fault() user
> should be doing. hmm_range_fault() is to help mirror the page table
> to a secondary, that is all. Migration policy shouldn't be part of
> it,
> just mirroring doesn't necessarily mean any access was performed, for
> instance.
>
> And mirroring doesn't track any access done by non-faulting cases
> either.
>
> > The plan I discussed with Thomas a while back at least for gpus was
> > to
> > have that as a drm_devpagemap library,
>
> I would not be happy to see this. Please improve pagemap directly if
> you think you need more things.
These are mainly helpers to migrate and populate a range of cpu memory
space (struct mm_struct) with GPU device_private memory, migrate to
system on gpu memory shortage and implement the migrate_to_vram pagemap
op, tied to gpu device memory allocations, so I don't think there is
anything we should be exposing at the dev_pagemap level at this point?
>
> > which would have a common owner (or
> > maybe per driver or so as Thomas suggested).
>
> Neither really match the expected design here. The owner should be
> entirely based on reachability. Devices that cannot reach each other
> directly should have different owners.
Actually what I'm putting together is a small helper to allocate and
assign an "owner" based on devices that are previously registered to a
"registry". The caller has to indicate using a callback function for
each struct device pair whether there is a fast interconnect available,
and this is expected to be done at pagemap creation time, so I think
this aligns with the above. Initially a "registry" (which is a list of
device-owner pairs) will be driver-local, but could easily have a wider
scope.
This means we handle access control, unplug checks and similar at
migration time, typically before hmm_range_fault(), and the role of
hmm_range_fault() will be to over pfns whose backing memory is directly
accessible to the device, else migrate to system.
Device unplug would then be handled by refusing migrations to the
device (gpu drivers would probably use drm_dev_enter()), and then evict
all device memory after a drm_dev_unplug(). This of course relies on
that eviction is more or less failsafe.
/Thomas
>
> > But upfront speccing all this out doesn't seem like a good idea to,
> > because I honestly don't know what we all need.
>
> This is why it is currently just void *owner :)
Again, with the above I think we are good for now, but having
experimented a lot with the callback, I'm still not convinced by the
performance argument, for the following reasons.
1) Existing users would never use the callback. They can still rely on
the owner check, only if that fails we check for callback existence.
2) By simply caching the result from the last checked dev_pagemap, most
callback calls could typically be eliminated.
3) As mentioned before, a callback call would typically always be
followed by either migration to ram or a page-table update. Compared to
these, the callback overhead would IMO be unnoticeable.
4) pcie_p2p is already planning a dev_pagemap callback?
/Thomas
>
> Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-02-04 9:32 ` Thomas Hellström
@ 2025-02-04 13:26 ` Jason Gunthorpe
2025-02-04 14:29 ` Thomas Hellström
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-02-04 13:26 UTC (permalink / raw)
To: Thomas Hellström
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, Feb 04, 2025 at 10:32:32AM +0100, Thomas Hellström wrote:
> > I would not be happy to see this. Please improve pagemap directly if
> > you think you need more things.
>
> These are mainly helpers to migrate and populate a range of cpu memory
> space (struct mm_struct) with GPU device_private memory, migrate to
> system on gpu memory shortage and implement the migrate_to_vram pagemap
> op, tied to gpu device memory allocations, so I don't think there is
> anything we should be exposing at the dev_pagemap level at this point?
Maybe that belongs in mm/hmm then?
> > Neither really match the expected design here. The owner should be
> > entirely based on reachability. Devices that cannot reach each other
> > directly should have different owners.
>
> Actually what I'm putting together is a small helper to allocate and
> assign an "owner" based on devices that are previously registered to a
> "registry". The caller has to indicate using a callback function for
> each struct device pair whether there is a fast interconnect available,
> and this is expected to be done at pagemap creation time, so I think
> this aligns with the above. Initially a "registry" (which is a list of
> device-owner pairs) will be driver-local, but could easily have a wider
> scope.
Yeah, that seems like a workable idea
> This means we handle access control, unplug checks and similar at
> migration time, typically before hmm_range_fault(), and the role of
> hmm_range_fault() will be to over pfns whose backing memory is directly
> accessible to the device, else migrate to system.
Yes, that sound right
> 1) Existing users would never use the callback. They can still rely on
> the owner check, only if that fails we check for callback existence.
> 2) By simply caching the result from the last checked dev_pagemap, most
> callback calls could typically be eliminated.
But then you are not in the locked region so your cache is racy and
invalid.
> 3) As mentioned before, a callback call would typically always be
> followed by either migration to ram or a page-table update. Compared to
> these, the callback overhead would IMO be unnoticeable.
Why? Surely the normal case should be a callback saying the memory can
be accessed?
> 4) pcie_p2p is already planning a dev_pagemap callback?
Yes, but it is not a racy validation callback, and it already is
creating a complicated lifecycle problem inside the exporting the
driver.
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-02-04 13:26 ` Jason Gunthorpe
@ 2025-02-04 14:29 ` Thomas Hellström
2025-02-04 19:16 ` Jason Gunthorpe
0 siblings, 1 reply; 26+ messages in thread
From: Thomas Hellström @ 2025-02-04 14:29 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, 2025-02-04 at 09:26 -0400, Jason Gunthorpe wrote:
> On Tue, Feb 04, 2025 at 10:32:32AM +0100, Thomas Hellström wrote:
> >
>
> > 1) Existing users would never use the callback. They can still rely
> > on
> > the owner check, only if that fails we check for callback
> > existence.
> > 2) By simply caching the result from the last checked dev_pagemap,
> > most
> > callback calls could typically be eliminated.
>
> But then you are not in the locked region so your cache is racy and
> invalid.
I'm not sure I follow? If a device private pfn handed back to the
caller is dependent on dev_pagemap A having a fast interconnect to the
client, then subsequent pfns in the same hmm_range_fault() call must be
able to make the same assumption (pagemap A having a fast
interconnect), else the whole result is invalid?
>
> > 3) As mentioned before, a callback call would typically always be
> > followed by either migration to ram or a page-table update.
> > Compared to
> > these, the callback overhead would IMO be unnoticeable.
>
> Why? Surely the normal case should be a callback saying the memory
> can
> be accessed?
Sure, but at least on the xe driver, that means page-table repopulation
since the hmm_range_fault() typically originated from a page-fault.
>
> > 4) pcie_p2p is already planning a dev_pagemap callback?
>
> Yes, but it is not a racy validation callback, and it already is
> creating a complicated lifecycle problem inside the exporting the
> driver.
Yeah, I bet there are various reasons against a callback. I just don't
see the performance argument being a main concern.
>
> Jason
/Thomas
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-02-04 14:29 ` Thomas Hellström
@ 2025-02-04 19:16 ` Jason Gunthorpe
2025-02-04 22:01 ` Thomas Hellström
0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2025-02-04 19:16 UTC (permalink / raw)
To: Thomas Hellström
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, Feb 04, 2025 at 03:29:48PM +0100, Thomas Hellström wrote:
> On Tue, 2025-02-04 at 09:26 -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 04, 2025 at 10:32:32AM +0100, Thomas Hellström wrote:
> > >
> >
> > > 1) Existing users would never use the callback. They can still rely
> > > on
> > > the owner check, only if that fails we check for callback
> > > existence.
> > > 2) By simply caching the result from the last checked dev_pagemap,
> > > most
> > > callback calls could typically be eliminated.
> >
> > But then you are not in the locked region so your cache is racy and
> > invalid.
>
> I'm not sure I follow? If a device private pfn handed back to the
> caller is dependent on dev_pagemap A having a fast interconnect to the
> client, then subsequent pfns in the same hmm_range_fault() call must be
> able to make the same assumption (pagemap A having a fast
> interconnect), else the whole result is invalid?
But what is the receiver going to do with this device private page?
Relock it again and check again if it is actually OK? Yuk.
> > > 3) As mentioned before, a callback call would typically always be
> > > followed by either migration to ram or a page-table update.
> > > Compared to
> > > these, the callback overhead would IMO be unnoticeable.
> >
> > Why? Surely the normal case should be a callback saying the memory
> > can
> > be accessed?
>
> Sure, but at least on the xe driver, that means page-table repopulation
> since the hmm_range_fault() typically originated from a page-fault.
Yes, I expect all hmm_range_fault()'s to be on page fault paths, and
we'd like it to be as fast as we can in the CPU present case..
Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
2025-02-04 19:16 ` Jason Gunthorpe
@ 2025-02-04 22:01 ` Thomas Hellström
0 siblings, 0 replies; 26+ messages in thread
From: Thomas Hellström @ 2025-02-04 22:01 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Maman, kherbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, GalShalom, dri-devel, nouveau, linux-kernel,
linux-rdma, linux-mm, linux-tegra
On Tue, 2025-02-04 at 15:16 -0400, Jason Gunthorpe wrote:
> On Tue, Feb 04, 2025 at 03:29:48PM +0100, Thomas Hellström wrote:
> > On Tue, 2025-02-04 at 09:26 -0400, Jason Gunthorpe wrote:
> > > On Tue, Feb 04, 2025 at 10:32:32AM +0100, Thomas Hellström wrote:
> > > >
> > >
> > > > 1) Existing users would never use the callback. They can still
> > > > rely
> > > > on
> > > > the owner check, only if that fails we check for callback
> > > > existence.
> > > > 2) By simply caching the result from the last checked
> > > > dev_pagemap,
> > > > most
> > > > callback calls could typically be eliminated.
> > >
> > > But then you are not in the locked region so your cache is racy
> > > and
> > > invalid.
> >
> > I'm not sure I follow? If a device private pfn handed back to the
> > caller is dependent on dev_pagemap A having a fast interconnect to
> > the
> > client, then subsequent pfns in the same hmm_range_fault() call
> > must be
> > able to make the same assumption (pagemap A having a fast
> > interconnect), else the whole result is invalid?
>
> But what is the receiver going to do with this device private page?
> Relock it again and check again if it is actually OK? Yuk.
I'm still lost as to what would be the possible race-condition that
can't be handled in the usual way using mmu invalidations + notifier
seqno bump? Is it the fast interconnect being taken down?
/Thomas
>
> > > > 3) As mentioned before, a callback call would typically always
> > > > be
> > > > followed by either migration to ram or a page-table update.
> > > > Compared to
> > > > these, the callback overhead would IMO be unnoticeable.
> > >
> > > Why? Surely the normal case should be a callback saying the
> > > memory
> > > can
> > > be accessed?
> >
> > Sure, but at least on the xe driver, that means page-table
> > repopulation
> > since the hmm_range_fault() typically originated from a page-fault.
>
> Yes, I expect all hmm_range_fault()'s to be on page fault paths, and
> we'd like it to be as fast as we can in the CPU present case..
>
> Jason
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC 2/5] nouveau/dmem: HMM P2P DMA for private dev pages
2024-12-01 10:36 [RFC 0/5] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
2024-12-01 10:36 ` [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
@ 2024-12-01 10:36 ` Yonatan Maman
2024-12-01 10:36 ` [RFC 3/5] IB/core: P2P DMA for device private pages Yonatan Maman
` (2 subsequent siblings)
4 siblings, 0 replies; 26+ messages in thread
From: Yonatan Maman @ 2024-12-01 10:36 UTC (permalink / raw)
To: kherbst, lyude, dakr, airlied, simona, jgg, leon, jglisse, akpm,
Ymaman, GalShalom, dri-devel, nouveau, linux-kernel, linux-rdma,
linux-mm, linux-tegra
From: Yonatan Maman <Ymaman@Nvidia.com>
Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications
is crucial for minimizing data transfer overhead (e.g., for RDMA use-
case).
This change aims to enable that capability for Nouveau over HMM device
private pages. P2P DMA for private device pages allows the GPU to
directly exchange data with other devices (e.g., NICs) without needing
to traverse system RAM.
To fully support Peer-to-Peer for device private pages, the following
changes are made:
- Introduce struct nouveau_dmem_hmm_p2p within struct nouveau_dmem
to manage BAR1 PCI P2P memory. p2p_start_addr holds the virtual
address allocated with pci_alloc_p2pmem(), and p2p_size represents
the allocated size of the PCI P2P memory.
- nouveau_dmem_init - Ensure BAR1 accessibility and assign struct
pages (PCI_P2P_PAGE) for all BAR1 pages. Introduce
nouveau_alloc_bar1_pci_p2p_mem in nouveau_dmem to expose BAR1 for
use as P2P memory via pci_p2pdma_add_resource and implement static
allocation and assignment of struct pages using pci_alloc_p2pmem.
This function will be called from nouveau_dmem_init, and failure
triggers a warning message instead of driver failure.
- nouveau_dmem_fini - Ensure BAR1 PCI P2P memory is properly
destroyed during driver cleanup. Introduce
nouveau_destroy_bar1_pci_p2p_mem to handle freeing of PCI P2P
memory associated with Nouveau BAR1. Modify nouveau_dmem_fini to
call nouveau_destroy_bar1_pci_p2p_mem.
- Implement Nouveau `p2p_page` callback function - Implement BAR1
mapping for the chunk using `io_mem_reserve` if no mapping exists.
Retrieve the pre-allocated P2P virtual address and size from
`hmm_p2p`. Calculate the page offset within BAR1 and return the
corresponding P2P page.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
1 file changed, 110 insertions(+)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1a072568cef6..003e74895ff4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -40,6 +40,9 @@
#include <linux/hmm.h>
#include <linux/memremap.h>
#include <linux/migrate.h>
+#include <linux/pci-p2pdma.h>
+#include <nvkm/core/pci.h>
/*
* FIXME: this is ugly right now we are using TTM to allocate vram and we pin
@@ -77,9 +80,15 @@ struct nouveau_dmem_migrate {
struct nouveau_channel *chan;
};
+struct nouveau_dmem_hmm_p2p {
+ size_t p2p_size;
+ void *p2p_start_addr;
+};
+
struct nouveau_dmem {
struct nouveau_drm *drm;
struct nouveau_dmem_migrate migrate;
+ struct nouveau_dmem_hmm_p2p hmm_p2p;
struct list_head chunks;
struct mutex mutex;
struct page *free_pages;
@@ -158,6 +167,60 @@ static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
return 0;
}
+static int nouveau_dmem_bar1_mapping(struct nouveau_bo *nvbo,
+ unsigned long long *bus_addr)
+{
+ int ret;
+ struct ttm_resource *mem = nvbo->bo.resource;
+
+ if (mem->bus.offset) {
+ *bus_addr = mem->bus.offset;
+ return 0;
+ }
+
+ if (PFN_UP(nvbo->bo.base.size) > PFN_UP(nvbo->bo.resource->size))
+ return -EINVAL;
+
+ ret = ttm_bo_reserve(&nvbo->bo, false, false, NULL);
+ if (ret)
+ return ret;
+
+ ret = nvbo->bo.bdev->funcs->io_mem_reserve(nvbo->bo.bdev, mem);
+ *bus_addr = mem->bus.offset;
+
+ ttm_bo_unreserve(&nvbo->bo);
+ return ret;
+}
+
+static int nouveau_dmem_get_dma_pfn(struct page *private_page,
+ unsigned long *dma_pfn)
+{
+ int ret;
+ unsigned long long offset_in_chunk;
+ unsigned long long chunk_bus_addr;
+ unsigned long long bar1_base_addr;
+ struct nouveau_drm *drm = page_to_drm(private_page);
+ struct nouveau_bo *nvbo = nouveau_page_to_chunk(private_page)->bo;
+ struct nvkm_device *nv_device = nvxx_device(drm);
+ size_t p2p_size = drm->dmem->hmm_p2p.p2p_size;
+
+ bar1_base_addr = nv_device->func->resource_addr(nv_device, 1);
+ offset_in_chunk =
+ (page_to_pfn(private_page) << PAGE_SHIFT) -
+ nouveau_page_to_chunk(private_page)->pagemap.range.start;
+
+ ret = nouveau_dmem_bar1_mapping(nvbo, &chunk_bus_addr);
+ if (ret)
+ return ret;
+
+ *dma_pfn = chunk_bus_addr + offset_in_chunk;
+ if (!p2p_size || *dma_pfn > bar1_base_addr + p2p_size ||
+ *dma_pfn < bar1_base_addr)
+ return -ENOMEM;
+
+ return 0;
+}
+
static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
{
struct nouveau_drm *drm = page_to_drm(vmf->page);
@@ -221,6 +284,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
.page_free = nouveau_dmem_page_free,
.migrate_to_ram = nouveau_dmem_migrate_to_ram,
+ .get_dma_pfn_for_device = nouveau_dmem_get_dma_pfn,
};
static int
@@ -413,14 +477,31 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
kvfree(dma_addrs);
}
+static void nouveau_destroy_bar1_pci_p2p_mem(struct nouveau_drm *drm,
+ struct pci_dev *pdev,
+ void *p2p_start_addr,
+ size_t p2p_size)
+{
+ if (p2p_size)
+ pci_free_p2pmem(pdev, p2p_start_addr, p2p_size);
+
+ NV_INFO(drm, "PCI P2P memory freed(%p)\n", p2p_start_addr);
+}
+
void
nouveau_dmem_fini(struct nouveau_drm *drm)
{
struct nouveau_dmem_chunk *chunk, *tmp;
+ struct nvkm_device *nv_device = nvxx_device(drm);
if (drm->dmem == NULL)
return;
+ nouveau_destroy_bar1_pci_p2p_mem(drm,
+ nv_device->func->pci(nv_device)->pdev,
+ drm->dmem->hmm_p2p.p2p_start_addr,
+ drm->dmem->hmm_p2p.p2p_size);
+
mutex_lock(&drm->dmem->mutex);
list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
@@ -586,10 +667,28 @@ nouveau_dmem_migrate_init(struct nouveau_drm *drm)
return -ENODEV;
}
+static int nouveau_alloc_bar1_pci_p2p_mem(struct nouveau_drm *drm,
+ struct pci_dev *pdev, size_t size,
+ void **pp2p_start_addr)
+{
+ int ret;
+
+ ret = pci_p2pdma_add_resource(pdev, 1, size, 0);
+ if (ret)
+ return ret;
+
+ *pp2p_start_addr = pci_alloc_p2pmem(pdev, size);
+
+ NV_INFO(drm, "PCI P2P memory allocated(%p)\n", *pp2p_start_addr);
+ return 0;
+}
+
void
nouveau_dmem_init(struct nouveau_drm *drm)
{
int ret;
+ struct nvkm_device *nv_device = nvxx_device(drm);
+ size_t bar1_size;
/* This only make sense on PASCAL or newer */
if (drm->client.device.info.family < NV_DEVICE_INFO_V0_PASCAL)
@@ -610,6 +709,17 @@ nouveau_dmem_init(struct nouveau_drm *drm)
kfree(drm->dmem);
drm->dmem = NULL;
}
+
+ /* Expose BAR1 for HMM P2P Memory */
+ bar1_size = nv_device->func->resource_size(nv_device, 1);
+ ret = nouveau_alloc_bar1_pci_p2p_mem(drm,
+ nv_device->func->pci(nv_device)->pdev,
+ bar1_size,
+ &drm->dmem->hmm_p2p.p2p_start_addr);
+ drm->dmem->hmm_p2p.p2p_size = (ret) ? 0 : bar1_size;
+ if (ret)
+ NV_WARN(drm,
+ "PCI P2P memory allocation failed, HMM P2P won't be supported\n");
}
static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
--
2.34.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [RFC 3/5] IB/core: P2P DMA for device private pages
2024-12-01 10:36 [RFC 0/5] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
2024-12-01 10:36 ` [RFC 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
2024-12-01 10:36 ` [RFC 2/5] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
@ 2024-12-01 10:36 ` Yonatan Maman
2024-12-01 10:36 ` [RFC 4/5] RDMA/mlx5: Add fallback for P2P DMA errors Yonatan Maman
2024-12-01 10:36 ` [RFC 5/5] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
4 siblings, 0 replies; 26+ messages in thread
From: Yonatan Maman @ 2024-12-01 10:36 UTC (permalink / raw)
To: kherbst, lyude, dakr, airlied, simona, jgg, leon, jglisse, akpm,
Ymaman, GalShalom, dri-devel, nouveau, linux-kernel, linux-rdma,
linux-mm, linux-tegra
From: Yonatan Maman <Ymaman@Nvidia.com>
Add Peer-to-Peer (P2P) DMA request for hmm_range_fault calling,
utilizing capabilities introduced in mm/hmm. By setting
range.default_flags to HMM_PFN_REQ_FAULT | HMM_PFN_REQ_TRY_P2P, HMM
attempts to initiate P2P DMA connections for device private pages
(instead of page fault handling).
This enhancement utilizes P2P DMA to reduce performance overhead
during data migration between devices (e.g., GPU) and system memory,
providing performance benefits for GPU-centric applications that
utilize RDMA and device private pages.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/infiniband/core/umem_odp.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 51d518989914..4c2465b9bdda 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -332,6 +332,10 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
range.default_flags |= HMM_PFN_REQ_WRITE;
}
+ if (access_mask & HMM_PFN_ALLOW_P2P)
+ range.default_flags |= HMM_PFN_ALLOW_P2P;
+
+ range.pfn_flags_mask = HMM_PFN_ALLOW_P2P;
range.hmm_pfns = &(umem_odp->map.pfn_list[pfn_start_idx]);
timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
--
2.34.1
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC 4/5] RDMA/mlx5: Add fallback for P2P DMA errors
2024-12-01 10:36 [RFC 0/5] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
` (2 preceding siblings ...)
2024-12-01 10:36 ` [RFC 3/5] IB/core: P2P DMA for device private pages Yonatan Maman
@ 2024-12-01 10:36 ` Yonatan Maman
2024-12-01 10:36 ` [RFC 5/5] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
4 siblings, 0 replies; 26+ messages in thread
From: Yonatan Maman @ 2024-12-01 10:36 UTC (permalink / raw)
To: kherbst, lyude, dakr, airlied, simona, jgg, leon, jglisse, akpm,
Ymaman, GalShalom, dri-devel, nouveau, linux-kernel, linux-rdma,
linux-mm, linux-tegra
From: Yonatan Maman <Ymaman@Nvidia.com>
Handle P2P DMA mapping errors when the transaction requires traversing
an inaccessible host bridge that is not in the allowlist:
- In `populate_mtt`, if a P2P mapping fails, the `HMM_PFN_ALLOW_P2P` flag
is cleared only for the PFNs that returned a mapping error.
- In `pagefault_real_mr`, if a P2P mapping error occurs, the mapping is
retried with the `HMM_PFN_ALLOW_P2P` flag only for the PFNs that didn't
fail, ensuring a fallback to standard DMA(host memory) for the rest,
if possible.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/infiniband/hw/mlx5/odp.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index fbb2a5670c32..f7a1291ec7d1 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -169,6 +169,7 @@ static int populate_mtt(__be64 *pas, size_t start, size_t nentries,
struct pci_p2pdma_map_state p2pdma_state = {};
struct ib_device *dev = odp->umem.ibdev;
size_t i;
+ int ret = 0;
if (flags & MLX5_IB_UPD_XLT_ZAP)
return 0;
@@ -184,8 +185,11 @@ static int populate_mtt(__be64 *pas, size_t start, size_t nentries,
dma_addr = hmm_dma_map_pfn(dev->dma_device, &odp->map,
start + i, &p2pdma_state);
- if (ib_dma_mapping_error(dev, dma_addr))
- return -EFAULT;
+ if (ib_dma_mapping_error(dev, dma_addr)) {
+ odp->map.pfn_list[start + i] &= ~(HMM_PFN_ALLOW_P2P);
+ ret = -EFAULT;
+ continue;
+ }
dma_addr |= MLX5_IB_MTT_READ;
if ((pfn & HMM_PFN_WRITE) && !downgrade)
@@ -194,7 +198,7 @@ static int populate_mtt(__be64 *pas, size_t start, size_t nentries,
pas[i] = cpu_to_be64(dma_addr);
odp->npages++;
}
- return 0;
+ return ret;
}
int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
@@ -696,6 +700,10 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
if (odp->umem.writable && !downgrade)
access_mask |= HMM_PFN_WRITE;
+ /*
+ * try fault with HMM_PFN_ALLOW_P2P flag
+ */
+ access_mask |= HMM_PFN_ALLOW_P2P;
np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
if (np < 0)
return np;
@@ -705,6 +713,16 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
* ib_umem_odp_map_dma_and_lock already checks this.
*/
ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
+ if (ret == -EFAULT) {
+ /*
+ * Indicate P2P Mapping Error, retry with no HMM_PFN_ALLOW_P2P
+ */
+ access_mask &= ~HMM_PFN_ALLOW_P2P;
+ np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
+ if (np < 0)
+ return np;
+ ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
+ }
mutex_unlock(&odp->umem_mutex);
if (ret < 0) {
--
2.34.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [RFC 5/5] RDMA/mlx5: Enabling ATS for ODP memory
2024-12-01 10:36 [RFC 0/5] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
` (3 preceding siblings ...)
2024-12-01 10:36 ` [RFC 4/5] RDMA/mlx5: Add fallback for P2P DMA errors Yonatan Maman
@ 2024-12-01 10:36 ` Yonatan Maman
4 siblings, 0 replies; 26+ messages in thread
From: Yonatan Maman @ 2024-12-01 10:36 UTC (permalink / raw)
To: kherbst, lyude, dakr, airlied, simona, jgg, leon, jglisse, akpm,
Ymaman, GalShalom, dri-devel, nouveau, linux-kernel, linux-rdma,
linux-mm, linux-tegra
From: Yonatan Maman <Ymaman@Nvidia.com>
ATS (Address Translation Services) mainly utilized to optimize PCI
Peer-to-Peer transfers and prevent bus failures. This change employed
ATS usage for ODP memory, to optimize DMA P2P for ODP memory. (e.g DMA
P2P for private device pages - ODP memory).
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 1bae5595c729..702d155f5048 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1705,9 +1705,9 @@ static inline bool rt_supported(int ts_cap)
static inline bool mlx5_umem_needs_ats(struct mlx5_ib_dev *dev,
struct ib_umem *umem, int access_flags)
{
- if (!MLX5_CAP_GEN(dev->mdev, ats) || !umem->is_dmabuf)
- return false;
- return access_flags & IB_ACCESS_RELAXED_ORDERING;
+ if (MLX5_CAP_GEN(dev->mdev, ats) && (umem->is_dmabuf || umem->is_odp))
+ return access_flags & IB_ACCESS_RELAXED_ORDERING;
+ return false;
}
int set_roce_addr(struct mlx5_ib_dev *dev, u32 port_num,
--
2.34.1
^ permalink raw reply [flat|nested] 26+ messages in thread