* [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-15 15:23 [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
@ 2024-10-15 15:23 ` Yonatan Maman
2024-10-16 4:49 ` Christoph Hellwig
2024-10-16 5:10 ` Alistair Popple
2024-10-15 15:23 ` [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
` (3 subsequent siblings)
4 siblings, 2 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-15 15:23 UTC (permalink / raw)
To: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
Cc: Yonatan Maman, Gal Shalom
From: Yonatan Maman <Ymaman@Nvidia.com>
hmm_range_fault() natively triggers a page fault on device private
pages, migrating them to RAM. In some cases, such as with RDMA devices,
the migration overhead between the device (e.g., GPU) and the CPU, and
vice-versa, significantly damages performance. Thus, enabling Peer-to-
Peer (P2P) DMA access for device private page might be crucial for
minimizing data transfer overhead.
This change introduces an API to support P2P connections for device
private pages by implementing the following:
- Leveraging the struct pagemap_ops for P2P Page Callbacks. This
callback involves mapping the page to MMIO and returning the
corresponding PCI_P2P page.
- Utilizing hmm_range_fault for Initializing P2P Connections. The API
also adds the HMM_PFN_REQ_TRY_P2P flag option for the
hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
attempts initializing the P2P connection first, if the owner device
supports P2P, using p2p_page. In case of failure or lack of support,
hmm_range_fault will continue with the regular flow of migrating the
page to RAM.
This change does not affect previous use-cases of hmm_range_fault,
because both the caller and the page owner must explicitly request and
support it to initialize P2P connection.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
---
include/linux/hmm.h | 2 ++
include/linux/memremap.h | 7 +++++++
mm/hmm.c | 28 ++++++++++++++++++++++++++++
3 files changed, 37 insertions(+)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..7154f5ed73a1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -41,6 +41,8 @@ enum hmm_pfn_flags {
/* Input flags */
HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
+ /* allow returning PCI P2PDMA pages */
+ HMM_PFN_REQ_ALLOW_P2P = 1,
HMM_PFN_FLAGS = 0xFFUL << HMM_PFN_ORDER_SHIFT,
};
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143ade32c..0ecfd3d191fa 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,13 @@ struct dev_pagemap_ops {
*/
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+ /*
+ * Used for private (un-addressable) device memory only. Return a
+ * corresponding struct page, that can be mapped to device
+ * (e.g using dma_map_page)
+ */
+ struct page *(*get_dma_page_for_device)(struct page *private_page);
+
/*
* Handle the memory failure happens on a range of pfns. Notify the
* processes who are using these pfns, and try to recover the data on
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..987dd143d697 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -230,6 +230,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
unsigned long cpu_flags;
pte_t pte = ptep_get(ptep);
uint64_t pfn_req_flags = *hmm_pfn;
+ struct page *(*get_dma_page_handler)(struct page *private_page);
+ struct page *dma_page;
if (pte_none_mostly(pte)) {
required_fault =
@@ -257,6 +259,32 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
return 0;
}
+ /*
+ * P2P for supported pages, and according to caller request
+ * translate the private page to the match P2P page if it fails
+ * continue with the regular flow
+ */
+ if (is_device_private_entry(entry)) {
+ get_dma_page_handler =
+ pfn_swap_entry_to_page(entry)
+ ->pgmap->ops->get_dma_page_for_device;
+ if ((hmm_vma_walk->range->default_flags &
+ HMM_PFN_REQ_ALLOW_P2P) &&
+ get_dma_page_handler) {
+ dma_page = get_dma_page_handler(
+ pfn_swap_entry_to_page(entry));
+ if (!IS_ERR(dma_page)) {
+ cpu_flags = HMM_PFN_VALID;
+ if (is_writable_device_private_entry(
+ entry))
+ cpu_flags |= HMM_PFN_WRITE;
+ *hmm_pfn = page_to_pfn(dma_page) |
+ cpu_flags;
+ return 0;
+ }
+ }
+ }
+
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (!required_fault) {
--
2.34.1
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-15 15:23 ` [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages Yonatan Maman
@ 2024-10-16 4:49 ` Christoph Hellwig
2024-10-16 15:04 ` Yonatan Maman
2024-10-16 15:44 ` Jason Gunthorpe
2024-10-16 5:10 ` Alistair Popple
1 sibling, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-16 4:49 UTC (permalink / raw)
To: Yonatan Maman
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs, Gal Shalom
The subject does not make sense. All P2P is on ZONE_DEVICE pages.
It seems like this is about device private memory?
On Tue, Oct 15, 2024 at 06:23:45PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
>
> hmm_range_fault() natively triggers a page fault on device private
> pages, migrating them to RAM.
That "natively" above doesn't make sense to me.
> In some cases, such as with RDMA devices,
> the migration overhead between the device (e.g., GPU) and the CPU, and
> vice-versa, significantly damages performance. Thus, enabling Peer-to-
s/damages/degrades/
> Peer (P2P) DMA access for device private page might be crucial for
> minimizing data transfer overhead.
>
> This change introduces an API to support P2P connections for device
> private pages by implementing the following:
"This change.. " or "This patch.." is pointless, just explain what you
are doing.
>
> - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
> callback involves mapping the page to MMIO and returning the
> corresponding PCI_P2P page.
While P2P uses the same underlying PCIe TLPs as MMIO, it is not
MMIO by definition, as memory mapped I/O is by definition about
the CPU memory mappping so that load and store instructions cause
the I/O. It also uses very different concepts in Linux.
> - Utilizing hmm_range_fault for Initializing P2P Connections. The API
There is no concept of a "connection" in PCIe dta transfers.
> also adds the HMM_PFN_REQ_TRY_P2P flag option for the
> hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
> attempts initializing the P2P connection first, if the owner device
> supports P2P, using p2p_page. In case of failure or lack of support,
> hmm_range_fault will continue with the regular flow of migrating the
> page to RAM.
What is the need for the flag? As far as I can tell from reading
the series, the P2P mapping is entirely transparent to the callers
of hmm_range_fault.
> + /*
> + * Used for private (un-addressable) device memory only. Return a
> + * corresponding struct page, that can be mapped to device
> + * (e.g using dma_map_page)
> + */
> + struct page *(*get_dma_page_for_device)(struct page *private_page);
We are talking about P2P memory here. How do you manage to get a page
that dma_map_page can be used on? All P2P memory needs to use the P2P
aware dma_map_sg as the pages for P2P memory are just fake zone device
pages.
> + * P2P for supported pages, and according to caller request
> + * translate the private page to the match P2P page if it fails
> + * continue with the regular flow
> + */
> + if (is_device_private_entry(entry)) {
> + get_dma_page_handler =
> + pfn_swap_entry_to_page(entry)
> + ->pgmap->ops->get_dma_page_for_device;
> + if ((hmm_vma_walk->range->default_flags &
> + HMM_PFN_REQ_ALLOW_P2P) &&
> + get_dma_page_handler) {
> + dma_page = get_dma_page_handler(
> + pfn_swap_entry_to_page(entry));
This is really messy. You probably really want to share a branch
with the private page handling for the owner so that you only need
a single is_device_private_entry and can use a local variable for
to shortcut finding the page. Probably best done with a little helper:
Then this becomes:
static bool hmm_handle_device_private(struct hmm_range *range,
swp_entry_t entry, unsigned long *hmm_pfn)
{
struct page *page = pfn_swap_entry_to_page(entry);
struct dev_pagemap *pgmap = page->pgmap;
if (pgmap->owner == range->dev_private_owner) {
*hmm_pfn = swp_offset_pfn(entry);
goto found;
}
if (pgmap->ops->get_dma_page_for_device) {
*hmm_pfn =
page_to_pfn(pgmap->ops->get_dma_page_for_device(page));
goto found;
}
return false;
found:
*hmm_pfn |= HMM_PFN_VALID
if (is_writable_device_private_entry(entry))
*hmm_pfn |= HMM_PFN_WRITE;
return true;
}
which also makes it clear that returning a page from the method is
not that great, a PFN might work a lot better, e.g.
unsigned long (*device_private_dma_pfn)(struct page *page);
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 4:49 ` Christoph Hellwig
@ 2024-10-16 15:04 ` Yonatan Maman
2024-10-16 15:44 ` Jason Gunthorpe
1 sibling, 0 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-16 15:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs, Gal Shalom
On 16/10/2024 7:49, Christoph Hellwig wrote:
> The subject does not make sense. All P2P is on ZONE_DEVICE pages.
> It seems like this is about device private memory?
>
Correct, thanks, I will change it to `mm/hmm: HMM API to enable P2P DMA
for device private pages`, on v2.
> On Tue, Oct 15, 2024 at 06:23:45PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> hmm_range_fault() natively triggers a page fault on device private
>> pages, migrating them to RAM.
>
> That "natively" above doesn't make sense to me.
What I meant to convey is that hmm_range_fault() by default triggered a
page fault on device private pages, which results in them being migrated
to RAM. In this patch, we are trying to provide a different (more
optimized) handling flow of P2P instead of migration.
I will clarify this on v2.
>
>> In some cases, such as with RDMA devices,
>> the migration overhead between the device (e.g., GPU) and the CPU, and
>> vice-versa, significantly damages performance. Thus, enabling Peer-to-
>
> s/damages/degrades/
Fixed (v2), thanks
>
>> Peer (P2P) DMA access for device private page might be crucial for
>> minimizing data transfer overhead.
>>
>> This change introduces an API to support P2P connections for device
>> private pages by implementing the following:
>
> "This change.. " or "This patch.." is pointless, just explain what you
> are doing.
>
Fixed (v2), thanks
>>
>> - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
>> callback involves mapping the page to MMIO and returning the
>> corresponding PCI_P2P page.
>
> While P2P uses the same underlying PCIe TLPs as MMIO, it is not
> MMIO by definition, as memory mapped I/O is by definition about
> the CPU memory mappping so that load and store instructions cause
> the I/O. It also uses very different concepts in Linux.
>
Got it. I just wanted to emphasize that this function could be used to
dynamically map the page for MMIO. However, I understand the potential
confusion between MMIO and P2P, so I’ll remove that part.
>> - Utilizing hmm_range_fault for Initializing P2P Connections. The API
>
> There is no concept of a "connection" in PCIe dta transfers.
>
what about "initializing P2P Transfers" or "initializing P2P DMA"?
>> also adds the HMM_PFN_REQ_TRY_P2P flag option for the
>> hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
>> attempts initializing the P2P connection first, if the owner device
>> supports P2P, using p2p_page. In case of failure or lack of support,
>> hmm_range_fault will continue with the regular flow of migrating the
>> page to RAM.
>
> What is the need for the flag? As far as I can tell from reading
> the series, the P2P mapping is entirely transparent to the callers
> of hmm_range_fault.
>
This flag mirrors the GUP flag (FOLL_PCI_P2PDMA). The purpose of this
flag is to ensure compatibility with existing flows that utilizing
hmm_range_fault but may not inherently support PCI P2P.
>> + /*
>> + * Used for private (un-addressable) device memory only. Return a
>> + * corresponding struct page, that can be mapped to device
>> + * (e.g using dma_map_page)
>> + */
>> + struct page *(*get_dma_page_for_device)(struct page *private_page);
>
> We are talking about P2P memory here. How do you manage to get a page
> that dma_map_page can be used on? All P2P memory needs to use the P2P
> aware dma_map_sg as the pages for P2P memory are just fake zone device
> pages.
>
According to our experiments dma_map_page is fully worked with PCI_P2P
pages, I will double check that. If not we can either return this
information using HMM flags or adding support to dma_map_page.
>
>> + * P2P for supported pages, and according to caller request
>> + * translate the private page to the match P2P page if it fails
>> + * continue with the regular flow
>> + */
>> + if (is_device_private_entry(entry)) {
>> + get_dma_page_handler =
>> + pfn_swap_entry_to_page(entry)
>> + ->pgmap->ops->get_dma_page_for_device;
>> + if ((hmm_vma_walk->range->default_flags &
>> + HMM_PFN_REQ_ALLOW_P2P) &&
>> + get_dma_page_handler) {
>> + dma_page = get_dma_page_handler(
>> + pfn_swap_entry_to_page(entry));
>
> This is really messy. You probably really want to share a branch
> with the private page handling for the owner so that you only need
> a single is_device_private_entry and can use a local variable for
> to shortcut finding the page. Probably best done with a little helper:
>
> Then this becomes:
>
> static bool hmm_handle_device_private(struct hmm_range *range,
> swp_entry_t entry, unsigned long *hmm_pfn)
> {
> struct page *page = pfn_swap_entry_to_page(entry);
> struct dev_pagemap *pgmap = page->pgmap;
>
> if (pgmap->owner == range->dev_private_owner) {
> *hmm_pfn = swp_offset_pfn(entry);
> goto found;
> }
>
> if (pgmap->ops->get_dma_page_for_device) {
> *hmm_pfn =
> page_to_pfn(pgmap->ops->get_dma_page_for_device(page));
> goto found;
> }
>
> return false;
>
> found:
> *hmm_pfn |= HMM_PFN_VALID
> if (is_writable_device_private_entry(entry))
> *hmm_pfn |= HMM_PFN_WRITE;
> return true;
> }
>
> which also makes it clear that returning a page from the method is
> not that great, a PFN might work a lot better, e.g.
>
> unsigned long (*device_private_dma_pfn)(struct page *page);
I Agree, I will fix that (v2).
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 4:49 ` Christoph Hellwig
2024-10-16 15:04 ` Yonatan Maman
@ 2024-10-16 15:44 ` Jason Gunthorpe
2024-10-16 16:41 ` Christoph Hellwig
1 sibling, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 15:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, apopple, bskeggs, Gal Shalom
On Tue, Oct 15, 2024 at 09:49:30PM -0700, Christoph Hellwig wrote:
> > + /*
> > + * Used for private (un-addressable) device memory only. Return a
> > + * corresponding struct page, that can be mapped to device
> > + * (e.g using dma_map_page)
> > + */
> > + struct page *(*get_dma_page_for_device)(struct page *private_page);
>
> We are talking about P2P memory here. How do you manage to get a page
> that dma_map_page can be used on? All P2P memory needs to use the P2P
> aware dma_map_sg as the pages for P2P memory are just fake zone device
> pages.
FWIW, I've been expecting this series to be rebased on top of Leon's
new DMA API series so it doesn't have this issue.. I think this series
is at RFC quality, but shows more of the next steps.
I'm guessing they got their testing done so far on a system without an
iommu translation?
> which also makes it clear that returning a page from the method is
> not that great, a PFN might work a lot better, e.g.
>
> unsigned long (*device_private_dma_pfn)(struct page *page);
Ideally I think we should not have the struct page * at all through
these APIs if we can avoid it..
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 15:44 ` Jason Gunthorpe
@ 2024-10-16 16:41 ` Christoph Hellwig
2024-10-16 17:44 ` Jason Gunthorpe
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-16 16:41 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Yonatan Maman, nouveau, linux-kernel,
linux-rdma, linux-mm, herbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, dri-devel, apopple, bskeggs, Gal Shalom
On Wed, Oct 16, 2024 at 12:44:28PM -0300, Jason Gunthorpe wrote:
> > We are talking about P2P memory here. How do you manage to get a page
> > that dma_map_page can be used on? All P2P memory needs to use the P2P
> > aware dma_map_sg as the pages for P2P memory are just fake zone device
> > pages.
>
> FWIW, I've been expecting this series to be rebased on top of Leon's
> new DMA API series so it doesn't have this issue..
That's not going to make a difference at this level.
> I'm guessing they got their testing done so far on a system without an
> iommu translation?
IOMMU or not doens't matter much for P2P. The important difference is
through the host bridge or through a switch. dma_map_page will work
for P2P through the host brige (assuming the host bridge even support
it as it also lacks the error handling for when not), but it lacks the
handling for P2P through a switch.
>
> > which also makes it clear that returning a page from the method is
> > not that great, a PFN might work a lot better, e.g.
> >
> > unsigned long (*device_private_dma_pfn)(struct page *page);
>
> Ideally I think we should not have the struct page * at all through
> these APIs if we can avoid it..
The input page is the device private page that we have at hand
anyway. Until that scheme is complete redone it is the right kind
of parameter.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 16:41 ` Christoph Hellwig
@ 2024-10-16 17:44 ` Jason Gunthorpe
2024-10-17 11:58 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 17:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, apopple, bskeggs, Gal Shalom
On Wed, Oct 16, 2024 at 09:41:03AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 16, 2024 at 12:44:28PM -0300, Jason Gunthorpe wrote:
> > > We are talking about P2P memory here. How do you manage to get a page
> > > that dma_map_page can be used on? All P2P memory needs to use the P2P
> > > aware dma_map_sg as the pages for P2P memory are just fake zone device
> > > pages.
> >
> > FWIW, I've been expecting this series to be rebased on top of Leon's
> > new DMA API series so it doesn't have this issue..
>
> That's not going to make a difference at this level.
I'm not sure what you are asking then.
Patch 2 does pci_p2pdma_add_resource() and so a valid struct page with
a P2P ZONE_DEVICE type exists, and that gets returned back to the
hmm/odp code.
Today odp calls dma_map_page() which only works by chance in limited
cases. With Leon's revision it will call hmm_dma_map_pfn() ->
dma_iova_link() which does call pci_p2pdma_map_type() and should do
the right thing.
> > I'm guessing they got their testing done so far on a system without an
> > iommu translation?
>
> IOMMU or not doens't matter much for P2P. The important difference is
> through the host bridge or through a switch. dma_map_page will work
> for P2P through the host brige (assuming the host bridge even support
> it as it also lacks the error handling for when not), but it lacks the
> handling for P2P through a switch.
On most x86 systems the BAR/bus address of the P2P memory is the same
as the CPU address, so without an IOMMU translation dma_map_page()
will return the CPU/host physical address which is the same as the
BAR/bus address and that will take the P2P switch path for testing.
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 17:44 ` Jason Gunthorpe
@ 2024-10-17 11:58 ` Christoph Hellwig
2024-10-17 13:05 ` Jason Gunthorpe
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-17 11:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Yonatan Maman, nouveau, linux-kernel,
linux-rdma, linux-mm, herbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, dri-devel, apopple, bskeggs, Gal Shalom
On Wed, Oct 16, 2024 at 02:44:45PM -0300, Jason Gunthorpe wrote:
> > > FWIW, I've been expecting this series to be rebased on top of Leon's
> > > new DMA API series so it doesn't have this issue..
> >
> > That's not going to make a difference at this level.
>
> I'm not sure what you are asking then.
>
> Patch 2 does pci_p2pdma_add_resource() and so a valid struct page with
> a P2P ZONE_DEVICE type exists, and that gets returned back to the
> hmm/odp code.
>
> Today odp calls dma_map_page() which only works by chance in limited
> cases. With Leon's revision it will call hmm_dma_map_pfn() ->
> dma_iova_link() which does call pci_p2pdma_map_type() and should do
> the right thing.
Again none of this affects the code posted here. It reshuffles the
callers but has no direct affect on the patches posted here.
(and the current DMA series lacks P2P support, I'm trying to figure
out how to properly handle it at the moment).
> > IOMMU or not doens't matter much for P2P. The important difference is
> > through the host bridge or through a switch. dma_map_page will work
> > for P2P through the host brige (assuming the host bridge even support
> > it as it also lacks the error handling for when not), but it lacks the
> > handling for P2P through a switch.
>
> On most x86 systems the BAR/bus address of the P2P memory is the same
> as the CPU address, so without an IOMMU translation dma_map_page()
> will return the CPU/host physical address which is the same as the
> BAR/bus address and that will take the P2P switch path for testing.
Maybe. Either way the use of dma_map_page is incorrect.
>
> Jason
---end quoted text---
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 11:58 ` Christoph Hellwig
@ 2024-10-17 13:05 ` Jason Gunthorpe
2024-10-17 13:12 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 13:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, apopple, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 04:58:12AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 16, 2024 at 02:44:45PM -0300, Jason Gunthorpe wrote:
> > > > FWIW, I've been expecting this series to be rebased on top of Leon's
> > > > new DMA API series so it doesn't have this issue..
> > >
> > > That's not going to make a difference at this level.
> >
> > I'm not sure what you are asking then.
> >
> > Patch 2 does pci_p2pdma_add_resource() and so a valid struct page with
> > a P2P ZONE_DEVICE type exists, and that gets returned back to the
> > hmm/odp code.
> >
> > Today odp calls dma_map_page() which only works by chance in limited
> > cases. With Leon's revision it will call hmm_dma_map_pfn() ->
> > dma_iova_link() which does call pci_p2pdma_map_type() and should do
> > the right thing.
>
> Again none of this affects the code posted here. It reshuffles the
> callers but has no direct affect on the patches posted here.
I didn't realize till last night that Leon's series did not have P2P
support.
What I'm trying to say is that this is a multi-series project.
A followup based on Leon's initial work will get the ODP DMA mapping
path able to support ZONE_DEVICE P2P pages.
Once that is done, this series sits on top. This series is only about
hmm and effectively allows hmm_range_fault() to return a ZONE_DEVICE
P2P page.
Yonatan should explain this better in the cover letter and mark it as
a RFC series.
So, I know we are still figuring out the P2P support on the DMA API
side, but my expectation for hmm is that hmm_range_fault() returing a
ZONE_DEVICE P2P page is going to be what we want.
> (and the current DMA series lacks P2P support, I'm trying to figure
> out how to properly handle it at the moment).
Yes, I see, I looked through those patches last night and there is a
gap there.
Broadly I think whatever flow NVMe uses for P2P will apply to ODP as
well.
Thanks,
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 13:05 ` Jason Gunthorpe
@ 2024-10-17 13:12 ` Christoph Hellwig
2024-10-17 13:46 ` Jason Gunthorpe
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-17 13:12 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Yonatan Maman, nouveau, linux-kernel,
linux-rdma, linux-mm, herbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, dri-devel, apopple, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 10:05:39AM -0300, Jason Gunthorpe wrote:
> Broadly I think whatever flow NVMe uses for P2P will apply to ODP as
> well.
ODP is a lot simpler than NVMe for P2P actually :(
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 13:12 ` Christoph Hellwig
@ 2024-10-17 13:46 ` Jason Gunthorpe
2024-10-17 13:49 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 13:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, apopple, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 06:12:55AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 10:05:39AM -0300, Jason Gunthorpe wrote:
> > Broadly I think whatever flow NVMe uses for P2P will apply to ODP as
> > well.
>
> ODP is a lot simpler than NVMe for P2P actually :(
What is your thinking there? I'm looking at the latest patches and I
would expect dma_iova_init() to accept a phys so it can call
pci_p2pdma_map_type() once for the whole transaction. It is a slow
operation.
Based on the result of pci_p2pdma_map_type() it would have to take one
of three paths: direct, iommu, or acs/switch.
It feels like dma_map_page() should become a new function that takes
in the state and then it can do direct or acs based on the type held
in the state.
ODP would have to refresh the state for each page, but could follow
the same code structure.
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 13:46 ` Jason Gunthorpe
@ 2024-10-17 13:49 ` Christoph Hellwig
2024-10-17 14:05 ` Jason Gunthorpe
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-17 13:49 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Yonatan Maman, nouveau, linux-kernel,
linux-rdma, linux-mm, herbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, dri-devel, apopple, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 10:46:44AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 06:12:55AM -0700, Christoph Hellwig wrote:
> > On Thu, Oct 17, 2024 at 10:05:39AM -0300, Jason Gunthorpe wrote:
> > > Broadly I think whatever flow NVMe uses for P2P will apply to ODP as
> > > well.
> >
> > ODP is a lot simpler than NVMe for P2P actually :(
>
> What is your thinking there? I'm looking at the latest patches and I
> would expect dma_iova_init() to accept a phys so it can call
> pci_p2pdma_map_type() once for the whole transaction. It is a slow
> operation.
You can't do it for the whole transaction. Here is my suggestion
for ODP:
http://git.infradead.org/?p=users/hch/misc.git;a=shortlog;h=refs/heads/dma-split-wip
For NVMe I need to figure out a way to split bios on a per P2P
type boundary as we don't have any space to record if something is a bus
mapped address.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 13:49 ` Christoph Hellwig
@ 2024-10-17 14:05 ` Jason Gunthorpe
2024-10-17 14:19 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 14:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, apopple, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 06:49:30AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 10:46:44AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 17, 2024 at 06:12:55AM -0700, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 10:05:39AM -0300, Jason Gunthorpe wrote:
> > > > Broadly I think whatever flow NVMe uses for P2P will apply to ODP as
> > > > well.
> > >
> > > ODP is a lot simpler than NVMe for P2P actually :(
> >
> > What is your thinking there? I'm looking at the latest patches and I
> > would expect dma_iova_init() to accept a phys so it can call
> > pci_p2pdma_map_type() once for the whole transaction. It is a slow
> > operation.
>
> You can't do it for the whole transaction. Here is my suggestion
> for ODP:
>
> http://git.infradead.org/?p=users/hch/misc.git;a=shortlog;h=refs/heads/dma-split-wip
OK, this looks very promising. I sketched something similar to the
pci-p2pdma changes a while back too.
BTW this:
iommu: generalize the batched sync after map interface
I am hoping to in a direction of adding a gather to the map, just like
unmap. So eventually instead of open coding iotlb_sync_map() you'd
flush the gather and it would do it.
> For NVMe I need to figure out a way to split bios on a per P2P
> type boundary as we don't have any space to record if something is a bus
> mapped address.
Yeah this came up before :\ Can't precompute the p2p type during bio
creation, splitting based on pgmap would be good enough.
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 14:05 ` Jason Gunthorpe
@ 2024-10-17 14:19 ` Christoph Hellwig
0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-17 14:19 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Christoph Hellwig, Yonatan Maman, nouveau, linux-kernel,
linux-rdma, linux-mm, herbst, lyude, dakr, airlied, simona, leon,
jglisse, akpm, dri-devel, apopple, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 11:05:07AM -0300, Jason Gunthorpe wrote:
> BTW this:
>
> iommu: generalize the batched sync after map interface
>
> I am hoping to in a direction of adding a gather to the map, just like
> unmap. So eventually instead of open coding iotlb_sync_map() you'd
> flush the gather and it would do it.
I don't really care either way, I just need something to not regress
map behavior vs map_sg.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-15 15:23 ` [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages Yonatan Maman
2024-10-16 4:49 ` Christoph Hellwig
@ 2024-10-16 5:10 ` Alistair Popple
2024-10-16 15:45 ` Jason Gunthorpe
1 sibling, 1 reply; 28+ messages in thread
From: Alistair Popple @ 2024-10-16 5:10 UTC (permalink / raw)
To: Yonatan Maman
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, bskeggs,
Gal Shalom
Yonatan Maman <ymaman@nvidia.com> writes:
> From: Yonatan Maman <Ymaman@Nvidia.com>
>
> hmm_range_fault() natively triggers a page fault on device private
> pages, migrating them to RAM. In some cases, such as with RDMA devices,
> the migration overhead between the device (e.g., GPU) and the CPU, and
> vice-versa, significantly damages performance. Thus, enabling Peer-to-
> Peer (P2P) DMA access for device private page might be crucial for
> minimizing data transfer overhead.
>
> This change introduces an API to support P2P connections for device
> private pages by implementing the following:
>
> - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
> callback involves mapping the page to MMIO and returning the
> corresponding PCI_P2P page.
>
> - Utilizing hmm_range_fault for Initializing P2P Connections. The API
> also adds the HMM_PFN_REQ_TRY_P2P flag option for the
> hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
> attempts initializing the P2P connection first, if the owner device
> supports P2P, using p2p_page. In case of failure or lack of support,
> hmm_range_fault will continue with the regular flow of migrating the
> page to RAM.
>
> This change does not affect previous use-cases of hmm_range_fault,
> because both the caller and the page owner must explicitly request and
> support it to initialize P2P connection.
>
> Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
> Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
> ---
> include/linux/hmm.h | 2 ++
> include/linux/memremap.h | 7 +++++++
> mm/hmm.c | 28 ++++++++++++++++++++++++++++
> 3 files changed, 37 insertions(+)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 126a36571667..7154f5ed73a1 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -41,6 +41,8 @@ enum hmm_pfn_flags {
> /* Input flags */
> HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
> HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
> + /* allow returning PCI P2PDMA pages */
> + HMM_PFN_REQ_ALLOW_P2P = 1,
>
> HMM_PFN_FLAGS = 0xFFUL << HMM_PFN_ORDER_SHIFT,
> };
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 3f7143ade32c..0ecfd3d191fa 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -89,6 +89,13 @@ struct dev_pagemap_ops {
> */
> vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>
> + /*
> + * Used for private (un-addressable) device memory only. Return a
> + * corresponding struct page, that can be mapped to device
> + * (e.g using dma_map_page)
> + */
> + struct page *(*get_dma_page_for_device)(struct page *private_page);
It would be nice to add some documentation about this feature to
Documentation/mm/hmm.rst. In particular some notes on the page
lifetime/refcounting rules.
On that note how is the refcounting of the returned p2pdma page expected
to work? We don't want the driver calling hmm_range_fault() to be able
to pin the page with eg. get_page(), so the returned p2pdma page should
have a zero refcount to enforce that.
> +
> /*
> * Handle the memory failure happens on a range of pfns. Notify the
> * processes who are using these pfns, and try to recover the data on
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 7e0229ae4a5a..987dd143d697 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -230,6 +230,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> unsigned long cpu_flags;
> pte_t pte = ptep_get(ptep);
> uint64_t pfn_req_flags = *hmm_pfn;
> + struct page *(*get_dma_page_handler)(struct page *private_page);
> + struct page *dma_page;
>
> if (pte_none_mostly(pte)) {
> required_fault =
> @@ -257,6 +259,32 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> return 0;
> }
>
> + /*
> + * P2P for supported pages, and according to caller request
> + * translate the private page to the match P2P page if it fails
> + * continue with the regular flow
> + */
> + if (is_device_private_entry(entry)) {
> + get_dma_page_handler =
> + pfn_swap_entry_to_page(entry)
> + ->pgmap->ops->get_dma_page_for_device;
> + if ((hmm_vma_walk->range->default_flags &
> + HMM_PFN_REQ_ALLOW_P2P) &&
> + get_dma_page_handler) {
> + dma_page = get_dma_page_handler(
> + pfn_swap_entry_to_page(entry));
> + if (!IS_ERR(dma_page)) {
> + cpu_flags = HMM_PFN_VALID;
> + if (is_writable_device_private_entry(
> + entry))
> + cpu_flags |= HMM_PFN_WRITE;
> + *hmm_pfn = page_to_pfn(dma_page) |
> + cpu_flags;
> + return 0;
> + }
> + }
> + }
> +
> required_fault =
> hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
> if (!required_fault) {
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 5:10 ` Alistair Popple
@ 2024-10-16 15:45 ` Jason Gunthorpe
2024-10-17 1:58 ` Alistair Popple
0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 15:45 UTC (permalink / raw)
To: Alistair Popple
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, bskeggs, Gal Shalom
On Wed, Oct 16, 2024 at 04:10:53PM +1100, Alistair Popple wrote:
> On that note how is the refcounting of the returned p2pdma page expected
> to work? We don't want the driver calling hmm_range_fault() to be able
> to pin the page with eg. get_page(), so the returned p2pdma page should
> have a zero refcount to enforce that.
I think that is just the rule for hmm stuff in general, don't touch
the refcount.
We don't need to enforce, it we don't know what else the driver will
want to use that P2P page for after all. It might stick it in a VMA
for some unrelated reason.
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-16 15:45 ` Jason Gunthorpe
@ 2024-10-17 1:58 ` Alistair Popple
2024-10-17 11:53 ` Jason Gunthorpe
0 siblings, 1 reply; 28+ messages in thread
From: Alistair Popple @ 2024-10-17 1:58 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, bskeggs, Gal Shalom
Jason Gunthorpe <jgg@ziepe.ca> writes:
> On Wed, Oct 16, 2024 at 04:10:53PM +1100, Alistair Popple wrote:
>> On that note how is the refcounting of the returned p2pdma page expected
>> to work? We don't want the driver calling hmm_range_fault() to be able
>> to pin the page with eg. get_page(), so the returned p2pdma page should
>> have a zero refcount to enforce that.
>
> I think that is just the rule for hmm stuff in general, don't touch
> the refcount.
Actually I think the rule should be don't look at the page at
all. hmm_range_fault() is about mirroring PTEs, no assumption should
even be made about the existence or otherwise of a struct page.
> We don't need to enforce, it we don't know what else the driver will
> want to use that P2P page for after all. It might stick it in a VMA
> for some unrelated reason.
And wouldn't that touch the refcount and therefore be wrong? How is the
driver which handed out the P2PDMA page meant to ensure it can
free/evict the underlying device private memory in this case? Normally
this would happen by migration, which would trigger an MMU notifier on
the virtual address. But if the P2PDMA page has been mapped into another
VMA how does it drop the reference on the P2PDMA page?
- Alistair
> Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages
2024-10-17 1:58 ` Alistair Popple
@ 2024-10-17 11:53 ` Jason Gunthorpe
0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 11:53 UTC (permalink / raw)
To: Alistair Popple
Cc: Yonatan Maman, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, leon, jglisse, akpm,
dri-devel, bskeggs, Gal Shalom
On Thu, Oct 17, 2024 at 12:58:48PM +1100, Alistair Popple wrote:
> Actually I think the rule should be don't look at the page at
> all. hmm_range_fault() is about mirroring PTEs, no assumption should
> even be made about the existence or otherwise of a struct page.
We are not there yet..
> > We don't need to enforce, it we don't know what else the driver will
> > want to use that P2P page for after all. It might stick it in a VMA
> > for some unrelated reason.
>
> And wouldn't that touch the refcount and therefore be wrong?
I mean the originating driver would do that
Jason
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages
2024-10-15 15:23 [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
2024-10-15 15:23 ` [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages Yonatan Maman
@ 2024-10-15 15:23 ` Yonatan Maman
2024-10-16 5:12 ` Alistair Popple
2024-10-15 15:23 ` [PATCH v1 3/4] IB/core: P2P DMA for device private pages Yonatan Maman
` (2 subsequent siblings)
4 siblings, 1 reply; 28+ messages in thread
From: Yonatan Maman @ 2024-10-15 15:23 UTC (permalink / raw)
To: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
Cc: Yonatan Maman, Gal Shalom
From: Yonatan Maman <Ymaman@Nvidia.com>
Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications
is crucial for minimizing data transfer overhead (e.g., for RDMA use-
case).
This change aims to enable that capability for Nouveau over HMM device
private pages. P2P DMA for private device pages allows the GPU to
directly exchange data with other devices (e.g., NICs) without needing
to traverse system RAM.
To fully support Peer-to-Peer for device private pages, the following
changes are made:
- Introduce struct nouveau_dmem_hmm_p2p within struct nouveau_dmem
to manage BAR1 PCI P2P memory. p2p_start_addr holds the virtual
address allocated with pci_alloc_p2pmem(), and p2p_size represents
the allocated size of the PCI P2P memory.
- nouveau_dmem_init - Ensure BAR1 accessibility and assign struct
pages (PCI_P2P_PAGE) for all BAR1 pages. Introduce
nouveau_alloc_bar1_pci_p2p_mem in nouveau_dmem to expose BAR1 for
use as P2P memory via pci_p2pdma_add_resource and implement static
allocation and assignment of struct pages using pci_alloc_p2pmem.
This function will be called from nouveau_dmem_init, and failure
triggers a warning message instead of driver failure.
- nouveau_dmem_fini - Ensure BAR1 PCI P2P memory is properly
destroyed during driver cleanup. Introduce
nouveau_destroy_bar1_pci_p2p_mem to handle freeing of PCI P2P
memory associated with Nouveau BAR1. Modify nouveau_dmem_fini to
call nouveau_destroy_bar1_pci_p2p_mem.
- Implement Nouveau `p2p_page` callback function - Implement BAR1
mapping for the chunk using `io_mem_reserve` if no mapping exists.
Retrieve the pre-allocated P2P virtual address and size from
`hmm_p2p`. Calculate the page offset within BAR1 and return the
corresponding P2P page.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 117 ++++++++++++++++++++++++-
1 file changed, 115 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1a072568cef6..13fb8671f212 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -40,6 +40,9 @@
#include <linux/hmm.h>
#include <linux/memremap.h>
#include <linux/migrate.h>
+#include <linux/pci-p2pdma.h>
+#include <nvkm/core/pci.h>
+
/*
* FIXME: this is ugly right now we are using TTM to allocate vram and we pin
@@ -77,9 +80,15 @@ struct nouveau_dmem_migrate {
struct nouveau_channel *chan;
};
+struct nouveau_dmem_hmm_p2p {
+ size_t p2p_size;
+ void *p2p_start_addr;
+};
+
struct nouveau_dmem {
struct nouveau_drm *drm;
struct nouveau_dmem_migrate migrate;
+ struct nouveau_dmem_hmm_p2p hmm_p2p;
struct list_head chunks;
struct mutex mutex;
struct page *free_pages;
@@ -158,6 +167,61 @@ static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
return 0;
}
+static int nouveau_dmem_bar1_mapping(struct nouveau_bo *nvbo,
+ unsigned long long *bus_addr)
+{
+ int ret;
+ struct ttm_resource *mem = nvbo->bo.resource;
+
+ if (mem->bus.offset) {
+ *bus_addr = mem->bus.offset;
+ return 0;
+ }
+
+ if (PFN_UP(nvbo->bo.base.size) > PFN_UP(nvbo->bo.resource->size))
+ return -EINVAL;
+
+ ret = ttm_bo_reserve(&nvbo->bo, false, false, NULL);
+ if (ret)
+ return ret;
+
+ ret = nvbo->bo.bdev->funcs->io_mem_reserve(nvbo->bo.bdev, mem);
+ *bus_addr = mem->bus.offset;
+
+ ttm_bo_unreserve(&nvbo->bo);
+ return ret;
+}
+
+static struct page *nouveau_dmem_get_dma_page(struct page *private_page)
+{
+ int ret;
+ unsigned long long offset_in_chunk, offset_in_bar1;
+ unsigned long long chunk_bus_addr, page_bus_addr;
+ unsigned long long bar1_base_addr;
+ struct nouveau_drm *drm = page_to_drm(private_page);
+ struct nouveau_bo *nvbo = nouveau_page_to_chunk(private_page)->bo;
+ struct nvkm_device *nv_device = nvxx_device(drm);
+ void *p2p_start_addr = drm->dmem->hmm_p2p.p2p_start_addr;
+ size_t p2p_size = drm->dmem->hmm_p2p.p2p_size;
+
+ bar1_base_addr = nv_device->func->resource_addr(nv_device, 1);
+ offset_in_chunk =
+ (page_to_pfn(private_page) << PAGE_SHIFT) -
+ nouveau_page_to_chunk(private_page)->pagemap.range.start;
+
+ ret = nouveau_dmem_bar1_mapping(nvbo, &chunk_bus_addr);
+ if (ret)
+ return ERR_PTR(ret);
+
+ page_bus_addr = chunk_bus_addr + offset_in_chunk;
+ if (!p2p_size || page_bus_addr > bar1_base_addr + p2p_size ||
+ page_bus_addr < bar1_base_addr)
+ return ERR_PTR(-ENOMEM);
+
+ offset_in_bar1 = page_bus_addr - bar1_base_addr;
+ return virt_to_page(p2p_start_addr + offset_in_bar1);
+}
+
static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
{
struct nouveau_drm *drm = page_to_drm(vmf->page);
@@ -219,8 +283,9 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
}
static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
- .page_free = nouveau_dmem_page_free,
- .migrate_to_ram = nouveau_dmem_migrate_to_ram,
+ .page_free = nouveau_dmem_page_free,
+ .migrate_to_ram = nouveau_dmem_migrate_to_ram,
+ .get_dma_page_for_device = nouveau_dmem_get_dma_page,
};
static int
@@ -413,14 +478,31 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
kvfree(dma_addrs);
}
+static void nouveau_destroy_bar1_pci_p2p_mem(struct nouveau_drm *drm,
+ struct pci_dev *pdev,
+ void *p2p_start_addr,
+ size_t p2p_size)
+{
+ if (p2p_size)
+ pci_free_p2pmem(pdev, p2p_start_addr, p2p_size);
+
+ NV_INFO(drm, "PCI P2P memory freed(%p)\n", p2p_start_addr);
+}
+
void
nouveau_dmem_fini(struct nouveau_drm *drm)
{
struct nouveau_dmem_chunk *chunk, *tmp;
+ struct nvkm_device *nv_device = nvxx_device(drm);
if (drm->dmem == NULL)
return;
+ nouveau_destroy_bar1_pci_p2p_mem(drm,
+ nv_device->func->pci(nv_device)->pdev,
+ drm->dmem->hmm_p2p.p2p_start_addr,
+ drm->dmem->hmm_p2p.p2p_size);
+
mutex_lock(&drm->dmem->mutex);
list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
@@ -586,10 +668,30 @@ nouveau_dmem_migrate_init(struct nouveau_drm *drm)
return -ENODEV;
}
+static int nouveau_alloc_bar1_pci_p2p_mem(struct nouveau_drm *drm,
+ struct pci_dev *pdev, size_t size,
+ void **pp2p_start_addr,
+ size_t *pp2p_size)
+{
+ int ret;
+
+ ret = pci_p2pdma_add_resource(pdev, 1, size, 0);
+ if (ret)
+ return ret;
+
+ *pp2p_start_addr = pci_alloc_p2pmem(pdev, size);
+ *pp2p_size = (*pp2p_start_addr) ? size : 0;
+
+ NV_INFO(drm, "PCI P2P memory allocated(%p)\n", *pp2p_start_addr);
+ return 0;
+}
+
void
nouveau_dmem_init(struct nouveau_drm *drm)
{
int ret;
+ struct nvkm_device *nv_device = nvxx_device(drm);
+ size_t bar1_size;
/* This only make sense on PASCAL or newer */
if (drm->client.device.info.family < NV_DEVICE_INFO_V0_PASCAL)
@@ -610,6 +712,17 @@ nouveau_dmem_init(struct nouveau_drm *drm)
kfree(drm->dmem);
drm->dmem = NULL;
}
+
+ /* Expose BAR1 for HMM P2P Memory */
+ bar1_size = nv_device->func->resource_size(nv_device, 1);
+ ret = nouveau_alloc_bar1_pci_p2p_mem(drm,
+ nv_device->func->pci(nv_device)->pdev,
+ bar1_size,
+ &drm->dmem->hmm_p2p.p2p_start_addr,
+ &drm->dmem->hmm_p2p.p2p_size);
+ if (ret)
+ NV_WARN(drm,
+ "PCI P2P memory allocation failed, HMM P2P won't be supported\n");
}
static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
--
2.34.1
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages
2024-10-15 15:23 ` [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
@ 2024-10-16 5:12 ` Alistair Popple
2024-10-16 15:18 ` Yonatan Maman
0 siblings, 1 reply; 28+ messages in thread
From: Alistair Popple @ 2024-10-16 5:12 UTC (permalink / raw)
To: Yonatan Maman
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, bskeggs,
Gal Shalom
Yonatan Maman <ymaman@nvidia.com> writes:
> From: Yonatan Maman <Ymaman@Nvidia.com>
>
> Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications
> is crucial for minimizing data transfer overhead (e.g., for RDMA use-
> case).
>
> This change aims to enable that capability for Nouveau over HMM device
> private pages. P2P DMA for private device pages allows the GPU to
> directly exchange data with other devices (e.g., NICs) without needing
> to traverse system RAM.
>
> To fully support Peer-to-Peer for device private pages, the following
> changes are made:
>
> - Introduce struct nouveau_dmem_hmm_p2p within struct nouveau_dmem
> to manage BAR1 PCI P2P memory. p2p_start_addr holds the virtual
> address allocated with pci_alloc_p2pmem(), and p2p_size represents
> the allocated size of the PCI P2P memory.
>
> - nouveau_dmem_init - Ensure BAR1 accessibility and assign struct
> pages (PCI_P2P_PAGE) for all BAR1 pages. Introduce
> nouveau_alloc_bar1_pci_p2p_mem in nouveau_dmem to expose BAR1 for
> use as P2P memory via pci_p2pdma_add_resource and implement static
> allocation and assignment of struct pages using pci_alloc_p2pmem.
> This function will be called from nouveau_dmem_init, and failure
> triggers a warning message instead of driver failure.
>
> - nouveau_dmem_fini - Ensure BAR1 PCI P2P memory is properly
> destroyed during driver cleanup. Introduce
> nouveau_destroy_bar1_pci_p2p_mem to handle freeing of PCI P2P
> memory associated with Nouveau BAR1. Modify nouveau_dmem_fini to
> call nouveau_destroy_bar1_pci_p2p_mem.
>
> - Implement Nouveau `p2p_page` callback function - Implement BAR1
> mapping for the chunk using `io_mem_reserve` if no mapping exists.
> Retrieve the pre-allocated P2P virtual address and size from
> `hmm_p2p`. Calculate the page offset within BAR1 and return the
> corresponding P2P page.
>
> Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
> Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
> ---
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 117 ++++++++++++++++++++++++-
> 1 file changed, 115 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 1a072568cef6..13fb8671f212 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -40,6 +40,9 @@
> #include <linux/hmm.h>
> #include <linux/memremap.h>
> #include <linux/migrate.h>
> +#include <linux/pci-p2pdma.h>
> +#include <nvkm/core/pci.h>
> +
>
> /*
> * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
> @@ -77,9 +80,15 @@ struct nouveau_dmem_migrate {
> struct nouveau_channel *chan;
> };
>
> +struct nouveau_dmem_hmm_p2p {
> + size_t p2p_size;
> + void *p2p_start_addr;
> +};
> +
> struct nouveau_dmem {
> struct nouveau_drm *drm;
> struct nouveau_dmem_migrate migrate;
> + struct nouveau_dmem_hmm_p2p hmm_p2p;
> struct list_head chunks;
> struct mutex mutex;
> struct page *free_pages;
> @@ -158,6 +167,61 @@ static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
> return 0;
> }
>
> +static int nouveau_dmem_bar1_mapping(struct nouveau_bo *nvbo,
> + unsigned long long *bus_addr)
> +{
> + int ret;
> + struct ttm_resource *mem = nvbo->bo.resource;
> +
> + if (mem->bus.offset) {
> + *bus_addr = mem->bus.offset;
> + return 0;
> + }
> +
> + if (PFN_UP(nvbo->bo.base.size) > PFN_UP(nvbo->bo.resource->size))
> + return -EINVAL;
> +
> + ret = ttm_bo_reserve(&nvbo->bo, false, false, NULL);
> + if (ret)
> + return ret;
> +
> + ret = nvbo->bo.bdev->funcs->io_mem_reserve(nvbo->bo.bdev, mem);
> + *bus_addr = mem->bus.offset;
> +
> + ttm_bo_unreserve(&nvbo->bo);
> + return ret;
> +}
> +
> +static struct page *nouveau_dmem_get_dma_page(struct page *private_page)
> +{
> + int ret;
> + unsigned long long offset_in_chunk, offset_in_bar1;
> + unsigned long long chunk_bus_addr, page_bus_addr;
> + unsigned long long bar1_base_addr;
> + struct nouveau_drm *drm = page_to_drm(private_page);
> + struct nouveau_bo *nvbo = nouveau_page_to_chunk(private_page)->bo;
> + struct nvkm_device *nv_device = nvxx_device(drm);
> + void *p2p_start_addr = drm->dmem->hmm_p2p.p2p_start_addr;
> + size_t p2p_size = drm->dmem->hmm_p2p.p2p_size;
> +
> + bar1_base_addr = nv_device->func->resource_addr(nv_device, 1);
> + offset_in_chunk =
> + (page_to_pfn(private_page) << PAGE_SHIFT) -
> + nouveau_page_to_chunk(private_page)->pagemap.range.start;
> +
> + ret = nouveau_dmem_bar1_mapping(nvbo, &chunk_bus_addr);
> + if (ret)
> + return ERR_PTR(ret);
> +
> + page_bus_addr = chunk_bus_addr + offset_in_chunk;
> + if (!p2p_size || page_bus_addr > bar1_base_addr + p2p_size ||
> + page_bus_addr < bar1_base_addr)
> + return ERR_PTR(-ENOMEM);
> +
> + offset_in_bar1 = page_bus_addr - bar1_base_addr;
> + return virt_to_page(p2p_start_addr + offset_in_bar1);
This conversion looks a bit complicated. Once you have page_bus_addr I
think you can just return pfn_to_page(page_bus_addr >> PAGE_SHIFT)
> +}
> +
> static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
> {
> struct nouveau_drm *drm = page_to_drm(vmf->page);
> @@ -219,8 +283,9 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
> }
>
> static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
> - .page_free = nouveau_dmem_page_free,
> - .migrate_to_ram = nouveau_dmem_migrate_to_ram,
> + .page_free = nouveau_dmem_page_free,
> + .migrate_to_ram = nouveau_dmem_migrate_to_ram,
> + .get_dma_page_for_device = nouveau_dmem_get_dma_page,
> };
>
> static int
> @@ -413,14 +478,31 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
> kvfree(dma_addrs);
> }
>
> +static void nouveau_destroy_bar1_pci_p2p_mem(struct nouveau_drm *drm,
> + struct pci_dev *pdev,
> + void *p2p_start_addr,
> + size_t p2p_size)
> +{
> + if (p2p_size)
> + pci_free_p2pmem(pdev, p2p_start_addr, p2p_size);
> +
> + NV_INFO(drm, "PCI P2P memory freed(%p)\n", p2p_start_addr);
> +}
> +
> void
> nouveau_dmem_fini(struct nouveau_drm *drm)
> {
> struct nouveau_dmem_chunk *chunk, *tmp;
> + struct nvkm_device *nv_device = nvxx_device(drm);
>
> if (drm->dmem == NULL)
> return;
>
> + nouveau_destroy_bar1_pci_p2p_mem(drm,
> + nv_device->func->pci(nv_device)->pdev,
> + drm->dmem->hmm_p2p.p2p_start_addr,
> + drm->dmem->hmm_p2p.p2p_size);
> +
> mutex_lock(&drm->dmem->mutex);
>
> list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
> @@ -586,10 +668,30 @@ nouveau_dmem_migrate_init(struct nouveau_drm *drm)
> return -ENODEV;
> }
>
> +static int nouveau_alloc_bar1_pci_p2p_mem(struct nouveau_drm *drm,
> + struct pci_dev *pdev, size_t size,
> + void **pp2p_start_addr,
> + size_t *pp2p_size)
> +{
> + int ret;
> +
> + ret = pci_p2pdma_add_resource(pdev, 1, size, 0);
> + if (ret)
> + return ret;
> +
> + *pp2p_start_addr = pci_alloc_p2pmem(pdev, size);
> + *pp2p_size = (*pp2p_start_addr) ? size : 0;
Why return the size here? Personally I think it would be clearer to have
the caller directly initialise/clear whatever struct values it needs.
> +
> + NV_INFO(drm, "PCI P2P memory allocated(%p)\n", *pp2p_start_addr);
> + return 0;
> +}
> +
> void
> nouveau_dmem_init(struct nouveau_drm *drm)
> {
> int ret;
> + struct nvkm_device *nv_device = nvxx_device(drm);
> + size_t bar1_size;
>
> /* This only make sense on PASCAL or newer */
> if (drm->client.device.info.family < NV_DEVICE_INFO_V0_PASCAL)
> @@ -610,6 +712,17 @@ nouveau_dmem_init(struct nouveau_drm *drm)
> kfree(drm->dmem);
> drm->dmem = NULL;
> }
> +
> + /* Expose BAR1 for HMM P2P Memory */
> + bar1_size = nv_device->func->resource_size(nv_device, 1);
> + ret = nouveau_alloc_bar1_pci_p2p_mem(drm,
> + nv_device->func->pci(nv_device)->pdev,
> + bar1_size,
> + &drm->dmem->hmm_p2p.p2p_start_addr,
> + &drm->dmem->hmm_p2p.p2p_size);
> + if (ret)
> + NV_WARN(drm,
> + "PCI P2P memory allocation failed, HMM P2P won't be supported\n");
> }
>
> static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages
2024-10-16 5:12 ` Alistair Popple
@ 2024-10-16 15:18 ` Yonatan Maman
0 siblings, 0 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-16 15:18 UTC (permalink / raw)
To: Alistair Popple
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, bskeggs,
Gal Shalom
On 16/10/2024 8:12, Alistair Popple wrote:
>
> Yonatan Maman <ymaman@nvidia.com> writes:
>
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications
>> is crucial for minimizing data transfer overhead (e.g., for RDMA use-
>> case).
>>
>> This change aims to enable that capability for Nouveau over HMM device
>> private pages. P2P DMA for private device pages allows the GPU to
>> directly exchange data with other devices (e.g., NICs) without needing
>> to traverse system RAM.
>>
>> To fully support Peer-to-Peer for device private pages, the following
>> changes are made:
>>
>> - Introduce struct nouveau_dmem_hmm_p2p within struct nouveau_dmem
>> to manage BAR1 PCI P2P memory. p2p_start_addr holds the virtual
>> address allocated with pci_alloc_p2pmem(), and p2p_size represents
>> the allocated size of the PCI P2P memory.
>>
>> - nouveau_dmem_init - Ensure BAR1 accessibility and assign struct
>> pages (PCI_P2P_PAGE) for all BAR1 pages. Introduce
>> nouveau_alloc_bar1_pci_p2p_mem in nouveau_dmem to expose BAR1 for
>> use as P2P memory via pci_p2pdma_add_resource and implement static
>> allocation and assignment of struct pages using pci_alloc_p2pmem.
>> This function will be called from nouveau_dmem_init, and failure
>> triggers a warning message instead of driver failure.
>>
>> - nouveau_dmem_fini - Ensure BAR1 PCI P2P memory is properly
>> destroyed during driver cleanup. Introduce
>> nouveau_destroy_bar1_pci_p2p_mem to handle freeing of PCI P2P
>> memory associated with Nouveau BAR1. Modify nouveau_dmem_fini to
>> call nouveau_destroy_bar1_pci_p2p_mem.
>>
>> - Implement Nouveau `p2p_page` callback function - Implement BAR1
>> mapping for the chunk using `io_mem_reserve` if no mapping exists.
>> Retrieve the pre-allocated P2P virtual address and size from
>> `hmm_p2p`. Calculate the page offset within BAR1 and return the
>> corresponding P2P page.
>>
>> Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
>> Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
>> ---
>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 117 ++++++++++++++++++++++++-
>> 1 file changed, 115 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> index 1a072568cef6..13fb8671f212 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
>> @@ -40,6 +40,9 @@
>> #include <linux/hmm.h>
>> #include <linux/memremap.h>
>> #include <linux/migrate.h>
>> +#include <linux/pci-p2pdma.h>
>> +#include <nvkm/core/pci.h>
>> +
>>
>> /*
>> * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
>> @@ -77,9 +80,15 @@ struct nouveau_dmem_migrate {
>> struct nouveau_channel *chan;
>> };
>>
>> +struct nouveau_dmem_hmm_p2p {
>> + size_t p2p_size;
>> + void *p2p_start_addr;
>> +};
>> +
>> struct nouveau_dmem {
>> struct nouveau_drm *drm;
>> struct nouveau_dmem_migrate migrate;
>> + struct nouveau_dmem_hmm_p2p hmm_p2p;
>> struct list_head chunks;
>> struct mutex mutex;
>> struct page *free_pages;
>> @@ -158,6 +167,61 @@ static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
>> return 0;
>> }
>>
>> +static int nouveau_dmem_bar1_mapping(struct nouveau_bo *nvbo,
>> + unsigned long long *bus_addr)
>> +{
>> + int ret;
>> + struct ttm_resource *mem = nvbo->bo.resource;
>> +
>> + if (mem->bus.offset) {
>> + *bus_addr = mem->bus.offset;
>> + return 0;
>> + }
>> +
>> + if (PFN_UP(nvbo->bo.base.size) > PFN_UP(nvbo->bo.resource->size))
>> + return -EINVAL;
>> +
>> + ret = ttm_bo_reserve(&nvbo->bo, false, false, NULL);
>> + if (ret)
>> + return ret;
>> +
>> + ret = nvbo->bo.bdev->funcs->io_mem_reserve(nvbo->bo.bdev, mem);
>> + *bus_addr = mem->bus.offset;
>> +
>> + ttm_bo_unreserve(&nvbo->bo);
>> + return ret;
>> +}
>> +
>> +static struct page *nouveau_dmem_get_dma_page(struct page *private_page)
>> +{
>> + int ret;
>> + unsigned long long offset_in_chunk, offset_in_bar1;
>> + unsigned long long chunk_bus_addr, page_bus_addr;
>> + unsigned long long bar1_base_addr;
>> + struct nouveau_drm *drm = page_to_drm(private_page);
>> + struct nouveau_bo *nvbo = nouveau_page_to_chunk(private_page)->bo;
>> + struct nvkm_device *nv_device = nvxx_device(drm);
>> + void *p2p_start_addr = drm->dmem->hmm_p2p.p2p_start_addr;
>> + size_t p2p_size = drm->dmem->hmm_p2p.p2p_size;
>> +
>> + bar1_base_addr = nv_device->func->resource_addr(nv_device, 1);
>> + offset_in_chunk =
>> + (page_to_pfn(private_page) << PAGE_SHIFT) -
>> + nouveau_page_to_chunk(private_page)->pagemap.range.start;
>> +
>> + ret = nouveau_dmem_bar1_mapping(nvbo, &chunk_bus_addr);
>> + if (ret)
>> + return ERR_PTR(ret);
>> +
>> + page_bus_addr = chunk_bus_addr + offset_in_chunk;
>> + if (!p2p_size || page_bus_addr > bar1_base_addr + p2p_size ||
>> + page_bus_addr < bar1_base_addr)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + offset_in_bar1 = page_bus_addr - bar1_base_addr;
>> + return virt_to_page(p2p_start_addr + offset_in_bar1);
>
> This conversion looks a bit complicated. Once you have page_bus_addr I
> think you can just return pfn_to_page(page_bus_addr >> PAGE_SHIFT)
>
Agree, I will fix that (v2)
>> +}
>> +
>> static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
>> {
>> struct nouveau_drm *drm = page_to_drm(vmf->page);
>> @@ -219,8 +283,9 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
>> }
>>
>> static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
>> - .page_free = nouveau_dmem_page_free,
>> - .migrate_to_ram = nouveau_dmem_migrate_to_ram,
>> + .page_free = nouveau_dmem_page_free,
>> + .migrate_to_ram = nouveau_dmem_migrate_to_ram,
>> + .get_dma_page_for_device = nouveau_dmem_get_dma_page,
>> };
>>
>> static int
>> @@ -413,14 +478,31 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
>> kvfree(dma_addrs);
>> }
>>
>> +static void nouveau_destroy_bar1_pci_p2p_mem(struct nouveau_drm *drm,
>> + struct pci_dev *pdev,
>> + void *p2p_start_addr,
>> + size_t p2p_size)
>> +{
>> + if (p2p_size)
>> + pci_free_p2pmem(pdev, p2p_start_addr, p2p_size);
>> +
>> + NV_INFO(drm, "PCI P2P memory freed(%p)\n", p2p_start_addr);
>> +}
>> +
>> void
>> nouveau_dmem_fini(struct nouveau_drm *drm)
>> {
>> struct nouveau_dmem_chunk *chunk, *tmp;
>> + struct nvkm_device *nv_device = nvxx_device(drm);
>>
>> if (drm->dmem == NULL)
>> return;
>>
>> + nouveau_destroy_bar1_pci_p2p_mem(drm,
>> + nv_device->func->pci(nv_device)->pdev,
>> + drm->dmem->hmm_p2p.p2p_start_addr,
>> + drm->dmem->hmm_p2p.p2p_size);
>> +
>> mutex_lock(&drm->dmem->mutex);
>>
>> list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
>> @@ -586,10 +668,30 @@ nouveau_dmem_migrate_init(struct nouveau_drm *drm)
>> return -ENODEV;
>> }
>>
>> +static int nouveau_alloc_bar1_pci_p2p_mem(struct nouveau_drm *drm,
>> + struct pci_dev *pdev, size_t size,
>> + void **pp2p_start_addr,
>> + size_t *pp2p_size)
>> +{
>> + int ret;
>> +
>> + ret = pci_p2pdma_add_resource(pdev, 1, size, 0);
>> + if (ret)
>> + return ret;
>> +
>> + *pp2p_start_addr = pci_alloc_p2pmem(pdev, size);
>> + *pp2p_size = (*pp2p_start_addr) ? size : 0;
>
> Why return the size here? Personally I think it would be clearer to have
> the caller directly initialise/clear whatever struct values it needs.
>
Agree, I will fix that (v2)
>> +
>> + NV_INFO(drm, "PCI P2P memory allocated(%p)\n", *pp2p_start_addr);
>> + return 0;
>> +}
>> +
>> void
>> nouveau_dmem_init(struct nouveau_drm *drm)
>> {
>> int ret;
>> + struct nvkm_device *nv_device = nvxx_device(drm);
>> + size_t bar1_size;
>>
>> /* This only make sense on PASCAL or newer */
>> if (drm->client.device.info.family < NV_DEVICE_INFO_V0_PASCAL)
>> @@ -610,6 +712,17 @@ nouveau_dmem_init(struct nouveau_drm *drm)
>> kfree(drm->dmem);
>> drm->dmem = NULL;
>> }
>> +
>> + /* Expose BAR1 for HMM P2P Memory */
>> + bar1_size = nv_device->func->resource_size(nv_device, 1);
>> + ret = nouveau_alloc_bar1_pci_p2p_mem(drm,
>> + nv_device->func->pci(nv_device)->pdev,
>> + bar1_size,
>> + &drm->dmem->hmm_p2p.p2p_start_addr,
>> + &drm->dmem->hmm_p2p.p2p_size);
>> + if (ret)
>> + NV_WARN(drm,
>> + "PCI P2P memory allocation failed, HMM P2P won't be supported\n");
>> }
>>
>> static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v1 3/4] IB/core: P2P DMA for device private pages
2024-10-15 15:23 [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
2024-10-15 15:23 ` [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages Yonatan Maman
2024-10-15 15:23 ` [PATCH v1 2/4] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
@ 2024-10-15 15:23 ` Yonatan Maman
2024-10-15 15:23 ` [PATCH v1 4/4] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
2024-10-16 4:23 ` [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Christoph Hellwig
4 siblings, 0 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-15 15:23 UTC (permalink / raw)
To: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
Cc: Yonatan Maman, Gal Shalom
From: Yonatan Maman <Ymaman@Nvidia.com>
Add Peer-to-Peer (P2P) DMA request for hmm_range_fault calling,
utilizing capabilities introduced in mm/hmm. By setting
range.default_flags to HMM_PFN_REQ_FAULT | HMM_PFN_REQ_TRY_P2P, HMM
attempts to initiate P2P DMA connections for device private pages
(instead of page fault handling).
This enhancement utilizes P2P DMA to reduce performance overhead
during data migration between devices (e.g., GPU) and system memory,
providing performance benefits for GPU-centric applications that
utilize RDMA and device private pages.
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/infiniband/core/umem_odp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d31c23..1f6498d26df4 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -381,7 +381,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
pfn_start_idx = (range.start - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
num_pfns = (range.end - range.start) >> PAGE_SHIFT;
if (fault) {
- range.default_flags = HMM_PFN_REQ_FAULT;
+ range.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_ALLOW_P2P;
if (access_mask & ODP_WRITE_ALLOWED_BIT)
range.default_flags |= HMM_PFN_REQ_WRITE;
--
2.34.1
^ permalink raw reply [flat|nested] 28+ messages in thread* [PATCH v1 4/4] RDMA/mlx5: Enabling ATS for ODP memory
2024-10-15 15:23 [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
` (2 preceding siblings ...)
2024-10-15 15:23 ` [PATCH v1 3/4] IB/core: P2P DMA for device private pages Yonatan Maman
@ 2024-10-15 15:23 ` Yonatan Maman
2024-10-16 4:23 ` [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Christoph Hellwig
4 siblings, 0 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-15 15:23 UTC (permalink / raw)
To: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
Cc: Yonatan Maman, Gal Shalom
From: Yonatan Maman <Ymaman@Nvidia.com>
ATS (Address Translation Services) mainly utilized to optimize PCI
Peer-to-Peer transfers and prevent bus failures. This change employed
ATS usage for ODP memory, to optimize DMA P2P for ODP memory. (e.g DMA
P2P for private device pages - ODP memory).
Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 23fd72f7f63d..55ccbc7d9aa3 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1701,9 +1701,9 @@ static inline bool rt_supported(int ts_cap)
static inline bool mlx5_umem_needs_ats(struct mlx5_ib_dev *dev,
struct ib_umem *umem, int access_flags)
{
- if (!MLX5_CAP_GEN(dev->mdev, ats) || !umem->is_dmabuf)
- return false;
- return access_flags & IB_ACCESS_RELAXED_ORDERING;
+ if (MLX5_CAP_GEN(dev->mdev, ats) && (umem->is_dmabuf || umem->is_odp))
+ return access_flags & IB_ACCESS_RELAXED_ORDERING;
+ return false;
}
int set_roce_addr(struct mlx5_ib_dev *dev, u32 port_num,
--
2.34.1
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages
2024-10-15 15:23 [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Yonatan Maman
` (3 preceding siblings ...)
2024-10-15 15:23 ` [PATCH v1 4/4] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
@ 2024-10-16 4:23 ` Christoph Hellwig
2024-10-16 15:16 ` Yonatan Maman
4 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-10-16 4:23 UTC (permalink / raw)
To: Yonatan Maman
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
>
> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
> GPU-centric applications that utilize RDMA and private device pages. This
> enhancement is crucial for minimizing data transfer overhead by allowing
> the GPU to directly expose device private page data to devices such as
> NICs, eliminating the need to traverse system RAM, which is the native
> method for exposing device private page data.
Please tone down your marketing language and explain your factual
changes. If you make performance claims back them by numbers.
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages
2024-10-16 4:23 ` [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages Christoph Hellwig
@ 2024-10-16 15:16 ` Yonatan Maman
2024-10-16 22:22 ` Alistair Popple
2024-10-18 7:26 ` Zhu Yanjun
0 siblings, 2 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-16 15:16 UTC (permalink / raw)
To: Christoph Hellwig
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
On 16/10/2024 7:23, Christoph Hellwig wrote:
> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>> GPU-centric applications that utilize RDMA and private device pages. This
>> enhancement is crucial for minimizing data transfer overhead by allowing
>> the GPU to directly expose device private page data to devices such as
>> NICs, eliminating the need to traverse system RAM, which is the native
>> method for exposing device private page data.
>
> Please tone down your marketing language and explain your factual
> changes. If you make performance claims back them by numbers.
>
Got it, thanks! I'll fix that. Regarding performance, we’re achieving
over 10x higher bandwidth and 10x lower latency using perftest-rdma,
especially (with a high rate of GPU memory access).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages
2024-10-16 15:16 ` Yonatan Maman
@ 2024-10-16 22:22 ` Alistair Popple
2024-10-18 7:26 ` Zhu Yanjun
1 sibling, 0 replies; 28+ messages in thread
From: Alistair Popple @ 2024-10-16 22:22 UTC (permalink / raw)
To: Yonatan Maman
Cc: Christoph Hellwig, nouveau, linux-kernel, linux-rdma, linux-mm,
herbst, lyude, dakr, airlied, simona, jgg, leon, jglisse, akpm,
dri-devel, bskeggs
Yonatan Maman <ymaman@nvidia.com> writes:
> On 16/10/2024 7:23, Christoph Hellwig wrote:
>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>>
>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>>> GPU-centric applications that utilize RDMA and private device pages. This
>>> enhancement is crucial for minimizing data transfer overhead by allowing
>>> the GPU to directly expose device private page data to devices such as
>>> NICs, eliminating the need to traverse system RAM, which is the native
>>> method for exposing device private page data.
>> Please tone down your marketing language and explain your factual
>> changes. If you make performance claims back them by numbers.
>>
>
> Got it, thanks! I'll fix that. Regarding performance, we’re achieving
> over 10x higher bandwidth and 10x lower latency using perftest-rdma,
> especially (with a high rate of GPU memory access).
The performance claims still sound a bit vague. Please make sure you
include actual perftest-rdma performance numbers from before and after
applying this series when you repost.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages
2024-10-16 15:16 ` Yonatan Maman
2024-10-16 22:22 ` Alistair Popple
@ 2024-10-18 7:26 ` Zhu Yanjun
2024-10-20 15:26 ` Yonatan Maman
1 sibling, 1 reply; 28+ messages in thread
From: Zhu Yanjun @ 2024-10-18 7:26 UTC (permalink / raw)
To: Yonatan Maman, Christoph Hellwig
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
在 2024/10/16 17:16, Yonatan Maman 写道:
>
>
> On 16/10/2024 7:23, Christoph Hellwig wrote:
>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>>
>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>>> GPU-centric applications that utilize RDMA and private device pages.
>>> This
>>> enhancement is crucial for minimizing data transfer overhead by allowing
>>> the GPU to directly expose device private page data to devices such as
>>> NICs, eliminating the need to traverse system RAM, which is the native
>>> method for exposing device private page data.
>>
>> Please tone down your marketing language and explain your factual
>> changes. If you make performance claims back them by numbers.
>>
>
> Got it, thanks! I'll fix that. Regarding performance, we’re achieving
> over 10x higher bandwidth and 10x lower latency using perftest-rdma,
> especially (with a high rate of GPU memory access).
If I got this patch series correctly, this is based on ODP (On Demand
Paging). And a way also exists which is based on non-ODP. From the
following links, this way is implemented on efa, irdma and mlx5.
1. iRDMA
https://lore.kernel.org/all/20230217011425.498847-1-yanjun.zhu@intel.com/
2. efa
https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/
3. mlx5
https://lore.kernel.org/all/1608067636-98073-5-git-send-email-jianxin.xiong@intel.com/
Because these 2 methods are both implemented on mlx5, have you compared
the test results with the 2 methods on mlx5?
The most important results should be latency and bandwidth. Please let
us know the test results.
Thanks a lot.
Zhu Yanjun
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages
2024-10-18 7:26 ` Zhu Yanjun
@ 2024-10-20 15:26 ` Yonatan Maman
0 siblings, 0 replies; 28+ messages in thread
From: Yonatan Maman @ 2024-10-20 15:26 UTC (permalink / raw)
To: Zhu Yanjun, Christoph Hellwig
Cc: nouveau, linux-kernel, linux-rdma, linux-mm, herbst, lyude, dakr,
airlied, simona, jgg, leon, jglisse, akpm, dri-devel, apopple,
bskeggs
On 18/10/2024 10:26, Zhu Yanjun wrote:
> External email: Use caution opening links or attachments
>
>
> 在 2024/10/16 17:16, Yonatan Maman 写道:
>>
>>
>> On 16/10/2024 7:23, Christoph Hellwig wrote:
>>> On Tue, Oct 15, 2024 at 06:23:44PM +0300, Yonatan Maman wrote:
>>>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>>>
>>>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>>>> GPU-centric applications that utilize RDMA and private device pages.
>>>> This
>>>> enhancement is crucial for minimizing data transfer overhead by
>>>> allowing
>>>> the GPU to directly expose device private page data to devices such as
>>>> NICs, eliminating the need to traverse system RAM, which is the native
>>>> method for exposing device private page data.
>>>
>>> Please tone down your marketing language and explain your factual
>>> changes. If you make performance claims back them by numbers.
>>>
>>
>> Got it, thanks! I'll fix that. Regarding performance, we’re achieving
>> over 10x higher bandwidth and 10x lower latency using perftest-rdma,
>> especially (with a high rate of GPU memory access).
>
> If I got this patch series correctly, this is based on ODP (On Demand
> Paging). And a way also exists which is based on non-ODP. From the
> following links, this way is implemented on efa, irdma and mlx5.
> 1. iRDMA
> https://lore.kernel.org/all/20230217011425.498847-1-yanjun.zhu@intel.com/
>
> 2. efa
> https://lore.kernel.org/lkml/20211007114018.GD2688930@ziepe.ca/t/
>
> 3. mlx5
> https://lore.kernel.org/all/1608067636-98073-5-git-send-email-
> jianxin.xiong@intel.com/
>
> Because these 2 methods are both implemented on mlx5, have you compared
> the test results with the 2 methods on mlx5?
>
> The most important results should be latency and bandwidth. Please let
> us know the test results.
>
> Thanks a lot.
> Zhu Yanjun
>
This patch-set aims to support GPU Direct RDMA for HMM ODP memory.
Compared to the dma-buf method, we achieve the same performance (BW and
latency), for GPU intensive test-cases (No CPU accesses during the test).
^ permalink raw reply [flat|nested] 28+ messages in thread