* [PATCH v4 0/3] Enable P2PDMA in Userspace RDMA
@ 2024-07-08 16:57 Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Martin Oliveira @ 2024-07-08 16:57 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
This version of this patch series fixes the unhandled WARN issue from
v3. Everything else remains the same. Thanks to everyone who provided
reviews and feedback!
Martin
Original cover letter:
This patch series enables P2PDMA memory to be used in userspace RDMA
transfers. With this series, P2PDMA memory mmaped into userspace (ie.
only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
similar) interfaces. This can be tested by passing a sysfs p2pmem
allocator to the --mmap flag of the perftest tools.
This requires addressing two issues:
* Stop exporting kernfs VMAs with page_mkwrite, which is incompatible
with FOLL_LONGTERM and is redudant since the default fault code has the
same behavior as kernfs_vma_page_mkwrite() (i.e., call
file_update_time()).
* Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
initially put in place due to excessive caution with assuming P2PDMA
would have similar problems to fsdax with unmap_mapping_range(). Seeing
P2PDMA only uses unmap_mapping_range() on device unbind and immediately
waits for all page reference counts to go to zero after calling it, it
is actually believed to be safe from reuse and user access faults. See
[1] for more discussion.
This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
using the mlx5_core driver, as well as an NVMe CMB.
Thanks,
Martin
[1]: https://lore.kernel.org/linux-mm/87cypuvh2i.fsf@nvdebian.thelocal/T/
--
Changes in v4:
- Actually handle the WARN if someone sets ->page_mkwrite
Changes in v3:
- Change to WARN_ON() if an implementaion of kernfs sets
.page_mkwrite() (Suggested by Christoph)
- Removed fast-gup patch
Changes in v2:
- Remove page_mkwrite() for all kernfs, instead of creating a
different vm_ops for p2pdma.
Martin Oliveira (3):
kernfs: remove page_mkwrite() from vm_operations_struct
mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
RDMA/umem: add support for P2P RDMA
drivers/infiniband/core/umem.c | 3 +++
fs/kernfs/file.c | 25 ++-----------------------
mm/gup.c | 5 -----
3 files changed, 5 insertions(+), 28 deletions(-)
base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
--
2.34.1
Martin Oliveira (3):
kernfs: remove page_mkwrite() from vm_operations_struct
mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
RDMA/umem: add support for P2P RDMA
drivers/infiniband/core/umem.c | 3 +++
fs/kernfs/file.c | 40 ++++++++++------------------------
mm/gup.c | 5 -----
3 files changed, 14 insertions(+), 34 deletions(-)
base-commit: 256abd8e550ce977b728be79a74e1729438b4948
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-08 16:57 [PATCH v4 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
@ 2024-07-08 16:57 ` Martin Oliveira
2024-07-09 5:24 ` Greg Kroah-Hartman
2024-07-09 14:34 ` Matthew Wilcox
2024-07-08 16:57 ` [PATCH v4 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 3/3] RDMA/umem: add support for P2P RDMA Martin Oliveira
2 siblings, 2 replies; 6+ messages in thread
From: Martin Oliveira @ 2024-07-08 16:57 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
The .page_mkwrite operator of kernfs just calls file_update_time().
This is the same behaviour that the fault code does if .page_mkwrite is
not set.
Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.
There are no users of .page_mkwrite and no known valid use cases, so
just remove the .page_mkwrite from kernfs_ops and WARN if an mmap()
implementation sets .page_mkwrite.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
fs/kernfs/file.c | 40 +++++++++++-----------------------------
1 file changed, 11 insertions(+), 29 deletions(-)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b..fb2b77bf0c04 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
return ret;
}
-static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
-{
- struct file *file = vmf->vma->vm_file;
- struct kernfs_open_file *of = kernfs_of(file);
- vm_fault_t ret;
-
- if (!of->vm_ops)
- return VM_FAULT_SIGBUS;
-
- if (!kernfs_get_active(of->kn))
- return VM_FAULT_SIGBUS;
-
- ret = 0;
- if (of->vm_ops->page_mkwrite)
- ret = of->vm_ops->page_mkwrite(vmf);
- else
- file_update_time(file);
-
- kernfs_put_active(of->kn);
- return ret;
-}
-
static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write)
{
@@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
static const struct vm_operations_struct kernfs_vm_ops = {
.open = kernfs_vma_open,
.fault = kernfs_vma_fault,
- .page_mkwrite = kernfs_vma_page_mkwrite,
.access = kernfs_vma_access,
};
@@ -475,12 +452,17 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
if (of->mmapped && of->vm_ops != vma->vm_ops)
goto out_put;
- /*
- * It is not possible to successfully wrap close.
- * So error if someone is trying to use close.
- */
- if (vma->vm_ops && vma->vm_ops->close)
- goto out_put;
+ if (vma->vm_ops) {
+ /*
+ * It is not possible to successfully wrap close.
+ * So error if someone is trying to use close.
+ */
+ if (vma->vm_ops->close)
+ goto out_put;
+
+ if (WARN_ON_ONCE(vma->vm_ops->page_mkwrite))
+ goto out_put;
+ }
rc = 0;
if (!of->mmapped) {
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
2024-07-08 16:57 [PATCH v4 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-07-08 16:57 ` Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 3/3] RDMA/umem: add support for P2P RDMA Martin Oliveira
2 siblings, 0 replies; 6+ messages in thread
From: Martin Oliveira @ 2024-07-08 16:57 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).
The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.
The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
[1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b..6922e1c38d75 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
return false;
- /* We want to allow the pgmap to be hot-unplugged at all times */
- if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
- (gup_flags & FOLL_PCI_P2PDMA)))
- return false;
-
*gup_flags_p = gup_flags;
return true;
}
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 3/3] RDMA/umem: add support for P2P RDMA
2024-07-08 16:57 [PATCH v4 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-07-08 16:57 ` Martin Oliveira
2 siblings, 0 replies; 6+ messages in thread
From: Martin Oliveira @ 2024-07-08 16:57 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan, Jason Gunthorpe
If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag
This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
mmaped into userspace for MRs in IB and RDMA transactions.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/core/umem.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..b59bb6e1475e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -208,6 +208,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
if (umem->writable)
gup_flags |= FOLL_WRITE;
+ if (ib_dma_pci_p2p_dma_supported(device))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
while (npages) {
cond_resched();
pinned = pin_user_pages_fast(cur_base,
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-08 16:57 ` [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-07-09 5:24 ` Greg Kroah-Hartman
2024-07-09 14:34 ` Matthew Wilcox
1 sibling, 0 replies; 6+ messages in thread
From: Greg Kroah-Hartman @ 2024-07-09 5:24 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-kernel, linux-mm, linux-rdma, Andrew Morton,
Artemy Kovalyov, Jason Gunthorpe, Leon Romanovsky,
Logan Gunthorpe, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
On Mon, Jul 08, 2024 at 10:57:12AM -0600, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and WARN if an mmap()
> implementation sets .page_mkwrite.
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
> fs/kernfs/file.c | 40 +++++++++++-----------------------------
> 1 file changed, 11 insertions(+), 29 deletions(-)
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-08 16:57 ` [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-07-09 5:24 ` Greg Kroah-Hartman
@ 2024-07-09 14:34 ` Matthew Wilcox
1 sibling, 0 replies; 6+ messages in thread
From: Matthew Wilcox @ 2024-07-09 14:34 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-kernel, linux-mm, linux-rdma, Andrew Morton,
Artemy Kovalyov, Greg Kroah-Hartman, Jason Gunthorpe,
Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
Dan Williams, David Sloan
On Mon, Jul 08, 2024 at 10:57:12AM -0600, Martin Oliveira wrote:
> - /*
> - * It is not possible to successfully wrap close.
> - * So error if someone is trying to use close.
> - */
> - if (vma->vm_ops && vma->vm_ops->close)
> - goto out_put;
> + if (vma->vm_ops) {
> + /*
> + * It is not possible to successfully wrap close.
> + * So error if someone is trying to use close.
> + */
> + if (vma->vm_ops->close)
> + goto out_put;
> +
> + if (WARN_ON_ONCE(vma->vm_ops->page_mkwrite))
> + goto out_put;
> + }
This is stupid. Warn for both or neither.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-07-09 14:34 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-08 16:57 [PATCH v4 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-07-09 5:24 ` Greg Kroah-Hartman
2024-07-09 14:34 ` Matthew Wilcox
2024-07-08 16:57 ` [PATCH v4 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-07-08 16:57 ` [PATCH v4 3/3] RDMA/umem: add support for P2P RDMA Martin Oliveira
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox