* [PATCH v3 0/3] Enable P2PDMA in Userspace RDMA
@ 2024-07-04 16:37 Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Martin Oliveira @ 2024-07-04 16:37 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
This patch series enables P2PDMA memory to be used in userspace RDMA
transfers. With this series, P2PDMA memory mmaped into userspace (ie.
only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
similar) interfaces. This can be tested by passing a sysfs p2pmem
allocator to the --mmap flag of the perftest tools.
This requires addressing two issues:
* Stop exporting kernfs VMAs with page_mkwrite, which is incompatible
with FOLL_LONGTERM and is redudant since the default fault code has the
same behavior as kernfs_vma_page_mkwrite() (i.e., call
file_update_time()).
* Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
initially put in place due to excessive caution with assuming P2PDMA
would have similar problems to fsdax with unmap_mapping_range(). Seeing
P2PDMA only uses unmap_mapping_range() on device unbind and immediately
waits for all page reference counts to go to zero after calling it, it
is actually believed to be safe from reuse and user access faults. See
[1] for more discussion.
This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
using the mlx5_core driver, as well as an NVMe CMB.
Thanks,
Martin
[1]: https://lore.kernel.org/linux-mm/87cypuvh2i.fsf@nvdebian.thelocal/T/
--
Changes in v3:
- Change to WARN_ON() if an implementaion of kernfs sets
.page_mkwrite() (Suggested by Christoph)
- Removed fast-gup patch
Changes in v2:
- Remove page_mkwrite() for all kernfs, instead of creating a
different vm_ops for p2pdma.
Martin Oliveira (3):
kernfs: remove page_mkwrite() from vm_operations_struct
mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
RDMA/umem: add support for P2P RDMA
drivers/infiniband/core/umem.c | 3 +++
fs/kernfs/file.c | 25 ++-----------------------
mm/gup.c | 5 -----
3 files changed, 5 insertions(+), 28 deletions(-)
base-commit: 22a40d14b572deb80c0648557f4bd502d7e83826
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-04 16:37 [PATCH v3 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
@ 2024-07-04 16:37 ` Martin Oliveira
2024-07-04 16:48 ` Greg Kroah-Hartman
2024-07-04 17:02 ` Matthew Wilcox
2024-07-04 16:37 ` [PATCH v3 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 3/3] RDMA/umem: add support for P2P RDMA Martin Oliveira
2 siblings, 2 replies; 9+ messages in thread
From: Martin Oliveira @ 2024-07-04 16:37 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
The .page_mkwrite operator of kernfs just calls file_update_time().
This is the same behaviour that the fault code does if .page_mkwrite is
not set.
Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.
There are no users of .page_mkwrite and no known valid use cases, so
just remove the .page_mkwrite from kernfs_ops and WARN_ON() if an mmap()
implementation sets .page_mkwrite.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
fs/kernfs/file.c | 25 ++-----------------------
1 file changed, 2 insertions(+), 23 deletions(-)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b..90603664de7f 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
return ret;
}
-static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
-{
- struct file *file = vmf->vma->vm_file;
- struct kernfs_open_file *of = kernfs_of(file);
- vm_fault_t ret;
-
- if (!of->vm_ops)
- return VM_FAULT_SIGBUS;
-
- if (!kernfs_get_active(of->kn))
- return VM_FAULT_SIGBUS;
-
- ret = 0;
- if (of->vm_ops->page_mkwrite)
- ret = of->vm_ops->page_mkwrite(vmf);
- else
- file_update_time(file);
-
- kernfs_put_active(of->kn);
- return ret;
-}
-
static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write)
{
@@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
static const struct vm_operations_struct kernfs_vm_ops = {
.open = kernfs_vma_open,
.fault = kernfs_vma_fault,
- .page_mkwrite = kernfs_vma_page_mkwrite,
.access = kernfs_vma_access,
};
@@ -482,6 +459,8 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_ops && vma->vm_ops->close)
goto out_put;
+ WARN_ON(vma->vm_ops && vma->vm_ops->page_mkwrite);
+
rc = 0;
if (!of->mmapped) {
of->mmapped = true;
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
2024-07-04 16:37 [PATCH v3 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-07-04 16:37 ` Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 3/3] RDMA/umem: add support for P2P RDMA Martin Oliveira
2 siblings, 0 replies; 9+ messages in thread
From: Martin Oliveira @ 2024-07-04 16:37 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).
The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.
The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
[1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b..6922e1c38d75 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
return false;
- /* We want to allow the pgmap to be hot-unplugged at all times */
- if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
- (gup_flags & FOLL_PCI_P2PDMA)))
- return false;
-
*gup_flags_p = gup_flags;
return true;
}
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 3/3] RDMA/umem: add support for P2P RDMA
2024-07-04 16:37 [PATCH v3 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-07-04 16:37 ` Martin Oliveira
2 siblings, 0 replies; 9+ messages in thread
From: Martin Oliveira @ 2024-07-04 16:37 UTC (permalink / raw)
To: linux-kernel, linux-mm, linux-rdma
Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Martin Oliveira, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan, Jason Gunthorpe
If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag
This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
mmaped into userspace for MRs in IB and RDMA transactions.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/core/umem.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..b59bb6e1475e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -208,6 +208,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
if (umem->writable)
gup_flags |= FOLL_WRITE;
+ if (ib_dma_pci_p2p_dma_supported(device))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
while (npages) {
cond_resched();
pinned = pin_user_pages_fast(cur_base,
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-04 16:37 ` [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-07-04 16:48 ` Greg Kroah-Hartman
2024-07-04 17:02 ` Matthew Wilcox
1 sibling, 0 replies; 9+ messages in thread
From: Greg Kroah-Hartman @ 2024-07-04 16:48 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-kernel, linux-mm, linux-rdma, Andrew Morton,
Artemy Kovalyov, Jason Gunthorpe, Leon Romanovsky,
Logan Gunthorpe, Michael Guralnik, Mike Marciniszyn,
Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
On Thu, Jul 04, 2024 at 10:37:22AM -0600, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and WARN_ON() if an mmap()
> implementation sets .page_mkwrite.
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-04 16:37 ` [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-07-04 16:48 ` Greg Kroah-Hartman
@ 2024-07-04 17:02 ` Matthew Wilcox
2024-07-04 20:43 ` Martin Oliveira
1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2024-07-04 17:02 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-kernel, linux-mm, linux-rdma, Andrew Morton,
Artemy Kovalyov, Greg Kroah-Hartman, Jason Gunthorpe,
Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
Dan Williams, David Sloan
On Thu, Jul 04, 2024 at 10:37:22AM -0600, Martin Oliveira wrote:
> @@ -482,6 +459,8 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
> if (vma->vm_ops && vma->vm_ops->close)
> goto out_put;
>
> + WARN_ON(vma->vm_ops && vma->vm_ops->page_mkwrite);
> +
> rc = 0;
> if (!of->mmapped) {
> of->mmapped = true;
Seems to me we should actually _handle_ that, not do something wrong.
eg:
if (vma->vm_ops) {
if (vma->vm_ops->close)
goto out_put;
if (WARN_ON(vma->vm_ops->page_mkwrite))
goto out_put;
}
or maybe this doesn't need to be a WARN at all? After all, there
isn't one for having a ->close method, so why is page_mkwrite special?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-04 17:02 ` Matthew Wilcox
@ 2024-07-04 20:43 ` Martin Oliveira
2024-07-05 7:20 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Martin Oliveira @ 2024-07-04 20:43 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-kernel, linux-mm, linux-rdma, Andrew Morton,
Artemy Kovalyov, Greg Kroah-Hartman, Jason Gunthorpe,
Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
Mike Marciniszyn, Tejun Heo, John Hubbard, Dan Williams,
David Sloan
On 2024-07-04 11:02, Matthew Wilcox wrote:> Seems to me we should actually _handle_ that, not do something wrong.
> eg:
>
> if (vma->vm_ops) {
> if (vma->vm_ops->close)
> goto out_put;
> if (WARN_ON(vma->vm_ops->page_mkwrite))
> goto out_put;
> }
Good point.
> or maybe this doesn't need to be a WARN at all? After all, there
> isn't one for having a ->close method, so why is page_mkwrite special?
Hmm yeah, they should probably be treated the same.
Maybe ->close should be converted to WARN as well? It would be easier to
catch an error this way than chasing the EINVAL, but I'm OK either way.
Thanks,
Martin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-04 20:43 ` Martin Oliveira
@ 2024-07-05 7:20 ` Christoph Hellwig
2024-07-08 16:31 ` Martin Oliveira
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2024-07-05 7:20 UTC (permalink / raw)
To: Martin Oliveira
Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-rdma,
Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Michael Guralnik, Mike Marciniszyn, Tejun Heo, John Hubbard,
Dan Williams, David Sloan
On Thu, Jul 04, 2024 at 02:43:04PM -0600, Martin Oliveira wrote:
> On 2024-07-04 11:02, Matthew Wilcox wrote:> Seems to me we should actually _handle_ that, not do something wrong.
> > eg:
> >
> > if (vma->vm_ops) {
> > if (vma->vm_ops->close)
> > goto out_put;
> > if (WARN_ON(vma->vm_ops->page_mkwrite))
> > goto out_put;
> > }
>
> Good point.
Btw, sorry if I mislead you with my WARN_ON_ONCE suggestion. That
was always intended in addition to the error handling, not instead.
(In fact there are very few reasons to use WARN_ON* without actually
handling the error as well).
>
> > or maybe this doesn't need to be a WARN at all? After all, there
> > isn't one for having a ->close method, so why is page_mkwrite special?
>
> Hmm yeah, they should probably be treated the same.
>
> Maybe ->close should be converted to WARN as well? It would be easier to
> catch an error this way than chasing the EINVAL, but I'm OK either way.
Yes, doing the same for ->close or anything unimplemented would be
nice. But it's not really in scope for this series.
kernfs really should be using it's own ops instead of abusing
file_operations, but that's even more out of scope..
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct
2024-07-05 7:20 ` Christoph Hellwig
@ 2024-07-08 16:31 ` Martin Oliveira
0 siblings, 0 replies; 9+ messages in thread
From: Martin Oliveira @ 2024-07-08 16:31 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Matthew Wilcox, linux-kernel, linux-mm, linux-rdma,
Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
Michael Guralnik, Mike Marciniszyn, Tejun Heo, John Hubbard,
Dan Williams, David Sloan
On 2024-07-05 01:20, Christoph Hellwig wrote:
> Btw, sorry if I mislead you with my WARN_ON_ONCE suggestion. That
> was always intended in addition to the error handling, not instead.
> (In fact there are very few reasons to use WARN_ON* without actually
> handling the error as well).
Yeah, I should have caught that. Thanks for the feedback, Christoph!
I'll submit a new version later today.
> Yes, doing the same for ->close or anything unimplemented would be
> nice. But it's not really in scope for this series.
>
> kernfs really should be using it's own ops instead of abusing
> file_operations, but that's even more out of scope..
Ok, I'll add the ->page_mkwrite with the WARN but leave the ->close
the way it is.
Martin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-07-08 16:31 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-04 16:37 [PATCH v3 0/3] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 1/3] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-07-04 16:48 ` Greg Kroah-Hartman
2024-07-04 17:02 ` Matthew Wilcox
2024-07-04 20:43 ` Martin Oliveira
2024-07-05 7:20 ` Christoph Hellwig
2024-07-08 16:31 ` Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 2/3] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-07-04 16:37 ` [PATCH v3 3/3] RDMA/umem: add support for P2P RDMA Martin Oliveira
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox