linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/4] Enable P2PDMA in Userspace RDMA
@ 2024-08-08 18:33 Martin Oliveira
  2024-08-08 18:33 ` [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set Martin Oliveira
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Martin Oliveira @ 2024-08-08 18:33 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
	Michael Guralnik, Mike Marciniszyn, Shiraz Saleem, Tejun Heo,
	John Hubbard, Dan Williams, David Sloan, Martin Oliveira

In the last version of this series, there was a discrepancy on how
->close() and ->page_mkwrite() were handled, as just the latter had a
WARN. Matthew requested that they be the same, so this version adds an
extra patch to add a WARN to the ->close() handling. Everything else
remains the same.

On a different note, I was wondering which tree should take this series.

Thanks to everyone who has provided feedback!

Thanks,
Martin

Original cover letter:

This patch series enables P2PDMA memory to be used in userspace RDMA
transfers. With this series, P2PDMA memory mmaped into userspace (ie.
only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
similar) interfaces. This can be tested by passing a sysfs p2pmem
allocator to the --mmap flag of the perftest tools.

This requires addressing two issues:

* Stop exporting kernfs VMAs with page_mkwrite, which is incompatible
with FOLL_LONGTERM and is redudant since the default fault code has the
same behavior as kernfs_vma_page_mkwrite() (i.e., call
file_update_time()).

* Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
initially put in place due to excessive caution with assuming P2PDMA
would have similar problems to fsdax with unmap_mapping_range(). Seeing
P2PDMA only uses unmap_mapping_range() on device unbind and immediately
waits for all page reference counts to go to zero after calling it, it
is actually believed to be safe from reuse and user access faults. See
[1] for more discussion.

This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
using the mlx5_core driver, as well as an NVMe CMB.

Thanks,
Martin

[1]: https://lore.kernel.org/linux-mm/87cypuvh2i.fsf@nvdebian.thelocal/T/

--

Changes in v5:
  - Add a WARN in the ->close() handling (per Matthew)

Changes in v4:
  - Actually handle the WARN if someone sets ->page_mkwrite

Changes in v3:
  - Change to WARN_ON() if an implementaion of kernfs sets
    .page_mkwrite() (Suggested by Christoph)
  - Removed fast-gup patch

Changes in v2:
  - Remove page_mkwrite() for all kernfs, instead of creating a
    different vm_ops for p2pdma.


Martin Oliveira (4):
  kernfs: upgrade ->close check on mmap to WARN
  kernfs: remove page_mkwrite() from vm_operations_struct
  mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  RDMA/umem: add support for P2P RDMA

 drivers/infiniband/core/umem.c |  3 +++
 fs/kernfs/file.c               | 40 ++++++++++------------------------
 mm/gup.c                       |  5 -----
 3 files changed, 14 insertions(+), 34 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set
  2024-08-08 18:33 [PATCH v5 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
@ 2024-08-08 18:33 ` Martin Oliveira
  2024-08-09  5:37   ` Greg Kroah-Hartman
  2024-08-12  6:38   ` Christoph Hellwig
  2024-08-08 18:33 ` [PATCH v5 2/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Martin Oliveira @ 2024-08-08 18:33 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
	Michael Guralnik, Mike Marciniszyn, Shiraz Saleem, Tejun Heo,
	John Hubbard, Dan Williams, David Sloan, Martin Oliveira

The next patch is going to remove .page_mkwrite from kernfs and will
WARN if an mmap implementation sets .page_mkwrite.

In preparation for that change, and to make it consistent, add a WARN to
the ->close check.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
 fs/kernfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b..72cc51dcf870 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -479,7 +479,7 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
 	 * It is not possible to successfully wrap close.
 	 * So error if someone is trying to use close.
 	 */
-	if (vma->vm_ops && vma->vm_ops->close)
+	if (WARN_ON_ONCE(vma->vm_ops && vma->vm_ops->close))
 		goto out_put;
 
 	rc = 0;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 2/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-08-08 18:33 [PATCH v5 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
  2024-08-08 18:33 ` [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set Martin Oliveira
@ 2024-08-08 18:33 ` Martin Oliveira
  2024-08-12  6:38   ` Christoph Hellwig
  2024-08-08 18:33 ` [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
  2024-08-08 18:33 ` [PATCH v5 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
  3 siblings, 1 reply; 19+ messages in thread
From: Martin Oliveira @ 2024-08-08 18:33 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
	Michael Guralnik, Mike Marciniszyn, Shiraz Saleem, Tejun Heo,
	John Hubbard, Dan Williams, David Sloan, Martin Oliveira

The .page_mkwrite operator of kernfs just calls file_update_time().
This is the same behaviour that the fault code does if .page_mkwrite is
not set.

Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.

There are no users of .page_mkwrite and no known valid use cases, so
just remove the .page_mkwrite from kernfs_ops and WARN if an mmap()
implementation sets .page_mkwrite.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
 fs/kernfs/file.c | 40 +++++++++++-----------------------------
 1 file changed, 11 insertions(+), 29 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 72cc51dcf870..a747ed95063f 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
-{
-	struct file *file = vmf->vma->vm_file;
-	struct kernfs_open_file *of = kernfs_of(file);
-	vm_fault_t ret;
-
-	if (!of->vm_ops)
-		return VM_FAULT_SIGBUS;
-
-	if (!kernfs_get_active(of->kn))
-		return VM_FAULT_SIGBUS;
-
-	ret = 0;
-	if (of->vm_ops->page_mkwrite)
-		ret = of->vm_ops->page_mkwrite(vmf);
-	else
-		file_update_time(file);
-
-	kernfs_put_active(of->kn);
-	return ret;
-}
-
 static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
 			     void *buf, int len, int write)
 {
@@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
 static const struct vm_operations_struct kernfs_vm_ops = {
 	.open		= kernfs_vma_open,
 	.fault		= kernfs_vma_fault,
-	.page_mkwrite	= kernfs_vma_page_mkwrite,
 	.access		= kernfs_vma_access,
 };
 
@@ -475,12 +452,17 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
 	if (of->mmapped && of->vm_ops != vma->vm_ops)
 		goto out_put;
 
-	/*
-	 * It is not possible to successfully wrap close.
-	 * So error if someone is trying to use close.
-	 */
-	if (WARN_ON_ONCE(vma->vm_ops && vma->vm_ops->close))
-		goto out_put;
+	if (vma->vm_ops) {
+		/*
+		 * It is not possible to successfully wrap close.
+		 * So error if someone is trying to use close.
+		 */
+		if (WARN_ON_ONCE(vma->vm_ops->close))
+			goto out_put;
+
+		if (WARN_ON_ONCE(vma->vm_ops->page_mkwrite))
+			goto out_put;
+	}
 
 	rc = 0;
 	if (!of->mmapped) {
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-08 18:33 [PATCH v5 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
  2024-08-08 18:33 ` [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set Martin Oliveira
  2024-08-08 18:33 ` [PATCH v5 2/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-08-08 18:33 ` Martin Oliveira
  2024-08-12  6:39   ` Christoph Hellwig
  2024-08-08 18:33 ` [PATCH v5 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
  3 siblings, 1 reply; 19+ messages in thread
From: Martin Oliveira @ 2024-08-08 18:33 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
	Michael Guralnik, Mike Marciniszyn, Shiraz Saleem, Tejun Heo,
	John Hubbard, Dan Williams, David Sloan, Martin Oliveira

This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).

The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.

The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>

[1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
---
 mm/gup.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 54d0dc3831fb..df267b7890aa 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2547,11 +2547,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 	if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
 		return false;
 
-	/* We want to allow the pgmap to be hot-unplugged at all times */
-	if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
-			 (gup_flags & FOLL_PCI_P2PDMA)))
-		return false;
-
 	*gup_flags_p = gup_flags;
 	return true;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 4/4] RDMA/umem: add support for P2P RDMA
  2024-08-08 18:33 [PATCH v5 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
                   ` (2 preceding siblings ...)
  2024-08-08 18:33 ` [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-08-08 18:33 ` Martin Oliveira
  3 siblings, 0 replies; 19+ messages in thread
From: Martin Oliveira @ 2024-08-08 18:33 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Jason Gunthorpe, Leon Romanovsky, Logan Gunthorpe,
	Michael Guralnik, Mike Marciniszyn, Shiraz Saleem, Tejun Heo,
	John Hubbard, Dan Williams, David Sloan, Martin Oliveira,
	Jason Gunthorpe

If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag

This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
mmaped into userspace for MRs in IB and RDMA transactions.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..b59bb6e1475e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -208,6 +208,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 	if (umem->writable)
 		gup_flags |= FOLL_WRITE;
 
+	if (ib_dma_pci_p2p_dma_supported(device))
+		gup_flags |= FOLL_PCI_P2PDMA;
+
 	while (npages) {
 		cond_resched();
 		pinned = pin_user_pages_fast(cur_base,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set
  2024-08-08 18:33 ` [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set Martin Oliveira
@ 2024-08-09  5:37   ` Greg Kroah-Hartman
  2024-08-09 15:41     ` Martin Oliveira
  2024-08-12  6:38     ` Christoph Hellwig
  2024-08-12  6:38   ` Christoph Hellwig
  1 sibling, 2 replies; 19+ messages in thread
From: Greg Kroah-Hartman @ 2024-08-09  5:37 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Andrew Morton,
	Artemy Kovalyov, Jason Gunthorpe, Leon Romanovsky,
	Logan Gunthorpe, Michael Guralnik, Mike Marciniszyn,
	Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
	David Sloan

On Thu, Aug 08, 2024 at 12:33:37PM -0600, Martin Oliveira wrote:
> The next patch is going to remove .page_mkwrite from kernfs and will
> WARN if an mmap implementation sets .page_mkwrite.
> 
> In preparation for that change, and to make it consistent, add a WARN to
> the ->close check.
> 
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
>  fs/kernfs/file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 8502ef68459b..72cc51dcf870 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -479,7 +479,7 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
>  	 * It is not possible to successfully wrap close.
>  	 * So error if someone is trying to use close.
>  	 */
> -	if (vma->vm_ops && vma->vm_ops->close)
> +	if (WARN_ON_ONCE(vma->vm_ops && vma->vm_ops->close))

So you just rebooted a machine that hits this, loosing data everywhere.
Not nice :(



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set
  2024-08-09  5:37   ` Greg Kroah-Hartman
@ 2024-08-09 15:41     ` Martin Oliveira
  2024-08-11  8:31       ` Greg Kroah-Hartman
  2024-08-12  6:38     ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Martin Oliveira @ 2024-08-09 15:41 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-rdma, linux-kernel, linux-mm, Andrew Morton,
	Artemy Kovalyov, Jason Gunthorpe, Leon Romanovsky,
	Logan Gunthorpe, Michael Guralnik, Mike Marciniszyn,
	Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
	David Sloan

On 2024-08-08 23:37, Greg Kroah-Hartman wrote:
>> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
>> index 8502ef68459b..72cc51dcf870 100644
>> --- a/fs/kernfs/file.c
>> +++ b/fs/kernfs/file.c
>> @@ -479,7 +479,7 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
>>        * It is not possible to successfully wrap close.
>>        * So error if someone is trying to use close.
>>        */
>> -     if (vma->vm_ops && vma->vm_ops->close)
>> +     if (WARN_ON_ONCE(vma->vm_ops && vma->vm_ops->close))
> 
> So you just rebooted a machine that hits this, loosing data everywhere.
> Not nice :(

Well, apologies for that, but there's no way to know what every single machine
out there is doing...

However if this machine is using ->close when that's clearly marked as
unsupported then shouldn't we fix that?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set
  2024-08-09 15:41     ` Martin Oliveira
@ 2024-08-11  8:31       ` Greg Kroah-Hartman
  0 siblings, 0 replies; 19+ messages in thread
From: Greg Kroah-Hartman @ 2024-08-11  8:31 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Andrew Morton,
	Artemy Kovalyov, Jason Gunthorpe, Leon Romanovsky,
	Logan Gunthorpe, Michael Guralnik, Mike Marciniszyn,
	Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
	David Sloan

On Fri, Aug 09, 2024 at 09:41:48AM -0600, Martin Oliveira wrote:
> On 2024-08-08 23:37, Greg Kroah-Hartman wrote:
> >> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> >> index 8502ef68459b..72cc51dcf870 100644
> >> --- a/fs/kernfs/file.c
> >> +++ b/fs/kernfs/file.c
> >> @@ -479,7 +479,7 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
> >>        * It is not possible to successfully wrap close.
> >>        * So error if someone is trying to use close.
> >>        */
> >> -     if (vma->vm_ops && vma->vm_ops->close)
> >> +     if (WARN_ON_ONCE(vma->vm_ops && vma->vm_ops->close))
> > 
> > So you just rebooted a machine that hits this, loosing data everywhere.
> > Not nice :(
> 
> Well, apologies for that, but there's no way to know what every single machine
> out there is doing...
> 
> However if this machine is using ->close when that's clearly marked as
> unsupported then shouldn't we fix that?

return an error properly?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set
  2024-08-09  5:37   ` Greg Kroah-Hartman
  2024-08-09 15:41     ` Martin Oliveira
@ 2024-08-12  6:38     ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2024-08-12  6:38 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Martin Oliveira, linux-rdma, linux-kernel, linux-mm,
	Andrew Morton, Artemy Kovalyov, Jason Gunthorpe, Leon Romanovsky,
	Logan Gunthorpe, Michael Guralnik, Mike Marciniszyn,
	Shiraz Saleem, Tejun Heo, John Hubbard, Dan Williams,
	David Sloan

On Fri, Aug 09, 2024 at 07:37:49AM +0200, Greg Kroah-Hartman wrote:
> >  	 * It is not possible to successfully wrap close.
> >  	 * So error if someone is trying to use close.
> >  	 */
> > -	if (vma->vm_ops && vma->vm_ops->close)
> > +	if (WARN_ON_ONCE(vma->vm_ops && vma->vm_ops->close))
> 
> So you just rebooted a machine that hits this, loosing data everywhere.
> Not nice :(

Huh.  if you are stupid enough to set panic_on_warn you get to keep
the pieces.  And our file systems are reliable to not use data on
an unclean shutdown anyway.

Pleaee stop these BS arguments.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set
  2024-08-08 18:33 ` [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set Martin Oliveira
  2024-08-09  5:37   ` Greg Kroah-Hartman
@ 2024-08-12  6:38   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2024-08-12  6:38 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Andrew Morton,
	Artemy Kovalyov, Greg Kroah-Hartman, Jason Gunthorpe,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-08-08 18:33 ` [PATCH v5 2/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-08-12  6:38   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2024-08-12  6:38 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Andrew Morton,
	Artemy Kovalyov, Greg Kroah-Hartman, Jason Gunthorpe,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-08 18:33 ` [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-08-12  6:39   ` Christoph Hellwig
  2024-08-12 23:12     ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-08-12  6:39 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Andrew Morton,
	Artemy Kovalyov, Greg Kroah-Hartman, Jason Gunthorpe,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

On Thu, Aug 08, 2024 at 12:33:39PM -0600, Martin Oliveira wrote:
> This check existed originally due to concerns that P2PDMA needed to copy
> fsdax until pgmap refcounts were fixed (see [1]).
> 
> The P2PDMA infrastructure will only call unmap_mapping_range() when the
> underlying device is unbound, and immediately after unmapping it waits
> for the reference of all ZONE_DEVICE pages to be released before
> continuing. This does not allow for a page to be reused and no user
> access fault is therefore possible. It does not have the same problem as
> fsdax.
> 
> The one minor concern with FOLL_LONGTERM pins is they will block device
> unbind until userspace releases them all.

This is unfortunately not really minor unless we have a well documented
way to force this :(



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-12  6:39   ` Christoph Hellwig
@ 2024-08-12 23:12     ` Jason Gunthorpe
  2024-08-13  5:41       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2024-08-12 23:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Oliveira, linux-rdma, linux-kernel, linux-mm,
	Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

On Sun, Aug 11, 2024 at 11:39:22PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 08, 2024 at 12:33:39PM -0600, Martin Oliveira wrote:
> > This check existed originally due to concerns that P2PDMA needed to copy
> > fsdax until pgmap refcounts were fixed (see [1]).
> > 
> > The P2PDMA infrastructure will only call unmap_mapping_range() when the
> > underlying device is unbound, and immediately after unmapping it waits
> > for the reference of all ZONE_DEVICE pages to be released before
> > continuing. This does not allow for a page to be reused and no user
> > access fault is therefore possible. It does not have the same problem as
> > fsdax.
> > 
> > The one minor concern with FOLL_LONGTERM pins is they will block device
> > unbind until userspace releases them all.
> 
> This is unfortunately not really minor unless we have a well documented
> way to force this :(

It is not that different from blocking driver unbind while FDs are
open which a lot of places do in various ways?

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-12 23:12     ` Jason Gunthorpe
@ 2024-08-13  5:41       ` Christoph Hellwig
  2024-08-13 16:05         ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-08-13  5:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Martin Oliveira, linux-rdma, linux-kernel,
	linux-mm, Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > This is unfortunately not really minor unless we have a well documented
> > way to force this :(
> 
> It is not that different from blocking driver unbind while FDs are
> open which a lot of places do in various ways?

Where do we block driver unbind with an open resource?  The whole
concept is that open resources will pin the in-memory object (and
modulo for a modular driver), but never an unbind or hardware
unplug, of which unbind really just is a simulation.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-13  5:41       ` Christoph Hellwig
@ 2024-08-13 16:05         ` Jason Gunthorpe
  2024-08-13 17:03           ` Dan Williams
  2024-08-14  4:26           ` Christoph Hellwig
  0 siblings, 2 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Oliveira, linux-rdma, linux-kernel, linux-mm,
	Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

On Mon, Aug 12, 2024 at 10:41:20PM -0700, Christoph Hellwig wrote:
> On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > > This is unfortunately not really minor unless we have a well documented
> > > way to force this :(
> > 
> > It is not that different from blocking driver unbind while FDs are
> > open which a lot of places do in various ways?
> 
> Where do we block driver unbind with an open resource?  

I keep seeing it in different subsystems, safe driver unbind is really
hard. :\ eg I think VFIO has some waits in it

> The whole concept is that open resources will pin the in-memory
> object (and modulo for a modular driver), but never an unbind or
> hardware unplug, of which unbind really just is a simulation.

Yes, ideally, but not every part of the kernel hits that ideal in my
experience. It is alot of work and some places don't have any good
solutions, like here.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-13 16:05         ` Jason Gunthorpe
@ 2024-08-13 17:03           ` Dan Williams
  2024-08-14  0:02             ` Jason Gunthorpe
  2024-08-14  4:26           ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Dan Williams @ 2024-08-13 17:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Martin Oliveira, linux-rdma, linux-kernel, linux-mm,
	Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

Jason Gunthorpe wrote:
> On Mon, Aug 12, 2024 at 10:41:20PM -0700, Christoph Hellwig wrote:
> > On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > > > This is unfortunately not really minor unless we have a well documented
> > > > way to force this :(
> > > 
> > > It is not that different from blocking driver unbind while FDs are
> > > open which a lot of places do in various ways?
> > 
> > Where do we block driver unbind with an open resource?  
> 
> I keep seeing it in different subsystems, safe driver unbind is really
> hard. :\ eg I think VFIO has some waits in it
> 
> > The whole concept is that open resources will pin the in-memory
> > object (and modulo for a modular driver), but never an unbind or
> > hardware unplug, of which unbind really just is a simulation.
> 
> Yes, ideally, but not every part of the kernel hits that ideal in my
> experience. It is alot of work and some places don't have any good
> solutions, like here.

...but there is a distinction between transient and permanent waits,
right? The difficult aspect of FOLL_LONGTERM is the holder has no idea
someone is trying to cleanup and may never drop its pin.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-13 17:03           ` Dan Williams
@ 2024-08-14  0:02             ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2024-08-14  0:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Martin Oliveira, linux-rdma, linux-kernel,
	linux-mm, Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	David Sloan

On Tue, Aug 13, 2024 at 10:03:55AM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Mon, Aug 12, 2024 at 10:41:20PM -0700, Christoph Hellwig wrote:
> > > On Mon, Aug 12, 2024 at 08:12:49PM -0300, Jason Gunthorpe wrote:
> > > > > This is unfortunately not really minor unless we have a well documented
> > > > > way to force this :(
> > > > 
> > > > It is not that different from blocking driver unbind while FDs are
> > > > open which a lot of places do in various ways?
> > > 
> > > Where do we block driver unbind with an open resource?  
> > 
> > I keep seeing it in different subsystems, safe driver unbind is really
> > hard. :\ eg I think VFIO has some waits in it
> > 
> > > The whole concept is that open resources will pin the in-memory
> > > object (and modulo for a modular driver), but never an unbind or
> > > hardware unplug, of which unbind really just is a simulation.
> > 
> > Yes, ideally, but not every part of the kernel hits that ideal in my
> > experience. It is alot of work and some places don't have any good
> > solutions, like here.
> 
> ...but there is a distinction between transient and permanent waits,
> right? The difficult aspect of FOLL_LONGTERM is the holder has no idea
> someone is trying to cleanup and may never drop its pin.

It is the quite similar to userspace holding a FD open while a driver
is trying to unbind. The FD holder has possibly no idea things are
waiting on it.

Nice subsystems allow the FD to keep existing while the driver is
unplugged, but still many have to wait for the FD to close as
disconnecting an active driver from it's FD requires some pretty
careful design.

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-13 16:05         ` Jason Gunthorpe
  2024-08-13 17:03           ` Dan Williams
@ 2024-08-14  4:26           ` Christoph Hellwig
  2024-08-15 16:38             ` Jason Gunthorpe
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2024-08-14  4:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Martin Oliveira, linux-rdma, linux-kernel,
	linux-mm, Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

On Tue, Aug 13, 2024 at 01:05:02PM -0300, Jason Gunthorpe wrote:
> > Where do we block driver unbind with an open resource?  
> 
> I keep seeing it in different subsystems, safe driver unbind is really
> hard. :\ eg I think VFIO has some waits in it

Well, waits for what?  Usuaully an unbind has to wait for something,
but waiting for unbound references is always a bug.  And I can't
think of any sensible subsystem doing this.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-08-14  4:26           ` Christoph Hellwig
@ 2024-08-15 16:38             ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2024-08-15 16:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin Oliveira, linux-rdma, linux-kernel, linux-mm,
	Andrew Morton, Artemy Kovalyov, Greg Kroah-Hartman,
	Leon Romanovsky, Logan Gunthorpe, Michael Guralnik,
	Mike Marciniszyn, Shiraz Saleem, Tejun Heo, John Hubbard,
	Dan Williams, David Sloan

On Tue, Aug 13, 2024 at 09:26:55PM -0700, Christoph Hellwig wrote:
> On Tue, Aug 13, 2024 at 01:05:02PM -0300, Jason Gunthorpe wrote:
> > > Where do we block driver unbind with an open resource?  
> > 
> > I keep seeing it in different subsystems, safe driver unbind is really
> > hard. :\ eg I think VFIO has some waits in it
> 
> Well, waits for what? 

Looks like it waits for the fds to close at least. It has weird
handshake where it notifies userspace that the device is closing and
the userspace needs to close the fd. See vfio_unregister_group_dev()

In the past VFIO couldn't do anything else because it had VMAs open
that point to the device's mmio and it would be wrong to unbind the
driver and leave those uncleaned. VFIO now knows how to zap VMA so
maybe this could be improved, but it looks like a sizable uplift.

> Usuaully an unbind has to wait for something, but waiting for
> unbound references is always a bug.  And I can't think of any
> sensible subsystem doing this.

I've seen several cases over the years.. It can be hard in many cases
to deal with the issue. Like the VMA's above for instance. rdma also
had to learn how to do zap before it could do fast unbind.

Some places just have bugs where they don't wait and Weird Stuff
happens. There is a strange misconception that module refcounts
protect against unbind :\

Not saying this is good design, or desirable, just pointing out we
seem to tolerate this in enough other places.

In this case the only resolution I could think of would be to have the
rdma stack somehow register a notifier on the pgmap. Then we could
revoke the RDMA access synchronously during destruction and return
those refcounts.

That does seems a bit complicated though..

Jason


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-08-15 16:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-08 18:33 [PATCH v5 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-08-08 18:33 ` [PATCH v5 1/4] kernfs: add a WARN_ON_ONCE if ->close is set Martin Oliveira
2024-08-09  5:37   ` Greg Kroah-Hartman
2024-08-09 15:41     ` Martin Oliveira
2024-08-11  8:31       ` Greg Kroah-Hartman
2024-08-12  6:38     ` Christoph Hellwig
2024-08-12  6:38   ` Christoph Hellwig
2024-08-08 18:33 ` [PATCH v5 2/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-08-12  6:38   ` Christoph Hellwig
2024-08-08 18:33 ` [PATCH v5 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-08-12  6:39   ` Christoph Hellwig
2024-08-12 23:12     ` Jason Gunthorpe
2024-08-13  5:41       ` Christoph Hellwig
2024-08-13 16:05         ` Jason Gunthorpe
2024-08-13 17:03           ` Dan Williams
2024-08-14  0:02             ` Jason Gunthorpe
2024-08-14  4:26           ` Christoph Hellwig
2024-08-15 16:38             ` Jason Gunthorpe
2024-08-08 18:33 ` [PATCH v5 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox