linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA
@ 2024-06-11 18:27 Martin Oliveira
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Martin Oliveira,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov

This patch series enables P2PDMA memory to be used in userspace RDMA
transfers. With this series, P2PDMA memory mmaped into userspace (ie.
only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
similar) interfaces. This can be tested by passing a sysfs p2pmem
allocator to the --mmap flag of the perftest tools.

This requires addressing three issues:

* Stop exporting kernfs VMAs with page_mkwrite, which is incompatible
with FOLL_LONGTERM and is redudant since the default fault code has the
same behavior as kernfs_vma_page_mkwrite() (i.e., call
file_update_time()).

* Fix folio_fast_pin_allowed() path to take into account ZONE_DEVICE pages.

* Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
initially put in place due to excessive caution with assuming P2PDMA
would have similar problems to fsdax with unmap_mapping_range(). Seeing
P2PDMA only uses unmap_mapping_range() on device unbind and immediately
waits for all page reference counts to go to zero after calling it, it
is actually believed to be safe from reuse and user access faults. See
[1] for more discussion.

This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
using the mlx5_core driver, as well as an NVMe CMB.

Thanks,
Martin

[1]: https://lore.kernel.org/linux-mm/87cypuvh2i.fsf@nvdebian.thelocal/T/

--

Changes in v2:
  - Remove page_mkwrite() for all kernfs, instead of creating a
    different vm_ops for p2pdma.

--

Martin Oliveira (4):
  kernfs: remove page_mkwrite() from vm_operations_struct
  mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
  mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  RDMA/umem: add support for P2P RDMA

 drivers/infiniband/core/umem.c |  3 +++
 fs/kernfs/file.c               | 26 +++-----------------------
 mm/gup.c                       |  9 ++++-----
 3 files changed, 10 insertions(+), 28 deletions(-)


base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
  2024-06-12  8:09   ` Greg Kroah-Hartman
                     ` (4 more replies)
  2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
                   ` (2 subsequent siblings)
  3 siblings, 5 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Martin Oliveira,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov

The .page_mkwrite operator of kernfs just calls file_update_time().
This is the same behaviour that the fault code does if .page_mkwrite is
not set.

Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.

There are no users of .page_mkwrite and no known valid use cases, so
just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
mmap() implementation sets .page_mkwrite.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
 fs/kernfs/file.c | 26 +++-----------------------
 1 file changed, 3 insertions(+), 23 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b9..a198cb0718772 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
 	return ret;
 }
 
-static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
-{
-	struct file *file = vmf->vma->vm_file;
-	struct kernfs_open_file *of = kernfs_of(file);
-	vm_fault_t ret;
-
-	if (!of->vm_ops)
-		return VM_FAULT_SIGBUS;
-
-	if (!kernfs_get_active(of->kn))
-		return VM_FAULT_SIGBUS;
-
-	ret = 0;
-	if (of->vm_ops->page_mkwrite)
-		ret = of->vm_ops->page_mkwrite(vmf);
-	else
-		file_update_time(file);
-
-	kernfs_put_active(of->kn);
-	return ret;
-}
-
 static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
 			     void *buf, int len, int write)
 {
@@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
 static const struct vm_operations_struct kernfs_vm_ops = {
 	.open		= kernfs_vma_open,
 	.fault		= kernfs_vma_fault,
-	.page_mkwrite	= kernfs_vma_page_mkwrite,
 	.access		= kernfs_vma_access,
 };
 
@@ -482,6 +459,9 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
 	if (vma->vm_ops && vma->vm_ops->close)
 		goto out_put;
 
+	if (vma->vm_ops->page_mkwrite)
+		goto out_put;
+
 	rc = 0;
 	if (!of->mmapped) {
 		of->mmapped = true;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
  2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
  2024-06-15  2:40   ` John Hubbard
  2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
  2024-06-11 18:27 ` [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
  3 siblings, 1 reply; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Martin Oliveira,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov

folio_fast_pin_allowed() does not support ZONE_DEVICE pages because
currently it is impossible for that type of page to be used with
FOLL_LONGTERM. When this changes in a subsequent patch, this path will
attempt to read the mapping of a ZONE_DEVICE page which is not valid.

Instead, allow ZONE_DEVICE pages explicitly seeing they shouldn't pose
any problem with the fast path.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
 mm/gup.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b2..00d0a77112f4f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2847,6 +2847,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
 	if (folio_test_hugetlb(folio))
 		return true;
 
+	/* It makes no sense to access the mapping of ZONE_DEVICE pages */
+	if (folio_is_zone_device(folio))
+		return true;
+
 	/*
 	 * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
 	 * cannot proceed, which means no actions performed under RCU can
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
  2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
  2024-06-15  2:45   ` John Hubbard
  2024-06-11 18:27 ` [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
  3 siblings, 1 reply; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Martin Oliveira,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov

This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).

The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.

The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>

[1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
---
 mm/gup.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 00d0a77112f4f..28060e41788d0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 	if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
 		return false;
 
-	/* We want to allow the pgmap to be hot-unplugged at all times */
-	if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
-			 (gup_flags & FOLL_PCI_P2PDMA)))
-		return false;
-
 	*gup_flags_p = gup_flags;
 	return true;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA
  2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
                   ` (2 preceding siblings ...)
  2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
  3 siblings, 0 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
  To: linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Martin Oliveira,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov, Jason Gunthorpe

If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag

This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
mmaped into userspace for MRs in IB and RDMA transactions.

Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/core/umem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b6999..b59bb6e1475e2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -208,6 +208,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 	if (umem->writable)
 		gup_flags |= FOLL_WRITE;
 
+	if (ib_dma_pci_p2p_dma_supported(device))
+		gup_flags |= FOLL_PCI_P2PDMA;
+
 	while (npages) {
 		cond_resched();
 		pinned = pin_user_pages_fast(cur_base,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-06-12  8:09   ` Greg Kroah-Hartman
  2024-06-12 17:29   ` Tejun Heo
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Greg Kroah-Hartman @ 2024-06-12  8:09 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Jason Gunthorpe,
	Leon Romanovsky, Tejun Heo, Andrew Morton, Logan Gunthorpe,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov

On Tue, Jun 11, 2024 at 12:27:29PM -0600, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
> 
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
> 
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
> mmap() implementation sets .page_mkwrite.
> 
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
>  fs/kernfs/file.c | 26 +++-----------------------
>  1 file changed, 3 insertions(+), 23 deletions(-)

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
  2024-06-12  8:09   ` Greg Kroah-Hartman
@ 2024-06-12 17:29   ` Tejun Heo
  2024-06-13  5:15   ` Dan Carpenter
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2024-06-12 17:29 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Jason Gunthorpe,
	Leon Romanovsky, Greg Kroah-Hartman, Andrew Morton,
	Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
	Michael Guralnik, Artemy Kovalyov

On Tue, Jun 11, 2024 at 12:27:29PM -0600, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
> 
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
> 
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
> mmap() implementation sets .page_mkwrite.
> 
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
  2024-06-12  8:09   ` Greg Kroah-Hartman
  2024-06-12 17:29   ` Tejun Heo
@ 2024-06-13  5:15   ` Dan Carpenter
  2024-06-13  5:44   ` Christoph Hellwig
  2024-06-15  2:32   ` John Hubbard
  4 siblings, 0 replies; 13+ messages in thread
From: Dan Carpenter @ 2024-06-13  5:15 UTC (permalink / raw)
  To: oe-kbuild, Martin Oliveira, linux-rdma, linux-kernel, linux-mm
  Cc: lkp, oe-kbuild-all, Jason Gunthorpe, Leon Romanovsky,
	Greg Kroah-Hartman, Tejun Heo, Andrew Morton,
	Linux Memory Management List, Logan Gunthorpe, Martin Oliveira,
	Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
	Artemy Kovalyov

Hi Martin,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Martin-Oliveira/kernfs-remove-page_mkwrite-from-vm_operations_struct/20240612-023130
base:   83a7eefedc9b56fe7bfeff13b6c7356688ffa670
patch link:    https://lore.kernel.org/r/20240611182732.360317-2-martin.oliveira%40eideticom.com
patch subject: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
config: i386-randconfig-141-20240612 (https://download.01.org/0day-ci/archive/20240613/202406130357.6NmgCbMP-lkp@intel.com/config)
compiler: gcc-12 (Ubuntu 12.3.0-9ubuntu2) 12.3.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202406130357.6NmgCbMP-lkp@intel.com/

smatch warnings:
fs/kernfs/file.c:462 kernfs_fop_mmap() error: we previously assumed 'vma->vm_ops' could be null (see line 459)

vim +462 fs/kernfs/file.c

c637b8acbe079e Tejun Heo           2013-12-11  416  static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
414985ae23c031 Tejun Heo           2013-11-28  417  {
c525aaddc366df Tejun Heo           2013-12-11  418  	struct kernfs_open_file *of = kernfs_of(file);
414985ae23c031 Tejun Heo           2013-11-28  419  	const struct kernfs_ops *ops;
414985ae23c031 Tejun Heo           2013-11-28  420  	int rc;
414985ae23c031 Tejun Heo           2013-11-28  421  
9b2db6e1894577 Tejun Heo           2013-12-10  422  	/*
9b2db6e1894577 Tejun Heo           2013-12-10  423  	 * mmap path and of->mutex are prone to triggering spurious lockdep
9b2db6e1894577 Tejun Heo           2013-12-10  424  	 * warnings and we don't want to add spurious locking dependency
9b2db6e1894577 Tejun Heo           2013-12-10  425  	 * between the two.  Check whether mmap is actually implemented
9b2db6e1894577 Tejun Heo           2013-12-10  426  	 * without grabbing @of->mutex by testing HAS_MMAP flag.  See the
c810729fe6471a Ahelenia Ziemiańska 2023-12-21  427  	 * comment in kernfs_fop_open() for more details.
9b2db6e1894577 Tejun Heo           2013-12-10  428  	 */
df23fc39bce03b Tejun Heo           2013-12-11  429  	if (!(of->kn->flags & KERNFS_HAS_MMAP))
9b2db6e1894577 Tejun Heo           2013-12-10  430  		return -ENODEV;
9b2db6e1894577 Tejun Heo           2013-12-10  431  
414985ae23c031 Tejun Heo           2013-11-28  432  	mutex_lock(&of->mutex);
414985ae23c031 Tejun Heo           2013-11-28  433  
414985ae23c031 Tejun Heo           2013-11-28  434  	rc = -ENODEV;
c637b8acbe079e Tejun Heo           2013-12-11  435  	if (!kernfs_get_active(of->kn))
414985ae23c031 Tejun Heo           2013-11-28  436  		goto out_unlock;
414985ae23c031 Tejun Heo           2013-11-28  437  
324a56e16e44ba Tejun Heo           2013-12-11  438  	ops = kernfs_ops(of->kn);
414985ae23c031 Tejun Heo           2013-11-28  439  	rc = ops->mmap(of, vma);
b44b2140265ddf Tejun Heo           2014-04-20  440  	if (rc)
b44b2140265ddf Tejun Heo           2014-04-20  441  		goto out_put;
414985ae23c031 Tejun Heo           2013-11-28  442  
414985ae23c031 Tejun Heo           2013-11-28  443  	/*
414985ae23c031 Tejun Heo           2013-11-28  444  	 * PowerPC's pci_mmap of legacy_mem uses shmem_zero_setup()
414985ae23c031 Tejun Heo           2013-11-28  445  	 * to satisfy versions of X which crash if the mmap fails: that
414985ae23c031 Tejun Heo           2013-11-28  446  	 * substitutes a new vm_file, and we don't then want bin_vm_ops.
414985ae23c031 Tejun Heo           2013-11-28  447  	 */
414985ae23c031 Tejun Heo           2013-11-28  448  	if (vma->vm_file != file)
414985ae23c031 Tejun Heo           2013-11-28  449  		goto out_put;
414985ae23c031 Tejun Heo           2013-11-28  450  
414985ae23c031 Tejun Heo           2013-11-28  451  	rc = -EINVAL;
414985ae23c031 Tejun Heo           2013-11-28  452  	if (of->mmapped && of->vm_ops != vma->vm_ops)
414985ae23c031 Tejun Heo           2013-11-28  453  		goto out_put;
414985ae23c031 Tejun Heo           2013-11-28  454  
414985ae23c031 Tejun Heo           2013-11-28  455  	/*
414985ae23c031 Tejun Heo           2013-11-28  456  	 * It is not possible to successfully wrap close.
414985ae23c031 Tejun Heo           2013-11-28  457  	 * So error if someone is trying to use close.
414985ae23c031 Tejun Heo           2013-11-28  458  	 */
414985ae23c031 Tejun Heo           2013-11-28 @459  	if (vma->vm_ops && vma->vm_ops->close)
                                                            ^^^^^^^^^^^
If ->vm_ops is NULL

414985ae23c031 Tejun Heo           2013-11-28  460  		goto out_put;
414985ae23c031 Tejun Heo           2013-11-28  461  
927bb8d619fea4 Martin Oliveira     2024-06-11 @462  	if (vma->vm_ops->page_mkwrite)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
then we're in trouble

927bb8d619fea4 Martin Oliveira     2024-06-11  463  		goto out_put;
927bb8d619fea4 Martin Oliveira     2024-06-11  464  
414985ae23c031 Tejun Heo           2013-11-28  465  	rc = 0;
05d8f255867e31 Neel Natu           2024-01-27  466  	if (!of->mmapped) {
a1d82aff5df760 Tejun Heo           2016-12-27  467  		of->mmapped = true;
bdb2fd7fc56e19 Tejun Heo           2022-08-27  468  		of_on(of)->nr_mmapped++;
414985ae23c031 Tejun Heo           2013-11-28  469  		of->vm_ops = vma->vm_ops;
05d8f255867e31 Neel Natu           2024-01-27  470  	}
414985ae23c031 Tejun Heo           2013-11-28  471  	vma->vm_ops = &kernfs_vm_ops;
414985ae23c031 Tejun Heo           2013-11-28  472  out_put:
c637b8acbe079e Tejun Heo           2013-12-11  473  	kernfs_put_active(of->kn);
414985ae23c031 Tejun Heo           2013-11-28  474  out_unlock:
414985ae23c031 Tejun Heo           2013-11-28  475  	mutex_unlock(&of->mutex);
414985ae23c031 Tejun Heo           2013-11-28  476  
414985ae23c031 Tejun Heo           2013-11-28  477  	return rc;
414985ae23c031 Tejun Heo           2013-11-28  478  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
                     ` (2 preceding siblings ...)
  2024-06-13  5:15   ` Dan Carpenter
@ 2024-06-13  5:44   ` Christoph Hellwig
  2024-06-15  2:32   ` John Hubbard
  4 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-06-13  5:44 UTC (permalink / raw)
  To: Martin Oliveira
  Cc: linux-rdma, linux-kernel, linux-mm, Jason Gunthorpe,
	Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo, Andrew Morton,
	Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
	Michael Guralnik, Artemy Kovalyov

On Tue, Jun 11, 2024 at 12:27:29PM -0600, Martin Oliveira wrote:
> +	if (vma->vm_ops->page_mkwrite)
> +		goto out_put;
> +

I'd probably make this a WARN_ON so that driver authors trying to
add a page_mkwrite in the vm_ops passed to kernfs get a big fat
warning instead of spending a couple hours trying to track down
what is going wrong :)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
  2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
                     ` (3 preceding siblings ...)
  2024-06-13  5:44   ` Christoph Hellwig
@ 2024-06-15  2:32   ` John Hubbard
  4 siblings, 0 replies; 13+ messages in thread
From: John Hubbard @ 2024-06-15  2:32 UTC (permalink / raw)
  To: Martin Oliveira, linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
	Michael Guralnik, Artemy Kovalyov

On 6/11/24 11:27 AM, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
> 
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
> 
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
> mmap() implementation sets .page_mkwrite.

Hi Martin and Logan!

First of all, I admire this approach to solving one of the gup+filesystem
interaction problems, by coming in from the other direction. Neat. :)


> 
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
>   fs/kernfs/file.c | 26 +++-----------------------
>   1 file changed, 3 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 8502ef68459b9..a198cb0718772 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
>   	return ret;
>   }
>   
> -static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
> -{
> -	struct file *file = vmf->vma->vm_file;
> -	struct kernfs_open_file *of = kernfs_of(file);
> -	vm_fault_t ret;
> -
> -	if (!of->vm_ops)
> -		return VM_FAULT_SIGBUS;
> -
> -	if (!kernfs_get_active(of->kn))
> -		return VM_FAULT_SIGBUS;
> -
> -	ret = 0;
> -	if (of->vm_ops->page_mkwrite)
> -		ret = of->vm_ops->page_mkwrite(vmf);
> -	else
> -		file_update_time(file);
> -
> -	kernfs_put_active(of->kn);
> -	return ret;
> -}
> -
>   static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
>   			     void *buf, int len, int write)
>   {
> @@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
>   static const struct vm_operations_struct kernfs_vm_ops = {
>   	.open		= kernfs_vma_open,
>   	.fault		= kernfs_vma_fault,
> -	.page_mkwrite	= kernfs_vma_page_mkwrite,
>   	.access		= kernfs_vma_access,
>   };
>   
> @@ -482,6 +459,9 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
>   	if (vma->vm_ops && vma->vm_ops->close)
>   		goto out_put;
>   
> +	if (vma->vm_ops->page_mkwrite)

As the kernel test bot results imply, you probably want to do it like this:

    	if (vma->vm_ops && vma->vm_ops->page_mkwrite)


> +		goto out_put;
> +
>   	rc = 0;
>   	if (!of->mmapped) {
>   		of->mmapped = true;

thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
  2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
@ 2024-06-15  2:40   ` John Hubbard
  2024-06-26 22:23     ` Martin Oliveira
  0 siblings, 1 reply; 13+ messages in thread
From: John Hubbard @ 2024-06-15  2:40 UTC (permalink / raw)
  To: Martin Oliveira, linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
	Michael Guralnik, Artemy Kovalyov

On 6/11/24 11:27 AM, Martin Oliveira wrote:
> folio_fast_pin_allowed() does not support ZONE_DEVICE pages because

s/folio_fast_pin_allowed/gup_fast_folio_allowed/ ?

> currently it is impossible for that type of page to be used with
> FOLL_LONGTERM. When this changes in a subsequent patch, this path will
> attempt to read the mapping of a ZONE_DEVICE page which is not valid.
> 
> Instead, allow ZONE_DEVICE pages explicitly seeing they shouldn't pose
> any problem with the fast path.
> 
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
>   mm/gup.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ca0f5cedce9b2..00d0a77112f4f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2847,6 +2847,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
>   	if (folio_test_hugetlb(folio))
>   		return true;
>   
> +	/* It makes no sense to access the mapping of ZONE_DEVICE pages */

This comment is very difficult, because it states that one cannot
do something, right before explicitly enable something else. And the
reader is given little help on connecting the two.

And there are several subtypes of ZONE_DEVICE. Is it really true that
none of them can be mapped to user space? For p2p BAR1 mappings, those
actually go to user space, yes? Confused, need help. :)

> +	if (folio_is_zone_device(folio))
> +		return true;
> +
>   	/*
>   	 * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
>   	 * cannot proceed, which means no actions performed under RCU can

thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
  2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-06-15  2:45   ` John Hubbard
  0 siblings, 0 replies; 13+ messages in thread
From: John Hubbard @ 2024-06-15  2:45 UTC (permalink / raw)
  To: Martin Oliveira, linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
	Michael Guralnik, Artemy Kovalyov

On 6/11/24 11:27 AM, Martin Oliveira wrote:
> This check existed originally due to concerns that P2PDMA needed to copy
> fsdax until pgmap refcounts were fixed (see [1]).
> 
> The P2PDMA infrastructure will only call unmap_mapping_range() when the
> underlying device is unbound, and immediately after unmapping it waits
> for the reference of all ZONE_DEVICE pages to be released before
> continuing. This does not allow for a page to be reused and no user
> access fault is therefore possible. It does not have the same problem as
> fsdax.

This sounds great. I'm adding Dan Williams to Cc, in hopes of getting an
ack from him on this point.

> 
> The one minor concern with FOLL_LONGTERM pins is they will block device
> unbind until userspace releases them all.

That seems like a completely reasonable consequence of what you are
doing here, IMHO.

> 
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> 
> [1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
> ---
>   mm/gup.c | 5 -----
>   1 file changed, 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 00d0a77112f4f..28060e41788d0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
>   	if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
>   		return false;
>   
> -	/* We want to allow the pgmap to be hot-unplugged at all times */
> -	if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
> -			 (gup_flags & FOLL_PCI_P2PDMA)))
> -		return false;
> -

I am not immediately seeing anything wrong with this... :)


>   	*gup_flags_p = gup_flags;
>   	return true;
>   }

thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
  2024-06-15  2:40   ` John Hubbard
@ 2024-06-26 22:23     ` Martin Oliveira
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-26 22:23 UTC (permalink / raw)
  To: John Hubbard, linux-rdma, linux-kernel, linux-mm
  Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
	Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
	Michael Guralnik, Artemy Kovalyov, david.sloan

Hi John,

Thanks for your comments and sorry for the delayed response, I was off the
last few days.

On 2024-06-14 20:40, John Hubbard wrote:
> s/folio_fast_pin_allowed/gup_fast_folio_allowed/ ?

Nice catch! That function got renamed from the original work
we did on this series.

> This comment is very difficult, because it states that one cannot
> do something, right before explicitly enable something else. And the
> reader is given little help on connecting the two.
> 
> And there are several subtypes of ZONE_DEVICE. Is it really true that
> none of them can be mapped to user space? For p2p BAR1 mappings, those
> actually go to user space, yes? Confused, need help. :)
This is a fair point, I had only looked at p2p but I can't say anything
about the other subtypes of ZONE_DEVICE.

For p2p, yes, they will go to userspace, however folio->mapping was NULL
for those cases. And hence fast path would reject it.

In any case, I think we could drop this patch for now and revisit when/if
required, the regular gup path was not measurably slower.

Thanks,
Martin


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-06-26 22:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-06-12  8:09   ` Greg Kroah-Hartman
2024-06-12 17:29   ` Tejun Heo
2024-06-13  5:15   ` Dan Carpenter
2024-06-13  5:44   ` Christoph Hellwig
2024-06-15  2:32   ` John Hubbard
2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
2024-06-15  2:40   ` John Hubbard
2024-06-26 22:23     ` Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-06-15  2:45   ` John Hubbard
2024-06-11 18:27 ` [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox