* [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA
@ 2024-06-11 18:27 Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
To: linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Martin Oliveira,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov
This patch series enables P2PDMA memory to be used in userspace RDMA
transfers. With this series, P2PDMA memory mmaped into userspace (ie.
only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
similar) interfaces. This can be tested by passing a sysfs p2pmem
allocator to the --mmap flag of the perftest tools.
This requires addressing three issues:
* Stop exporting kernfs VMAs with page_mkwrite, which is incompatible
with FOLL_LONGTERM and is redudant since the default fault code has the
same behavior as kernfs_vma_page_mkwrite() (i.e., call
file_update_time()).
* Fix folio_fast_pin_allowed() path to take into account ZONE_DEVICE pages.
* Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
initially put in place due to excessive caution with assuming P2PDMA
would have similar problems to fsdax with unmap_mapping_range(). Seeing
P2PDMA only uses unmap_mapping_range() on device unbind and immediately
waits for all page reference counts to go to zero after calling it, it
is actually believed to be safe from reuse and user access faults. See
[1] for more discussion.
This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
using the mlx5_core driver, as well as an NVMe CMB.
Thanks,
Martin
[1]: https://lore.kernel.org/linux-mm/87cypuvh2i.fsf@nvdebian.thelocal/T/
--
Changes in v2:
- Remove page_mkwrite() for all kernfs, instead of creating a
different vm_ops for p2pdma.
--
Martin Oliveira (4):
kernfs: remove page_mkwrite() from vm_operations_struct
mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
RDMA/umem: add support for P2P RDMA
drivers/infiniband/core/umem.c | 3 +++
fs/kernfs/file.c | 26 +++-----------------------
mm/gup.c | 9 ++++-----
3 files changed, 10 insertions(+), 28 deletions(-)
base-commit: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
2024-06-12 8:09 ` Greg Kroah-Hartman
` (4 more replies)
2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
` (2 subsequent siblings)
3 siblings, 5 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
To: linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Martin Oliveira,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov
The .page_mkwrite operator of kernfs just calls file_update_time().
This is the same behaviour that the fault code does if .page_mkwrite is
not set.
Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.
There are no users of .page_mkwrite and no known valid use cases, so
just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
mmap() implementation sets .page_mkwrite.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
fs/kernfs/file.c | 26 +++-----------------------
1 file changed, 3 insertions(+), 23 deletions(-)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b9..a198cb0718772 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
return ret;
}
-static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
-{
- struct file *file = vmf->vma->vm_file;
- struct kernfs_open_file *of = kernfs_of(file);
- vm_fault_t ret;
-
- if (!of->vm_ops)
- return VM_FAULT_SIGBUS;
-
- if (!kernfs_get_active(of->kn))
- return VM_FAULT_SIGBUS;
-
- ret = 0;
- if (of->vm_ops->page_mkwrite)
- ret = of->vm_ops->page_mkwrite(vmf);
- else
- file_update_time(file);
-
- kernfs_put_active(of->kn);
- return ret;
-}
-
static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
void *buf, int len, int write)
{
@@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
static const struct vm_operations_struct kernfs_vm_ops = {
.open = kernfs_vma_open,
.fault = kernfs_vma_fault,
- .page_mkwrite = kernfs_vma_page_mkwrite,
.access = kernfs_vma_access,
};
@@ -482,6 +459,9 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_ops && vma->vm_ops->close)
goto out_put;
+ if (vma->vm_ops->page_mkwrite)
+ goto out_put;
+
rc = 0;
if (!of->mmapped) {
of->mmapped = true;
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
2024-06-15 2:40 ` John Hubbard
2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
3 siblings, 1 reply; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
To: linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Martin Oliveira,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov
folio_fast_pin_allowed() does not support ZONE_DEVICE pages because
currently it is impossible for that type of page to be used with
FOLL_LONGTERM. When this changes in a subsequent patch, this path will
attempt to read the mapping of a ZONE_DEVICE page which is not valid.
Instead, allow ZONE_DEVICE pages explicitly seeing they shouldn't pose
any problem with the fast path.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
---
mm/gup.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b2..00d0a77112f4f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2847,6 +2847,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
if (folio_test_hugetlb(folio))
return true;
+ /* It makes no sense to access the mapping of ZONE_DEVICE pages */
+ if (folio_is_zone_device(folio))
+ return true;
+
/*
* GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
* cannot proceed, which means no actions performed under RCU can
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
2024-06-15 2:45 ` John Hubbard
2024-06-11 18:27 ` [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
3 siblings, 1 reply; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
To: linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Martin Oliveira,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov
This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).
The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.
The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
[1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 00d0a77112f4f..28060e41788d0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
return false;
- /* We want to allow the pgmap to be hot-unplugged at all times */
- if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
- (gup_flags & FOLL_PCI_P2PDMA)))
- return false;
-
*gup_flags_p = gup_flags;
return true;
}
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA
2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
` (2 preceding siblings ...)
2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-06-11 18:27 ` Martin Oliveira
3 siblings, 0 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-11 18:27 UTC (permalink / raw)
To: linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Martin Oliveira,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov, Jason Gunthorpe
If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag
This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
mmaped into userspace for MRs in IB and RDMA transactions.
Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/infiniband/core/umem.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b6999..b59bb6e1475e2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -208,6 +208,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
if (umem->writable)
gup_flags |= FOLL_WRITE;
+ if (ib_dma_pci_p2p_dma_supported(device))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
while (npages) {
cond_resched();
pinned = pin_user_pages_fast(cur_base,
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
@ 2024-06-12 8:09 ` Greg Kroah-Hartman
2024-06-12 17:29 ` Tejun Heo
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Greg Kroah-Hartman @ 2024-06-12 8:09 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-rdma, linux-kernel, linux-mm, Jason Gunthorpe,
Leon Romanovsky, Tejun Heo, Andrew Morton, Logan Gunthorpe,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov
On Tue, Jun 11, 2024 at 12:27:29PM -0600, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
> mmap() implementation sets .page_mkwrite.
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
> fs/kernfs/file.c | 26 +++-----------------------
> 1 file changed, 3 insertions(+), 23 deletions(-)
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-06-12 8:09 ` Greg Kroah-Hartman
@ 2024-06-12 17:29 ` Tejun Heo
2024-06-13 5:15 ` Dan Carpenter
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2024-06-12 17:29 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-rdma, linux-kernel, linux-mm, Jason Gunthorpe,
Leon Romanovsky, Greg Kroah-Hartman, Andrew Morton,
Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
Michael Guralnik, Artemy Kovalyov
On Tue, Jun 11, 2024 at 12:27:29PM -0600, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
> mmap() implementation sets .page_mkwrite.
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-06-12 8:09 ` Greg Kroah-Hartman
2024-06-12 17:29 ` Tejun Heo
@ 2024-06-13 5:15 ` Dan Carpenter
2024-06-13 5:44 ` Christoph Hellwig
2024-06-15 2:32 ` John Hubbard
4 siblings, 0 replies; 13+ messages in thread
From: Dan Carpenter @ 2024-06-13 5:15 UTC (permalink / raw)
To: oe-kbuild, Martin Oliveira, linux-rdma, linux-kernel, linux-mm
Cc: lkp, oe-kbuild-all, Jason Gunthorpe, Leon Romanovsky,
Greg Kroah-Hartman, Tejun Heo, Andrew Morton,
Linux Memory Management List, Logan Gunthorpe, Martin Oliveira,
Mike Marciniszyn, Shiraz Saleem, Michael Guralnik,
Artemy Kovalyov
Hi Martin,
kernel test robot noticed the following build warnings:
url: https://github.com/intel-lab-lkp/linux/commits/Martin-Oliveira/kernfs-remove-page_mkwrite-from-vm_operations_struct/20240612-023130
base: 83a7eefedc9b56fe7bfeff13b6c7356688ffa670
patch link: https://lore.kernel.org/r/20240611182732.360317-2-martin.oliveira%40eideticom.com
patch subject: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
config: i386-randconfig-141-20240612 (https://download.01.org/0day-ci/archive/20240613/202406130357.6NmgCbMP-lkp@intel.com/config)
compiler: gcc-12 (Ubuntu 12.3.0-9ubuntu2) 12.3.0
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202406130357.6NmgCbMP-lkp@intel.com/
smatch warnings:
fs/kernfs/file.c:462 kernfs_fop_mmap() error: we previously assumed 'vma->vm_ops' could be null (see line 459)
vim +462 fs/kernfs/file.c
c637b8acbe079e Tejun Heo 2013-12-11 416 static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
414985ae23c031 Tejun Heo 2013-11-28 417 {
c525aaddc366df Tejun Heo 2013-12-11 418 struct kernfs_open_file *of = kernfs_of(file);
414985ae23c031 Tejun Heo 2013-11-28 419 const struct kernfs_ops *ops;
414985ae23c031 Tejun Heo 2013-11-28 420 int rc;
414985ae23c031 Tejun Heo 2013-11-28 421
9b2db6e1894577 Tejun Heo 2013-12-10 422 /*
9b2db6e1894577 Tejun Heo 2013-12-10 423 * mmap path and of->mutex are prone to triggering spurious lockdep
9b2db6e1894577 Tejun Heo 2013-12-10 424 * warnings and we don't want to add spurious locking dependency
9b2db6e1894577 Tejun Heo 2013-12-10 425 * between the two. Check whether mmap is actually implemented
9b2db6e1894577 Tejun Heo 2013-12-10 426 * without grabbing @of->mutex by testing HAS_MMAP flag. See the
c810729fe6471a Ahelenia Ziemiańska 2023-12-21 427 * comment in kernfs_fop_open() for more details.
9b2db6e1894577 Tejun Heo 2013-12-10 428 */
df23fc39bce03b Tejun Heo 2013-12-11 429 if (!(of->kn->flags & KERNFS_HAS_MMAP))
9b2db6e1894577 Tejun Heo 2013-12-10 430 return -ENODEV;
9b2db6e1894577 Tejun Heo 2013-12-10 431
414985ae23c031 Tejun Heo 2013-11-28 432 mutex_lock(&of->mutex);
414985ae23c031 Tejun Heo 2013-11-28 433
414985ae23c031 Tejun Heo 2013-11-28 434 rc = -ENODEV;
c637b8acbe079e Tejun Heo 2013-12-11 435 if (!kernfs_get_active(of->kn))
414985ae23c031 Tejun Heo 2013-11-28 436 goto out_unlock;
414985ae23c031 Tejun Heo 2013-11-28 437
324a56e16e44ba Tejun Heo 2013-12-11 438 ops = kernfs_ops(of->kn);
414985ae23c031 Tejun Heo 2013-11-28 439 rc = ops->mmap(of, vma);
b44b2140265ddf Tejun Heo 2014-04-20 440 if (rc)
b44b2140265ddf Tejun Heo 2014-04-20 441 goto out_put;
414985ae23c031 Tejun Heo 2013-11-28 442
414985ae23c031 Tejun Heo 2013-11-28 443 /*
414985ae23c031 Tejun Heo 2013-11-28 444 * PowerPC's pci_mmap of legacy_mem uses shmem_zero_setup()
414985ae23c031 Tejun Heo 2013-11-28 445 * to satisfy versions of X which crash if the mmap fails: that
414985ae23c031 Tejun Heo 2013-11-28 446 * substitutes a new vm_file, and we don't then want bin_vm_ops.
414985ae23c031 Tejun Heo 2013-11-28 447 */
414985ae23c031 Tejun Heo 2013-11-28 448 if (vma->vm_file != file)
414985ae23c031 Tejun Heo 2013-11-28 449 goto out_put;
414985ae23c031 Tejun Heo 2013-11-28 450
414985ae23c031 Tejun Heo 2013-11-28 451 rc = -EINVAL;
414985ae23c031 Tejun Heo 2013-11-28 452 if (of->mmapped && of->vm_ops != vma->vm_ops)
414985ae23c031 Tejun Heo 2013-11-28 453 goto out_put;
414985ae23c031 Tejun Heo 2013-11-28 454
414985ae23c031 Tejun Heo 2013-11-28 455 /*
414985ae23c031 Tejun Heo 2013-11-28 456 * It is not possible to successfully wrap close.
414985ae23c031 Tejun Heo 2013-11-28 457 * So error if someone is trying to use close.
414985ae23c031 Tejun Heo 2013-11-28 458 */
414985ae23c031 Tejun Heo 2013-11-28 @459 if (vma->vm_ops && vma->vm_ops->close)
^^^^^^^^^^^
If ->vm_ops is NULL
414985ae23c031 Tejun Heo 2013-11-28 460 goto out_put;
414985ae23c031 Tejun Heo 2013-11-28 461
927bb8d619fea4 Martin Oliveira 2024-06-11 @462 if (vma->vm_ops->page_mkwrite)
^^^^^^^^^^^^^^^^^^^^^^^^^
then we're in trouble
927bb8d619fea4 Martin Oliveira 2024-06-11 463 goto out_put;
927bb8d619fea4 Martin Oliveira 2024-06-11 464
414985ae23c031 Tejun Heo 2013-11-28 465 rc = 0;
05d8f255867e31 Neel Natu 2024-01-27 466 if (!of->mmapped) {
a1d82aff5df760 Tejun Heo 2016-12-27 467 of->mmapped = true;
bdb2fd7fc56e19 Tejun Heo 2022-08-27 468 of_on(of)->nr_mmapped++;
414985ae23c031 Tejun Heo 2013-11-28 469 of->vm_ops = vma->vm_ops;
05d8f255867e31 Neel Natu 2024-01-27 470 }
414985ae23c031 Tejun Heo 2013-11-28 471 vma->vm_ops = &kernfs_vm_ops;
414985ae23c031 Tejun Heo 2013-11-28 472 out_put:
c637b8acbe079e Tejun Heo 2013-12-11 473 kernfs_put_active(of->kn);
414985ae23c031 Tejun Heo 2013-11-28 474 out_unlock:
414985ae23c031 Tejun Heo 2013-11-28 475 mutex_unlock(&of->mutex);
414985ae23c031 Tejun Heo 2013-11-28 476
414985ae23c031 Tejun Heo 2013-11-28 477 return rc;
414985ae23c031 Tejun Heo 2013-11-28 478 }
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
` (2 preceding siblings ...)
2024-06-13 5:15 ` Dan Carpenter
@ 2024-06-13 5:44 ` Christoph Hellwig
2024-06-15 2:32 ` John Hubbard
4 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-06-13 5:44 UTC (permalink / raw)
To: Martin Oliveira
Cc: linux-rdma, linux-kernel, linux-mm, Jason Gunthorpe,
Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo, Andrew Morton,
Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
Michael Guralnik, Artemy Kovalyov
On Tue, Jun 11, 2024 at 12:27:29PM -0600, Martin Oliveira wrote:
> + if (vma->vm_ops->page_mkwrite)
> + goto out_put;
> +
I'd probably make this a WARN_ON so that driver authors trying to
add a page_mkwrite in the vm_ops passed to kernfs get a big fat
warning instead of spending a couple hours trying to track down
what is going wrong :)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
` (3 preceding siblings ...)
2024-06-13 5:44 ` Christoph Hellwig
@ 2024-06-15 2:32 ` John Hubbard
4 siblings, 0 replies; 13+ messages in thread
From: John Hubbard @ 2024-06-15 2:32 UTC (permalink / raw)
To: Martin Oliveira, linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
Michael Guralnik, Artemy Kovalyov
On 6/11/24 11:27 AM, Martin Oliveira wrote:
> The .page_mkwrite operator of kernfs just calls file_update_time().
> This is the same behaviour that the fault code does if .page_mkwrite is
> not set.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> There are no users of .page_mkwrite and no known valid use cases, so
> just remove the .page_mkwrite from kernfs_ops and return -EINVAL if an
> mmap() implementation sets .page_mkwrite.
Hi Martin and Logan!
First of all, I admire this approach to solving one of the gup+filesystem
interaction problems, by coming in from the other direction. Neat. :)
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
> fs/kernfs/file.c | 26 +++-----------------------
> 1 file changed, 3 insertions(+), 23 deletions(-)
>
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 8502ef68459b9..a198cb0718772 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -386,28 +386,6 @@ static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
> return ret;
> }
>
> -static vm_fault_t kernfs_vma_page_mkwrite(struct vm_fault *vmf)
> -{
> - struct file *file = vmf->vma->vm_file;
> - struct kernfs_open_file *of = kernfs_of(file);
> - vm_fault_t ret;
> -
> - if (!of->vm_ops)
> - return VM_FAULT_SIGBUS;
> -
> - if (!kernfs_get_active(of->kn))
> - return VM_FAULT_SIGBUS;
> -
> - ret = 0;
> - if (of->vm_ops->page_mkwrite)
> - ret = of->vm_ops->page_mkwrite(vmf);
> - else
> - file_update_time(file);
> -
> - kernfs_put_active(of->kn);
> - return ret;
> -}
> -
> static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
> void *buf, int len, int write)
> {
> @@ -432,7 +410,6 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
> static const struct vm_operations_struct kernfs_vm_ops = {
> .open = kernfs_vma_open,
> .fault = kernfs_vma_fault,
> - .page_mkwrite = kernfs_vma_page_mkwrite,
> .access = kernfs_vma_access,
> };
>
> @@ -482,6 +459,9 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
> if (vma->vm_ops && vma->vm_ops->close)
> goto out_put;
>
> + if (vma->vm_ops->page_mkwrite)
As the kernel test bot results imply, you probably want to do it like this:
if (vma->vm_ops && vma->vm_ops->page_mkwrite)
> + goto out_put;
> +
> rc = 0;
> if (!of->mmapped) {
> of->mmapped = true;
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
@ 2024-06-15 2:40 ` John Hubbard
2024-06-26 22:23 ` Martin Oliveira
0 siblings, 1 reply; 13+ messages in thread
From: John Hubbard @ 2024-06-15 2:40 UTC (permalink / raw)
To: Martin Oliveira, linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
Michael Guralnik, Artemy Kovalyov
On 6/11/24 11:27 AM, Martin Oliveira wrote:
> folio_fast_pin_allowed() does not support ZONE_DEVICE pages because
s/folio_fast_pin_allowed/gup_fast_folio_allowed/ ?
> currently it is impossible for that type of page to be used with
> FOLL_LONGTERM. When this changes in a subsequent patch, this path will
> attempt to read the mapping of a ZONE_DEVICE page which is not valid.
>
> Instead, allow ZONE_DEVICE pages explicitly seeing they shouldn't pose
> any problem with the fast path.
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
> ---
> mm/gup.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index ca0f5cedce9b2..00d0a77112f4f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2847,6 +2847,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
> if (folio_test_hugetlb(folio))
> return true;
>
> + /* It makes no sense to access the mapping of ZONE_DEVICE pages */
This comment is very difficult, because it states that one cannot
do something, right before explicitly enable something else. And the
reader is given little help on connecting the two.
And there are several subtypes of ZONE_DEVICE. Is it really true that
none of them can be mapped to user space? For p2p BAR1 mappings, those
actually go to user space, yes? Confused, need help. :)
> + if (folio_is_zone_device(folio))
> + return true;
> +
> /*
> * GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
> * cannot proceed, which means no actions performed under RCU can
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
@ 2024-06-15 2:45 ` John Hubbard
0 siblings, 0 replies; 13+ messages in thread
From: John Hubbard @ 2024-06-15 2:45 UTC (permalink / raw)
To: Martin Oliveira, linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
Michael Guralnik, Artemy Kovalyov
On 6/11/24 11:27 AM, Martin Oliveira wrote:
> This check existed originally due to concerns that P2PDMA needed to copy
> fsdax until pgmap refcounts were fixed (see [1]).
>
> The P2PDMA infrastructure will only call unmap_mapping_range() when the
> underlying device is unbound, and immediately after unmapping it waits
> for the reference of all ZONE_DEVICE pages to be released before
> continuing. This does not allow for a page to be reused and no user
> access fault is therefore possible. It does not have the same problem as
> fsdax.
This sounds great. I'm adding Dan Williams to Cc, in hopes of getting an
ack from him on this point.
>
> The one minor concern with FOLL_LONGTERM pins is they will block device
> unbind until userspace releases them all.
That seems like a completely reasonable consequence of what you are
doing here, IMHO.
>
> Co-developed-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
>
> [1]: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
> ---
> mm/gup.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 00d0a77112f4f..28060e41788d0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
> if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
> return false;
>
> - /* We want to allow the pgmap to be hot-unplugged at all times */
> - if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
> - (gup_flags & FOLL_PCI_P2PDMA)))
> - return false;
> -
I am not immediately seeing anything wrong with this... :)
> *gup_flags_p = gup_flags;
> return true;
> }
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
2024-06-15 2:40 ` John Hubbard
@ 2024-06-26 22:23 ` Martin Oliveira
0 siblings, 0 replies; 13+ messages in thread
From: Martin Oliveira @ 2024-06-26 22:23 UTC (permalink / raw)
To: John Hubbard, linux-rdma, linux-kernel, linux-mm
Cc: Jason Gunthorpe, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Andrew Morton, Logan Gunthorpe, Mike Marciniszyn, Shiraz Saleem,
Michael Guralnik, Artemy Kovalyov, david.sloan
Hi John,
Thanks for your comments and sorry for the delayed response, I was off the
last few days.
On 2024-06-14 20:40, John Hubbard wrote:
> s/folio_fast_pin_allowed/gup_fast_folio_allowed/ ?
Nice catch! That function got renamed from the original work
we did on this series.
> This comment is very difficult, because it states that one cannot
> do something, right before explicitly enable something else. And the
> reader is given little help on connecting the two.
>
> And there are several subtypes of ZONE_DEVICE. Is it really true that
> none of them can be mapped to user space? For p2p BAR1 mappings, those
> actually go to user space, yes? Confused, need help. :)
This is a fair point, I had only looked at p2p but I can't say anything
about the other subtypes of ZONE_DEVICE.
For p2p, yes, they will go to userspace, however folio->mapping was NULL
for those cases. And hence fast path would reject it.
In any case, I think we could drop this patch for now and revisit when/if
required, the regular gup path was not measurably slower.
Thanks,
Martin
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-06-26 22:23 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-11 18:27 [PATCH v2 0/4] Enable P2PDMA in Userspace RDMA Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 1/4] kernfs: remove page_mkwrite() from vm_operations_struct Martin Oliveira
2024-06-12 8:09 ` Greg Kroah-Hartman
2024-06-12 17:29 ` Tejun Heo
2024-06-13 5:15 ` Dan Carpenter
2024-06-13 5:44 ` Christoph Hellwig
2024-06-15 2:32 ` John Hubbard
2024-06-11 18:27 ` [PATCH v2 2/4] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed() Martin Oliveira
2024-06-15 2:40 ` John Hubbard
2024-06-26 22:23 ` Martin Oliveira
2024-06-11 18:27 ` [PATCH v2 3/4] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA Martin Oliveira
2024-06-15 2:45 ` John Hubbard
2024-06-11 18:27 ` [PATCH v2 4/4] RDMA/umem: add support for P2P RDMA Martin Oliveira
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox