* [PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() fails
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-22 16:49 ` Logan Gunthorpe
2025-12-20 4:04 ` [PATCH 02/13] PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap() Hou Tao
` (12 subsequent siblings)
13 siblings, 1 reply; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
forever when trying to remove the PCIe device.
Fix it by adding the missed percpu_ref_put().
Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4a2fc7ab42c3..218c1f5252b6 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -152,6 +152,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+ percpu_ref_put(ref);
return ret;
}
percpu_ref_get(ref);
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() fails
2025-12-20 4:04 ` [PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() fails Hou Tao
@ 2025-12-22 16:49 ` Logan Gunthorpe
0 siblings, 0 replies; 25+ messages in thread
From: Logan Gunthorpe @ 2025-12-22 16:49 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Alistair Popple,
Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
On 2025-12-19 21:04, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> forever when trying to remove the PCIe device.
>
> Fix it by adding the missed percpu_ref_put().
>
> Fixes: 7e9c7ef83d78 ("PCI/P2PDMA: Allow userspace VMA allocations through sysfs")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
Nice catch, thanks:
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Logan
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 02/13] PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap()
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
2025-12-20 4:04 ` [PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() fails Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-22 16:50 ` Logan Gunthorpe
2025-12-20 4:04 ` [PATCH 03/13] kernfs: add support for get_unmapped_area callback Hou Tao
` (11 subsequent siblings)
13 siblings, 1 reply; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
Commit b7e282378773 has already changed the initial page refcount of
p2pdma page from one to zero, however, in p2pmem_alloc_mmap() it uses
"VM_WARN_ON_ONCE_PAGE(!page_ref_count(page))" to assert the initial page
refcount should not be zero and the following will be reported when
CONFIG_DEBUG_VM is enabled:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x380400000
flags: 0x20000000002000(reserved|node=0|zone=4)
raw: 0020000000002000 ff1100015e3ab440 0000000000000000 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_WARN_ON_ONCE_PAGE(!page_ref_count(page))
------------[ cut here ]------------
WARNING: CPU: 5 PID: 449 at drivers/pci/p2pdma.c:240 p2pmem_alloc_mmap+0x83a/0xa60
Fix by using "page_ref_count(page)" as the assertion condition.
Fixes: b7e282378773 ("mm/mm_init: move p2pdma page refcount initialisation to p2pdma")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 218c1f5252b6..dd64ec830fdd 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -147,7 +147,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
* we have just allocated the page no one else should be
* using it.
*/
- VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
+ VM_WARN_ON_ONCE_PAGE(page_ref_count(page), page);
set_page_count(page, 1);
ret = vm_insert_page(vma, vaddr, page);
if (ret) {
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 02/13] PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap()
2025-12-20 4:04 ` [PATCH 02/13] PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap() Hou Tao
@ 2025-12-22 16:50 ` Logan Gunthorpe
0 siblings, 0 replies; 25+ messages in thread
From: Logan Gunthorpe @ 2025-12-22 16:50 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Alistair Popple,
Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
On 2025-12-19 21:04, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Commit b7e282378773 has already changed the initial page refcount of
> p2pdma page from one to zero, however, in p2pmem_alloc_mmap() it uses
> "VM_WARN_ON_ONCE_PAGE(!page_ref_count(page))" to assert the initial page
> refcount should not be zero and the following will be reported when
> CONFIG_DEBUG_VM is enabled:
>
> page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x380400000
> flags: 0x20000000002000(reserved|node=0|zone=4)
> raw: 0020000000002000 ff1100015e3ab440 0000000000000000 0000000000000000
> raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: VM_WARN_ON_ONCE_PAGE(!page_ref_count(page))
> ------------[ cut here ]------------
> WARNING: CPU: 5 PID: 449 at drivers/pci/p2pdma.c:240 p2pmem_alloc_mmap+0x83a/0xa60
>
> Fix by using "page_ref_count(page)" as the assertion condition.
>
> Fixes: b7e282378773 ("mm/mm_init: move p2pdma page refcount initialisation to p2pdma")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
Thanks for the fix
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Logan
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 03/13] kernfs: add support for get_unmapped_area callback
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
2025-12-20 4:04 ` [PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() fails Hou Tao
2025-12-20 4:04 ` [PATCH 02/13] PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap() Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 15:43 ` kernel test robot
2025-12-20 15:57 ` kernel test robot
2025-12-20 4:04 ` [PATCH 04/13] kernfs: add support for may_split and pagesize callbacks Hou Tao
` (10 subsequent siblings)
13 siblings, 2 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
kernfs has already support ->mmap callback, however it doesn't support
->get_unmapped_area callback to return PMD-aligned or PUD-aligned
virtual address. The following patch will need it to support compound
page for p2pdma device memory, therefore add the necessary support for it.
When the ->get_unmapped_area callback is not defined in kernfs_ops or
the callback returns -EOPNOTSUPP, kernfs_get_unmapped_area() will
fallback to mm_get_unmapped_area().
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
fs/kernfs/file.c | 37 +++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 3 +++
2 files changed, 40 insertions(+)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 9adf36e6364b..9773b5734a2c 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -454,6 +454,39 @@ static const struct vm_operations_struct kernfs_vm_ops = {
.access = kernfs_vma_access,
};
+static unsigned long kernfs_get_unmapped_area(struct file *file, unsigned long uaddr,
+ unsigned long len, unsigned long pgoff,
+ unsigned long flags)
+{
+ struct kernfs_open_file *of = kernfs_of(file);
+ const struct kernfs_ops *ops;
+ long addr;
+
+ if (!(of->kn->flags & KERNFS_HAS_MMAP))
+ return -ENODEV;
+
+ mutex_lock(&of->mutex);
+
+ addr = -ENODEV;
+ if (!kernfs_get_active_of(of))
+ goto out_unlock;
+
+ ops = kernfs_ops(of->kn);
+ if (ops->get_unmapped_area) {
+ addr = ops->get_unmapped_area(of, uaddr, len, pgoff, flags);
+ if (!IS_ERR_VALUE(addr) || addr != -EOPNOTSUPP)
+ goto out_put;
+ }
+ addr = mm_get_unmapped_area(file, uaddr, len, pgoff, flags);
+
+out_put:
+ kernfs_put_active_of(of);
+out_unlock:
+ mutex_unlock(&of->mutex);
+
+ return addr;
+}
+
static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
{
struct kernfs_open_file *of = kernfs_of(file);
@@ -1017,6 +1050,7 @@ const struct file_operations kernfs_file_fops = {
.write_iter = kernfs_fop_write_iter,
.llseek = kernfs_fop_llseek,
.mmap = kernfs_fop_mmap,
+ .get_unmapped_area = kernfs_get_unmapped_area,
.open = kernfs_fop_open,
.release = kernfs_fop_release,
.poll = kernfs_fop_poll,
@@ -1052,6 +1086,9 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
unsigned flags;
int rc;
+ if (ops->get_unmapped_area && !ops->mmap)
+ return ERR_PTR(-EINVAL);
+
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG,
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index b5a5f32fdfd1..9467b0a2b339 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -324,6 +324,9 @@ struct kernfs_ops {
int (*mmap)(struct kernfs_open_file *of, struct vm_area_struct *vma);
loff_t (*llseek)(struct kernfs_open_file *of, loff_t offset, int whence);
+ unsigned long (*get_unmapped_area)(struct kernfs_open_file *of, unsigned long uaddr,
+ unsigned long len, unsigned long pgoff,
+ unsigned long flags);
};
/*
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 03/13] kernfs: add support for get_unmapped_area callback
2025-12-20 4:04 ` [PATCH 03/13] kernfs: add support for get_unmapped_area callback Hou Tao
@ 2025-12-20 15:43 ` kernel test robot
2025-12-20 15:57 ` kernel test robot
1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-12-20 15:43 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: oe-kbuild-all, linux-pci, linux-mm, linux-nvme, Bjorn Helgaas,
Logan Gunthorpe, Alistair Popple, Leon Romanovsky,
Greg Kroah-Hartman, Tejun Heo, Rafael J . Wysocki,
Danilo Krummrich, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, houtao1
Hi Hou,
kernel test robot noticed the following build errors:
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus akpm-mm/mm-everything linus/master v6.19-rc1 next-20251219]
[cannot apply to pci/next pci/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Hou-Tao/PCI-P2PDMA-Release-the-per-cpu-ref-of-pgmap-when-vm_insert_page-fails/20251220-121804
base: driver-core/driver-core-testing
patch link: https://lore.kernel.org/r/20251220040446.274991-4-houtao%40huaweicloud.com
patch subject: [PATCH 03/13] kernfs: add support for get_unmapped_area callback
config: sh-allnoconfig (https://download.01.org/0day-ci/archive/20251220/202512202307.ewUcqBQV-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251220/202512202307.ewUcqBQV-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512202307.ewUcqBQV-lkp@intel.com/
All errors (new ones prefixed by >>):
fs/kernfs/file.c: In function 'kernfs_get_unmapped_area':
>> fs/kernfs/file.c:480:16: error: implicit declaration of function 'mm_get_unmapped_area'; did you mean 'get_unmapped_area'? [-Wimplicit-function-declaration]
480 | addr = mm_get_unmapped_area(file, uaddr, len, pgoff, flags);
| ^~~~~~~~~~~~~~~~~~~~
| get_unmapped_area
vim +480 fs/kernfs/file.c
456
457 static unsigned long kernfs_get_unmapped_area(struct file *file, unsigned long uaddr,
458 unsigned long len, unsigned long pgoff,
459 unsigned long flags)
460 {
461 struct kernfs_open_file *of = kernfs_of(file);
462 const struct kernfs_ops *ops;
463 long addr;
464
465 if (!(of->kn->flags & KERNFS_HAS_MMAP))
466 return -ENODEV;
467
468 mutex_lock(&of->mutex);
469
470 addr = -ENODEV;
471 if (!kernfs_get_active_of(of))
472 goto out_unlock;
473
474 ops = kernfs_ops(of->kn);
475 if (ops->get_unmapped_area) {
476 addr = ops->get_unmapped_area(of, uaddr, len, pgoff, flags);
477 if (!IS_ERR_VALUE(addr) || addr != -EOPNOTSUPP)
478 goto out_put;
479 }
> 480 addr = mm_get_unmapped_area(file, uaddr, len, pgoff, flags);
481
482 out_put:
483 kernfs_put_active_of(of);
484 out_unlock:
485 mutex_unlock(&of->mutex);
486
487 return addr;
488 }
489
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 03/13] kernfs: add support for get_unmapped_area callback
2025-12-20 4:04 ` [PATCH 03/13] kernfs: add support for get_unmapped_area callback Hou Tao
2025-12-20 15:43 ` kernel test robot
@ 2025-12-20 15:57 ` kernel test robot
1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-12-20 15:57 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: llvm, oe-kbuild-all, linux-pci, linux-mm, linux-nvme,
Bjorn Helgaas, Logan Gunthorpe, Alistair Popple, Leon Romanovsky,
Greg Kroah-Hartman, Tejun Heo, Rafael J . Wysocki,
Danilo Krummrich, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, houtao1
Hi Hou,
kernel test robot noticed the following build errors:
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus akpm-mm/mm-everything linus/master v6.19-rc1 next-20251219]
[cannot apply to pci/next pci/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Hou-Tao/PCI-P2PDMA-Release-the-per-cpu-ref-of-pgmap-when-vm_insert_page-fails/20251220-121804
base: driver-core/driver-core-testing
patch link: https://lore.kernel.org/r/20251220040446.274991-4-houtao%40huaweicloud.com
patch subject: [PATCH 03/13] kernfs: add support for get_unmapped_area callback
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20251220/202512202338.qFqw6FI8-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project b324c9f4fa112d61a553bf489b5f4f7ceea05ea8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251220/202512202338.qFqw6FI8-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512202338.qFqw6FI8-lkp@intel.com/
All errors (new ones prefixed by >>):
>> fs/kernfs/file.c:480:9: error: call to undeclared function 'mm_get_unmapped_area'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
480 | addr = mm_get_unmapped_area(file, uaddr, len, pgoff, flags);
| ^
fs/kernfs/file.c:480:9: note: did you mean '__get_unmapped_area'?
include/linux/mm.h:3671:1: note: '__get_unmapped_area' declared here
3671 | __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
| ^
1 error generated.
vim +/mm_get_unmapped_area +480 fs/kernfs/file.c
456
457 static unsigned long kernfs_get_unmapped_area(struct file *file, unsigned long uaddr,
458 unsigned long len, unsigned long pgoff,
459 unsigned long flags)
460 {
461 struct kernfs_open_file *of = kernfs_of(file);
462 const struct kernfs_ops *ops;
463 long addr;
464
465 if (!(of->kn->flags & KERNFS_HAS_MMAP))
466 return -ENODEV;
467
468 mutex_lock(&of->mutex);
469
470 addr = -ENODEV;
471 if (!kernfs_get_active_of(of))
472 goto out_unlock;
473
474 ops = kernfs_ops(of->kn);
475 if (ops->get_unmapped_area) {
476 addr = ops->get_unmapped_area(of, uaddr, len, pgoff, flags);
477 if (!IS_ERR_VALUE(addr) || addr != -EOPNOTSUPP)
478 goto out_put;
479 }
> 480 addr = mm_get_unmapped_area(file, uaddr, len, pgoff, flags);
481
482 out_put:
483 kernfs_put_active_of(of);
484 out_unlock:
485 mutex_unlock(&of->mutex);
486
487 return addr;
488 }
489
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 04/13] kernfs: add support for may_split and pagesize callbacks
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (2 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 03/13] kernfs: add support for get_unmapped_area callback Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 05/13] sysfs: support get_unmapped_area callback for binary file Hou Tao
` (9 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
->may_split() and ->pagesize() callbacks are necessary for the support
of compound page. ->may_split() will check whether the splitting of
compound page is allowed during mprotect or remap, and ->pagesize() will
output the correct page size in /proc/${pid}/smap file. These two
callbacks will be used by the following patch to enable the mapping of
compound page of p2pdma memory into userspace, therefore, add the
support for these two callbacks.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
fs/kernfs/file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 9773b5734a2c..5df45b1dbb36 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -384,6 +384,46 @@ static void kernfs_vma_open(struct vm_area_struct *vma)
kernfs_put_active_of(of);
}
+static int kernfs_vma_may_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct file *file = vma->vm_file;
+ struct kernfs_open_file *of = kernfs_of(file);
+ int ret;
+
+ if (!of->vm_ops)
+ return 0;
+
+ if (!kernfs_get_active_of(of))
+ return -ENODEV;
+
+ ret = 0;
+ if (of->vm_ops->may_split)
+ ret = of->vm_ops->may_split(vma, addr);
+
+ kernfs_put_active_of(of);
+ return ret;
+}
+
+static unsigned long kernfs_vma_pagesize(struct vm_area_struct *vma)
+{
+ struct file *file = vma->vm_file;
+ struct kernfs_open_file *of = kernfs_of(file);
+ unsigned long size;
+
+ if (!of->vm_ops)
+ return PAGE_SIZE;
+
+ if (!kernfs_get_active_of(of))
+ return PAGE_SIZE;
+
+ size = PAGE_SIZE;
+ if (of->vm_ops->pagesize)
+ size = of->vm_ops->pagesize(vma);
+
+ kernfs_put_active_of(of);
+ return size;
+}
+
static vm_fault_t kernfs_vma_fault(struct vm_fault *vmf)
{
struct file *file = vmf->vma->vm_file;
@@ -449,9 +489,11 @@ static int kernfs_vma_access(struct vm_area_struct *vma, unsigned long addr,
static const struct vm_operations_struct kernfs_vm_ops = {
.open = kernfs_vma_open,
+ .may_split = kernfs_vma_may_split,
.fault = kernfs_vma_fault,
.page_mkwrite = kernfs_vma_page_mkwrite,
.access = kernfs_vma_access,
+ .pagesize = kernfs_vma_pagesize,
};
static unsigned long kernfs_get_unmapped_area(struct file *file, unsigned long uaddr,
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 05/13] sysfs: support get_unmapped_area callback for binary file
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (3 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 04/13] kernfs: add support for may_split and pagesize callbacks Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 06/13] PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource() Hou Tao
` (8 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
Add support for ->get_unmapped_area callback for binary sysfs file. The
following patch will use it to support compound page for p2pdma device
memory when the device memory is mapped into userspace.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
fs/sysfs/file.c | 15 +++++++++++++++
include/linux/sysfs.h | 4 ++++
2 files changed, 19 insertions(+)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 3825e780cc58..e843795ebdc2 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -164,6 +164,20 @@ static ssize_t sysfs_kf_bin_write(struct kernfs_open_file *of, char *buf,
return battr->write(of->file, kobj, battr, buf, pos, count);
}
+static unsigned long sysfs_kf_bin_get_unmapped_area(struct kernfs_open_file *of,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ const struct bin_attribute *battr = of->kn->priv;
+ struct kobject *kobj;
+
+ if (!battr->get_unmapped_area)
+ return -EOPNOTSUPP;
+
+ kobj = sysfs_file_kobj(of->kn);
+ return battr->get_unmapped_area(of->file, kobj, battr, uaddr, len, pgoff, flags);
+}
+
static int sysfs_kf_bin_mmap(struct kernfs_open_file *of,
struct vm_area_struct *vma)
{
@@ -268,6 +282,7 @@ static const struct kernfs_ops sysfs_bin_kfops_mmap = {
.mmap = sysfs_kf_bin_mmap,
.open = sysfs_kf_bin_open,
.llseek = sysfs_kf_bin_llseek,
+ .get_unmapped_area = sysfs_kf_bin_get_unmapped_area,
};
int sysfs_add_file_mode_ns(struct kernfs_node *parent,
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c33a96b7391a..f4a50f244f4d 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -321,6 +321,10 @@ struct bin_attribute {
loff_t, int);
int (*mmap)(struct file *, struct kobject *, const struct bin_attribute *attr,
struct vm_area_struct *vma);
+ unsigned long (*get_unmapped_area)(struct file *, struct kobject *,
+ const struct bin_attribute *attr,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags);
};
/**
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 06/13] PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource()
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (4 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 05/13] sysfs: support get_unmapped_area callback for binary file Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 07/13] PCI/P2PDMA: create compound page for aligned p2pdma memory Hou Tao
` (7 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
The align parameter is used to align both the mapping of p2p dma memory
and to enable the compound page for p2p dma memory in the kernel and in
the userspace.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/accel/habanalabs/common/hldio.c | 3 +-
drivers/nvme/host/pci.c | 2 +-
drivers/pci/p2pdma.c | 38 +++++++++++++++++++++----
include/linux/pci-p2pdma.h | 4 +--
4 files changed, 37 insertions(+), 10 deletions(-)
diff --git a/drivers/accel/habanalabs/common/hldio.c b/drivers/accel/habanalabs/common/hldio.c
index 083ae5610875..4d1528dbde9f 100644
--- a/drivers/accel/habanalabs/common/hldio.c
+++ b/drivers/accel/habanalabs/common/hldio.c
@@ -372,7 +372,8 @@ int hl_p2p_region_init(struct hl_device *hdev, struct hl_p2p_region *p2pr)
int rc, i;
/* Start by publishing our p2p memory */
- rc = pci_p2pdma_add_resource(hdev->pdev, p2pr->bar, p2pr->size, p2pr->bar_offset);
+ rc = pci_p2pdma_add_resource(hdev->pdev, p2pr->bar, p2pr->size, PAGE_SIZE,
+ p2pr->bar_offset);
if (rc) {
dev_err(hdev->dev, "error adding p2p resource: %d\n", rc);
goto err;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0e4caeab739c..b070095bae5e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2309,7 +2309,7 @@ static void nvme_map_cmb(struct nvme_dev *dev)
dev->bar + NVME_REG_CMBMSC);
}
- if (pci_p2pdma_add_resource(pdev, bar, size, offset)) {
+ if (pci_p2pdma_add_resource(pdev, bar, size, PAGE_SIZE, offset)) {
dev_warn(dev->ctrl.device,
"failed to register the CMB\n");
hi_lo_writeq(0, dev->bar + NVME_REG_CMBMSC);
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index dd64ec830fdd..70482e240304 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -23,6 +23,7 @@
struct pci_p2pdma {
struct gen_pool *pool;
+ size_t align;
bool p2pmem_published;
struct xarray map_types;
struct p2pdma_provider mem[PCI_STD_NUM_BARS];
@@ -211,7 +212,7 @@ static void p2pdma_folio_free(struct folio *folio)
struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
- PAGE_SIZE, (void **)&ref);
+ p2pdma->align, (void **)&ref);
percpu_ref_put(ref);
}
@@ -323,17 +324,22 @@ struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
}
EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
-static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
+static int pci_p2pdma_setup_pool(struct pci_dev *pdev, size_t align)
{
struct pci_p2pdma *p2pdma;
int ret;
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
- if (p2pdma->pool)
+ if (p2pdma->pool) {
+ /* Two p2pdma BARs with different alignment ? */
+ if (p2pdma->align != align)
+ return -EINVAL;
/* We already setup pools, do nothing, */
return 0;
+ }
- p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+ p2pdma->align = align;
+ p2pdma->pool = gen_pool_create(ilog2(p2pdma->align), dev_to_node(&pdev->dev));
if (!p2pdma->pool)
return -ENOMEM;
@@ -363,18 +369,31 @@ static void pci_p2pdma_unmap_mappings(void *data)
p2pmem_group.name);
}
+static inline int pci_p2pdma_check_pagemap_align(struct pci_dev *pdev, int bar,
+ u64 size, size_t align,
+ u64 offset)
+{
+ if (align == PAGE_SIZE)
+ return 0;
+ return -EINVAL;
+}
+
/**
* pci_p2pdma_add_resource - add memory for use as p2p memory
* @pdev: the device to add the memory to
* @bar: PCI BAR to add
* @size: size of the memory to add, may be zero to use the whole BAR
+ * @align: dev memory mapping alignment of the memory to add. It is used
+ * to optimize the mappings both in userspace and kernel space when
+ * transparent huge page is supported. The possible values are
+ * PAGE_SIZE, PMD_SIZE, and PUD_SIZE.
* @offset: offset into the PCI BAR
*
* The memory will be given ZONE_DEVICE struct pages so that it may
* be used with any DMA request.
*/
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
- u64 offset)
+ size_t align, u64 offset)
{
struct pci_p2pdma_pagemap *p2p_pgmap;
struct p2pdma_provider *mem;
@@ -395,11 +414,18 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
if (size + offset > pci_resource_len(pdev, bar))
return -EINVAL;
+ error = pci_p2pdma_check_pagemap_align(pdev, bar, size, align, offset);
+ if (error) {
+ pci_info_ratelimited(pdev, "invalid align 0x%zx for bar %d\n",
+ align, bar);
+ return error;
+ }
+
error = pcim_p2pdma_init(pdev);
if (error)
return error;
- error = pci_p2pdma_setup_pool(pdev);
+ error = pci_p2pdma_setup_pool(pdev, align);
if (error)
return error;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 517e121d2598..2fa671274c45 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -69,7 +69,7 @@ enum pci_p2pdma_map_type {
int pcim_p2pdma_init(struct pci_dev *pdev);
struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
- u64 offset);
+ size_t align, u64 offset);
int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
int num_clients, bool verbose);
struct pci_dev *pci_p2pmem_find_many(struct device **clients, int num_clients);
@@ -97,7 +97,7 @@ static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
return NULL;
}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
- size_t size, u64 offset)
+ size_t size, size_t align, u64 offset)
{
return -EOPNOTSUPP;
}
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 07/13] PCI/P2PDMA: create compound page for aligned p2pdma memory
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (5 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 06/13] PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource() Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 08/13] mm/huge_memory: add helpers to insert huge page during mmap Hou Tao
` (6 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
Commit c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound
pages") has already supported compound page for ZONE_DEVICE memory. It
not only decreases the memory overhead of ZONE_DEVICE memory through
the deduplication of vmemmap pages, it also optimize the performance of
get_user_pages when the memory is used for IO.
As for now, the alignment of p2p dma memory is already known, setting
vmemmap_shift accordingly to create compound page for p2pdma memory.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 70482e240304..7180dea4855c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -447,6 +447,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->ops = &p2pdma_pgmap_ops;
+ if (align > PAGE_SIZE)
+ pgmap->vmemmap_shift = ilog2(align) - PAGE_SHIFT;
p2p_pgmap->mem = mem;
addr = devm_memremap_pages(&pdev->dev, pgmap);
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 08/13] mm/huge_memory: add helpers to insert huge page during mmap
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (6 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 07/13] PCI/P2PDMA: create compound page for aligned p2pdma memory Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 09/13] PCI/P2PDMA: support get_unmapped_area to return aligned vaddr Hou Tao
` (5 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
vmf_insert_folio_{pmd,pud}() can be used to insert huge page during page
fault. However, for simplicity, the mapping of p2pdma memory inserts all
necessary pages during mmap. Therefore, add vm_insert_folio_{pmd|pud}
helpers to support inserting pmd-sized and pud-sized page during mmap.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
include/linux/huge_mm.h | 4 +++
mm/huge_memory.c | 66 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfde..8cf8bb85be79 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -45,6 +45,10 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
bool write);
vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
bool write);
+int vm_insert_folio_pmd(struct vm_area_struct *vma, unsigned long addr,
+ struct folio *folio);
+int vm_insert_folio_pud(struct vm_area_struct *vma, unsigned long addr,
+ struct folio *folio);
enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_UNSUPPORTED,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21..11d19f8986da 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1644,6 +1644,41 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
}
EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
+int vm_insert_folio_pmd(struct vm_area_struct *vma, unsigned long addr,
+ struct folio *folio)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct folio_or_pfn fop = {
+ .folio = folio,
+ .is_folio = true,
+ };
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ vm_fault_t fault_err;
+
+ mmap_assert_write_locked(mm);
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return -ENOMEM;
+ pmd = pmd_alloc(mm, pud, addr);
+ if (!pmd)
+ return -ENOMEM;
+
+ fault_err = insert_pmd(vma, addr, pmd, fop, vma->vm_page_prot,
+ vma->vm_flags & VM_WRITE);
+ if (fault_err != VM_FAULT_NOPAGE)
+ return -EINVAL;
+
+ return 0;
+}
+
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
{
@@ -1759,6 +1794,37 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
+
+int vm_insert_folio_pud(struct vm_area_struct *vma, unsigned long addr,
+ struct folio *folio)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct folio_or_pfn fop = {
+ .folio = folio,
+ .is_folio = true,
+ };
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ vm_fault_t fault_err;
+
+ mmap_assert_write_locked(mm);
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return -ENOMEM;
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud)
+ return -ENOMEM;
+
+ fault_err = insert_pud(vma, addr, pud, fop, vma->vm_page_prot,
+ vma->vm_flags & VM_WRITE);
+ if (fault_err != VM_FAULT_NOPAGE)
+ return -EINVAL;
+
+ return 0;
+}
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
/**
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 09/13] PCI/P2PDMA: support get_unmapped_area to return aligned vaddr
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (7 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 08/13] mm/huge_memory: add helpers to insert huge page during mmap Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap() Hou Tao
` (4 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
P2PDMA memory already supports compound page. When mmapping the P2PDMA
memory in the userspace, the mmap procedure needs to use an aligned
virtual address to match the alignment of P2PDMA memory. Therefore,
implement get_unmapped_area for p2pdma memory to return an aligned
virtual address.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 7180dea4855c..e97f5da73458 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -90,6 +90,44 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(published);
+static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject *kobj,
+ const struct bin_attribute *attr,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+ struct pci_p2pdma *p2pdma;
+ unsigned long aligned_len;
+ unsigned long addr;
+ unsigned long align;
+
+ if (pgoff)
+ return -EINVAL;
+
+ rcu_read_lock();
+ p2pdma = rcu_dereference(pdev->p2pdma);
+ if (!p2pdma) {
+ rcu_read_unlock();
+ return -ENODEV;
+ }
+ align = p2pdma->align;
+ rcu_read_unlock();
+
+ /* Fixed address */
+ if (uaddr)
+ goto out;
+
+ aligned_len = len + align;
+ if (aligned_len < len)
+ goto out;
+
+ addr = mm_get_unmapped_area(filp, uaddr, aligned_len, pgoff, flags);
+ if (!IS_ERR_VALUE(addr))
+ return round_up(addr, align);
+out:
+ return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
+}
+
static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
const struct bin_attribute *attr, struct vm_area_struct *vma)
{
@@ -175,6 +213,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
static const struct bin_attribute p2pmem_alloc_attr = {
.attr = { .name = "allocate", .mode = 0660 },
.mmap = p2pmem_alloc_mmap,
+ .get_unmapped_area = p2pmem_get_unmapped_area,
/*
* Some places where we want to call mmap (ie. python) will check
* that the file size is greater than the mmap size before allowing
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (8 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 09/13] PCI/P2PDMA: support get_unmapped_area to return aligned vaddr Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-22 17:04 ` Logan Gunthorpe
2025-12-20 4:04 ` [PATCH 11/13] PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align() Hou Tao
` (3 subsequent siblings)
13 siblings, 1 reply; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
P2PDMA memory has already supported compound page and the helpers which
support inserting compound page into vma is also ready, therefore, add
support for compound page in p2pmem_alloc_mmap() as well. It will reduce
the overhead of mmap() and get_user_pages() a lot when compound page is
enabled for p2pdma memory.
The use of vm_private_data to save the alignment of p2pdma memory needs
explanation. The normal way to get the alignment is through pci_dev. It
can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
defining a new struct kernfs_vm_ops to pass the kobject to the
may_split() and ->pagesize() callbacks. The former approach depends too
much on kernfs implementation details, and the latter would lead to
excessive churn. Therefore, choose the simpler way of saving alignment
in vm_private_data instead.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 44 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e97f5da73458..4a133219ac43 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
}
+static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ size_t align = (uintptr_t)vma->vm_private_data;
+
+ if (!IS_ALIGNED(addr, align))
+ return -EINVAL;
+ return 0;
+}
+
+static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
+{
+ return (uintptr_t)vma->vm_private_data;
+}
+
+static const struct vm_operations_struct p2pmem_vm_ops = {
+ .may_split = p2pmem_may_split,
+ .pagesize = p2pmem_pagesize,
+};
+
static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
const struct bin_attribute *attr, struct vm_area_struct *vma)
{
@@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
struct pci_p2pdma *p2pdma;
struct percpu_ref *ref;
unsigned long vaddr;
+ size_t align;
void *kaddr;
int ret;
@@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
goto out;
}
+ align = p2pdma->align;
+ if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
+ pci_info_ratelimited(pdev,
+ "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
+ current->comm, vma->vm_start, vma->vm_end,
+ align);
+ ret = -EINVAL;
+ goto out;
+ }
+
kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
if (!kaddr) {
ret = -ENOMEM;
@@ -178,7 +208,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
}
rcu_read_unlock();
- for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+ for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += align) {
struct page *page = virt_to_page(kaddr);
/*
@@ -188,7 +218,12 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
*/
VM_WARN_ON_ONCE_PAGE(page_ref_count(page), page);
set_page_count(page, 1);
- ret = vm_insert_page(vma, vaddr, page);
+ if (align == PUD_SIZE)
+ ret = vm_insert_folio_pud(vma, vaddr, page_folio(page));
+ else if (align == PMD_SIZE)
+ ret = vm_insert_folio_pmd(vma, vaddr, page_folio(page));
+ else
+ ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
percpu_ref_put(ref);
@@ -196,10 +231,15 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
}
percpu_ref_get(ref);
put_page(page);
- kaddr += PAGE_SIZE;
- len -= PAGE_SIZE;
+ kaddr += align;
+ len -= align;
}
+ /* Disable unaligned splitting due to vma merge */
+ vm_flags_set(vma, VM_DONTEXPAND);
+ vma->vm_ops = &p2pmem_vm_ops;
+ vma->vm_private_data = (void *)(uintptr_t)align;
+
percpu_ref_put(ref);
return 0;
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
2025-12-20 4:04 ` [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap() Hou Tao
@ 2025-12-22 17:04 ` Logan Gunthorpe
2025-12-24 2:20 ` Hou Tao
0 siblings, 1 reply; 25+ messages in thread
From: Logan Gunthorpe @ 2025-12-22 17:04 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Alistair Popple,
Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
On 2025-12-19 21:04, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> P2PDMA memory has already supported compound page and the helpers which
> support inserting compound page into vma is also ready, therefore, add
> support for compound page in p2pmem_alloc_mmap() as well. It will reduce
> the overhead of mmap() and get_user_pages() a lot when compound page is
> enabled for p2pdma memory.
>
> The use of vm_private_data to save the alignment of p2pdma memory needs
> explanation. The normal way to get the alignment is through pci_dev. It
> can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
> defining a new struct kernfs_vm_ops to pass the kobject to the
> may_split() and ->pagesize() callbacks. The former approach depends too
> much on kernfs implementation details, and the latter would lead to
> excessive churn. Therefore, choose the simpler way of saving alignment
> in vm_private_data instead.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index e97f5da73458..4a133219ac43 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
> return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
> }
>
> +static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
> +{
> + size_t align = (uintptr_t)vma->vm_private_data;
> +
> + if (!IS_ALIGNED(addr, align))
> + return -EINVAL;
> + return 0;
> +}
> +
> +static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
> +{
> + return (uintptr_t)vma->vm_private_data;
> +}
> +
> +static const struct vm_operations_struct p2pmem_vm_ops = {
> + .may_split = p2pmem_may_split,
> + .pagesize = p2pmem_pagesize,
> +};
> +
> static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> const struct bin_attribute *attr, struct vm_area_struct *vma)
> {
> @@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> struct pci_p2pdma *p2pdma;
> struct percpu_ref *ref;
> unsigned long vaddr;
> + size_t align;
> void *kaddr;
> int ret;
>
> @@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> goto out;
> }
>
> + align = p2pdma->align;
> + if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
> + pci_info_ratelimited(pdev,
> + "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
> + current->comm, vma->vm_start, vma->vm_end,
> + align);
> + ret = -EINVAL;
> + goto out;
> + }
I'm a bit confused by some aspects of these changes. Why does the
alignment become a property of the PCI device? It appears that if the
CPU supports different sized huge pages then the size and alignment
restrictions on P2PDMA memory become greater. So if someone is only
allocating a few KB these changes will break their code and refuse to
allocate single pages.
I would have expected this code to allocate an appropriately aligned
block of the p2p memory based on the requirements of the current
mapping, not based on alignment requirements established when the device
is probed.
Logan
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
2025-12-22 17:04 ` Logan Gunthorpe
@ 2025-12-24 2:20 ` Hou Tao
0 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-24 2:20 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Alistair Popple,
Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
On 12/23/2025 1:04 AM, Logan Gunthorpe wrote:
>
> On 2025-12-19 21:04, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> P2PDMA memory has already supported compound page and the helpers which
>> support inserting compound page into vma is also ready, therefore, add
>> support for compound page in p2pmem_alloc_mmap() as well. It will reduce
>> the overhead of mmap() and get_user_pages() a lot when compound page is
>> enabled for p2pdma memory.
>>
>> The use of vm_private_data to save the alignment of p2pdma memory needs
>> explanation. The normal way to get the alignment is through pci_dev. It
>> can be achieved by either invoking kernfs_of() and sysfs_file_kobj() or
>> defining a new struct kernfs_vm_ops to pass the kobject to the
>> may_split() and ->pagesize() callbacks. The former approach depends too
>> much on kernfs implementation details, and the latter would lead to
>> excessive churn. Therefore, choose the simpler way of saving alignment
>> in vm_private_data instead.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>> drivers/pci/p2pdma.c | 48 ++++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index e97f5da73458..4a133219ac43 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -128,6 +128,25 @@ static unsigned long p2pmem_get_unmapped_area(struct file *filp, struct kobject
>> return mm_get_unmapped_area(filp, uaddr, len, pgoff, flags);
>> }
>>
>> +static int p2pmem_may_split(struct vm_area_struct *vma, unsigned long addr)
>> +{
>> + size_t align = (uintptr_t)vma->vm_private_data;
>> +
>> + if (!IS_ALIGNED(addr, align))
>> + return -EINVAL;
>> + return 0;
>> +}
>> +
>> +static unsigned long p2pmem_pagesize(struct vm_area_struct *vma)
>> +{
>> + return (uintptr_t)vma->vm_private_data;
>> +}
>> +
>> +static const struct vm_operations_struct p2pmem_vm_ops = {
>> + .may_split = p2pmem_may_split,
>> + .pagesize = p2pmem_pagesize,
>> +};
>> +
>> static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>> const struct bin_attribute *attr, struct vm_area_struct *vma)
>> {
>> @@ -136,6 +155,7 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>> struct pci_p2pdma *p2pdma;
>> struct percpu_ref *ref;
>> unsigned long vaddr;
>> + size_t align;
>> void *kaddr;
>> int ret;
>>
>> @@ -161,6 +181,16 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>> goto out;
>> }
>>
>> + align = p2pdma->align;
>> + if (vma->vm_start & (align - 1) || vma->vm_end & (align - 1)) {
>> + pci_info_ratelimited(pdev,
>> + "%s: unaligned vma (%#lx~%#lx, %#lx)\n",
>> + current->comm, vma->vm_start, vma->vm_end,
>> + align);
>> + ret = -EINVAL;
>> + goto out;
>> + }
> I'm a bit confused by some aspects of these changes. Why does the
> alignment become a property of the PCI device? It appears that if the
> CPU supports different sized huge pages then the size and alignment
> restrictions on P2PDMA memory become greater. So if someone is only
> allocating a few KB these changes will break their code and refuse to
> allocate single pages.
>
> I would have expected this code to allocate an appropriately aligned
> block of the p2p memory based on the requirements of the current
> mapping, not based on alignment requirements established when the device
> is probed.
The behavior mimics device-dax in which the creation of device-dax
device needs to specify the alignment property. Supporting different
alignments for different userspace mapping could work. However, it is no
way for the userspace to tell whether or not the the alignment is
mandatory. Take the below procedure as an example:
1) the size of CMB bar is 4MB
2) application 1 allocates 4KB. Its mapping is 4KB aligned
3) application 2 allocates 2MB. If the allocation from gen_pool is not
aligned, the mapping only supports 4KB-aligned mapping. If the
allocation support aligned allocation, the mapping could support
2MB-aligned mapping. However, the mmap implementation in the kernel
doesn't know which way is appropriate. If the alignment is specified in
the p2pdma, the implement could know the aligned 2MB mapping is appropriate.
> Logan
>
>
>
> .
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 11/13] PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align()
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (9 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 10/13] PCI/P2PDMA: support compound page in p2pmem_alloc_mmap() Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 4:04 ` [PATCH 12/13] nvme-pci: introduce cmb_devmap_align module parameter Hou Tao
` (2 subsequent siblings)
13 siblings, 0 replies; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
Add helper pci_p2pdma_max_pagemap_align() to find the max possible
alignment for p2p dma memory mapping in both userspace and kernel space.
When huge transparent page is supported, and the physical address, the
size of the BAR and the offset is {PUD|PMM}_SIZE aligned, it returns
{PUD|PMD_SIZE} accordingly. Otherwise, it returns PAGE_SIZE.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
include/linux/pci-p2pdma.h | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 2fa671274c45..5d940b9e5338 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -210,4 +210,30 @@ pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr)
return paddr + provider->bus_offset;
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline size_t pci_p2pdma_max_pagemap_align(struct pci_dev *pdev, int bar,
+ u64 size, u64 offset)
+{
+ resource_size_t start = pci_resource_start(pdev, bar);
+
+ if (has_transparent_pud_hugepage() &&
+ IS_ALIGNED(start, PUD_SIZE) && IS_ALIGNED(size, PUD_SIZE) &&
+ IS_ALIGNED(offset, PUD_SIZE))
+ return PUD_SIZE;
+
+ if (has_transparent_hugepage() &&
+ IS_ALIGNED(start, PMD_SIZE) && IS_ALIGNED(size, PMD_SIZE) &&
+ IS_ALIGNED(offset, PMD_SIZE))
+ return PMD_SIZE;
+
+ return PAGE_SIZE;
+}
+#else
+static inline size_t pci_p2pdma_max_pagemap_align(resource_size_t start,
+ u64 size, u64 offset)
+{
+ return PAGE_SIZE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
#endif /* _LINUX_PCI_P2P_H */
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 12/13] nvme-pci: introduce cmb_devmap_align module parameter
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (10 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 11/13] PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align() Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-20 22:22 ` kernel test robot
2025-12-20 4:04 ` [PATCH 13/13] PCI/P2PDMA: enable compound page support for p2pdma memory Hou Tao
2025-12-21 12:19 ` [PATCH 00/13] Enable compound page " Leon Romanovsky
13 siblings, 1 reply; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
P2PDMA memory has supported compound page. It is best to enable compound
page for p2pdma memory automatically accordingly to the address, the
size and the offset of the CMB, however, for nvme device, the p2pdma
memory may be used in the kernel space (e.g., for SQ entries) and it
will incur a lot of waste when a PUD or PMD-sized page is enabled for
nvme device.
Therefore, introduce a module parameter cmb_devmap_align to control the
alignment of p2pdma memory mapping. Its default value is PAGE_SIZE. When
its value is zero, it will use pci_p2pdma_max_pagemap_align() to find
the maximal possible mapping alignment.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/nvme/host/pci.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b070095bae5e..ca0126e36834 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -79,6 +79,10 @@ static bool use_cmb_sqes = true;
module_param(use_cmb_sqes, bool, 0444);
MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for I/O SQes");
+static unsigned long cmb_devmap_align = PAGE_SIZE;
+module_param(cmb_devmap_align, ulong, 0444);
+MODULE_PARM_DESC(cmb_devmap_align, "the mapping alignment of CMB");
+
static unsigned int max_host_mem_size_mb = 128;
module_param(max_host_mem_size_mb, uint, 0444);
MODULE_PARM_DESC(max_host_mem_size_mb,
@@ -2266,6 +2270,7 @@ static void nvme_map_cmb(struct nvme_dev *dev)
u64 size, offset;
resource_size_t bar_size;
struct pci_dev *pdev = to_pci_dev(dev->dev);
+ size_t align;
int bar;
if (dev->cmb_size)
@@ -2309,7 +2314,10 @@ static void nvme_map_cmb(struct nvme_dev *dev)
dev->bar + NVME_REG_CMBMSC);
}
- if (pci_p2pdma_add_resource(pdev, bar, size, PAGE_SIZE, offset)) {
+ align = cmb_devmap_align;
+ if (!align)
+ align = pci_p2pdma_max_pagemap_align(pdev, bar, size, offset);
+ if (pci_p2pdma_add_resource(pdev, bar, size, align, offset)) {
dev_warn(dev->ctrl.device,
"failed to register the CMB\n");
hi_lo_writeq(0, dev->bar + NVME_REG_CMBMSC);
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 12/13] nvme-pci: introduce cmb_devmap_align module parameter
2025-12-20 4:04 ` [PATCH 12/13] nvme-pci: introduce cmb_devmap_align module parameter Hou Tao
@ 2025-12-20 22:22 ` kernel test robot
0 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2025-12-20 22:22 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: oe-kbuild-all, linux-pci, linux-mm, linux-nvme, Bjorn Helgaas,
Logan Gunthorpe, Alistair Popple, Leon Romanovsky,
Greg Kroah-Hartman, Tejun Heo, Rafael J . Wysocki,
Danilo Krummrich, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, houtao1
Hi Hou,
kernel test robot noticed the following build errors:
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on driver-core/driver-core-next driver-core/driver-core-linus akpm-mm/mm-everything linus/master v6.19-rc1 next-20251219]
[cannot apply to pci/next pci/for-linus]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Hou-Tao/PCI-P2PDMA-Release-the-per-cpu-ref-of-pgmap-when-vm_insert_page-fails/20251220-121804
base: driver-core/driver-core-testing
patch link: https://lore.kernel.org/r/20251220040446.274991-13-houtao%40huaweicloud.com
patch subject: [PATCH 12/13] nvme-pci: introduce cmb_devmap_align module parameter
config: alpha-allyesconfig (https://download.01.org/0day-ci/archive/20251221/202512210635.b7EdhXBT-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251221/202512210635.b7EdhXBT-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512210635.b7EdhXBT-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/nvme/host/pci.c: In function 'nvme_map_cmb':
>> drivers/nvme/host/pci.c:2319:54: error: passing argument 1 of 'pci_p2pdma_max_pagemap_align' makes integer from pointer without a cast [-Wint-conversion]
2319 | align = pci_p2pdma_max_pagemap_align(pdev, bar, size, offset);
| ^~~~
| |
| struct pci_dev *
In file included from include/linux/blk-mq-dma.h:6,
from drivers/nvme/host/pci.c:10:
include/linux/pci-p2pdma.h:232:67: note: expected 'resource_size_t' {aka 'long long unsigned int'} but argument is of type 'struct pci_dev *'
232 | static inline size_t pci_p2pdma_max_pagemap_align(resource_size_t start,
| ~~~~~~~~~~~~~~~~^~~~~
>> drivers/nvme/host/pci.c:2319:25: error: too many arguments to function 'pci_p2pdma_max_pagemap_align'; expected 3, have 4
2319 | align = pci_p2pdma_max_pagemap_align(pdev, bar, size, offset);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~
include/linux/pci-p2pdma.h:232:22: note: declared here
232 | static inline size_t pci_p2pdma_max_pagemap_align(resource_size_t start,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
vim +/pci_p2pdma_max_pagemap_align +2319 drivers/nvme/host/pci.c
2267
2268 static void nvme_map_cmb(struct nvme_dev *dev)
2269 {
2270 u64 size, offset;
2271 resource_size_t bar_size;
2272 struct pci_dev *pdev = to_pci_dev(dev->dev);
2273 size_t align;
2274 int bar;
2275
2276 if (dev->cmb_size)
2277 return;
2278
2279 if (NVME_CAP_CMBS(dev->ctrl.cap))
2280 writel(NVME_CMBMSC_CRE, dev->bar + NVME_REG_CMBMSC);
2281
2282 dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
2283 if (!dev->cmbsz)
2284 return;
2285 dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
2286
2287 size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
2288 offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
2289 bar = NVME_CMB_BIR(dev->cmbloc);
2290 bar_size = pci_resource_len(pdev, bar);
2291
2292 if (offset > bar_size)
2293 return;
2294
2295 /*
2296 * Controllers may support a CMB size larger than their BAR, for
2297 * example, due to being behind a bridge. Reduce the CMB to the
2298 * reported size of the BAR
2299 */
2300 size = min(size, bar_size - offset);
2301
2302 if (!IS_ALIGNED(size, memremap_compat_align()) ||
2303 !IS_ALIGNED(pci_resource_start(pdev, bar),
2304 memremap_compat_align()))
2305 return;
2306
2307 /*
2308 * Tell the controller about the host side address mapping the CMB,
2309 * and enable CMB decoding for the NVMe 1.4+ scheme:
2310 */
2311 if (NVME_CAP_CMBS(dev->ctrl.cap)) {
2312 hi_lo_writeq(NVME_CMBMSC_CRE | NVME_CMBMSC_CMSE |
2313 (pci_bus_address(pdev, bar) + offset),
2314 dev->bar + NVME_REG_CMBMSC);
2315 }
2316
2317 align = cmb_devmap_align;
2318 if (!align)
> 2319 align = pci_p2pdma_max_pagemap_align(pdev, bar, size, offset);
2320 if (pci_p2pdma_add_resource(pdev, bar, size, align, offset)) {
2321 dev_warn(dev->ctrl.device,
2322 "failed to register the CMB\n");
2323 hi_lo_writeq(0, dev->bar + NVME_REG_CMBMSC);
2324 return;
2325 }
2326
2327 dev->cmb_size = size;
2328 dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS);
2329
2330 if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) ==
2331 (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS))
2332 pci_p2pmem_publish(pdev, true);
2333 }
2334
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 13/13] PCI/P2PDMA: enable compound page support for p2pdma memory
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (11 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 12/13] nvme-pci: introduce cmb_devmap_align module parameter Hou Tao
@ 2025-12-20 4:04 ` Hou Tao
2025-12-22 17:10 ` Logan Gunthorpe
2025-12-21 12:19 ` [PATCH 00/13] Enable compound page " Leon Romanovsky
13 siblings, 1 reply; 25+ messages in thread
From: Hou Tao @ 2025-12-20 4:04 UTC (permalink / raw)
To: linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Logan Gunthorpe,
Alistair Popple, Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
From: Hou Tao <houtao1@huawei.com>
Compound page support for P2PDMA memory in both kernel and user space is
now in place. Enable it by allowing PUD_SIZE and PMD_SIZE alignment.
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
drivers/pci/p2pdma.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4a133219ac43..969bdacdcf8b 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -452,9 +452,19 @@ static inline int pci_p2pdma_check_pagemap_align(struct pci_dev *pdev, int bar,
u64 size, size_t align,
u64 offset)
{
+ if (has_transparent_pud_hugepage() && align == PUD_SIZE)
+ goto more_check;
+ if (has_transparent_hugepage() && align == PMD_SIZE)
+ goto more_check;
if (align == PAGE_SIZE)
return 0;
return -EINVAL;
+
+more_check:
+ if (IS_ALIGNED(pci_resource_start(pdev, bar), align) &&
+ IS_ALIGNED(size, align) && IS_ALIGNED(offset, align))
+ return 0;
+ return -EINVAL;
}
/**
--
2.29.2
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 13/13] PCI/P2PDMA: enable compound page support for p2pdma memory
2025-12-20 4:04 ` [PATCH 13/13] PCI/P2PDMA: enable compound page support for p2pdma memory Hou Tao
@ 2025-12-22 17:10 ` Logan Gunthorpe
0 siblings, 0 replies; 25+ messages in thread
From: Logan Gunthorpe @ 2025-12-22 17:10 UTC (permalink / raw)
To: Hou Tao, linux-kernel
Cc: linux-pci, linux-mm, linux-nvme, Bjorn Helgaas, Alistair Popple,
Leon Romanovsky, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
On 2025-12-19 21:04, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Compound page support for P2PDMA memory in both kernel and user space is
> now in place. Enable it by allowing PUD_SIZE and PMD_SIZE alignment.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
> drivers/pci/p2pdma.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4a133219ac43..969bdacdcf8b 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -452,9 +452,19 @@ static inline int pci_p2pdma_check_pagemap_align(struct pci_dev *pdev, int bar,
> u64 size, size_t align,
> u64 offset)
> {
> + if (has_transparent_pud_hugepage() && align == PUD_SIZE)
> + goto more_check;
> + if (has_transparent_hugepage() && align == PMD_SIZE)
> + goto more_check;
> if (align == PAGE_SIZE)
> return 0;
> return -EINVAL;
> +
> +more_check:
> + if (IS_ALIGNED(pci_resource_start(pdev, bar), align) &&
> + IS_ALIGNED(size, align) && IS_ALIGNED(offset, align))
> + return 0;
> + return -EINVAL;
> }
Again this seems strange. It's a bit unlikely to have a large BAR that
wouldn't be well aligned, but this change is now requiring all P2P
memory to be aligned to 1GB if the CPU supports PUDs. So if a particular
device only has a small (say 256MB) imperfectly aligned bar it may now
fail to be registered.
I don't think the alignment should be a property of the device. When a
mapping is created, if everything is aligned appropriately, and there is
enough free aligned P2PDMA memory, then it should map a full PUD page.
There shouldn't be other restrictions placed on the hardware to make
this work.
Logan
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 00/13] Enable compound page for p2pdma memory
2025-12-20 4:04 [PATCH 00/13] Enable compound page for p2pdma memory Hou Tao
` (12 preceding siblings ...)
2025-12-20 4:04 ` [PATCH 13/13] PCI/P2PDMA: enable compound page support for p2pdma memory Hou Tao
@ 2025-12-21 12:19 ` Leon Romanovsky
[not found] ` <416b2575-f5e7-7faf-9e7c-6e9df170bf1a@huaweicloud.com>
13 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2025-12-21 12:19 UTC (permalink / raw)
To: Hou Tao
Cc: linux-kernel, linux-pci, linux-mm, linux-nvme, Bjorn Helgaas,
Logan Gunthorpe, Alistair Popple, Greg Kroah-Hartman, Tejun Heo,
Rafael J . Wysocki, Danilo Krummrich, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Keith Busch, Jens Axboe,
Christoph Hellwig, Sagi Grimberg, houtao1
On Sat, Dec 20, 2025 at 12:04:33PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> device-dax has already supported compound page. It not only reduces the
> cost of struct page significantly, it also improve the performance of
> get_user_pages when 2MB or 1GB page size is used. We are experimenting
> to use p2p dma to directly transfer the content of NVMe SSD into NPU.
I’ll admit my understanding here is limited, and lately everything tends
to look like a DMABUF problem to me. Could you explain why DMABUF support
is not being used for this use case?
Thanks
> The size of NPU HBM is 32GB or larger and there are at most 8 NPUs in
> the host. When using the base page, the memory overhead is about 4GB for
> 128GB HBM, and the mapping of 32GB HBM into userspace takes about 0.8
> second. Considering ZONE_DEVICE memory type has already supported the
> compound page, enabling the compound page support for p2pdma memory as
> well. After applying the patch set, when using the 1GB page, the memory
> overhead is about 2MB and the mmap costs about 0.04 ms.
>
> The main difference between the compound page support of device-dax and
> p2pdma is that p2pdma inserts the page into user vma during mmap instead
> of page fault. The main reason is simplicity. The patch set is
> structured as shown below:
>
> Patch #1~#2: tiny bug fixes for p2pdma
> Patch #3~#5: add callbacks support in kernfs and sysfs, include
> pagesize, may_split and get_unmapped_area. These callbacks are necessary
> for the support of compound page when mmaping sysfs binary file.
> Patch #6~#7: create compound page for p2pdma memory in the kernel.
> Patch #8~#10: support the mapping of compound page in userspace.
> Patch #11~#12: support the compound page for NVMe CMB.
> Patch #13: enable the support for compound page for p2pdma memory.
>
> Please see individual patches for more details. Comments and
> suggestions are always welcome.
>
> Hou Tao (13):
> PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page()
> fails
> PCI/P2PDMA: Fix the warning condition in p2pmem_alloc_mmap()
> kernfs: add support for get_unmapped_area callback
> kernfs: add support for may_split and pagesize callbacks
> sysfs: support get_unmapped_area callback for binary file
> PCI/P2PDMA: add align parameter for pci_p2pdma_add_resource()
> PCI/P2PDMA: create compound page for aligned p2pdma memory
> mm/huge_memory: add helpers to insert huge page during mmap
> PCI/P2PDMA: support get_unmapped_area to return aligned vaddr
> PCI/P2PDMA: support compound page in p2pmem_alloc_mmap()
> PCI/P2PDMA: add helper pci_p2pdma_max_pagemap_align()
> nvme-pci: introduce cmb_devmap_align module parameter
> PCI/P2PDMA: enable compound page support for p2pdma memory
>
> drivers/accel/habanalabs/common/hldio.c | 3 +-
> drivers/nvme/host/pci.c | 10 +-
> drivers/pci/p2pdma.c | 140 ++++++++++++++++++++++--
> fs/kernfs/file.c | 79 +++++++++++++
> fs/sysfs/file.c | 15 +++
> include/linux/huge_mm.h | 4 +
> include/linux/kernfs.h | 3 +
> include/linux/pci-p2pdma.h | 30 ++++-
> include/linux/sysfs.h | 4 +
> mm/huge_memory.c | 66 +++++++++++
> 10 files changed, 339 insertions(+), 15 deletions(-)
>
> --
> 2.29.2
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread