[PATCH v2 0/3] mm: fixup pfnmap memory failure handling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm: fixup pfnmap memory failure handling
@ 2025-12-13  4:47 ankita
  2025-12-13  4:47 ` [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff ankita
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: ankita @ 2025-12-13  4:47 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex, akpm,
	linmiaohe, nao.horiguchi
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

It was noticed during 6.19 merge window that the patch series [1] to
introduce memory failure handling for the PFNMAP memory is broken.

The expected behaviour of the series is to allow a driver (such as
nvgrace-gpu) to register its device memory with the mm. The mm would
then handle the poison on that registered memory region.

However, the following issues were identified in the patch series.
1. Faulty use of PFN instead of mapping file page offset to derive
the usermode process VA corresponding to the mapping to PFN.
2. nvgrace-gpu code called the registration at mmap, exposing it
to corruption. This may happen, when multiple mmap were called on the
same BAR. This issue was also noticed by Linus Torvalds who reverted
the patch [2].

This patch series addresses those issues.

Patch 1/3 fixes the first issue by translating PFN to page offset
and using that information to send the SIGBUS to the mapping process.
Patch 2/3 add stubs for CONFIG_MEMORY_FAILURE disabled.
Patch 3/3 is a resend of the reverted change to register the device
memory at the time of open instead of mmap.

Many thanks to Jason Gunthorpe (jgg@nvidia.com) and Alex Williamson
(alex@shazbot.org) for identifying the issue and suggesting the fix.
Thanks to Andrew Morton (akpm@linux-foundation.org) for picking up
1/3 for mm-unstable. Requesting to consider the entire series in 6.19
as 3/3 is a resend-with-fix of the only user that was reverted in the
original series [2].

Link: https://lore.kernel.org/all/20251102184434.2406-1-ankita@nvidia.com/ [1]
Link: https://lore.kernel.org/all/20251102184434.2406-4-ankita@nvidia.com/ [2]

Changelog:
v2:
* 1/3 added to the mm-unstable branch (Thanks Andrew Morton!)
* Fixed return types in 3/3 based on Alex Williamson' suggestions.
* s/u64/pgoff_t u64 for offsets in 3/3 (Thanks Alex Williamson)
* Removed inine in pfn_memregion_offset in 3/3 (Thanks Alex Williamson)
* No change in 1/3, 2/3.

Link:
https://lore.kernel.org/all/20251211070603.338701-1-ankita@nvidia.com/ [v1]

Ankit Agrawal (3):
  mm: fixup pfnmap memory failure handling to use pgoff
  mm: add stubs for PFNMAP memory failure registration functions
  vfio/nvgrace-gpu: register device memory for poison handling

 drivers/vfio/pci/nvgrace-gpu/main.c | 116 +++++++++++++++++++++++++++-
 include/linux/memory-failure.h      |  15 +++-
 mm/memory-failure.c                 |  29 ++++---
 3 files changed, 143 insertions(+), 17 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff
  2025-12-13  4:47 [PATCH v2 0/3] mm: fixup pfnmap memory failure handling ankita
@ 2025-12-13  4:47 ` ankita
  2025-12-17  3:10   ` Miaohe Lin
  2025-12-13  4:47 ` [PATCH v2 2/3] mm: add stubs for PFNMAP memory failure registration functions ankita
  2025-12-13  4:47 ` [PATCH v2 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
  2 siblings, 1 reply; 8+ messages in thread
From: ankita @ 2025-12-13  4:47 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex, akpm,
	linmiaohe, nao.horiguchi
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The memory failure handling implementation for the PFNMAP memory with no
struct pages is faulty. The VA of the mapping is determined based on the
the PFN. It should instead be based on the file mapping offset.

At the occurrence of poison, the memory_failure_pfn is triggered on the
poisoned PFN. Introduce a callback function that allows mm to translate
the PFN to the corresponding file page offset. The kernel module using
the registration API must implement the callback function and provide the
translation. The translated value is then used to determine the VA
information and sending the SIGBUS to the usermode process mapped to
the poisoned PFN.

The callback is also useful for the driver to be notified of the poisoned
PFN, which may then track it.

Fixes: 2ec41967189c ("mm: handle poisoning of pfn without struct pages")

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/linux/memory-failure.h |  2 ++
 mm/memory-failure.c            | 29 ++++++++++++++++++-----------
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
index bc326503d2d2..7b5e11cf905f 100644
--- a/include/linux/memory-failure.h
+++ b/include/linux/memory-failure.h
@@ -9,6 +9,8 @@ struct pfn_address_space;
 struct pfn_address_space {
 	struct interval_tree_node node;
 	struct address_space *mapping;
+	int (*pfn_to_vma_pgoff)(struct vm_area_struct *vma,
+				unsigned long pfn, pgoff_t *pgoff);
 };
 
 int register_pfn_address_space(struct pfn_address_space *pfn_space);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fbc5a01260c8..c80c2907da33 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2161,6 +2161,9 @@ int register_pfn_address_space(struct pfn_address_space *pfn_space)
 {
 	guard(mutex)(&pfn_space_lock);
 
+	if (!pfn_space->pfn_to_vma_pgoff)
+		return -EINVAL;
+
 	if (interval_tree_iter_first(&pfn_space_itree,
 				     pfn_space->node.start,
 				     pfn_space->node.last))
@@ -2183,10 +2186,10 @@ void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
 }
 EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
 
-static void add_to_kill_pfn(struct task_struct *tsk,
-			    struct vm_area_struct *vma,
-			    struct list_head *to_kill,
-			    unsigned long pfn)
+static void add_to_kill_pgoff(struct task_struct *tsk,
+			      struct vm_area_struct *vma,
+			      struct list_head *to_kill,
+			      pgoff_t pgoff)
 {
 	struct to_kill *tk;
 
@@ -2197,12 +2200,12 @@ static void add_to_kill_pfn(struct task_struct *tsk,
 	}
 
 	/* Check for pgoff not backed by struct page */
-	tk->addr = vma_address(vma, pfn, 1);
+	tk->addr = vma_address(vma, pgoff, 1);
 	tk->size_shift = PAGE_SHIFT;
 
 	if (tk->addr == -EFAULT)
 		pr_info("Unable to find address %lx in %s\n",
-			pfn, tsk->comm);
+			pgoff, tsk->comm);
 
 	get_task_struct(tsk);
 	tk->tsk = tsk;
@@ -2212,11 +2215,12 @@ static void add_to_kill_pfn(struct task_struct *tsk,
 /*
  * Collect processes when the error hit a PFN not backed by struct page.
  */
-static void collect_procs_pfn(struct address_space *mapping,
+static void collect_procs_pfn(struct pfn_address_space *pfn_space,
 			      unsigned long pfn, struct list_head *to_kill)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
+	struct address_space *mapping = pfn_space->mapping;
 
 	i_mmap_lock_read(mapping);
 	rcu_read_lock();
@@ -2226,9 +2230,12 @@ static void collect_procs_pfn(struct address_space *mapping,
 		t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
-			if (vma->vm_mm == t->mm)
-				add_to_kill_pfn(t, vma, to_kill, pfn);
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, ULONG_MAX) {
+			pgoff_t pgoff;
+
+			if (vma->vm_mm == t->mm &&
+			    !pfn_space->pfn_to_vma_pgoff(vma, pfn, &pgoff))
+				add_to_kill_pgoff(t, vma, to_kill, pgoff);
 		}
 	}
 	rcu_read_unlock();
@@ -2264,7 +2271,7 @@ static int memory_failure_pfn(unsigned long pfn, int flags)
 			struct pfn_address_space *pfn_space =
 				container_of(node, struct pfn_address_space, node);
 
-			collect_procs_pfn(pfn_space->mapping, pfn, &tokill);
+			collect_procs_pfn(pfn_space, pfn, &tokill);
 
 			mf_handled = true;
 		}
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] mm: add stubs for PFNMAP memory failure registration functions
  2025-12-13  4:47 [PATCH v2 0/3] mm: fixup pfnmap memory failure handling ankita
  2025-12-13  4:47 ` [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff ankita
@ 2025-12-13  4:47 ` ankita
  2025-12-13  4:47 ` [PATCH v2 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
  2 siblings, 0 replies; 8+ messages in thread
From: ankita @ 2025-12-13  4:47 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex, akpm,
	linmiaohe, nao.horiguchi
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

Add stubs to address CONFIG_MEMORY_FAILURE disabled.

Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/linux/memory-failure.h | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
index 7b5e11cf905f..d333dcdbeae7 100644
--- a/include/linux/memory-failure.h
+++ b/include/linux/memory-failure.h
@@ -4,8 +4,6 @@
 
 #include <linux/interval_tree.h>
 
-struct pfn_address_space;
-
 struct pfn_address_space {
 	struct interval_tree_node node;
 	struct address_space *mapping;
@@ -13,7 +11,18 @@ struct pfn_address_space {
 				unsigned long pfn, pgoff_t *pgoff);
 };
 
+#ifdef CONFIG_MEMORY_FAILURE
 int register_pfn_address_space(struct pfn_address_space *pfn_space);
 void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
+#else
+static inline int register_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+}
+#endif /* CONFIG_MEMORY_FAILURE */
 
 #endif /* _LINUX_MEMORY_FAILURE_H */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] vfio/nvgrace-gpu: register device memory for poison handling
  2025-12-13  4:47 [PATCH v2 0/3] mm: fixup pfnmap memory failure handling ankita
  2025-12-13  4:47 ` [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff ankita
  2025-12-13  4:47 ` [PATCH v2 2/3] mm: add stubs for PFNMAP memory failure registration functions ankita
@ 2025-12-13  4:47 ` ankita
  2025-12-13  8:00   ` Alex Williamson
  2 siblings, 1 reply; 8+ messages in thread
From: ankita @ 2025-12-13  4:47 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex, akpm,
	linmiaohe, nao.horiguchi
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel, linux-mm

From: Ankit Agrawal <ankita@nvidia.com>

The nvgrace-gpu module [1] maps the device memory to the user VA (Qemu)
without adding the memory to the kernel. The device memory pages are PFNMAP
and not backed by struct page. The module can thus utilize the MM's PFNMAP
memory_failure mechanism that handles ECC/poison on regions with no struct
pages.

The kernel MM code exposes register/unregister APIs allowing modules to
register the device memory for memory_failure handling. Make nvgrace-gpu
register the GPU memory with the MM on open.

The module registers its memory region, the address_space with the
kernel MM for ECC handling and implements a callback function to convert
the PFN to the file page offset. The callback functions checks if the
PFN belongs to the device memory region and is also contained in the
VMA range, an error is returned otherwise.

Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [1]

Suggested-by: Alex Williamson <alex@shazbot.org>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 116 +++++++++++++++++++++++++++-
 1 file changed, 112 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 84d142a47ec6..91b4a3a135cf 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -9,6 +9,7 @@
 #include <linux/jiffies.h>
 #include <linux/pci-p2pdma.h>
 #include <linux/pm_runtime.h>
+#include <linux/memory-failure.h>
 
 /*
  * The device memory usable to the workloads running in the VM is cached
@@ -49,6 +50,7 @@ struct mem_region {
 		void *memaddr;
 		void __iomem *ioaddr;
 	};                      /* Base virtual address of the region */
+	struct pfn_address_space pfn_address_space;
 };
 
 struct nvgrace_gpu_pci_core_device {
@@ -88,6 +90,83 @@ nvgrace_gpu_memregion(int index,
 	return NULL;
 }
 
+static int pfn_memregion_offset(struct nvgrace_gpu_pci_core_device *nvdev,
+				unsigned int index,
+				unsigned long pfn,
+				pgoff_t *pfn_offset_in_region)
+{
+	struct mem_region *region;
+	unsigned long start_pfn, num_pages;
+
+	region = nvgrace_gpu_memregion(index, nvdev);
+	if (!region)
+		return -EINVAL;
+
+	start_pfn = PHYS_PFN(region->memphys);
+	num_pages = region->memlength >> PAGE_SHIFT;
+
+	if (pfn < start_pfn || pfn >= start_pfn + num_pages)
+		return -EFAULT;
+
+	*pfn_offset_in_region = pfn - start_pfn;
+
+	return 0;
+}
+
+static inline
+struct nvgrace_gpu_pci_core_device *vma_to_nvdev(struct vm_area_struct *vma);
+
+static int nvgrace_gpu_pfn_to_vma_pgoff(struct vm_area_struct *vma,
+					unsigned long pfn,
+					pgoff_t *pgoff)
+{
+	struct nvgrace_gpu_pci_core_device *nvdev;
+	unsigned int index =
+		vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	pgoff_t vma_offset_in_region = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	pgoff_t pfn_offset_in_region;
+	int ret;
+
+	nvdev = vma_to_nvdev(vma);
+	if (!nvdev)
+		return -ENOENT;
+
+	ret = pfn_memregion_offset(nvdev, index, pfn, &pfn_offset_in_region);
+	if (ret)
+		return ret;
+
+	/* Ensure PFN is not before VMA's start within the region */
+	if (pfn_offset_in_region < vma_offset_in_region)
+		return -EFAULT;
+
+	/* Calculate offset from VMA start */
+	*pgoff = vma->vm_pgoff +
+		 (pfn_offset_in_region - vma_offset_in_region);
+
+	return 0;
+}
+
+static int
+nvgrace_gpu_vfio_pci_register_pfn_range(struct vfio_device *core_vdev,
+					struct mem_region *region)
+{
+	int ret;
+	unsigned long pfn, nr_pages;
+
+	pfn = PHYS_PFN(region->memphys);
+	nr_pages = region->memlength >> PAGE_SHIFT;
+
+	region->pfn_address_space.node.start = pfn;
+	region->pfn_address_space.node.last = pfn + nr_pages - 1;
+	region->pfn_address_space.mapping = core_vdev->inode->i_mapping;
+	region->pfn_address_space.pfn_to_vma_pgoff = nvgrace_gpu_pfn_to_vma_pgoff;
+
+	ret = register_pfn_address_space(&region->pfn_address_space);
+
+	return ret;
+}
+
 static int nvgrace_gpu_open_device(struct vfio_device *core_vdev)
 {
 	struct vfio_pci_core_device *vdev =
@@ -114,14 +193,28 @@ static int nvgrace_gpu_open_device(struct vfio_device *core_vdev)
 	 * memory mapping.
 	 */
 	ret = vfio_pci_core_setup_barmap(vdev, 0);
-	if (ret) {
-		vfio_pci_core_disable(vdev);
-		return ret;
+	if (ret)
+		goto error_exit;
+
+	if (nvdev->resmem.memlength) {
+		ret = nvgrace_gpu_vfio_pci_register_pfn_range(core_vdev, &nvdev->resmem);
+		if (ret && ret != -EOPNOTSUPP)
+			goto error_exit;
 	}
 
-	vfio_pci_core_finish_enable(vdev);
+	ret = nvgrace_gpu_vfio_pci_register_pfn_range(core_vdev, &nvdev->usemem);
+	if (ret && ret != -EOPNOTSUPP)
+		goto register_mem_failed;
 
+	vfio_pci_core_finish_enable(vdev);
 	return 0;
+
+register_mem_failed:
+	if (nvdev->resmem.memlength)
+		unregister_pfn_address_space(&nvdev->resmem.pfn_address_space);
+error_exit:
+	vfio_pci_core_disable(vdev);
+	return ret;
 }
 
 static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
@@ -130,6 +223,11 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
 		container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
 			     core_device.vdev);
 
+	if (nvdev->resmem.memlength)
+		unregister_pfn_address_space(&nvdev->resmem.pfn_address_space);
+
+	unregister_pfn_address_space(&nvdev->usemem.pfn_address_space);
+
 	/* Unmap the mapping to the device memory cached region */
 	if (nvdev->usemem.memaddr) {
 		memunmap(nvdev->usemem.memaddr);
@@ -247,6 +345,16 @@ static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
 #endif
 };
 
+static inline
+struct nvgrace_gpu_pci_core_device *vma_to_nvdev(struct vm_area_struct *vma)
+{
+	/* Check if this VMA belongs to us */
+	if (vma->vm_ops != &nvgrace_gpu_vfio_pci_mmap_ops)
+		return NULL;
+
+	return vma->vm_private_data;
+}
+
 static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 			    struct vm_area_struct *vma)
 {
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] vfio/nvgrace-gpu: register device memory for poison handling
  2025-12-13  4:47 ` [PATCH v2 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
@ 2025-12-13  8:00   ` Alex Williamson
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Williamson @ 2025-12-13  8:00 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, akpm, linmiaohe,
	nao.horiguchi, cjia, zhiw, kjaju, yishaih, kevin.tian, kvm,
	linux-kernel, linux-mm

On Sat, 13 Dec 2025 04:47:08 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The nvgrace-gpu module [1] maps the device memory to the user VA (Qemu)
> without adding the memory to the kernel. The device memory pages are PFNMAP
> and not backed by struct page. The module can thus utilize the MM's PFNMAP
> memory_failure mechanism that handles ECC/poison on regions with no struct
> pages.
> 
> The kernel MM code exposes register/unregister APIs allowing modules to
> register the device memory for memory_failure handling. Make nvgrace-gpu
> register the GPU memory with the MM on open.
> 
> The module registers its memory region, the address_space with the
> kernel MM for ECC handling and implements a callback function to convert
> the PFN to the file page offset. The callback functions checks if the
> PFN belongs to the device memory region and is also contained in the
> VMA range, an error is returned otherwise.
> 
> Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [1]
> 
> Suggested-by: Alex Williamson <alex@shazbot.org>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 116 +++++++++++++++++++++++++++-
>  1 file changed, 112 insertions(+), 4 deletions(-)

I'm not sure where Andrew stands with this series going into v6.19-rc
via mm as an alternate fix to Linus' revert, but in case it's on the
table for that to happen:

Reviewed-by: Alex Williamson <alex@shazbot.org>

Otherwise let's get some mm buy-in for the front of the series and
maybe it should go in through vfio since nvgrace is the only user of
these interfaces currently.  Thanks,

Alex

> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 84d142a47ec6..91b4a3a135cf 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -9,6 +9,7 @@
>  #include <linux/jiffies.h>
>  #include <linux/pci-p2pdma.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/memory-failure.h>
>  
>  /*
>   * The device memory usable to the workloads running in the VM is cached
> @@ -49,6 +50,7 @@ struct mem_region {
>  		void *memaddr;
>  		void __iomem *ioaddr;
>  	};                      /* Base virtual address of the region */
> +	struct pfn_address_space pfn_address_space;
>  };
>  
>  struct nvgrace_gpu_pci_core_device {
> @@ -88,6 +90,83 @@ nvgrace_gpu_memregion(int index,
>  	return NULL;
>  }
>  
> +static int pfn_memregion_offset(struct nvgrace_gpu_pci_core_device *nvdev,
> +				unsigned int index,
> +				unsigned long pfn,
> +				pgoff_t *pfn_offset_in_region)
> +{
> +	struct mem_region *region;
> +	unsigned long start_pfn, num_pages;
> +
> +	region = nvgrace_gpu_memregion(index, nvdev);
> +	if (!region)
> +		return -EINVAL;
> +
> +	start_pfn = PHYS_PFN(region->memphys);
> +	num_pages = region->memlength >> PAGE_SHIFT;
> +
> +	if (pfn < start_pfn || pfn >= start_pfn + num_pages)
> +		return -EFAULT;
> +
> +	*pfn_offset_in_region = pfn - start_pfn;
> +
> +	return 0;
> +}
> +
> +static inline
> +struct nvgrace_gpu_pci_core_device *vma_to_nvdev(struct vm_area_struct *vma);
> +
> +static int nvgrace_gpu_pfn_to_vma_pgoff(struct vm_area_struct *vma,
> +					unsigned long pfn,
> +					pgoff_t *pgoff)
> +{
> +	struct nvgrace_gpu_pci_core_device *nvdev;
> +	unsigned int index =
> +		vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +	pgoff_t vma_offset_in_region = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	pgoff_t pfn_offset_in_region;
> +	int ret;
> +
> +	nvdev = vma_to_nvdev(vma);
> +	if (!nvdev)
> +		return -ENOENT;
> +
> +	ret = pfn_memregion_offset(nvdev, index, pfn, &pfn_offset_in_region);
> +	if (ret)
> +		return ret;
> +
> +	/* Ensure PFN is not before VMA's start within the region */
> +	if (pfn_offset_in_region < vma_offset_in_region)
> +		return -EFAULT;
> +
> +	/* Calculate offset from VMA start */
> +	*pgoff = vma->vm_pgoff +
> +		 (pfn_offset_in_region - vma_offset_in_region);
> +
> +	return 0;
> +}
> +
> +static int
> +nvgrace_gpu_vfio_pci_register_pfn_range(struct vfio_device *core_vdev,
> +					struct mem_region *region)
> +{
> +	int ret;
> +	unsigned long pfn, nr_pages;
> +
> +	pfn = PHYS_PFN(region->memphys);
> +	nr_pages = region->memlength >> PAGE_SHIFT;
> +
> +	region->pfn_address_space.node.start = pfn;
> +	region->pfn_address_space.node.last = pfn + nr_pages - 1;
> +	region->pfn_address_space.mapping = core_vdev->inode->i_mapping;
> +	region->pfn_address_space.pfn_to_vma_pgoff = nvgrace_gpu_pfn_to_vma_pgoff;
> +
> +	ret = register_pfn_address_space(&region->pfn_address_space);
> +
> +	return ret;
> +}
> +
>  static int nvgrace_gpu_open_device(struct vfio_device *core_vdev)
>  {
>  	struct vfio_pci_core_device *vdev =
> @@ -114,14 +193,28 @@ static int nvgrace_gpu_open_device(struct vfio_device *core_vdev)
>  	 * memory mapping.
>  	 */
>  	ret = vfio_pci_core_setup_barmap(vdev, 0);
> -	if (ret) {
> -		vfio_pci_core_disable(vdev);
> -		return ret;
> +	if (ret)
> +		goto error_exit;
> +
> +	if (nvdev->resmem.memlength) {
> +		ret = nvgrace_gpu_vfio_pci_register_pfn_range(core_vdev, &nvdev->resmem);
> +		if (ret && ret != -EOPNOTSUPP)
> +			goto error_exit;
>  	}
>  
> -	vfio_pci_core_finish_enable(vdev);
> +	ret = nvgrace_gpu_vfio_pci_register_pfn_range(core_vdev, &nvdev->usemem);
> +	if (ret && ret != -EOPNOTSUPP)
> +		goto register_mem_failed;
>  
> +	vfio_pci_core_finish_enable(vdev);
>  	return 0;
> +
> +register_mem_failed:
> +	if (nvdev->resmem.memlength)
> +		unregister_pfn_address_space(&nvdev->resmem.pfn_address_space);
> +error_exit:
> +	vfio_pci_core_disable(vdev);
> +	return ret;
>  }
>  
>  static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
> @@ -130,6 +223,11 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
>  		container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
>  			     core_device.vdev);
>  
> +	if (nvdev->resmem.memlength)
> +		unregister_pfn_address_space(&nvdev->resmem.pfn_address_space);
> +
> +	unregister_pfn_address_space(&nvdev->usemem.pfn_address_space);
> +
>  	/* Unmap the mapping to the device memory cached region */
>  	if (nvdev->usemem.memaddr) {
>  		memunmap(nvdev->usemem.memaddr);
> @@ -247,6 +345,16 @@ static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
>  #endif
>  };
>  
> +static inline
> +struct nvgrace_gpu_pci_core_device *vma_to_nvdev(struct vm_area_struct *vma)
> +{
> +	/* Check if this VMA belongs to us */
> +	if (vma->vm_ops != &nvgrace_gpu_vfio_pci_mmap_ops)
> +		return NULL;
> +
> +	return vma->vm_private_data;
> +}
> +
>  static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
>  			    struct vm_area_struct *vma)
>  {



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff
  2025-12-13  4:47 ` [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff ankita
@ 2025-12-17  3:10   ` Miaohe Lin
  2025-12-17 18:10     ` Ankit Agrawal
  0 siblings, 1 reply; 8+ messages in thread
From: Miaohe Lin @ 2025-12-17  3:10 UTC (permalink / raw)
  To: ankita
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel,
	linux-mm, vsethi, jgg, mochs, jgg, skolothumtho, alex, akpm,
	nao.horiguchi

On 2025/12/13 12:47, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The memory failure handling implementation for the PFNMAP memory with no
> struct pages is faulty. The VA of the mapping is determined based on the
> the PFN. It should instead be based on the file mapping offset.
> 
> At the occurrence of poison, the memory_failure_pfn is triggered on the
> poisoned PFN. Introduce a callback function that allows mm to translate
> the PFN to the corresponding file page offset. The kernel module using
> the registration API must implement the callback function and provide the
> translation. The translated value is then used to determine the VA
> information and sending the SIGBUS to the usermode process mapped to
> the poisoned PFN.
> 
> The callback is also useful for the driver to be notified of the poisoned
> PFN, which may then track it.
> 
> Fixes: 2ec41967189c ("mm: handle poisoning of pfn without struct pages")
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Thanks for your patch.

> ---
>  include/linux/memory-failure.h |  2 ++
>  mm/memory-failure.c            | 29 ++++++++++++++++++-----------
>  2 files changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
> index bc326503d2d2..7b5e11cf905f 100644
> --- a/include/linux/memory-failure.h
> +++ b/include/linux/memory-failure.h
> @@ -9,6 +9,8 @@ struct pfn_address_space;
>  struct pfn_address_space {
>  	struct interval_tree_node node;
>  	struct address_space *mapping;
> +	int (*pfn_to_vma_pgoff)(struct vm_area_struct *vma,
> +				unsigned long pfn, pgoff_t *pgoff);
>  };
>  
>  int register_pfn_address_space(struct pfn_address_space *pfn_space);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fbc5a01260c8..c80c2907da33 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2161,6 +2161,9 @@ int register_pfn_address_space(struct pfn_address_space *pfn_space)
>  {
>  	guard(mutex)(&pfn_space_lock);
>  
> +	if (!pfn_space->pfn_to_vma_pgoff)
> +		return -EINVAL;
> +
>  	if (interval_tree_iter_first(&pfn_space_itree,
>  				     pfn_space->node.start,
>  				     pfn_space->node.last))
> @@ -2183,10 +2186,10 @@ void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
>  }
>  EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
>  
> -static void add_to_kill_pfn(struct task_struct *tsk,
> -			    struct vm_area_struct *vma,
> -			    struct list_head *to_kill,
> -			    unsigned long pfn)
> +static void add_to_kill_pgoff(struct task_struct *tsk,
> +			      struct vm_area_struct *vma,
> +			      struct list_head *to_kill,
> +			      pgoff_t pgoff)
>  {
>  	struct to_kill *tk;
>  
> @@ -2197,12 +2200,12 @@ static void add_to_kill_pfn(struct task_struct *tsk,
>  	}
>  
>  	/* Check for pgoff not backed by struct page */
> -	tk->addr = vma_address(vma, pfn, 1);
> +	tk->addr = vma_address(vma, pgoff, 1);
>  	tk->size_shift = PAGE_SHIFT;
>  
>  	if (tk->addr == -EFAULT)
>  		pr_info("Unable to find address %lx in %s\n",
> -			pfn, tsk->comm);
> +			pgoff, tsk->comm);
>  
>  	get_task_struct(tsk);
>  	tk->tsk = tsk;
> @@ -2212,11 +2215,12 @@ static void add_to_kill_pfn(struct task_struct *tsk,
>  /*
>   * Collect processes when the error hit a PFN not backed by struct page.
>   */
> -static void collect_procs_pfn(struct address_space *mapping,
> +static void collect_procs_pfn(struct pfn_address_space *pfn_space,
>  			      unsigned long pfn, struct list_head *to_kill)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;
> +	struct address_space *mapping = pfn_space->mapping;
>  
>  	i_mmap_lock_read(mapping);
>  	rcu_read_lock();
> @@ -2226,9 +2230,12 @@ static void collect_procs_pfn(struct address_space *mapping,
>  		t = task_early_kill(tsk, true);
>  		if (!t)
>  			continue;
> -		vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
> -			if (vma->vm_mm == t->mm)
> -				add_to_kill_pfn(t, vma, to_kill, pfn);
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, ULONG_MAX) {
> +			pgoff_t pgoff;

IIUC, all vma will be traversed to find the final pgoff. This might not be a good idea
because rcu lock is held and this traversal might take a really long time. Or am I miss
something?

Thanks.
.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff
  2025-12-17  3:10   ` Miaohe Lin
@ 2025-12-17 18:10     ` Ankit Agrawal
  2025-12-18  2:18       ` Miaohe Lin
  0 siblings, 1 reply; 8+ messages in thread
From: Ankit Agrawal @ 2025-12-17 18:10 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas, kevin.tian,
	kvm, linux-kernel, linux-mm, Vikram Sethi, Jason Gunthorpe,
	Matt Ochs, jgg, Shameer Kolothum, alex, akpm, nao.horiguchi

>>       i_mmap_lock_read(mapping);
>>       rcu_read_lock();
>> @@ -2226,9 +2230,12 @@ static void collect_procs_pfn(struct address_space *mapping,
>>               t = task_early_kill(tsk, true);
>>               if (!t)
>>                       continue;
>> -             vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
>> -                     if (vma->vm_mm == t->mm)
>> -                             add_to_kill_pfn(t, vma, to_kill, pfn);
>> +             vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, ULONG_MAX) {
>> +                     pgoff_t pgoff;
>
> IIUC, all vma will be traversed to find the final pgoff. This might not be a good idea
> because rcu lock is held and this traversal might take a really long time. Or am I miss
> something?

Hi Miaohe, the VMAs on the registered address space will be checked. For the nvgrace-gpu
user of this API in 3/3, there are only 3 VMAs on the registered address space (that are
associated with the vfio file).


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff
  2025-12-17 18:10     ` Ankit Agrawal
@ 2025-12-18  2:18       ` Miaohe Lin
  0 siblings, 0 replies; 8+ messages in thread
From: Miaohe Lin @ 2025-12-18  2:18 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas, kevin.tian,
	kvm, linux-kernel, linux-mm, Vikram Sethi, Jason Gunthorpe,
	Matt Ochs, jgg, Shameer Kolothum, alex, akpm, nao.horiguchi

On 2025/12/18 2:10, Ankit Agrawal wrote:
>>>        i_mmap_lock_read(mapping);
>>>        rcu_read_lock();
>>> @@ -2226,9 +2230,12 @@ static void collect_procs_pfn(struct address_space *mapping,
>>>                t = task_early_kill(tsk, true);
>>>                if (!t)
>>>                        continue;
>>> -             vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
>>> -                     if (vma->vm_mm == t->mm)
>>> -                             add_to_kill_pfn(t, vma, to_kill, pfn);
>>> +             vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, ULONG_MAX) {
>>> +                     pgoff_t pgoff;
>>
>> IIUC, all vma will be traversed to find the final pgoff. This might not be a good idea
>> because rcu lock is held and this traversal might take a really long time. Or am I miss
>> something?
> 
> Hi Miaohe, the VMAs on the registered address space will be checked. For the nvgrace-gpu
> user of this API in 3/3, there are only 3 VMAs on the registered address space (that are
> associated with the vfio file).

Oh, I see. Thanks for your explanation. :)


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-12-18  2:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-13  4:47 [PATCH v2 0/3] mm: fixup pfnmap memory failure handling ankita
2025-12-13  4:47 ` [PATCH v2 1/3] mm: fixup pfnmap memory failure handling to use pgoff ankita
2025-12-17  3:10   ` Miaohe Lin
2025-12-17 18:10     ` Ankit Agrawal
2025-12-18  2:18       ` Miaohe Lin
2025-12-13  4:47 ` [PATCH v2 2/3] mm: add stubs for PFNMAP memory failure registration functions ankita
2025-12-13  4:47 ` [PATCH v2 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
2025-12-13  8:00   ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox