[PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page
@ 2025-10-21 10:23 ankita
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: ankita @ 2025-10-21 10:23 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently handles ECC errors / poison only on memory page
backed by struct page. The handling is currently missing for the PFNMAP
memory that does not have struct pages. The series adds such support.

Implement a new ECC handling for memory without struct pages. Kernel MM
expose registration APIs to allow modules that are managing the device
to register its device memory region. MM then tracks such regions using
interval tree.

The mechanism is largely similar to that of ECC on pfn with struct pages.
If there is an ECC error on a pfn, all the mapping to it are identified
and a SIGBUS is sent to the user space processes owning those mappings.
Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the poisoned PFN.
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN
would need breaking the PMD mapping into PTEs to unmap only the poisoned
PFN. This can have a major performance impact.

nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using
remap_pfn_range without being added to the kernel [2]. These device memory
PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci module
make use of the mechanism to get poison handling support on the device
memory.

Patch rebased to v6.17-rc7.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---

Link: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/ [v2]

v2 -> v3
- Rebased to v6.17-rc7.
- Skipped the unmapping of PFNMAP during reception of poison. Suggested by
Jason Gunthorpe, Jiaqi Yan, Vikram Sethi (Thanks!)
- Updated the check to prevent multiple registration to the same PFN
range using interval_tree_iter_first. Thanks Shameer Kolothum for the
suggestion.
- Removed the callback function in the nvgrace-gpu requiring tracking of
poisoned PFN as it isn't required anymore.
- Introduced seperate collect_procs_pfn function to collect the list of
processes mapping to the poisoned PFN.

v1 -> v2
- Change poisoned page tracking from bitmap to hashtable.
- Addressed miscellaneous comments in v1.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [2]

Ankit Agrawal (3):
  mm: handle poisoning of pfn without struct pages
  mm: Change ghes code to allow poison of non-struct pfn
  vfio/nvgrace-gpu: register device memory for poison handling

 MAINTAINERS                         |   1 +
 drivers/acpi/apei/ghes.c            |   6 --
 drivers/vfio/pci/nvgrace-gpu/main.c |  45 +++++++++-
 include/linux/memory-failure.h      |  17 ++++
 include/linux/mm.h                  |   1 +
 include/ras/ras_event.h             |   1 +
 mm/Kconfig                          |   1 +
 mm/memory-failure.c                 | 128 +++++++++++++++++++++++++++-
 8 files changed, 192 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/memory-failure.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-21 10:23 [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page ankita
@ 2025-10-21 10:23 ` ankita
  2025-10-21 17:05   ` Ira Weiny
                     ` (3 more replies)
  2025-10-21 10:23 ` [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn ankita
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 21+ messages in thread
From: ankita @ 2025-10-21 10:23 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently does not handle ECC errors / poison on a memory
region that is not backed by struct pages. If a memory region mapped
using remap_pfn_range() for example, but not added to the kernel, MM
will not have associated struct pages. Add a new mechanism to handle
memory failure on such memory.

Make kernel MM expose a function to allow modules managing the device
memory to register the device memory SPA and the address space associated
it. MM maintains this information as an interval tree. On poison, MM can
search for the range that the poisoned PFN belong and use the address_space
to determine the mapping VMA.

In this implementation, kernel MM follows the following sequence that is
largely similar to the memory_failure() handler for struct page backed
memory:
1. memory_failure() is triggered on reception of a poison error. An
absence of struct page is detected and consequently memory_failure_pfn()
is executed.
2. memory_failure_pfn() collects the processes mapped to the PFN.
3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
poisoned PFN using kill_procs().

Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the faulty PFN.
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map in either PMD level. Otherwise, a poison
to a PFN would need breaking the PMD mapping into PTEs to unmap only
the poisoned PFN. This will have a major performance impact.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 MAINTAINERS                    |   1 +
 include/linux/memory-failure.h |  17 +++++
 include/linux/mm.h             |   1 +
 include/ras/ras_event.h        |   1 +
 mm/Kconfig                     |   1 +
 mm/memory-failure.c            | 128 ++++++++++++++++++++++++++++++++-
 6 files changed, 148 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/memory-failure.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 520fb4e379a3..463d062d0386 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11359,6 +11359,7 @@ M:	Miaohe Lin <linmiaohe@huawei.com>
 R:	Naoya Horiguchi <nao.horiguchi@gmail.com>
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	include/linux/memory-failure.h
 F:	mm/hwpoison-inject.c
 F:	mm/memory-failure.c
 
diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
new file mode 100644
index 000000000000..bc326503d2d2
--- /dev/null
+++ b/include/linux/memory-failure.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_FAILURE_H
+#define _LINUX_MEMORY_FAILURE_H
+
+#include <linux/interval_tree.h>
+
+struct pfn_address_space;
+
+struct pfn_address_space {
+	struct interval_tree_node node;
+	struct address_space *mapping;
+};
+
+int register_pfn_address_space(struct pfn_address_space *pfn_space);
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
+
+#endif /* _LINUX_MEMORY_FAILURE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..0ab4ea82ce9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4006,6 +4006,7 @@ enum mf_action_page_type {
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
 	MF_MSG_ALREADY_POISONED,
+	MF_MSG_PFN_MAP,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index c8cd0f00c845..fecfeb7c8be7 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -375,6 +375,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
 	EM ( MF_MSG_ALREADY_POISONED, "already poisoned" )		\
+	EM ( MF_MSG_PFN_MAP, "non struct page pfn" )                    \
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..0b07219390b9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -777,6 +777,7 @@ config MEMORY_FAILURE
 	depends on ARCH_SUPPORTS_MEMORY_FAILURE
 	bool "Enable recovery from hardware memory errors"
 	select MEMORY_ISOLATION
+	select INTERVAL_TREE
 	select RAS
 	help
 	  Enables code to recover from some memory failures on systems
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df6ee59527dd..acfe5a9bde1d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -38,6 +38,7 @@
 
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/memory-failure.h>
 #include <linux/page-flags.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
@@ -154,6 +155,10 @@ static const struct ctl_table memory_failure_table[] = {
 	}
 };
 
+static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
+
+static DEFINE_MUTEX(pfn_space_lock);
+
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
@@ -957,6 +962,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
 	[MF_MSG_ALREADY_POISONED]	= "already poisoned page",
+	[MF_MSG_PFN_MAP]                = "non struct page pfn",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1349,7 +1355,7 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 {
 	trace_memory_failure_event(pfn, type, result);
 
-	if (type != MF_MSG_ALREADY_POISONED) {
+	if (type != MF_MSG_ALREADY_POISONED && type != MF_MSG_PFN_MAP) {
 		num_poisoned_pages_inc(pfn);
 		update_per_node_mf_stats(pfn, result);
 	}
@@ -2216,6 +2222,121 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
 	kill_procs(&tokill, true, pfn, flags);
 }
 
+int register_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!pfn_space)
+		return -EINVAL;
+
+	mutex_lock(&pfn_space_lock);
+
+	if (interval_tree_iter_first(&pfn_space_itree,
+				     pfn_space->node.start,
+				     pfn_space->node.last)) {
+		mutex_unlock(&pfn_space_lock);
+		return -EBUSY;
+	}
+
+	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(register_pfn_address_space);
+
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!pfn_space)
+		return;
+
+	mutex_lock(&pfn_space_lock);
+	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+}
+EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
+
+static void add_to_kill_pfn(struct task_struct *tsk,
+			    struct vm_area_struct *vma,
+			    struct list_head *to_kill,
+			    unsigned long pfn)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(*tk), GFP_ATOMIC);
+	if (!tk)
+		return;
+
+	/* Check for pgoff not backed by struct page */
+	tk->addr = vma_address(vma, pfn, 1);
+	tk->size_shift = PAGE_SHIFT;
+
+	if (tk->addr == -EFAULT)
+		pr_info("Unable to find address %lx in %s\n",
+			pfn, tsk->comm);
+
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Collect processes when the error hit a PFN not backed by struct page.
+ */
+static void collect_procs_pfn(struct address_space *mapping,
+			      unsigned long pfn, struct list_head *to_kill)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+
+	i_mmap_lock_read(mapping);
+	rcu_read_lock();
+	for_each_process(tsk) {
+		struct task_struct *t = tsk;
+
+		t = task_early_kill(tsk, true);
+		if (!t)
+			continue;
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
+			if (vma->vm_mm == t->mm)
+				add_to_kill_pfn(t, vma, to_kill, pfn);
+		}
+	}
+	rcu_read_unlock();
+	i_mmap_unlock_read(mapping);
+}
+
+static int memory_failure_pfn(unsigned long pfn, int flags)
+{
+	struct interval_tree_node *node;
+	LIST_HEAD(tokill);
+
+	mutex_lock(&pfn_space_lock);
+	/*
+	 * Modules registers with MM the address space mapping to the device memory they
+	 * manage. Iterate to identify exactly which address space has mapped to this
+	 * failing PFN.
+	 */
+	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
+	     node = interval_tree_iter_next(node, pfn, pfn)) {
+		struct pfn_address_space *pfn_space =
+			container_of(node, struct pfn_address_space, node);
+
+		collect_procs_pfn(pfn_space->mapping, pfn, &tokill);
+	}
+	mutex_unlock(&pfn_space_lock);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a different
+	 * physical page at a given virtual address, so all userspace
+	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
+	 * MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+
+	kill_procs(&tokill, true, pfn, flags);
+
+	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -2259,6 +2380,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!(flags & MF_SW_SIMULATED))
 		hw_memory_failure = true;
 
+	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
+		res = memory_failure_pfn(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		res = arch_memory_failure(pfn, flags);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
@ 2025-10-21 17:05   ` Ira Weiny
  2025-10-22 16:00   ` Jiaqi Yan
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Ira Weiny @ 2025-10-21 17:05 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

ankita@ wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region mapped
> using remap_pfn_range() for example, but not added to the kernel, MM
> will not have associated struct pages. Add a new mechanism to handle
> memory failure on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register the device memory SPA and the address space associated
> it. MM maintains this information as an interval tree. On poison, MM can
> search for the range that the poisoned PFN belong and use the address_space
> to determine the mapping VMA.
> 
> In this implementation, kernel MM follows the following sequence that is
> largely similar to the memory_failure() handler for struct page backed
> memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() collects the processes mapped to the PFN.
> 3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
> poisoned PFN using kill_procs().
> 
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the faulty PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map in either PMD level. Otherwise, a poison
> to a PFN would need breaking the PMD mapping into PTEs to unmap only
> the poisoned PFN. This will have a major performance impact.
> 
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
> 

[snip]

> @@ -2216,6 +2222,121 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
>  	kill_procs(&tokill, true, pfn, flags);
>  }
>  
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!pfn_space)
> +		return -EINVAL;

Is there a reason callers may be passing NULL?

> +
> +	mutex_lock(&pfn_space_lock);
> +
> +	if (interval_tree_iter_first(&pfn_space_itree,
> +				     pfn_space->node.start,
> +				     pfn_space->node.last)) {
> +		mutex_unlock(&pfn_space_lock);

Register and unregister are good places to use guard().

> +		return -EBUSY;
> +	}
> +
> +	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> +
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!pfn_space)
> +		return;
> +
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +}
> +EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> +
> +static void add_to_kill_pfn(struct task_struct *tsk,
> +			    struct vm_area_struct *vma,
> +			    struct list_head *to_kill,
> +			    unsigned long pfn)
> +{
> +	struct to_kill *tk;
> +
> +	tk = kmalloc(sizeof(*tk), GFP_ATOMIC);
> +	if (!tk)
> +		return;
> +
> +	/* Check for pgoff not backed by struct page */
> +	tk->addr = vma_address(vma, pfn, 1);
> +	tk->size_shift = PAGE_SHIFT;
> +
> +	if (tk->addr == -EFAULT)
> +		pr_info("Unable to find address %lx in %s\n",
> +			pfn, tsk->comm);

If the pfn is not in the process why is it added to the kill list?

> +
> +	get_task_struct(tsk);
> +	tk->tsk = tsk;
> +	list_add_tail(&tk->nd, to_kill);
> +}
> +
> +/*
> + * Collect processes when the error hit a PFN not backed by struct page.
> + */
> +static void collect_procs_pfn(struct address_space *mapping,
> +			      unsigned long pfn, struct list_head *to_kill)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +
> +	i_mmap_lock_read(mapping);
> +	rcu_read_lock();
> +	for_each_process(tsk) {
> +		struct task_struct *t = tsk;
> +
> +		t = task_early_kill(tsk, true);
> +		if (!t)
> +			continue;
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
> +			if (vma->vm_mm == t->mm)
> +				add_to_kill_pfn(t, vma, to_kill, pfn);
> +		}
> +	}
> +	rcu_read_unlock();
> +	i_mmap_unlock_read(mapping);
> +}
> +
> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> +	struct interval_tree_node *node;
> +	LIST_HEAD(tokill);
> +
> +	mutex_lock(&pfn_space_lock);

scoped_guard()  Or probably wrap this part in a guarded function.

Ira

> +	/*
> +	 * Modules registers with MM the address space mapping to the device memory they
> +	 * manage. Iterate to identify exactly which address space has mapped to this
> +	 * failing PFN.
> +	 */
> +	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> +	     node = interval_tree_iter_next(node, pfn, pfn)) {
> +		struct pfn_address_space *pfn_space =
> +			container_of(node, struct pfn_address_space, node);
> +
> +		collect_procs_pfn(pfn_space->mapping, pfn, &tokill);
> +	}
> +	mutex_unlock(&pfn_space_lock);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a different
> +	 * physical page at a given virtual address, so all userspace
> +	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
> +	 * MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +
> +	kill_procs(&tokill, true, pfn, flags);
> +
> +	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2259,6 +2380,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}
> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);
> -- 
> 2.34.1
> 
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
  2025-10-21 17:05   ` Ira Weiny
@ 2025-10-22 16:00   ` Jiaqi Yan
  2025-10-24  6:34   ` Miaohe Lin
  2025-10-24  9:45   ` Shuai Xue
  3 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2025-10-22 16:00 UTC (permalink / raw)
  To: ankita
  Cc: aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex, cjia, kwankhede, targupta, zhiw,
	dnigam, kjaju, linux-kernel, linux-mm, linux-edac,
	Jonathan.Cameron, ira.weiny, Smita.KoralahalliChannabasappa,
	u.kleine-koenig, peterz, linux-acpi, kvm

On Tue, Oct 21, 2025 at 3:23 AM <ankita@nvidia.com> wrote:
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region mapped
> using remap_pfn_range() for example, but not added to the kernel, MM
> will not have associated struct pages. Add a new mechanism to handle
> memory failure on such memory.
>
> Make kernel MM expose a function to allow modules managing the device
> memory to register the device memory SPA and the address space associated
> it. MM maintains this information as an interval tree. On poison, MM can
> search for the range that the poisoned PFN belong and use the address_space
> to determine the mapping VMA.
>
> In this implementation, kernel MM follows the following sequence that is
> largely similar to the memory_failure() handler for struct page backed
> memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() collects the processes mapped to the PFN.
> 3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
> poisoned PFN using kill_procs().
>
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the faulty PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map in either PMD level. Otherwise, a poison
> to a PFN would need breaking the PMD mapping into PTEs to unmap only
> the poisoned PFN. This will have a major performance impact.
>
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                    |   1 +
>  include/linux/memory-failure.h |  17 +++++
>  include/linux/mm.h             |   1 +
>  include/ras/ras_event.h        |   1 +
>  mm/Kconfig                     |   1 +
>  mm/memory-failure.c            | 128 ++++++++++++++++++++++++++++++++-
>  6 files changed, 148 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/memory-failure.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 520fb4e379a3..463d062d0386 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11359,6 +11359,7 @@ M:      Miaohe Lin <linmiaohe@huawei.com>
>  R:     Naoya Horiguchi <nao.horiguchi@gmail.com>
>  L:     linux-mm@kvack.org
>  S:     Maintained
> +F:     include/linux/memory-failure.h
>  F:     mm/hwpoison-inject.c
>  F:     mm/memory-failure.c
>
> diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
> new file mode 100644
> index 000000000000..bc326503d2d2
> --- /dev/null
> +++ b/include/linux/memory-failure.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_FAILURE_H
> +#define _LINUX_MEMORY_FAILURE_H
> +
> +#include <linux/interval_tree.h>
> +
> +struct pfn_address_space;
> +
> +struct pfn_address_space {
> +       struct interval_tree_node node;
> +       struct address_space *mapping;
> +};
> +
> +int register_pfn_address_space(struct pfn_address_space *pfn_space);
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
> +
> +#endif /* _LINUX_MEMORY_FAILURE_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..0ab4ea82ce9e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4006,6 +4006,7 @@ enum mf_action_page_type {
>         MF_MSG_DAX,
>         MF_MSG_UNSPLIT_THP,
>         MF_MSG_ALREADY_POISONED,
> +       MF_MSG_PFN_MAP,
>         MF_MSG_UNKNOWN,
>  };
>
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index c8cd0f00c845..fecfeb7c8be7 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -375,6 +375,7 @@ TRACE_EVENT(aer_event,
>         EM ( MF_MSG_DAX, "dax page" )                                   \
>         EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )                        \
>         EM ( MF_MSG_ALREADY_POISONED, "already poisoned" )              \
> +       EM ( MF_MSG_PFN_MAP, "non struct page pfn" )                    \
>         EMe ( MF_MSG_UNKNOWN, "unknown page" )
>
>  /*
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf..0b07219390b9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -777,6 +777,7 @@ config MEMORY_FAILURE
>         depends on ARCH_SUPPORTS_MEMORY_FAILURE
>         bool "Enable recovery from hardware memory errors"
>         select MEMORY_ISOLATION
> +       select INTERVAL_TREE
>         select RAS
>         help
>           Enables code to recover from some memory failures on systems
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index df6ee59527dd..acfe5a9bde1d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -38,6 +38,7 @@
>
>  #include <linux/kernel.h>
>  #include <linux/mm.h>
> +#include <linux/memory-failure.h>
>  #include <linux/page-flags.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/task.h>
> @@ -154,6 +155,10 @@ static const struct ctl_table memory_failure_table[] = {
>         }
>  };
>
> +static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
> +
> +static DEFINE_MUTEX(pfn_space_lock);
> +
>  /*
>   * Return values:
>   *   1:   the page is dissolved (if needed) and taken off from buddy,
> @@ -957,6 +962,7 @@ static const char * const action_page_types[] = {
>         [MF_MSG_DAX]                    = "dax page",
>         [MF_MSG_UNSPLIT_THP]            = "unsplit thp",
>         [MF_MSG_ALREADY_POISONED]       = "already poisoned page",
> +       [MF_MSG_PFN_MAP]                = "non struct page pfn",
>         [MF_MSG_UNKNOWN]                = "unknown page",
>  };
>
> @@ -1349,7 +1355,7 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
>  {
>         trace_memory_failure_event(pfn, type, result);
>
> -       if (type != MF_MSG_ALREADY_POISONED) {
> +       if (type != MF_MSG_ALREADY_POISONED && type != MF_MSG_PFN_MAP) {
>                 num_poisoned_pages_inc(pfn);
>                 update_per_node_mf_stats(pfn, result);
>         }
> @@ -2216,6 +2222,121 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
>         kill_procs(&tokill, true, pfn, flags);
>  }
>
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +       if (!pfn_space)
> +               return -EINVAL;
> +
> +       mutex_lock(&pfn_space_lock);
> +
> +       if (interval_tree_iter_first(&pfn_space_itree,
> +                                    pfn_space->node.start,
> +                                    pfn_space->node.last)) {
> +               mutex_unlock(&pfn_space_lock);
> +               return -EBUSY;
> +       }
> +
> +       interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> +       mutex_unlock(&pfn_space_lock);
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> +
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +       if (!pfn_space)
> +               return;
> +
> +       mutex_lock(&pfn_space_lock);
> +       interval_tree_remove(&pfn_space->node, &pfn_space_itree);

IIRC removing something not in interval tree will panic kernel. If I
am not mistaken, should here do something like
interval_tree_iter_first before interval_tree_remove, to avoid
driver's ill behavior crash the system?

> +       mutex_unlock(&pfn_space_lock);
> +}
> +EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> +
> +static void add_to_kill_pfn(struct task_struct *tsk,
> +                           struct vm_area_struct *vma,
> +                           struct list_head *to_kill,
> +                           unsigned long pfn)
> +{
> +       struct to_kill *tk;
> +
> +       tk = kmalloc(sizeof(*tk), GFP_ATOMIC);
> +       if (!tk)
> +               return;
> +
> +       /* Check for pgoff not backed by struct page */
> +       tk->addr = vma_address(vma, pfn, 1);
> +       tk->size_shift = PAGE_SHIFT;
> +
> +       if (tk->addr == -EFAULT)
> +               pr_info("Unable to find address %lx in %s\n",
> +                       pfn, tsk->comm);
> +
> +       get_task_struct(tsk);
> +       tk->tsk = tsk;
> +       list_add_tail(&tk->nd, to_kill);
> +}
> +
> +/*
> + * Collect processes when the error hit a PFN not backed by struct page.
> + */
> +static void collect_procs_pfn(struct address_space *mapping,
> +                             unsigned long pfn, struct list_head *to_kill)
> +{
> +       struct vm_area_struct *vma;
> +       struct task_struct *tsk;
> +
> +       i_mmap_lock_read(mapping);
> +       rcu_read_lock();
> +       for_each_process(tsk) {
> +               struct task_struct *t = tsk;
> +
> +               t = task_early_kill(tsk, true);
> +               if (!t)
> +                       continue;
> +               vma_interval_tree_foreach(vma, &mapping->i_mmap, pfn, pfn) {
> +                       if (vma->vm_mm == t->mm)
> +                               add_to_kill_pfn(t, vma, to_kill, pfn);
> +               }
> +       }
> +       rcu_read_unlock();
> +       i_mmap_unlock_read(mapping);
> +}
> +
> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> +       struct interval_tree_node *node;
> +       LIST_HEAD(tokill);
> +
> +       mutex_lock(&pfn_space_lock);
> +       /*
> +        * Modules registers with MM the address space mapping to the device memory they
> +        * manage. Iterate to identify exactly which address space has mapped to this
> +        * failing PFN.
> +        */
> +       for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> +            node = interval_tree_iter_next(node, pfn, pfn)) {
> +               struct pfn_address_space *pfn_space =
> +                       container_of(node, struct pfn_address_space, node);
> +
> +               collect_procs_pfn(pfn_space->mapping, pfn, &tokill);
> +       }
> +       mutex_unlock(&pfn_space_lock);
> +
> +       /*
> +        * Unlike System-RAM there is no possibility to swap in a different
> +        * physical page at a given virtual address, so all userspace
> +        * consumption of direct PFN memory necessitates SIGBUS (i.e.
> +        * MF_MUST_KILL)
> +        */
> +       flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +
> +       kill_procs(&tokill, true, pfn, flags);
> +
> +       return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2259,6 +2380,11 @@ int memory_failure(unsigned long pfn, int flags)
>         if (!(flags & MF_SW_SIMULATED))
>                 hw_memory_failure = true;
>
> +       if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> +               res = memory_failure_pfn(pfn, flags);
> +               goto unlock_mutex;
> +       }
> +
>         p = pfn_to_online_page(pfn);
>         if (!p) {
>                 res = arch_memory_failure(pfn, flags);
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
  2025-10-21 17:05   ` Ira Weiny
  2025-10-22 16:00   ` Jiaqi Yan
@ 2025-10-24  6:34   ` Miaohe Lin
  2025-10-24  9:45   ` Shuai Xue
  3 siblings, 0 replies; 21+ messages in thread
From: Miaohe Lin @ 2025-10-24  6:34 UTC (permalink / raw)
  To: ankita, aniketa
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm, vsethi, jgg, mochs, skolothumtho, nao.horiguchi,
	akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, tony.luck, bp, rafael, guohanjun, mchehab, lenb,
	kevin.tian, alex

On 2025/10/21 18:23, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region mapped
> using remap_pfn_range() for example, but not added to the kernel, MM
> will not have associated struct pages. Add a new mechanism to handle
> memory failure on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register the device memory SPA and the address space associated
> it. MM maintains this information as an interval tree. On poison, MM can
> search for the range that the poisoned PFN belong and use the address_space
> to determine the mapping VMA.
> 
> In this implementation, kernel MM follows the following sequence that is
> largely similar to the memory_failure() handler for struct page backed
> memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() collects the processes mapped to the PFN.
> 3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
> poisoned PFN using kill_procs().
> 
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the faulty PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map in either PMD level. Otherwise, a poison
> to a PFN would need breaking the PMD mapping into PTEs to unmap only
> the poisoned PFN. This will have a major performance impact.
> 
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Thanks for your patch. Some comments below.

> ---
>  MAINTAINERS                    |   1 +
>  include/linux/memory-failure.h |  17 +++++
>  include/linux/mm.h             |   1 +
>  include/ras/ras_event.h        |   1 +
>  mm/Kconfig                     |   1 +
>  mm/memory-failure.c            | 128 ++++++++++++++++++++++++++++++++-
>  6 files changed, 148 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/memory-failure.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 520fb4e379a3..463d062d0386 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11359,6 +11359,7 @@ M:	Miaohe Lin <linmiaohe@huawei.com>
>  R:	Naoya Horiguchi <nao.horiguchi@gmail.com>
>  L:	linux-mm@kvack.org
>  S:	Maintained
> +F:	include/linux/memory-failure.h
>  F:	mm/hwpoison-inject.c
>  F:	mm/memory-failure.c
>  
> diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
> new file mode 100644
> index 000000000000..bc326503d2d2
> --- /dev/null
> +++ b/include/linux/memory-failure.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_FAILURE_H
> +#define _LINUX_MEMORY_FAILURE_H
> +
> +#include <linux/interval_tree.h>
> +
> +struct pfn_address_space;

Do we need this declaration?

> +
> +struct pfn_address_space {
> +	struct interval_tree_node node;
> +	struct address_space *mapping;
> +};
> +

<snip>

> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> +	struct interval_tree_node *node;
> +	LIST_HEAD(tokill);
> +
> +	mutex_lock(&pfn_space_lock);
> +	/*
> +	 * Modules registers with MM the address space mapping to the device memory they
> +	 * manage. Iterate to identify exactly which address space has mapped to this
> +	 * failing PFN.
> +	 */
> +	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> +	     node = interval_tree_iter_next(node, pfn, pfn)) {
> +		struct pfn_address_space *pfn_space =
> +			container_of(node, struct pfn_address_space, node);
> +
> +		collect_procs_pfn(pfn_space->mapping, pfn, &tokill);
> +	}
> +	mutex_unlock(&pfn_space_lock);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a different
> +	 * physical page at a given virtual address, so all userspace
> +	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
> +	 * MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +
> +	kill_procs(&tokill, true, pfn, flags);
> +

If pfn doesn't belong to any address space mapping, it's still counted as MF_RECOVERED?

> +	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2259,6 +2380,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {

It's better to have some comments for this case.

> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}
> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);

Can we move above memory_failure_pfn block here? I'm worried that too many scenario branches might
lead to confusion.

Thanks.
.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
                     ` (2 preceding siblings ...)
  2025-10-24  6:34   ` Miaohe Lin
@ 2025-10-24  9:45   ` Shuai Xue
  2025-10-24 11:52     ` Jason Gunthorpe
  3 siblings, 1 reply; 21+ messages in thread
From: Shuai Xue @ 2025-10-24  9:45 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm



在 2025/10/21 18:23, ankita@nvidia.com 写道:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region mapped
> using remap_pfn_range() for example, but not added to the kernel, MM
> will not have associated struct pages. Add a new mechanism to handle
> memory failure on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register the device memory SPA and the address space associated
> it. MM maintains this information as an interval tree. On poison, MM can
> search for the range that the poisoned PFN belong and use the address_space
> to determine the mapping VMA.
> 
> In this implementation, kernel MM follows the following sequence that is
> largely similar to the memory_failure() handler for struct page backed
> memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed. 

This step depends on PATCH 2. I suggest reordering the patches so that
PATCH 2 comes first, which would make the series easier to review and
understand.

> 2. memory_failure_pfn() collects the processes mapped to the PFN.
> 3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
> poisoned PFN using kill_procs().
> 
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the faulty PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map in either PMD level. Otherwise, a poison
> to a PFN would need breaking the PMD mapping into PTEs to unmap only
> the poisoned PFN. This will have a major performance impact.
> 
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>   MAINTAINERS                    |   1 +
>   include/linux/memory-failure.h |  17 +++++
>   include/linux/mm.h             |   1 +
>   include/ras/ras_event.h        |   1 +
>   mm/Kconfig                     |   1 +
>   mm/memory-failure.c            | 128 ++++++++++++++++++++++++++++++++-
>   6 files changed, 148 insertions(+), 1 deletion(-)
>   create mode 100644 include/linux/memory-failure.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 520fb4e379a3..463d062d0386 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11359,6 +11359,7 @@ M:	Miaohe Lin <linmiaohe@huawei.com>
>   R:	Naoya Horiguchi <nao.horiguchi@gmail.com>
>   L:	linux-mm@kvack.org
>   S:	Maintained
> +F:	include/linux/memory-failure.h
>   F:	mm/hwpoison-inject.c
>   F:	mm/memory-failure.c
>   
> diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
> new file mode 100644
> index 000000000000..bc326503d2d2
> --- /dev/null
> +++ b/include/linux/memory-failure.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_FAILURE_H
> +#define _LINUX_MEMORY_FAILURE_H
> +
> +#include <linux/interval_tree.h>
> +
> +struct pfn_address_space;
> +
> +struct pfn_address_space {
> +	struct interval_tree_node node;
> +	struct address_space *mapping;
> +};
> +
> +int register_pfn_address_space(struct pfn_address_space *pfn_space);
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
> +
> +#endif /* _LINUX_MEMORY_FAILURE_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..0ab4ea82ce9e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4006,6 +4006,7 @@ enum mf_action_page_type {
>   	MF_MSG_DAX,
>   	MF_MSG_UNSPLIT_THP,
>   	MF_MSG_ALREADY_POISONED,
> +	MF_MSG_PFN_MAP,
>   	MF_MSG_UNKNOWN,
>   };
>   
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index c8cd0f00c845..fecfeb7c8be7 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -375,6 +375,7 @@ TRACE_EVENT(aer_event,
>   	EM ( MF_MSG_DAX, "dax page" )					\
>   	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
>   	EM ( MF_MSG_ALREADY_POISONED, "already poisoned" )		\
> +	EM ( MF_MSG_PFN_MAP, "non struct page pfn" )                    \
>   	EMe ( MF_MSG_UNKNOWN, "unknown page" )
>   
>   /*
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf..0b07219390b9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -777,6 +777,7 @@ config MEMORY_FAILURE
>   	depends on ARCH_SUPPORTS_MEMORY_FAILURE
>   	bool "Enable recovery from hardware memory errors"
>   	select MEMORY_ISOLATION
> +	select INTERVAL_TREE
>   	select RAS
>   	help
>   	  Enables code to recover from some memory failures on systems
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index df6ee59527dd..acfe5a9bde1d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -38,6 +38,7 @@
>   
>   #include <linux/kernel.h>
>   #include <linux/mm.h>
> +#include <linux/memory-failure.h>
>   #include <linux/page-flags.h>
>   #include <linux/sched/signal.h>
>   #include <linux/sched/task.h>
> @@ -154,6 +155,10 @@ static const struct ctl_table memory_failure_table[] = {
>   	}
>   };
>   
> +static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
> +
> +static DEFINE_MUTEX(pfn_space_lock);
> +
>   /*
>    * Return values:
>    *   1:   the page is dissolved (if needed) and taken off from buddy,
> @@ -957,6 +962,7 @@ static const char * const action_page_types[] = {
>   	[MF_MSG_DAX]			= "dax page",
>   	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
>   	[MF_MSG_ALREADY_POISONED]	= "already poisoned page",
> +	[MF_MSG_PFN_MAP]                = "non struct page pfn",
>   	[MF_MSG_UNKNOWN]		= "unknown page",
>   };
>   
> @@ -1349,7 +1355,7 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
>   {
>   	trace_memory_failure_event(pfn, type, result);
>   
> -	if (type != MF_MSG_ALREADY_POISONED) {
> +	if (type != MF_MSG_ALREADY_POISONED && type != MF_MSG_PFN_MAP) {
>   		num_poisoned_pages_inc(pfn);
>   		update_per_node_mf_stats(pfn, result);
>   	}
> @@ -2216,6 +2222,121 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
>   	kill_procs(&tokill, true, pfn, flags);
>   }
>   
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)

I have a design consideration here. Non-struct page PFNs typically
represent device memory managed by device drivers through their own
memory allocators. These drivers are responsible for allocation and
deallocation of this memory.

Rather than having MM maintain metadata about these PFNs, have you
considered adding an operation callback similar to
dev_pagemap_ops->memory_failure? This would allow device memory
allocators to:

- Maintain their own metadata tracking poison status (similar to
   TestSetPageHWPoison())
- Handle device-specific requirements for memory failure
- Provide more flexibility for different types of device memory

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-24  9:45   ` Shuai Xue
@ 2025-10-24 11:52     ` Jason Gunthorpe
  2025-10-24 11:59       ` Ankit Agrawal
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-10-24 11:52 UTC (permalink / raw)
  To: Shuai Xue
  Cc: ankita, aniketa, vsethi, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex, cjia, kwankhede, targupta, zhiw,
	dnigam, kjaju, linux-kernel, linux-mm, linux-edac,
	Jonathan.Cameron, ira.weiny, Smita.KoralahalliChannabasappa,
	u.kleine-koenig, peterz, linux-acpi, kvm

On Fri, Oct 24, 2025 at 05:45:45PM +0800, Shuai Xue wrote:
> Rather than having MM maintain metadata about these PFNs, have you
> considered adding an operation callback similar to
> dev_pagemap_ops->memory_failure? 

I think someone could come with such a proposal on top of this, it
would not be hard to add some ops to pfn_address_space and have the
code call them instead of using the address_space.

This version just needs to link into the existing VMA machinery (ie
collect_procs_pfn), it doesn't make alot of sense to push that work
into drivers.

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages
  2025-10-24 11:52     ` Jason Gunthorpe
@ 2025-10-24 11:59       ` Ankit Agrawal
  0 siblings, 0 replies; 21+ messages in thread
From: Ankit Agrawal @ 2025-10-24 11:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Shuai Xue
  Cc: Aniket Agashe, Vikram Sethi, Matt Ochs, Shameer Kolothum,
	linmiaohe, nao.horiguchi, akpm, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, tony.luck, bp,
	rafael, guohanjun, mchehab, lenb, kevin.tian, alex, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Zhi Wang, Dheeraj Nigam, Krishnakant Jaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

Thank you so much for the feedbacks! Comments inline.

>> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
>> +{
>> +	if (!pfn_space)
>> +		return -EINVAL;
>
> Is there a reason callers may be passing NULL?
No, that would be an invalid use case for such. 

> Register and unregister are good places to use guard().
Thanks for the suggestion Ira. Will update it.

>If the pfn is not in the process why is it added to the kill list?
I kept it as it is still a process with VMA mapped to the problematic
PFN. This would ultimately result in SIGKILL to be sent to the process.
in kill_procs(). This is very similar to the __add_to_kill() implementation.

>> +static int memory_failure_pfn(unsigned long pfn, int flags)
>> +{
>> +	struct interval_tree_node *node;
>> +	LIST_HEAD(tokill);
>> +
>> +	mutex_lock(&pfn_space_lock);
>
> scoped_guard()  Or probably wrap this part in a guarded function.

Ack, thanks.

>> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
>> +{
>> +       if (!pfn_space)
>> +               return;
>> +
>> +       mutex_lock(&pfn_space_lock);
>> +       interval_tree_remove(&pfn_space->node, &pfn_space_itree);
>
> IIRC removing something not in interval tree will panic kernel. If I
> am not mistaken, should here do something like
> interval_tree_iter_first before interval_tree_remove, to avoid
> driver's ill behavior crash the system?

Thanks Jiaqi for the suggestion. Yeah, I think we should add it.
I'll fix that in the next version.


> If pfn doesn't belong to any address space mapping, it's still
> counted as MF_RECOVERED?

Hi Miaohe, I wasn't sure how should we tag this. It seems you
are suggesting MF_FAILED is more appropriate. I'll address that.

>>  	p = pfn_to_online_page(pfn);
>>  	if (!p) {
>>  		res = arch_memory_failure(pfn, flags);
>
> Can we move above memory_failure_pfn block here? I'm worried
> that too many scenario branches might lead to confusion.

Sure if there isn't any objection, I'll update it.

>> On Fri, Oct 24, 2025 at 05:45:45PM +0800, Shuai Xue wrote:
>> Rather than having MM maintain metadata about these PFNs, have you
>> considered adding an operation callback similar to
>> dev_pagemap_ops->memory_failure?
>
> I think someone could come with such a proposal on top of this, it
> would not be hard to add some ops to pfn_address_space and have the
> code call them instead of using the address_space.

> This version just needs to link into the existing VMA machinery (ie
> collect_procs_pfn), it doesn't make alot of sense to push that work
> into drivers.

Thanks Shuai for the suggestion. However, I agree on this with Jason.
It is preferable to keep that as separate.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-21 10:23 [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page ankita
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
@ 2025-10-21 10:23 ` ankita
  2025-10-21 17:13   ` Ira Weiny
  2025-10-21 10:23 ` [PATCH v3 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
  2025-10-21 16:30 ` [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page Liam R. Howlett
  3 siblings, 1 reply; 21+ messages in thread
From: ankita @ 2025-10-21 10:23 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The GHES code allows calling of memory_failure() on the PFNs that pass the
pfn_valid() check. This contract is broken for the remapped PFNs which
fails the check and ghes_do_memory_failure() returns without triggering
memory_failure().

Update code to allow memory_failure() call on PFNs failing pfn_valid().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/acpi/apei/ghes.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3..bc4d0f2b3e9d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -505,12 +505,6 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 		return false;
 
 	pfn = PHYS_PFN(physical_addr);
-	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-		"Invalid address in generic error data: %#llx\n",
-		physical_addr);
-		return false;
-	}
 
 	if (flags == MF_ACTION_REQUIRED && current->mm) {
 		twcb = (void *)gen_pool_alloc(ghes_estatus_pool, sizeof(*twcb));
-- 
2.34.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-21 10:23 ` [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn ankita
@ 2025-10-21 17:13   ` Ira Weiny
  2025-10-21 17:19     ` Luck, Tony
  0 siblings, 1 reply; 21+ messages in thread
From: Ira Weiny @ 2025-10-21 17:13 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

ankita@ wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The GHES code allows calling of memory_failure() on the PFNs that pass the
> pfn_valid() check. This contract is broken for the remapped PFNs which
> fails the check and ghes_do_memory_failure() returns without triggering
> memory_failure().
> 
> Update code to allow memory_failure() call on PFNs failing pfn_valid().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/acpi/apei/ghes.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3..bc4d0f2b3e9d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -505,12 +505,6 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags)
>  		return false;
>  
>  	pfn = PHYS_PFN(physical_addr);
> -	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {

Tony,

I'm not an SGX expert but does this break SGX by removing
arch_is_platform_page()?

See:

40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
Cc: Tony Luck <tony.luck@intel.com>

Ira


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-21 17:13   ` Ira Weiny
@ 2025-10-21 17:19     ` Luck, Tony
  2025-10-22  6:53       ` Shuai Xue
  0 siblings, 1 reply; 21+ messages in thread
From: Luck, Tony @ 2025-10-21 17:19 UTC (permalink / raw)
  To: Weiny, Ira, ankita, aniketa, Sethi, Vikram, jgg, mochs,
	skolothumtho, linmiaohe, nao.horiguchi, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, bp,
	rafael, guohanjun, mchehab, lenb, Tian, Kevin, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

> >     pfn = PHYS_PFN(physical_addr);
> > -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>
> Tony,
>
> I'm not an SGX expert but does this break SGX by removing
> arch_is_platform_page()?
>
> See:
>
> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
> Cc: Tony Luck <tony.luck@intel.com>
>
Ira,

I think this deletion makes the GHES code always call memory_failure()
instead of bailing out here on "bad" page frame numbers.

That centralizes the checks for different types of memory into
memory_failure().

-Tony


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-21 17:19     ` Luck, Tony
@ 2025-10-22  6:53       ` Shuai Xue
  2025-10-22 15:03         ` Ira Weiny
  0 siblings, 1 reply; 21+ messages in thread
From: Shuai Xue @ 2025-10-22  6:53 UTC (permalink / raw)
  To: Luck, Tony, Weiny, Ira, ankita, aniketa, Sethi, Vikram, jgg,
	mochs, skolothumtho, linmiaohe, nao.horiguchi, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, bp,
	rafael, guohanjun, mchehab, lenb, Tian, Kevin, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm



在 2025/10/22 01:19, Luck, Tony 写道:
>>>      pfn = PHYS_PFN(physical_addr);
>>> -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>>
>> Tony,
>>
>> I'm not an SGX expert but does this break SGX by removing
>> arch_is_platform_page()?
>>
>> See:
>>
>> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
>> Cc: Tony Luck <tony.luck@intel.com>
>>
> Ira,
> 
> I think this deletion makes the GHES code always call memory_failure()
> instead of bailing out here on "bad" page frame numbers.
> 
> That centralizes the checks for different types of memory into
> memory_failure().
> 
> -Tony

Hi, Tony, Ankit and Ira,

Finally, we're seeing other use cases that need to handle errors for
non-struct page PFNs :)

IMHO, non-struct page PFNs are common in production environments.
Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
managed by a separate VMEM allocator. This VMEM allocator is designed
for virtual machine memory allocation, significantly reducing kernel
memory management overhead by minimizing page table maintenance.

To enable hardware error isolation for these memory pages, we've already
removed this sanity check internally. This change makes memory_failure()
the central point for handling all memory types, which is a much cleaner
architecture.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-22  6:53       ` Shuai Xue
@ 2025-10-22 15:03         ` Ira Weiny
  2025-10-24 10:03           ` Shuai Xue
  0 siblings, 1 reply; 21+ messages in thread
From: Ira Weiny @ 2025-10-22 15:03 UTC (permalink / raw)
  To: Shuai Xue, Luck, Tony, Weiny, Ira, ankita, aniketa, Sethi,
	Vikram, jgg, mochs, skolothumtho, linmiaohe, nao.horiguchi, akpm,
	david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, bp, rafael, guohanjun, mchehab, lenb, Tian, Kevin, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

Shuai Xue wrote:
> 
> 
> 在 2025/10/22 01:19, Luck, Tony 写道:
> >>>      pfn = PHYS_PFN(physical_addr);
> >>> -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
> >>
> >> Tony,
> >>
> >> I'm not an SGX expert but does this break SGX by removing
> >> arch_is_platform_page()?
> >>
> >> See:
> >>
> >> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
> >> Cc: Tony Luck <tony.luck@intel.com>
> >>
> > Ira,
> > 
> > I think this deletion makes the GHES code always call memory_failure()
> > instead of bailing out here on "bad" page frame numbers.
> > 
> > That centralizes the checks for different types of memory into
> > memory_failure().
> > 
> > -Tony
> 
> Hi, Tony, Ankit and Ira,
> 
> Finally, we're seeing other use cases that need to handle errors for
> non-struct page PFNs :)
> 
> IMHO, non-struct page PFNs are common in production environments.
> Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
> managed by a separate VMEM allocator.

Can you elaborate on this more?

Ira

>
> This VMEM allocator is designed
> for virtual machine memory allocation, significantly reducing kernel
> memory management overhead by minimizing page table maintenance.
> 
> To enable hardware error isolation for these memory pages, we've already
> removed this sanity check internally. This change makes memory_failure()
> the central point for handling all memory types, which is a much cleaner
> architecture.
> 
> Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> Thanks.
> Shuai




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-22 15:03         ` Ira Weiny
@ 2025-10-24 10:03           ` Shuai Xue
  2025-10-24 11:26             ` Ankit Agrawal
  0 siblings, 1 reply; 21+ messages in thread
From: Shuai Xue @ 2025-10-24 10:03 UTC (permalink / raw)
  To: Ira Weiny, Luck, Tony, ankita, aniketa, Sethi, Vikram, jgg,
	mochs, skolothumtho, linmiaohe, nao.horiguchi, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, bp,
	rafael, guohanjun, mchehab, lenb, Tian, Kevin, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm



在 2025/10/22 23:03, Ira Weiny 写道:
> Shuai Xue wrote:
>>
>>
>> 在 2025/10/22 01:19, Luck, Tony 写道:
>>>>>       pfn = PHYS_PFN(physical_addr);
>>>>> -   if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
>>>>
>>>> Tony,
>>>>
>>>> I'm not an SGX expert but does this break SGX by removing
>>>> arch_is_platform_page()?
>>>>
>>>> See:
>>>>
>>>> 40e0e7843e23 ("x86/sgx: Add infrastructure to identify SGX EPC pages")
>>>> Cc: Tony Luck <tony.luck@intel.com>
>>>>
>>> Ira,
>>>
>>> I think this deletion makes the GHES code always call memory_failure()
>>> instead of bailing out here on "bad" page frame numbers.
>>>
>>> That centralizes the checks for different types of memory into
>>> memory_failure().
>>>
>>> -Tony
>>
>> Hi, Tony, Ankit and Ira,
>>
>> Finally, we're seeing other use cases that need to handle errors for
>> non-struct page PFNs :)
>>
>> IMHO, non-struct page PFNs are common in production environments.
>> Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
>> managed by a separate VMEM allocator.
> 
> Can you elaborate on this more?

We reserve a significant portion of DRAM memory at boot time using
kernel command line parameters. This reserved memory is then managed by
our internal VMEM allocator, which handles memory allocation and
deallocation for virtual machines.

To minimize memory overhead, we intentionally avoid creating struct
pages for this reserved memory region. Instead, we've implemented the
following approach:

- Our VMEM allocator directly manages the physical memory without the
   overhead of struct page metadata.
- Error Handling: We register custom RAS operations (ras_ops) with the
   memory failure infrastructure. When poisoned memory is accessed within
   this region, our registered handler: Tags the affected memory area as
   poisoned Isolates the memory to prevent further access Terminates any
   tasks that were using the poisoned memory

This approach allows us to handle memory errors effectively while
maintaining minimal memory overhead for large reserved regions. It's
similar in concept to how device memory (like NVIDIA Grace GPU memory
mentioned earlier) needs error handling without struct page backing.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn
  2025-10-24 10:03           ` Shuai Xue
@ 2025-10-24 11:26             ` Ankit Agrawal
  0 siblings, 0 replies; 21+ messages in thread
From: Ankit Agrawal @ 2025-10-24 11:26 UTC (permalink / raw)
  To: Shuai Xue, Ira Weiny, Luck, Tony, Aniket Agashe, Vikram Sethi,
	Jason Gunthorpe, Matt Ochs, Shameer Kolothum, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, bp, rafael, guohanjun, mchehab,
	lenb, Tian, Kevin, alex
  Cc: Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Zhi Wang, Dheeraj Nigam, Krishnakant Jaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

>>> IMHO, non-struct page PFNs are common in production environments.
>>> Besides NVIDIA Grace GPU device memory, we also use reserved DRAM memory
>>> managed by a separate VMEM allocator.
>>
>> Can you elaborate on this more?
>
> We reserve a significant portion of DRAM memory at boot time using
> kernel command line parameters. This reserved memory is then managed by
>
> our internal VMEM allocator, which handles memory allocation and
> deallocation for virtual machines.
>
>To minimize memory overhead, we intentionally avoid creating struct
> pages for this reserved memory region.

Thanks Shuai for the explanation and providing the "Reviewed-by". We use
the PFNMAP for the similar reason to save on the memory usage by the
struct pages.

Also as mentioned in the 1/3 patch, I'll move this patch to the top.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 3/3] vfio/nvgrace-gpu: register device memory for poison handling
  2025-10-21 10:23 [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page ankita
  2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
  2025-10-21 10:23 ` [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn ankita
@ 2025-10-21 10:23 ` ankita
  2025-10-21 16:30 ` [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page Liam R. Howlett
  3 siblings, 0 replies; 21+ messages in thread
From: ankita @ 2025-10-21 10:23 UTC (permalink / raw)
  To: ankita, aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun,
	mchehab, lenb, kevin.tian, alex
  Cc: cjia, kwankhede, targupta, zhiw, dnigam, kjaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The nvgrace-gpu-vfio-pci module [1] maps the device memory to the user VA
(Qemu) using remap_pfn_range() without adding the memory to the kernel.
The device memory pages are not backed by struct page. Patches 1-2
implements the mechanism to handle ECC/poison on memory page without
struct page. This new mechanism is being used here.

The module registers its memory region and the address_space with the
kernel MM for ECC handling using the register_pfn_address_space()
registration API exposed by the kernel.

Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [1]

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 45 ++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index d95761dcdd58..80b3ed63c682 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -8,6 +8,10 @@
 #include <linux/delay.h>
 #include <linux/jiffies.h>
 
+#ifdef CONFIG_MEMORY_FAILURE
+#include <linux/memory-failure.h>
+#endif
+
 /*
  * The device memory usable to the workloads running in the VM is cached
  * and showcased as a 64b device BAR (comprising of BAR4 and BAR5 region)
@@ -47,6 +51,9 @@ struct mem_region {
 		void *memaddr;
 		void __iomem *ioaddr;
 	};                      /* Base virtual address of the region */
+#ifdef CONFIG_MEMORY_FAILURE
+	struct pfn_address_space pfn_address_space;
+#endif
 };
 
 struct nvgrace_gpu_pci_core_device {
@@ -60,6 +67,28 @@ struct nvgrace_gpu_pci_core_device {
 	bool has_mig_hw_bug;
 };
 
+#ifdef CONFIG_MEMORY_FAILURE
+
+static int
+nvgrace_gpu_vfio_pci_register_pfn_range(struct mem_region *region,
+					struct vm_area_struct *vma)
+{
+	unsigned long nr_pages;
+	int ret = 0;
+
+	nr_pages = region->memlength >> PAGE_SHIFT;
+
+	region->pfn_address_space.node.start = vma->vm_pgoff;
+	region->pfn_address_space.node.last = vma->vm_pgoff + nr_pages - 1;
+	region->pfn_address_space.mapping = vma->vm_file->f_mapping;
+
+	ret = register_pfn_address_space(&region->pfn_address_space);
+
+	return ret;
+}
+
+#endif
+
 static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
 {
 	struct nvgrace_gpu_pci_core_device *nvdev =
@@ -127,6 +156,13 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
 
 	mutex_destroy(&nvdev->remap_lock);
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (nvdev->resmem.memlength)
+		unregister_pfn_address_space(&nvdev->resmem.pfn_address_space);
+
+	unregister_pfn_address_space(&nvdev->usemem.pfn_address_space);
+#endif
+
 	vfio_pci_core_close_device(core_vdev);
 }
 
@@ -202,7 +238,14 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 
 	vma->vm_pgoff = start_pfn;
 
-	return 0;
+#ifdef CONFIG_MEMORY_FAILURE
+	if (nvdev->resmem.memlength && index == VFIO_PCI_BAR2_REGION_INDEX)
+		ret = nvgrace_gpu_vfio_pci_register_pfn_range(&nvdev->resmem, vma);
+	else if (index == VFIO_PCI_BAR4_REGION_INDEX)
+		ret = nvgrace_gpu_vfio_pci_register_pfn_range(&nvdev->usemem, vma);
+#endif
+
+	return ret;
 }
 
 static long
-- 
2.34.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page
  2025-10-21 10:23 [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page ankita
                   ` (2 preceding siblings ...)
  2025-10-21 10:23 ` [PATCH v3 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
@ 2025-10-21 16:30 ` Liam R. Howlett
  2025-10-21 16:44   ` Jason Gunthorpe
  3 siblings, 1 reply; 21+ messages in thread
From: Liam R. Howlett @ 2025-10-21 16:30 UTC (permalink / raw)
  To: ankita
  Cc: aniketa, vsethi, jgg, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, vbabka, rppt,
	surenb, mhocko, tony.luck, bp, rafael, guohanjun, mchehab, lenb,
	kevin.tian, alex, cjia, kwankhede, targupta, zhiw, dnigam, kjaju,
	linux-kernel, linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

* ankita@nvidia.com <ankita@nvidia.com> [251021 06:23]:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently handles ECC errors / poison only on memory page
> backed by struct page. The handling is currently missing for the PFNMAP
> memory that does not have struct pages. The series adds such support.
> 
> Implement a new ECC handling for memory without struct pages. Kernel MM
> expose registration APIs to allow modules that are managing the device
> to register its device memory region. MM then tracks such regions using
> interval tree.
> 
> The mechanism is largely similar to that of ECC on pfn with struct pages.
> If there is an ECC error on a pfn, all the mapping to it are identified
> and a SIGBUS is sent to the user space processes owning those mappings.
> Note that there is one primary difference versus the handling of the
> poison on struct pages, which is to skip unmapping to the poisoned PFN.
> This is done to handle the huge PFNMAP support added recently [1] that
> enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN
> would need breaking the PMD mapping into PTEs to unmap only the poisoned
> PFN. This can have a major performance impact.

Is the performance impact really a concern in the event of failed
memory?  Does this happen enough to warrant this special case?

Surely it's not failing hardware that may cause performance impacts, so
is this triggered in some other way that I'm missing or a conversation
pointer?

> 
> nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using
> remap_pfn_range without being added to the kernel [2]. These device memory
> PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci module
> make use of the mechanism to get poison handling support on the device
> memory.
> 
> Patch rebased to v6.17-rc7.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> 
> Link: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/ [v2]
> 
> v2 -> v3
> - Rebased to v6.17-rc7.
> - Skipped the unmapping of PFNMAP during reception of poison. Suggested by
> Jason Gunthorpe, Jiaqi Yan, Vikram Sethi (Thanks!)
> - Updated the check to prevent multiple registration to the same PFN
> range using interval_tree_iter_first. Thanks Shameer Kolothum for the
> suggestion.
> - Removed the callback function in the nvgrace-gpu requiring tracking of
> poisoned PFN as it isn't required anymore.
> - Introduced seperate collect_procs_pfn function to collect the list of
> processes mapping to the poisoned PFN.
> 
> v1 -> v2
> - Change poisoned page tracking from bitmap to hashtable.
> - Addressed miscellaneous comments in v1.
> 
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
> Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [2]
> 
> Ankit Agrawal (3):
>   mm: handle poisoning of pfn without struct pages
>   mm: Change ghes code to allow poison of non-struct pfn
>   vfio/nvgrace-gpu: register device memory for poison handling
> 
>  MAINTAINERS                         |   1 +
>  drivers/acpi/apei/ghes.c            |   6 --
>  drivers/vfio/pci/nvgrace-gpu/main.c |  45 +++++++++-
>  include/linux/memory-failure.h      |  17 ++++
>  include/linux/mm.h                  |   1 +
>  include/ras/ras_event.h             |   1 +
>  mm/Kconfig                          |   1 +
>  mm/memory-failure.c                 | 128 +++++++++++++++++++++++++++-
>  8 files changed, 192 insertions(+), 8 deletions(-)
>  create mode 100644 include/linux/memory-failure.h
> 
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page
  2025-10-21 16:30 ` [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page Liam R. Howlett
@ 2025-10-21 16:44   ` Jason Gunthorpe
  2025-10-21 18:54     ` Liam R. Howlett
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-10-21 16:44 UTC (permalink / raw)
  To: Liam R. Howlett, ankita, aniketa, vsethi, mochs, skolothumtho,
	linmiaohe, nao.horiguchi, akpm, david, lorenzo.stoakes, vbabka,
	rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun, mchehab,
	lenb, kevin.tian, alex, cjia, kwankhede, targupta, zhiw, dnigam,
	kjaju, linux-kernel, linux-mm, linux-edac, Jonathan.Cameron,
	ira.weiny, Smita.KoralahalliChannabasappa, u.kleine-koenig,
	peterz, linux-acpi, kvm

On Tue, Oct 21, 2025 at 12:30:48PM -0400, Liam R. Howlett wrote:
> > enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN
> > would need breaking the PMD mapping into PTEs to unmap only the poisoned
> > PFN. This can have a major performance impact.
> 
> Is the performance impact really a concern in the event of failed
> memory? 

Yes, something like the KVM S2 is very sensitive to page size for TLB
performace.

> Does this happen enough to warrant this special case?

If you have a 100k sized cluster it happens constantly :\

> Surely it's not failing hardware that may cause performance impacts, so
> is this triggered in some other way that I'm missing or a conversation
> pointer?

It is the splitting of a pgd/pmd level into PTEs that gets mirrored
into the S2 and then greatly increases the cost of table walks inside
a guest. The HW caches are sized for 1G S2 PTEs, not 4k.

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page
  2025-10-21 16:44   ` Jason Gunthorpe
@ 2025-10-21 18:54     ` Liam R. Howlett
  2025-10-21 22:38       ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Liam R. Howlett @ 2025-10-21 18:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, aniketa, vsethi, mochs, skolothumtho, linmiaohe,
	nao.horiguchi, akpm, david, lorenzo.stoakes, vbabka, rppt,
	surenb, mhocko, tony.luck, bp, rafael, guohanjun, mchehab, lenb,
	kevin.tian, alex, cjia, kwankhede, targupta, zhiw, dnigam, kjaju,
	linux-kernel, linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

* Jason Gunthorpe <jgg@nvidia.com> [251021 12:44]:
> On Tue, Oct 21, 2025 at 12:30:48PM -0400, Liam R. Howlett wrote:
> > > enables VM_PFNMAP vmas to map at PMD level. Otherwise, a poison to a PFN
> > > would need breaking the PMD mapping into PTEs to unmap only the poisoned
> > > PFN. This can have a major performance impact.
> > 
> > Is the performance impact really a concern in the event of failed
> > memory? 
> 
> Yes, something like the KVM S2 is very sensitive to page size for TLB
> performace.
> 
> > Does this happen enough to warrant this special case?
> 
> If you have a 100k sized cluster it happens constantly :\
> 
> > Surely it's not failing hardware that may cause performance impacts, so
> > is this triggered in some other way that I'm missing or a conversation
> > pointer?
> 
> It is the splitting of a pgd/pmd level into PTEs that gets mirrored
> into the S2 and then greatly increases the cost of table walks inside
> a guest. The HW caches are sized for 1G S2 PTEs, not 4k.

Ah, I see.  Seems like a worthy addition to the commit message?  I mean,
this is really a choice of throwing away memory for the benefit of tlb
performance.  Seems like a valid choice in your usecase but less so for
the average laptop.

Won't leaving the poisoned memory mapped cause migration issues?  Even
if the machine is migrated, my understanding is the poison follows
through checkpoint restore.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page
  2025-10-21 18:54     ` Liam R. Howlett
@ 2025-10-21 22:38       ` Jason Gunthorpe
  2025-10-24 11:16         ` Ankit Agrawal
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-10-21 22:38 UTC (permalink / raw)
  To: Liam R. Howlett, ankita, aniketa, vsethi, mochs, skolothumtho,
	linmiaohe, nao.horiguchi, akpm, david, lorenzo.stoakes, vbabka,
	rppt, surenb, mhocko, tony.luck, bp, rafael, guohanjun, mchehab,
	lenb, kevin.tian, alex, cjia, kwankhede, targupta, zhiw, dnigam,
	kjaju, linux-kernel, linux-mm, linux-edac, Jonathan.Cameron,
	ira.weiny, Smita.KoralahalliChannabasappa, u.kleine-koenig,
	peterz, linux-acpi, kvm

On Tue, Oct 21, 2025 at 02:54:10PM -0400, Liam R. Howlett wrote:

> > > Surely it's not failing hardware that may cause performance impacts, so
> > > is this triggered in some other way that I'm missing or a conversation
> > > pointer?
> > 
> > It is the splitting of a pgd/pmd level into PTEs that gets mirrored
> > into the S2 and then greatly increases the cost of table walks inside
> > a guest. The HW caches are sized for 1G S2 PTEs, not 4k.
> 
> Ah, I see.  Seems like a worthy addition to the commit message?  I mean,
> this is really a choice of throwing away memory for the benefit of tlb
> performance.  Seems like a valid choice in your usecase but less so for
> the average laptop.

No memory is being thrown away, the choice is if the kernel will
protect itself from loading via userspace issuing repeated reads to
bad memory.

Ankit please include some of these details in the commit message

> Won't leaving the poisoned memory mapped cause migration issues?  Even
> if the machine is migrated, my understanding is the poison follows
> through checkpoint restore.

The VMM has to keep track of this and not try to read the bad memory
during migration.

Jason


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page
  2025-10-21 22:38       ` Jason Gunthorpe
@ 2025-10-24 11:16         ` Ankit Agrawal
  0 siblings, 0 replies; 21+ messages in thread
From: Ankit Agrawal @ 2025-10-24 11:16 UTC (permalink / raw)
  To: Jason Gunthorpe, Liam R. Howlett, Aniket Agashe, Vikram Sethi,
	Matt Ochs, Shameer Kolothum, linmiaohe, nao.horiguchi, akpm,
	david, lorenzo.stoakes, vbabka, rppt, surenb, mhocko, tony.luck,
	bp, rafael, guohanjun, mchehab, lenb, kevin.tian, alex, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Zhi Wang, Dheeraj Nigam, Krishnakant Jaju, linux-kernel,
	linux-mm, linux-edac, Jonathan.Cameron, ira.weiny,
	Smita.KoralahalliChannabasappa, u.kleine-koenig, peterz,
	linux-acpi, kvm

>>
>> Ah, I see.  Seems like a worthy addition to the commit message?  I mean,
>> this is really a choice of throwing away memory for the benefit of tlb
>> performance.  Seems like a valid choice in your usecase but less so for
>> the average laptop.
>
> No memory is being thrown away, the choice is if the kernel will
> protect itself from loading via userspace issuing repeated reads to
> bad memory.
>
> Ankit please include some of these details in the commit message

Thanks Jason, Liam for the suggestion. I'll update the commit message
in the next version.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-10-24 11:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-21 10:23 [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page ankita
2025-10-21 10:23 ` [PATCH v3 1/3] mm: handle poisoning of pfn without struct pages ankita
2025-10-21 17:05   ` Ira Weiny
2025-10-22 16:00   ` Jiaqi Yan
2025-10-24  6:34   ` Miaohe Lin
2025-10-24  9:45   ` Shuai Xue
2025-10-24 11:52     ` Jason Gunthorpe
2025-10-24 11:59       ` Ankit Agrawal
2025-10-21 10:23 ` [PATCH v3 2/3] mm: Change ghes code to allow poison of non-struct pfn ankita
2025-10-21 17:13   ` Ira Weiny
2025-10-21 17:19     ` Luck, Tony
2025-10-22  6:53       ` Shuai Xue
2025-10-22 15:03         ` Ira Weiny
2025-10-24 10:03           ` Shuai Xue
2025-10-24 11:26             ` Ankit Agrawal
2025-10-21 10:23 ` [PATCH v3 3/3] vfio/nvgrace-gpu: register device memory for poison handling ankita
2025-10-21 16:30 ` [PATCH v3 0/3] mm: Implement ECC handling for pfn with no struct page Liam R. Howlett
2025-10-21 16:44   ` Jason Gunthorpe
2025-10-21 18:54     ` Liam R. Howlett
2025-10-21 22:38       ` Jason Gunthorpe
2025-10-24 11:16         ` Ankit Agrawal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox