[PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page
@ 2023-09-20 14:02 ankita
  2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: ankita @ 2023-09-20 14:02 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently handles ECC errors / poison only on memory page
backed by struct page. As part of [1], the nvgrace-gpu-vfio-pci module
maps the device memory to user VA (Qemu) using remap_pfn_range without
being added to the kernel. These pages are not backed by struct page.

Implement a new ECC handling for memory without struct pages. Kernel MM
expose registration APIs to allow modules that are managing the device
to register its memory region and a callback function. MM then tracks
such regions using interval tree.

The mechanism is largely similar to that of ECC on pfn with struct pages.
If there is an ECC error on a pfn, MM uses the registered memory failure
callback function to notify the module of the faulty PFN, so that the
module may take any required action. The pfn is then unmapped in Stage-2.
When the VM tries to access the page, it gets trapped in KVM, which calls
the vm ops fault function. If the module fault function returns
VM_FAULT_HWPOISON, KVM sends a BUS_MCEERR_AR to the usermode (Qemu) mapped
to the poisoned page.

Lastly, nvgrace-gpu-vfio-pci module make use of the new mechanism to get
poison handling support on the device memory.

Patch generated over v6.6-rc2 and with [1] applied. [1] is currently under
review.

[1] https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/

Ankit Agrawal (4):
  mm: handle poisoning of pfn without struct pages
  mm: Add poison error check in fixup_user_fault() for mapped pfn
  mm: Change ghes code to allow poison of non-struct pfn
  vfio/nvgpu: register device memory for poison handling

 drivers/acpi/apei/ghes.c            |  12 +--
 drivers/vfio/pci/nvgrace-gpu/main.c | 107 +++++++++++++++++++++-
 drivers/vfio/vfio.h                 |  11 ---
 drivers/vfio/vfio_main.c            |   3 +-
 include/linux/memory-failure.h      |  22 +++++
 include/linux/mm.h                  |   1 +
 include/linux/vfio.h                |  15 ++++
 include/ras/ras_event.h             |   1 +
 mm/Kconfig                          |   1 +
 mm/gup.c                            |   2 +-
 mm/memory-failure.c                 | 135 +++++++++++++++++++++++-----
 virt/kvm/kvm_main.c                 |   6 ++
 12 files changed, 270 insertions(+), 46 deletions(-)
 create mode 100644 include/linux/memory-failure.h

-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages
  2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
@ 2023-09-20 14:02 ` ankita
  2023-09-23  3:20   ` Miaohe Lin
  2023-09-26  7:23   ` Naoya Horiguchi
  2023-09-20 14:02 ` [PATCH v1 2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn ankita
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: ankita @ 2023-09-20 14:02 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently does not handle ECC errors / poison on a memory
region that is not backed by struct pages. If a memory region is mapped
using remap_pfn_range(), but not added to the kernel, MM will not have
associated struct pages. Add a new mechanism to handle memory failure
on such memory.

Make kernel MM expose a function to allow modules managing the device
memory to register a failure function and the physical address space
associated with the device memory. MM maintains this information as
interval tree. The registered memory failure function is used by MM to
notify the kernel module managing the PFN, so that the module may take
any required action. The module for example may use the information
to track the poisoned pages.

In this implementation, kernel MM follows the following sequence similar
(mostly) to the memory_failure() handler for struct page backed memory:
1. memory_failure() is triggered on reception of a poison error. An
absence of struct page is detected and consequently memory_failure_pfn()
is executed.
2. memory_failure_pfn() call the newly introduced failure handler exposed
by the module managing the poisoned memory to notify it of the problematic
PFN.
3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
4. memory_failure_pfn() collects the processes mapped to the PFN.
5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
mapping the faulty PFN using kill_procs().
6. An access to the faulty PFN by an operation in VM at a later point
is trapped and user_mem_abort() is called.
7. The vma ops fault function gets called due to the absence of Stage-2
mapping. It is expected to return VM_FAULT_HWPOISON on the PFN.
8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
through kvm_send_hwpoison_signal().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/linux/memory-failure.h |  22 ++++++
 include/linux/mm.h             |   1 +
 include/ras/ras_event.h        |   1 +
 mm/Kconfig                     |   1 +
 mm/memory-failure.c            | 135 ++++++++++++++++++++++++++++-----
 5 files changed, 139 insertions(+), 21 deletions(-)
 create mode 100644 include/linux/memory-failure.h

diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
new file mode 100644
index 000000000000..9a579960972a
--- /dev/null
+++ b/include/linux/memory-failure.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_FAILURE_H
+#define _LINUX_MEMORY_FAILURE_H
+
+#include <linux/interval_tree.h>
+
+struct pfn_address_space;
+
+struct pfn_address_space_ops {
+	void (*failure)(struct pfn_address_space *pfn_space, unsigned long pfn);
+};
+
+struct pfn_address_space {
+	struct interval_tree_node node;
+	const struct pfn_address_space_ops *ops;
+	struct address_space *mapping;
+};
+
+int register_pfn_address_space(struct pfn_address_space *pfn_space);
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
+
+#endif /* _LINUX_MEMORY_FAILURE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..d677688c016c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3934,6 +3934,7 @@ enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
+	MF_MSG_PFN_MAP,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index cbd3ddd7c33d..05c3e6f6bd02 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -373,6 +373,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
+	EM ( MF_MSG_PFN_MAP, "non struct page pfn" )			\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..2ee42ff8b6ca 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -762,6 +762,7 @@ config MEMORY_FAILURE
 	depends on ARCH_SUPPORTS_MEMORY_FAILURE
 	bool "Enable recovery from hardware memory errors"
 	select MEMORY_ISOLATION
+	select INTERVAL_TREE
 	select RAS
 	help
 	  Enables code to recover from some memory failures on systems
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..e1e1d96fd6a2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -38,6 +38,7 @@
 
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/memory-failure.h>
 #include <linux/page-flags.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
@@ -60,6 +61,7 @@
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
 #include <linux/sysctl.h>
+#include <linux/pfn_t.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -144,6 +146,10 @@ static struct ctl_table memory_failure_table[] = {
 	{ }
 };
 
+static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
+
+static DEFINE_MUTEX(pfn_space_lock);
+
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
@@ -422,15 +428,15 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  *
- * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
- * filesystem with a memory failure handler has claimed the
- * memory_failure event. In all other cases, page->index and
- * page->mapping are sufficient for mapping the page back to its
+ * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
+ * backed by struct page and a filesystem with a memory failure handler
+ * has claimed the memory_failure event. In all other cases, page->index
+ * and page->mapping are sufficient for mapping the page back to its
  * corresponding user virtual address.
  */
 static void __add_to_kill(struct task_struct *tsk, struct page *p,
 			  struct vm_area_struct *vma, struct list_head *to_kill,
-			  unsigned long ksm_addr, pgoff_t fsdax_pgoff)
+			  unsigned long ksm_addr, pgoff_t pgoff)
 {
 	struct to_kill *tk;
 
@@ -440,13 +446,18 @@ static void __add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
-	if (is_zone_device_page(p)) {
-		if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
-			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
-		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
-	} else
-		tk->size_shift = page_shift(compound_head(p));
+	if (vma->vm_flags | PFN_MAP) {
+		tk->addr = vma_pgoff_address(pgoff, 1, vma);
+		tk->size_shift = PAGE_SHIFT;
+	} else {
+		tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
+		if (is_zone_device_page(p)) {
+			if (pgoff != FSDAX_INVALID_PGOFF)
+				tk->addr = vma_pgoff_address(pgoff, 1, vma);
+			tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
+		} else
+			tk->size_shift = page_shift(compound_head(p));
+	}
 
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -666,8 +677,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	i_mmap_unlock_read(mapping);
 }
 
-#ifdef CONFIG_FS_DAX
-static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
+static void add_to_kill_pgoff(struct task_struct *tsk, struct page *p,
 			      struct vm_area_struct *vma,
 			      struct list_head *to_kill, pgoff_t pgoff)
 {
@@ -677,9 +687,9 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
 /*
  * Collect processes when the error hit a fsdax page.
  */
-static void collect_procs_fsdax(struct page *page,
-		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+static void collect_procs_pgoff(struct page *page,
+				struct address_space *mapping, pgoff_t pgoff,
+				struct list_head *to_kill)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -693,13 +703,12 @@ static void collect_procs_fsdax(struct page *page,
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 			if (vma->vm_mm == t->mm)
-				add_to_kill_fsdax(t, page, vma, to_kill, pgoff);
+				add_to_kill_pgoff(t, page, vma, to_kill, pgoff);
 		}
 	}
 	rcu_read_unlock();
 	i_mmap_unlock_read(mapping);
 }
-#endif /* CONFIG_FS_DAX */
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
@@ -893,6 +902,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
+	[MF_MSG_PFN_MAP]		= "non struct page pfn",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1324,7 +1334,8 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 
 	num_poisoned_pages_inc(pfn);
 
-	update_per_node_mf_stats(pfn, result);
+	if (type != MF_MSG_PFN_MAP)
+		update_per_node_mf_stats(pfn, result);
 
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
@@ -1805,7 +1816,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 
 		SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		collect_procs_pgoff(page, mapping, index, &to_kill);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
@@ -2144,6 +2155,83 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+int register_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!pfn_space)
+		return -EINVAL;
+
+	if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
+		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
+		return -EBUSY;
+
+	mutex_lock(&pfn_space_lock);
+	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(register_pfn_address_space);
+
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!pfn_space)
+		return;
+
+	mutex_lock(&pfn_space_lock);
+	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+	release_mem_region(pfn_space->node.start << PAGE_SHIFT,
+		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
+
+static int memory_failure_pfn(unsigned long pfn, int flags)
+{
+	struct interval_tree_node *node;
+	int res = MF_FAILED;
+	LIST_HEAD(tokill);
+
+	mutex_lock(&pfn_space_lock);
+	/*
+	 * Modules registers with MM the address space mapping to the device memory they
+	 * manage. Iterate to identify exactly which address space has mapped to this
+	 * failing PFN.
+	 */
+	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
+	     node = interval_tree_iter_next(node, pfn, pfn)) {
+		struct pfn_address_space *pfn_space =
+			container_of(node, struct pfn_address_space, node);
+		/*
+		 * Modules managing the device memory need to be conveyed about the
+		 * memory failure so that the poisoned PFN can be tracked.
+		 */
+		if (pfn_space->ops)
+			pfn_space->ops->failure(pfn_space, pfn);
+
+		collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
+
+		unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+
+		res = MF_RECOVERED;
+	}
+	mutex_unlock(&pfn_space_lock);
+
+	if (res == MF_FAILED)
+		return action_result(pfn, MF_MSG_PFN_MAP, res);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a different
+	 * physical page at a given virtual address, so all userspace
+	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
+	 * MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	kill_procs(&tokill, true, false, pfn, flags);
+
+	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!(flags & MF_SW_SIMULATED))
 		hw_memory_failure = true;
 
+	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
+		res = memory_failure_pfn(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		res = arch_memory_failure(pfn, flags);
-- 
2.17.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn
  2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
  2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
@ 2023-09-20 14:02 ` ankita
  2023-09-20 14:02 ` [PATCH v1 3/4] mm: Change ghes code to allow poison of non-struct pfn ankita
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: ankita @ 2023-09-20 14:02 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The fixup_user_fault() currently does not expect a VM_FAULT_HWPOISON
and hence does not check for it while calling vm_fault_to_errno(). Since
we now have a new code path which can trigger such case, change
fixup_user_fault to look for VM_FAULT_HWPOISON.

Also make hva_to_pfn_remapped check for -EHWPOISON and communicate the
poison fault up to the user_mem_abort().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 mm/gup.c            | 2 +-
 virt/kvm/kvm_main.c | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2f8a2d89fde1..fe469326dbe6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1414,7 +1414,7 @@ int fixup_user_fault(struct mm_struct *mm,
 	}
 
 	if (ret & VM_FAULT_ERROR) {
-		int err = vm_fault_to_errno(ret, 0);
+		int err = vm_fault_to_errno(ret, FOLL_HWPOISON);
 
 		if (err)
 			return err;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..2ff067f21a7c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2731,6 +2731,12 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 		r = hva_to_pfn_remapped(vma, addr, write_fault, writable, &pfn);
 		if (r == -EAGAIN)
 			goto retry;
+
+		if (r == -EHWPOISON) {
+			pfn = KVM_PFN_ERR_HWPOISON;
+			goto exit;
+		}
+
 		if (r < 0)
 			pfn = KVM_PFN_ERR_FAULT;
 	} else {
-- 
2.17.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 3/4] mm: Change ghes code to allow poison of non-struct pfn
  2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
  2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
  2023-09-20 14:02 ` [PATCH v1 2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn ankita
@ 2023-09-20 14:02 ` ankita
  2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
  2023-09-20 16:02 ` [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page Andrew Morton
  4 siblings, 0 replies; 13+ messages in thread
From: ankita @ 2023-09-20 14:02 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The GHES code allows calling of memory_failure() on the PFNs that pass the
pfn_valid() check. This contract is broken for the remapped PFNs which
fails the check and ghes_do_memory_failure() returns without triggering
memory_failure().

Update code to allow memory_failure() call on PFNs failing pfn_valid().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/acpi/apei/ghes.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index ef59d6ea16da..6ad1e4cbc968 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -457,20 +457,10 @@ static void ghes_kick_task_work(struct callback_head *head)
 
 static bool ghes_do_memory_failure(u64 physical_addr, int flags)
 {
-	unsigned long pfn;
-
 	if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
 		return false;
 
-	pfn = PHYS_PFN(physical_addr);
-	if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) {
-		pr_warn_ratelimited(FW_WARN GHES_PFX
-		"Invalid address in generic error data: %#llx\n",
-		physical_addr);
-		return false;
-	}
-
-	memory_failure_queue(pfn, flags);
+	memory_failure_queue(PHYS_PFN(physical_addr), flags);
 	return true;
 }
 
-- 
2.17.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
  2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
                   ` (2 preceding siblings ...)
  2023-09-20 14:02 ` [PATCH v1 3/4] mm: Change ghes code to allow poison of non-struct pfn ankita
@ 2023-09-20 14:02 ` ankita
  2023-09-26  5:36   ` kernel test robot
                     ` (2 more replies)
  2023-09-20 16:02 ` [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page Andrew Morton
  4 siblings, 3 replies; 13+ messages in thread
From: ankita @ 2023-09-20 14:02 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

From: Ankit Agrawal <ankita@nvidia.com>

The nvgrace-gpu-vfio-pci module [1] maps the device memory to the user VA
(Qemu) using remap_pfn_range() without adding the memory to the kernel.
The device memory pages are not backed by struct page. Patches 1-3
implements the mechanism to handle ECC/poison on memory page without
struct page and expose a registration function. This new mechanism is
leveraged here.
 
The module registers its memory region with the kernel MM for ECC handling
using the register_pfn_address_space() registration API exposed by the
kernel. It also defines a failure callback function pfn_memory_failure()
to get the poisoned PFN from the MM.
 
The module track poisoned PFN as a bitmap with a bit per PFN. The PFN is
communicated by the kernel MM to the module through the failure function,
which sets the appropriate bit in the bitmap.
 
The module also defines a VMA fault ops for the module. It returns
VM_FAULT_HWPOISON in case the bit for the PFN is set in the bitmap.

[1] https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 107 +++++++++++++++++++++++++++-
 drivers/vfio/vfio.h                 |  11 ---
 drivers/vfio/vfio_main.c            |   3 +-
 include/linux/vfio.h                |  15 ++++
 4 files changed, 123 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index ba323f2d8ea1..1c89ce0cc1cc 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -6,6 +6,10 @@
 #include <linux/pci.h>
 #include <linux/vfio_pci_core.h>
 #include <linux/vfio.h>
+#ifdef CONFIG_MEMORY_FAILURE
+#include <linux/bitmap.h>
+#include <linux/memory-failure.h>
+#endif
 
 struct nvgrace_gpu_vfio_pci_core_device {
 	struct vfio_pci_core_device core_device;
@@ -13,8 +17,85 @@ struct nvgrace_gpu_vfio_pci_core_device {
 	size_t memlength;
 	void *memmap;
 	struct mutex memmap_lock;
+#ifdef CONFIG_MEMORY_FAILURE
+	struct pfn_address_space pfn_address_space;
+	unsigned long *pfn_bitmap;
+#endif
 };
 
+#ifdef CONFIG_MEMORY_FAILURE
+void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
+		unsigned long pfn)
+{
+	struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+		pfn_space, struct nvgrace_gpu_vfio_pci_core_device, pfn_address_space);
+	unsigned long mem_offset = pfn - pfn_space->node.start;
+
+	if (mem_offset >= nvdev->memlength)
+		return;
+
+	/*
+	 * MM has called to notify a poisoned page. Track that in the bitmap.
+	 */
+	__set_bit(mem_offset, nvdev->pfn_bitmap);
+}
+
+struct pfn_address_space_ops nvgrace_gpu_vfio_pci_pas_ops = {
+	.failure = nvgrace_gpu_vfio_pci_pfn_memory_failure,
+};
+
+static int
+nvgrace_gpu_vfio_pci_register_pfn_range(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
+					struct vm_area_struct *vma)
+{
+	unsigned long nr_pages;
+	int ret = 0;
+
+	nr_pages = nvdev->memlength >> PAGE_SHIFT;
+
+	nvdev->pfn_address_space.node.start = vma->vm_pgoff;
+	nvdev->pfn_address_space.node.last = vma->vm_pgoff + nr_pages - 1;
+	nvdev->pfn_address_space.ops = &nvgrace_gpu_vfio_pci_pas_ops;
+	nvdev->pfn_address_space.mapping = vma->vm_file->f_mapping;
+
+	ret = register_pfn_address_space(&(nvdev->pfn_address_space));
+
+	return ret;
+}
+
+static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
+{
+	unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
+	struct vfio_device *core_vdev;
+	struct nvgrace_gpu_vfio_pci_core_device *nvdev;
+
+	if (!(vmf->vma->vm_file))
+		goto error_exit;
+
+	core_vdev = vfio_device_from_file(vmf->vma->vm_file);
+
+	if (!core_vdev)
+		goto error_exit;
+
+	nvdev = container_of(core_vdev,
+			struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
+
+	/*
+	 * Check if the page is poisoned.
+	 */
+	if (mem_offset < (nvdev->memlength >> PAGE_SHIFT) &&
+		test_bit(mem_offset, nvdev->pfn_bitmap))
+		return VM_FAULT_HWPOISON;
+
+error_exit:
+	return VM_FAULT_ERROR;
+}
+
+static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
+	.fault = nvgrace_gpu_vfio_pci_fault,
+};
+#endif
+
 static int nvgrace_gpu_vfio_pci_open_device(struct vfio_device *core_vdev)
 {
 	struct vfio_pci_core_device *vdev =
@@ -46,6 +127,9 @@ static void nvgrace_gpu_vfio_pci_close_device(struct vfio_device *core_vdev)
 
 	mutex_destroy(&nvdev->memmap_lock);
 
+#ifdef CONFIG_MEMORY_FAILURE
+	unregister_pfn_address_space(&(nvdev->pfn_address_space));
+#endif
 	vfio_pci_core_close_device(core_vdev);
 }
 
@@ -104,8 +188,12 @@ static int nvgrace_gpu_vfio_pci_mmap(struct vfio_device *core_vdev,
 		return ret;
 
 	vma->vm_pgoff = start_pfn;
+#ifdef CONFIG_MEMORY_FAILURE
+	vma->vm_ops = &nvgrace_gpu_vfio_pci_mmap_ops;
 
-	return 0;
+	ret = nvgrace_gpu_vfio_pci_register_pfn_range(nvdev, vma);
+#endif
+	return ret;
 }
 
 static long
@@ -406,6 +494,19 @@ nvgrace_gpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
 
 	nvdev->memlength = memlength;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	/*
+	 * A bitmap is maintained to track the pages that are poisoned. Each
+	 * page is represented by a bit. Allocation size in bytes is
+	 * determined by shifting the device memory size by PAGE_SHIFT to
+	 * determine the number of pages; and further shifted by 3 as each
+	 * byte could track 8 pages.
+	 */
+	nvdev->pfn_bitmap
+		= vzalloc((nvdev->memlength >> PAGE_SHIFT)/BITS_PER_TYPE(char));
+	if (!nvdev->pfn_bitmap)
+		ret = -ENOMEM;
+#endif
 	return ret;
 }
 
@@ -442,6 +543,10 @@ static void nvgrace_gpu_vfio_pci_remove(struct pci_dev *pdev)
 	struct nvgrace_gpu_vfio_pci_core_device *nvdev = nvgrace_gpu_drvdata(pdev);
 	struct vfio_pci_core_device *vdev = &nvdev->core_device;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	vfree(nvdev->pfn_bitmap);
+#endif
+
 	vfio_pci_core_unregister_device(vdev);
 	vfio_put_device(&vdev->vdev);
 }
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 307e3f29b527..747094503909 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -16,17 +16,6 @@ struct iommufd_ctx;
 struct iommu_group;
 struct vfio_container;
 
-struct vfio_device_file {
-	struct vfio_device *device;
-	struct vfio_group *group;
-
-	u8 access_granted;
-	u32 devid; /* only valid when iommufd is valid */
-	spinlock_t kvm_ref_lock; /* protect kvm field */
-	struct kvm *kvm;
-	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
-};
-
 void vfio_device_put_registration(struct vfio_device *device);
 bool vfio_device_try_get_registration(struct vfio_device *device);
 int vfio_df_open(struct vfio_device_file *df);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 40732e8ed4c6..a7dafd7c64a6 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1309,7 +1309,7 @@ const struct file_operations vfio_device_fops = {
 	.mmap		= vfio_device_fops_mmap,
 };
 
-static struct vfio_device *vfio_device_from_file(struct file *file)
+struct vfio_device *vfio_device_from_file(struct file *file)
 {
 	struct vfio_device_file *df = file->private_data;
 
@@ -1317,6 +1317,7 @@ static struct vfio_device *vfio_device_from_file(struct file *file)
 		return NULL;
 	return df->device;
 }
+EXPORT_SYMBOL_GPL(vfio_device_from_file);
 
 /**
  * vfio_file_is_valid - True if the file is valid vfio file
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 454e9295970c..d88af251e931 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -361,4 +361,19 @@ int vfio_virqfd_enable(void *opaque, int (*handler)(void *, void *),
 		       struct virqfd **pvirqfd, int fd);
 void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
+/*
+ * VFIO device file.
+ */
+struct vfio_device_file {
+	struct vfio_device *device;
+	struct vfio_group *group;
+	u8 access_granted;
+	u32 devid; /* only valid when iommufd is valid */
+	spinlock_t kvm_ref_lock; /* protect kvm field */
+	struct kvm *kvm;
+	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
+};
+
+struct vfio_device *vfio_device_from_file(struct file *file);
+
 #endif /* VFIO_H */
-- 
2.17.1



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page
  2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
                   ` (3 preceding siblings ...)
  2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
@ 2023-09-20 16:02 ` Andrew Morton
  2023-09-20 16:04   ` Jason Gunthorpe
  4 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2023-09-20 16:02 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, tony.luck, bp, naoya.horiguchi, linmiaohe,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

On Wed, 20 Sep 2023 19:32:06 +0530 <ankita@nvidia.com> wrote:

> The kernel MM currently handles ECC errors / poison only on memory page
> backed by struct page. As part of [1], the nvgrace-gpu-vfio-pci module
> maps the device memory to user VA (Qemu) using remap_pfn_range without
> being added to the kernel. These pages are not backed by struct page.

Are you able to identify any other drivers which can (or will) use
this?  Or is it likely that this feature will only ever be for
nvgrace-gpu-vfio-pci?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page
  2023-09-20 16:02 ` [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page Andrew Morton
@ 2023-09-20 16:04   ` Jason Gunthorpe
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2023-09-20 16:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: ankita, alex.williamson, tony.luck, bp, naoya.horiguchi,
	linmiaohe, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
	anuaggarwal, linux-kernel, linux-mm, linux-edac, kvm

On Wed, Sep 20, 2023 at 09:02:22AM -0700, Andrew Morton wrote:
> On Wed, 20 Sep 2023 19:32:06 +0530 <ankita@nvidia.com> wrote:
> 
> > The kernel MM currently handles ECC errors / poison only on memory page
> > backed by struct page. As part of [1], the nvgrace-gpu-vfio-pci module
> > maps the device memory to user VA (Qemu) using remap_pfn_range without
> > being added to the kernel. These pages are not backed by struct page.
> 
> Are you able to identify any other drivers which can (or will) use
> this?  Or is it likely that this feature will only ever be for
> nvgrace-gpu-vfio-pci?

I think a future vfio-cxl will have a similar desire at least.

Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages
  2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
@ 2023-09-23  3:20   ` Miaohe Lin
  2023-09-25 12:36     ` Jason Gunthorpe
  2023-09-26  7:23   ` Naoya Horiguchi
  1 sibling, 1 reply; 13+ messages in thread
From: Miaohe Lin @ 2023-09-23  3:20 UTC (permalink / raw)
  To: ankita
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm, jgg, alex.williamson,
	akpm, tony.luck, bp, naoya.horiguchi

On 2023/9/20 22:02, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region is mapped
> using remap_pfn_range(), but not added to the kernel, MM will not have
> associated struct pages. Add a new mechanism to handle memory failure
> on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register a failure function and the physical address space
> associated with the device memory. MM maintains this information as
> interval tree. The registered memory failure function is used by MM to
> notify the kernel module managing the PFN, so that the module may take
> any required action. The module for example may use the information
> to track the poisoned pages.
> 
> In this implementation, kernel MM follows the following sequence similar
> (mostly) to the memory_failure() handler for struct page backed memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() call the newly introduced failure handler exposed
> by the module managing the poisoned memory to notify it of the problematic
> PFN.
> 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> 4. memory_failure_pfn() collects the processes mapped to the PFN.
> 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
> mapping the faulty PFN using kill_procs().
> 6. An access to the faulty PFN by an operation in VM at a later point
> is trapped and user_mem_abort() is called.
> 7. The vma ops fault function gets called due to the absence of Stage-2
> mapping. It is expected to return VM_FAULT_HWPOISON on the PFN.
> 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
> the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
> through kvm_send_hwpoison_signal().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Thanks for your patch.

<snip>

>  /*
>   * Return values:
>   *   1:   the page is dissolved (if needed) and taken off from buddy,
> @@ -422,15 +428,15 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   *
> - * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
> - * filesystem with a memory failure handler has claimed the
> - * memory_failure event. In all other cases, page->index and
> - * page->mapping are sufficient for mapping the page back to its
> + * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
> + * backed by struct page and a filesystem with a memory failure handler
> + * has claimed the memory_failure event. In all other cases, page->index
> + * and page->mapping are sufficient for mapping the page back to its
>   * corresponding user virtual address.
>   */
>  static void __add_to_kill(struct task_struct *tsk, struct page *p,
>  			  struct vm_area_struct *vma, struct list_head *to_kill,
> -			  unsigned long ksm_addr, pgoff_t fsdax_pgoff)
> +			  unsigned long ksm_addr, pgoff_t pgoff)
>  {
>  	struct to_kill *tk;
>  
> @@ -440,13 +446,18 @@ static void __add_to_kill(struct task_struct *tsk, struct page *p,
>  		return;
>  	}
>  
> -	tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
> -	if (is_zone_device_page(p)) {
> -		if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
> -			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
> -		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> -	} else
> -		tk->size_shift = page_shift(compound_head(p));
> +	if (vma->vm_flags | PFN_MAP) {

if (vma->vm_flags | PFN_MAP)? So this branch is always selected?

> +		tk->addr = vma_pgoff_address(pgoff, 1, vma);
> +		tk->size_shift = PAGE_SHIFT;
> +	} else {
> +		tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
> +		if (is_zone_device_page(p)) {
> +			if (pgoff != FSDAX_INVALID_PGOFF)
> +				tk->addr = vma_pgoff_address(pgoff, 1, vma);
> +			tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> +		} else
> +			tk->size_shift = page_shift(compound_head(p));
> +	}
>  

IIUC, the page passed to __add_to_kill is NULL in this case. So when tk->addr == -EFAULT, we will have problem
to do the page_to_pfn(p) in the following pr_info:

	if (tk->addr == -EFAULT) {
		pr_info("Unable to find user space address %lx in %s\n",
			page_to_pfn(p), tsk->comm);

>  	/*
>  	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -666,8 +677,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  	i_mmap_unlock_read(mapping);
>  }
>  

<snip>

>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {

Could it be better to add a helper here to detect the pfns without struct page?

> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}
> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);
> 

Thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages
  2023-09-23  3:20   ` Miaohe Lin
@ 2023-09-25 12:36     ` Jason Gunthorpe
  0 siblings, 0 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2023-09-25 12:36 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: ankita, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
	anuaggarwal, linux-kernel, linux-mm, linux-edac, kvm,
	alex.williamson, akpm, tony.luck, bp, naoya.horiguchi

On Sat, Sep 23, 2023 at 11:20:19AM +0800, Miaohe Lin wrote:

> >  /**
> >   * memory_failure - Handle memory failure of a page.
> >   * @pfn: Page Number of the corrupted page
> > @@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (!(flags & MF_SW_SIMULATED))
> >  		hw_memory_failure = true;
> >  
> > +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> 
> Could it be better to add a helper here to detect the pfns without
> struct page?

pfn_valid is supposed to do that.

This arch_is_platform_page stuff is actually detecting Intel SGX
memory and routing it to arch_memory_failure()

It would have been more accurately named
arch_is_arch_memory_failure_pfn() or something

Actually that SGX stuff could probably be changed over to use the
interval tree of this series. Modify sgx_setup_epc_section() to
register tree nodes per-section and remove all this arch stuff
entirely.

Jason


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
  2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
@ 2023-09-26  5:36   ` kernel test robot
  2023-09-26  7:38   ` Naoya Horiguchi
  2023-09-28 19:45   ` Alex Williamson
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2023-09-26  5:36 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, akpm, tony.luck, bp,
	naoya.horiguchi, linmiaohe
  Cc: llvm, oe-kbuild-all, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, anuaggarwal, linux-kernel, linux-mm, linux-edac, kvm

Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on awilliam-vfio/for-linus]
[also build test WARNING on kvm/queue rafael-pm/linux-next linus/master]
[cannot apply to akpm-mm/mm-everything awilliam-vfio/next kvm/linux-next next-20230925]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/ankita-nvidia-com/mm-handle-poisoning-of-pfn-without-struct-pages/20230920-220626
base:   https://github.com/awilliam/linux-vfio.git for-linus
patch link:    https://lore.kernel.org/r/20230920140210.12663-5-ankita%40nvidia.com
patch subject: [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
config: powerpc64-allmodconfig (https://download.01.org/0day-ci/archive/20230925/202309252319.hQ7rHJTJ-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230925/202309252319.hQ7rHJTJ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <yujie.liu@intel.com>
| Closes: https://lore.kernel.org/r/202309252319.hQ7rHJTJ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/vfio/pci/nvgrace-gpu/main.c:27:6: warning: no previous prototype for function 'nvgrace_gpu_vfio_pci_pfn_memory_failure' [-Wmissing-prototypes]
      27 | void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
         |      ^
   drivers/vfio/pci/nvgrace-gpu/main.c:27:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
      27 | void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
         | ^
         | static 
   drivers/vfio/pci/nvgrace-gpu/main.c:300:9: warning: no previous prototype for function 'nvgrace_gpu_read_mem' [-Wmissing-prototypes]
     300 | ssize_t nvgrace_gpu_read_mem(void __user *buf, size_t count, loff_t *ppos,
         |         ^
   drivers/vfio/pci/nvgrace-gpu/main.c:300:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     300 | ssize_t nvgrace_gpu_read_mem(void __user *buf, size_t count, loff_t *ppos,
         | ^
         | static 
   drivers/vfio/pci/nvgrace-gpu/main.c:376:9: warning: no previous prototype for function 'nvgrace_gpu_write_mem' [-Wmissing-prototypes]
     376 | ssize_t nvgrace_gpu_write_mem(size_t count, loff_t *ppos, const void __user *buf,
         |         ^
   drivers/vfio/pci/nvgrace-gpu/main.c:376:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     376 | ssize_t nvgrace_gpu_write_mem(size_t count, loff_t *ppos, const void __user *buf,
         | ^
         | static 
   3 warnings generated.


vim +/nvgrace_gpu_vfio_pci_pfn_memory_failure +27 drivers/vfio/pci/nvgrace-gpu/main.c

b59e9d949a79e1 Ankit Agrawal 2023-09-14  25  
5f3746d8629350 Ankit Agrawal 2023-09-20  26  #ifdef CONFIG_MEMORY_FAILURE
5f3746d8629350 Ankit Agrawal 2023-09-20 @27  void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
5f3746d8629350 Ankit Agrawal 2023-09-20  28  		unsigned long pfn)
5f3746d8629350 Ankit Agrawal 2023-09-20  29  {
5f3746d8629350 Ankit Agrawal 2023-09-20  30  	struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
5f3746d8629350 Ankit Agrawal 2023-09-20  31  		pfn_space, struct nvgrace_gpu_vfio_pci_core_device, pfn_address_space);
5f3746d8629350 Ankit Agrawal 2023-09-20  32  	unsigned long mem_offset = pfn - pfn_space->node.start;
5f3746d8629350 Ankit Agrawal 2023-09-20  33  
5f3746d8629350 Ankit Agrawal 2023-09-20  34  	if (mem_offset >= nvdev->memlength)
5f3746d8629350 Ankit Agrawal 2023-09-20  35  		return;
5f3746d8629350 Ankit Agrawal 2023-09-20  36  
5f3746d8629350 Ankit Agrawal 2023-09-20  37  	/*
5f3746d8629350 Ankit Agrawal 2023-09-20  38  	 * MM has called to notify a poisoned page. Track that in the bitmap.
5f3746d8629350 Ankit Agrawal 2023-09-20  39  	 */
5f3746d8629350 Ankit Agrawal 2023-09-20  40  	__set_bit(mem_offset, nvdev->pfn_bitmap);
5f3746d8629350 Ankit Agrawal 2023-09-20  41  }
5f3746d8629350 Ankit Agrawal 2023-09-20  42  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages
  2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
  2023-09-23  3:20   ` Miaohe Lin
@ 2023-09-26  7:23   ` Naoya Horiguchi
  1 sibling, 0 replies; 13+ messages in thread
From: Naoya Horiguchi @ 2023-09-26  7:23 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, akpm, tony.luck, bp, naoya.horiguchi,
	linmiaohe, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
	anuaggarwal, linux-kernel, linux-mm, linux-edac, kvm

On Wed, Sep 20, 2023 at 07:32:07PM +0530, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region is mapped
> using remap_pfn_range(), but not added to the kernel, MM will not have
> associated struct pages. Add a new mechanism to handle memory failure
> on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register a failure function and the physical address space
> associated with the device memory. MM maintains this information as
> interval tree. The registered memory failure function is used by MM to
> notify the kernel module managing the PFN, so that the module may take
> any required action. The module for example may use the information
> to track the poisoned pages.
> 
> In this implementation, kernel MM follows the following sequence similar
> (mostly) to the memory_failure() handler for struct page backed memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() call the newly introduced failure handler exposed
> by the module managing the poisoned memory to notify it of the problematic
> PFN.
> 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> 4. memory_failure_pfn() collects the processes mapped to the PFN.
> 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
> mapping the faulty PFN using kill_procs().
> 6. An access to the faulty PFN by an operation in VM at a later point
> is trapped and user_mem_abort() is called.
> 7. The vma ops fault function gets called due to the absence of Stage-2
> mapping. It is expected to return VM_FAULT_HWPOISON on the PFN.
> 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
> the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
> through kvm_send_hwpoison_signal().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Thanks for the patches.

A few comment below ...

...

> @@ -422,15 +428,15 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   *
> - * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
> - * filesystem with a memory failure handler has claimed the
> - * memory_failure event. In all other cases, page->index and
> - * page->mapping are sufficient for mapping the page back to its
> + * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
> + * backed by struct page and a filesystem with a memory failure handler
> + * has claimed the memory_failure event.

This sentense is unclear because latter part ("a filesystem with ...")
is not true for pfns not backed by struct page.  Could you separate this
notice into two (one for fsdax case and one for "non struct page" case)?

> In all other cases, page->index
> + * and page->mapping are sufficient for mapping the page back to its
>   * corresponding user virtual address.
>   */
>  static void __add_to_kill(struct task_struct *tsk, struct page *p,
>  			  struct vm_area_struct *vma, struct list_head *to_kill,
> -			  unsigned long ksm_addr, pgoff_t fsdax_pgoff)
> +			  unsigned long ksm_addr, pgoff_t pgoff)
>  {
>  	struct to_kill *tk;
>  

...

> @@ -677,9 +687,9 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>  /*
>   * Collect processes when the error hit a fsdax page.

Maybe you need update the comment not to restrict to fsdax page?

>   */
> -static void collect_procs_fsdax(struct page *page,
> -		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +static void collect_procs_pgoff(struct page *page,
> +				struct address_space *mapping, pgoff_t pgoff,
> +				struct list_head *to_kill)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;

...

> @@ -2144,6 +2155,83 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!pfn_space)
> +		return -EINVAL;
> +
> +	if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
> +		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
> +		return -EBUSY;
> +
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> +
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!pfn_space)
> +		return;
> +
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +	release_mem_region(pfn_space->node.start << PAGE_SHIFT,
> +		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
> +}
> +EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> +
> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> +	struct interval_tree_node *node;
> +	int res = MF_FAILED;
> +	LIST_HEAD(tokill);
> +
> +	mutex_lock(&pfn_space_lock);
> +	/*
> +	 * Modules registers with MM the address space mapping to the device memory they
> +	 * manage. Iterate to identify exactly which address space has mapped to this
> +	 * failing PFN.
> +	 */
> +	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> +	     node = interval_tree_iter_next(node, pfn, pfn)) {
> +		struct pfn_address_space *pfn_space =
> +			container_of(node, struct pfn_address_space, node);
> +		/*
> +		 * Modules managing the device memory need to be conveyed about the
> +		 * memory failure so that the poisoned PFN can be tracked.
> +		 */
> +		if (pfn_space->ops)
> +			pfn_space->ops->failure(pfn_space, pfn);
> +
> +		collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
> +
> +		unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
> +				    PAGE_SIZE, 0);
> +
> +		res = MF_RECOVERED;
> +	}
> +	mutex_unlock(&pfn_space_lock);
> +
> +	if (res == MF_FAILED)
> +		return action_result(pfn, MF_MSG_PFN_MAP, res);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a different
> +	 * physical page at a given virtual address, so all userspace
> +	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
> +	 * MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	kill_procs(&tokill, true, false, pfn, flags);
> +
> +	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
> +}
> +

It might not be a major issue, but these new code above seems to be used
only when CONFIG_NVGRACE_GPU_VFIO_PCI is enabled, so putting this in
#ifdef block might be helpful to save binary size without nvgrace-gpu-vfio-pci.

Thanks,
Naoya Horiguchi

>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}
> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);
> -- 
> 2.17.1
> 
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
  2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
  2023-09-26  5:36   ` kernel test robot
@ 2023-09-26  7:38   ` Naoya Horiguchi
  2023-09-28 19:45   ` Alex Williamson
  2 siblings, 0 replies; 13+ messages in thread
From: Naoya Horiguchi @ 2023-09-26  7:38 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, akpm, tony.luck, bp, naoya.horiguchi,
	linmiaohe, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
	anuaggarwal, linux-kernel, linux-mm, linux-edac, kvm

On Wed, Sep 20, 2023 at 07:32:10PM +0530, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The nvgrace-gpu-vfio-pci module [1] maps the device memory to the user VA
> (Qemu) using remap_pfn_range() without adding the memory to the kernel.
> The device memory pages are not backed by struct page. Patches 1-3
> implements the mechanism to handle ECC/poison on memory page without
> struct page and expose a registration function. This new mechanism is
> leveraged here.
>  
> The module registers its memory region with the kernel MM for ECC handling
> using the register_pfn_address_space() registration API exposed by the
> kernel. It also defines a failure callback function pfn_memory_failure()
> to get the poisoned PFN from the MM.
>  
> The module track poisoned PFN as a bitmap with a bit per PFN. The PFN is
> communicated by the kernel MM to the module through the failure function,
> which sets the appropriate bit in the bitmap.
>  
> The module also defines a VMA fault ops for the module. It returns
> VM_FAULT_HWPOISON in case the bit for the PFN is set in the bitmap.
> 
> [1] https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---

...

> @@ -406,6 +494,19 @@ nvgrace_gpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
>  
>  	nvdev->memlength = memlength;
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +	/*
> +	 * A bitmap is maintained to track the pages that are poisoned. Each
> +	 * page is represented by a bit. Allocation size in bytes is
> +	 * determined by shifting the device memory size by PAGE_SHIFT to
> +	 * determine the number of pages; and further shifted by 3 as each
> +	 * byte could track 8 pages.
> +	 */
> +	nvdev->pfn_bitmap
> +		= vzalloc((nvdev->memlength >> PAGE_SHIFT)/BITS_PER_TYPE(char));
> +	if (!nvdev->pfn_bitmap)
> +		ret = -ENOMEM;
> +#endif
>  	return ret;
>  }
>  

I assume that memory failure is a relatively rare event (otherwise the device
is simply broken and it's better to stop using it), so the bitmap is mostly
full of zeros.
I think that the size of device memory is on the order of 100GB, then the
bitmap size is about 3.2MB, which might be not too large in modern systems,
but using other data structure with smaller memory footprint like hash table
can be more beneficial?

Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling
  2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
  2023-09-26  5:36   ` kernel test robot
  2023-09-26  7:38   ` Naoya Horiguchi
@ 2023-09-28 19:45   ` Alex Williamson
  2 siblings, 0 replies; 13+ messages in thread
From: Alex Williamson @ 2023-09-28 19:45 UTC (permalink / raw)
  To: ankita
  Cc: jgg, akpm, tony.luck, bp, naoya.horiguchi, linmiaohe, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, anuaggarwal,
	linux-kernel, linux-mm, linux-edac, kvm

On Wed, 20 Sep 2023 19:32:10 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The nvgrace-gpu-vfio-pci module [1] maps the device memory to the user VA
> (Qemu) using remap_pfn_range() without adding the memory to the kernel.
> The device memory pages are not backed by struct page. Patches 1-3
> implements the mechanism to handle ECC/poison on memory page without
> struct page and expose a registration function. This new mechanism is
> leveraged here.
>  
> The module registers its memory region with the kernel MM for ECC handling
> using the register_pfn_address_space() registration API exposed by the
> kernel. It also defines a failure callback function pfn_memory_failure()
> to get the poisoned PFN from the MM.
>  
> The module track poisoned PFN as a bitmap with a bit per PFN. The PFN is
> communicated by the kernel MM to the module through the failure function,
> which sets the appropriate bit in the bitmap.
>  
> The module also defines a VMA fault ops for the module. It returns
> VM_FAULT_HWPOISON in case the bit for the PFN is set in the bitmap.
> 
> [1] https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 107 +++++++++++++++++++++++++++-
>  drivers/vfio/vfio.h                 |  11 ---
>  drivers/vfio/vfio_main.c            |   3 +-
>  include/linux/vfio.h                |  15 ++++
>  4 files changed, 123 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index ba323f2d8ea1..1c89ce0cc1cc 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -6,6 +6,10 @@
>  #include <linux/pci.h>
>  #include <linux/vfio_pci_core.h>
>  #include <linux/vfio.h>
> +#ifdef CONFIG_MEMORY_FAILURE
> +#include <linux/bitmap.h>
> +#include <linux/memory-failure.h>
> +#endif
>  
>  struct nvgrace_gpu_vfio_pci_core_device {
>  	struct vfio_pci_core_device core_device;
> @@ -13,8 +17,85 @@ struct nvgrace_gpu_vfio_pci_core_device {
>  	size_t memlength;
>  	void *memmap;
>  	struct mutex memmap_lock;
> +#ifdef CONFIG_MEMORY_FAILURE
> +	struct pfn_address_space pfn_address_space;
> +	unsigned long *pfn_bitmap;
> +#endif
>  };
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +void nvgrace_gpu_vfio_pci_pfn_memory_failure(struct pfn_address_space *pfn_space,
> +		unsigned long pfn)
> +{
> +	struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> +		pfn_space, struct nvgrace_gpu_vfio_pci_core_device, pfn_address_space);
> +	unsigned long mem_offset = pfn - pfn_space->node.start;
> +
> +	if (mem_offset >= nvdev->memlength)
> +		return;
> +
> +	/*
> +	 * MM has called to notify a poisoned page. Track that in the bitmap.
> +	 */
> +	__set_bit(mem_offset, nvdev->pfn_bitmap);
> +}
> +
> +struct pfn_address_space_ops nvgrace_gpu_vfio_pci_pas_ops = {
> +	.failure = nvgrace_gpu_vfio_pci_pfn_memory_failure,
> +};
> +
> +static int
> +nvgrace_gpu_vfio_pci_register_pfn_range(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> +					struct vm_area_struct *vma)
> +{
> +	unsigned long nr_pages;
> +	int ret = 0;
> +
> +	nr_pages = nvdev->memlength >> PAGE_SHIFT;
> +
> +	nvdev->pfn_address_space.node.start = vma->vm_pgoff;
> +	nvdev->pfn_address_space.node.last = vma->vm_pgoff + nr_pages - 1;
> +	nvdev->pfn_address_space.ops = &nvgrace_gpu_vfio_pci_pas_ops;
> +	nvdev->pfn_address_space.mapping = vma->vm_file->f_mapping;
> +
> +	ret = register_pfn_address_space(&(nvdev->pfn_address_space));
> +
> +	return ret;
> +}
> +
> +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
> +	struct vfio_device *core_vdev;
> +	struct nvgrace_gpu_vfio_pci_core_device *nvdev;
> +
> +	if (!(vmf->vma->vm_file))
> +		goto error_exit;
> +
> +	core_vdev = vfio_device_from_file(vmf->vma->vm_file);
> +
> +	if (!core_vdev)
> +		goto error_exit;
> +
> +	nvdev = container_of(core_vdev,
> +			struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> +
> +	/*
> +	 * Check if the page is poisoned.
> +	 */
> +	if (mem_offset < (nvdev->memlength >> PAGE_SHIFT) &&
> +		test_bit(mem_offset, nvdev->pfn_bitmap))
> +		return VM_FAULT_HWPOISON;
> +
> +error_exit:
> +	return VM_FAULT_ERROR;
> +}
> +
> +static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
> +	.fault = nvgrace_gpu_vfio_pci_fault,
> +};
> +#endif
> +
>  static int nvgrace_gpu_vfio_pci_open_device(struct vfio_device *core_vdev)
>  {
>  	struct vfio_pci_core_device *vdev =
> @@ -46,6 +127,9 @@ static void nvgrace_gpu_vfio_pci_close_device(struct vfio_device *core_vdev)
>  
>  	mutex_destroy(&nvdev->memmap_lock);
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +	unregister_pfn_address_space(&(nvdev->pfn_address_space));
> +#endif
>  	vfio_pci_core_close_device(core_vdev);
>  }
>  
> @@ -104,8 +188,12 @@ static int nvgrace_gpu_vfio_pci_mmap(struct vfio_device *core_vdev,
>  		return ret;
>  
>  	vma->vm_pgoff = start_pfn;
> +#ifdef CONFIG_MEMORY_FAILURE
> +	vma->vm_ops = &nvgrace_gpu_vfio_pci_mmap_ops;
>  
> -	return 0;
> +	ret = nvgrace_gpu_vfio_pci_register_pfn_range(nvdev, vma);
> +#endif
> +	return ret;
>  }
>  
>  static long
> @@ -406,6 +494,19 @@ nvgrace_gpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
>  
>  	nvdev->memlength = memlength;
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +	/*
> +	 * A bitmap is maintained to track the pages that are poisoned. Each
> +	 * page is represented by a bit. Allocation size in bytes is
> +	 * determined by shifting the device memory size by PAGE_SHIFT to
> +	 * determine the number of pages; and further shifted by 3 as each
> +	 * byte could track 8 pages.
> +	 */
> +	nvdev->pfn_bitmap
> +		= vzalloc((nvdev->memlength >> PAGE_SHIFT)/BITS_PER_TYPE(char));
> +	if (!nvdev->pfn_bitmap)
> +		ret = -ENOMEM;
> +#endif
>  	return ret;
>  }
>  
> @@ -442,6 +543,10 @@ static void nvgrace_gpu_vfio_pci_remove(struct pci_dev *pdev)
>  	struct nvgrace_gpu_vfio_pci_core_device *nvdev = nvgrace_gpu_drvdata(pdev);
>  	struct vfio_pci_core_device *vdev = &nvdev->core_device;
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +	vfree(nvdev->pfn_bitmap);
> +#endif
> +
>  	vfio_pci_core_unregister_device(vdev);
>  	vfio_put_device(&vdev->vdev);
>  }
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 307e3f29b527..747094503909 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -16,17 +16,6 @@ struct iommufd_ctx;
>  struct iommu_group;
>  struct vfio_container;
>  
> -struct vfio_device_file {
> -	struct vfio_device *device;
> -	struct vfio_group *group;
> -
> -	u8 access_granted;
> -	u32 devid; /* only valid when iommufd is valid */
> -	spinlock_t kvm_ref_lock; /* protect kvm field */
> -	struct kvm *kvm;
> -	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
> -};
> -
>  void vfio_device_put_registration(struct vfio_device *device);
>  bool vfio_device_try_get_registration(struct vfio_device *device);
>  int vfio_df_open(struct vfio_device_file *df);
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 40732e8ed4c6..a7dafd7c64a6 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1309,7 +1309,7 @@ const struct file_operations vfio_device_fops = {
>  	.mmap		= vfio_device_fops_mmap,
>  };
>  
> -static struct vfio_device *vfio_device_from_file(struct file *file)
> +struct vfio_device *vfio_device_from_file(struct file *file)
>  {
>  	struct vfio_device_file *df = file->private_data;
>  
> @@ -1317,6 +1317,7 @@ static struct vfio_device *vfio_device_from_file(struct file *file)
>  		return NULL;
>  	return df->device;
>  }
> +EXPORT_SYMBOL_GPL(vfio_device_from_file);
>  
>  /**
>   * vfio_file_is_valid - True if the file is valid vfio file
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 454e9295970c..d88af251e931 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -361,4 +361,19 @@ int vfio_virqfd_enable(void *opaque, int (*handler)(void *, void *),
>  		       struct virqfd **pvirqfd, int fd);
>  void vfio_virqfd_disable(struct virqfd **pvirqfd);
>  
> +/*
> + * VFIO device file.
> + */
> +struct vfio_device_file {
> +	struct vfio_device *device;
> +	struct vfio_group *group;
> +	u8 access_granted;
> +	u32 devid; /* only valid when iommufd is valid */
> +	spinlock_t kvm_ref_lock; /* protect kvm field */
> +	struct kvm *kvm;
> +	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
> +};

What here necessitates moving this to the more public header?  Thanks,

Alex

> +
> +struct vfio_device *vfio_device_from_file(struct file *file);
> +
>  #endif /* VFIO_H */



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-09-28 19:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-20 14:02 [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page ankita
2023-09-20 14:02 ` [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages ankita
2023-09-23  3:20   ` Miaohe Lin
2023-09-25 12:36     ` Jason Gunthorpe
2023-09-26  7:23   ` Naoya Horiguchi
2023-09-20 14:02 ` [PATCH v1 2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn ankita
2023-09-20 14:02 ` [PATCH v1 3/4] mm: Change ghes code to allow poison of non-struct pfn ankita
2023-09-20 14:02 ` [PATCH v1 4/4] vfio/nvgpu: register device memory for poison handling ankita
2023-09-26  5:36   ` kernel test robot
2023-09-26  7:38   ` Naoya Horiguchi
2023-09-28 19:45   ` Alex Williamson
2023-09-20 16:02 ` [PATCH v1 0/4] mm: Implement ECC handling for pfn with no struct page Andrew Morton
2023-09-20 16:04   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox