linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] mm: Implement ECC handling for pfn with no struct page
@ 2023-11-23  0:35 ankita
  2023-11-23  0:35 ` [PATCH v2 1/4] mm: handle poisoning of pfn without struct pages ankita
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: ankita @ 2023-11-23  0:35 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, naoya.horiguchi, akpm, tony.luck,
	bp, linmiaohe, rafael, lenb, james.morse, shiju.jose, bhelgaas,
	pabeni, yishaih, shameerali.kolothum.thodi, kevin.tian
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	anuaggarwal, jhubbard, danw, mochs, kvm, linux-kernel,
	linux-arm-kernel, linux-mm, linux-edac, linux-acpi

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently handles ECC errors / poison only on memory page
backed by struct page. As part of [1], the nvgrace-gpu-vfio-pci module
maps the device memory to user VA (Qemu) using remap_pfn_range without
being added to the kernel. These pages are not backed by struct page.

Implement a new ECC handling for memory without struct pages. Kernel MM
expose registration APIs to allow modules that are managing the device
to register its memory region and a callback function. MM then tracks
such regions using interval tree.

The mechanism is largely similar to that of ECC on pfn with struct pages.
If there is an ECC error on a pfn, MM uses the registered memory failure
callback function to notify the module of the faulty PFN, so that the
module may take any required action. The pfn is then unmapped in Stage-2.
When the VM tries to access the page, it gets trapped in KVM, which calls
the vm ops fault function. If the module fault function returns
VM_FAULT_HWPOISON, KVM sends a BUS_MCEERR_AR to the usermode (Qemu) mapped
to the poisoned page.

Lastly, nvgrace-gpu-vfio-pci module make use of the new mechanism to get
poison handling support on the device memory.

Patch generated over v6.7-rc2 and with [1] applied. [1] is currently under
review.

[1] https://lore.kernel.org/all/20231114081611.30550-1-ankita@nvidia.com/

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---

Link for v1: https://lore.kernel.org/all/20230920140210.12663-1-ankita@nvidia.com/

v1 -> v2
- Change poisoned page tracking from bitmap to hashtable.
- Addressed miscellaneous comments in v1.

Ankit Agrawal (4):
  mm: handle poisoning of pfn without struct pages
  mm: Add poison error check in fixup_user_fault() for mapped pfn
  mm: Change ghes code to allow poison of non-struct pfn
  vfio/nvgpu: register device memory for poison handling

 drivers/acpi/apei/ghes.c            |  12 +--
 drivers/vfio/pci/nvgrace-gpu/main.c | 123 ++++++++++++++++++++++-
 drivers/vfio/vfio_main.c            |   3 +-
 include/linux/memory-failure.h      |  22 +++++
 include/linux/mm.h                  |   1 +
 include/ras/ras_event.h             |   1 +
 mm/Kconfig                          |   1 +
 mm/gup.c                            |   2 +-
 mm/memory-failure.c                 | 146 +++++++++++++++++++++++-----
 virt/kvm/kvm_main.c                 |   6 ++
 10 files changed, 278 insertions(+), 39 deletions(-)
 create mode 100644 include/linux/memory-failure.h

-- 
2.17.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-12-04 15:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-23  0:35 [PATCH v2 0/4] mm: Implement ECC handling for pfn with no struct page ankita
2023-11-23  0:35 ` [PATCH v2 1/4] mm: handle poisoning of pfn without struct pages ankita
2023-11-23  0:35 ` [PATCH v2 2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn ankita
2023-12-01 17:04   ` Sean Christopherson
2023-11-23  0:35 ` [PATCH v2 3/4] mm: Change ghes code to allow poison of non-struct pfn ankita
2023-12-02 23:23   ` Borislav Petkov
2023-12-04 14:36     ` Jason Gunthorpe
2023-12-04 15:36       ` Borislav Petkov
2023-12-04 15:54         ` Ankit Agrawal
2023-12-04 15:55           ` Jason Gunthorpe
2023-11-23  0:35 ` [PATCH v2 4/4] vfio/nvgpu: register device memory for poison handling ankita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox