[RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure
@ 2025-10-15 18:35 Lisa Wang
  2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:35 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

Hello,

This patch series addresses an issue in the memory failure handling path
where MF_DELAYED is incorrectly treated as an error. This issue was
revealed because guest_memfd’s .error_remove_folio() callback returns
MF_DELAYED.

Currently, when the .error_remove_folio() callback for guest_memfd returns
MF_DELAYED, there are a few issues.

1. truncate_error_folio() maps this to MF_FAILED. This causes
   memory_failure() to return -EBUSY, which unconditionally triggers a
   SIGBUS. The process’ configured memory corruption kill policy is ignored
   - even if PR_MCE_KILL_LATE is set, the process will still get a SIGBUS
   on deferred memory failures.

2. “Failed to punch page” is printed, even though MF_DELAYED indicates that
   it was intentionally not punched.

The first patch corrects this by updating truncate_error_folio() to
propagate MF_DELAYED to its caller. This allows memory_failure() to return
0, indicating success, and lets the delayed handling proceed as designed.
This patch also updates me_pagecache_clean() to account for the folio's
refcount, which remains elevated during delayed handling, aligning its
logic with me_swapcache_dirty().

The subsequent two patches add KVM selftests to validate the fix and the
expected behavior of guest_memfd memory failure:

The first test patch verifies that memory_failure() now returns 0 in the
delayed case and confirms that SIGBUS signaling logic remains correct for
other scenarios (e.g., madvise injection or PR_MCE_KILL_EARLY).

The second test patch confirms that after a memory failure, the poisoned
page is correctly unmapped from the KVM guest's stage 2 page tables and
that a subsequent access by the guest correctly notifies the userspace VMM
with EHWPOISON.

This patch series is built upon kvm/next. In addition, to align with the
change of INIT_SHARED and to use the macro wrapper in guest_memfd
selftests, we put these patches behind Sean’s patches [1].

For ease of testing, this series is also available, stitched together, at
https://github.com/googleprodkernel/linux-cc/tree/memory-failure-mf-delayed-fix-rfc-v1 

[1]: https://lore.kernel.org/all/20251003232606.4070510-1-seanjc@google.com/T/

Thank you,

Lisa Wang (3):
  mm: memory_failure: Fix MF_DELAYED handling on truncation during
    failure
  KVM: selftests: Add memory failure tests in guest_memfd_test
  KVM: selftests: Test guest_memfd behavior with respect to stage 2 page
    tables

 mm/memory-failure.c                           |  24 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 233 ++++++++++++++++++
 2 files changed, 248 insertions(+), 9 deletions(-)

-- 
2.51.0.788.g6d19910ace-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
  2025-10-15 18:35 [RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
@ 2025-10-15 18:35 ` Lisa Wang
  2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
                     ` (2 more replies)
  2025-10-15 18:35 ` [RFC PATCH 2/3] KVM: selftests: Add memory failure tests in guest_memfd_test Lisa Wang
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:35 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

The .error_remove_folio a_ops is used by different filesystems to handle
folio truncation upon discovery of a memory failure in the memory
associated with the given folio.

Currently, MF_DELAYED is treated as an error, causing "Failed to punch
page" to be written to the console. MF_DELAYED is then relayed to the
caller of truncat_error_folio() as MF_FAILED. This further causes
memory_failure() to return -EBUSY, which then always causes a SIGBUS.

This is also implies that regardless of whether the thread's memory
corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
memory failure within guest_memfd memory will always cause a SIGBUS.

Update truncate_error_folio() to return MF_DELAYED to the caller if the
.error_remove_folio() callback reports MF_DELAYED.

Generalize the comment: MF_DELAYED means memory failure was handled and
some other part of memory failure will be handled later (e.g. a next
access will result in the process being killed). Specifically for
guest_memfd, a next access by the guest will result in an error returned
to the userspace VMM.

With delayed handling, the filemap continues to hold refcounts on the
folio. Hence, take that into account when checking for extra refcounts
in me_pagecache_clean(). This is aligned with the implementation in
me_swapcache_dirty(), where, if a folio is still in the swap cache,
extra_pins is set to true.

Signed-off-by: Lisa Wang <wyihan@google.com>
---
 mm/memory-failure.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df6ee59527dd..77f665c16a73 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -922,9 +922,11 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
  * by the m-f() handler immediately.
  *
  * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.
- * The page is unmapped, and is removed from the LRU or file mapping.
- * An attempt to access the page again will trigger page fault and the
- * PF handler will kill the process.
+ * It means memory_failure was handled (e.g. removed from file mapping or the
+ * LRU) and some other part of memory failure will be handled later (e.g. a
+ * next access will result in the process being killed). Specifically for
+ * guest_memfd, a next access by the guest will result in an error returned to
+ * the userspace VMM.
  *
  * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
  * The page has been completely isolated, that is, unmapped, taken out of
@@ -999,6 +1001,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
 	if (mapping->a_ops->error_remove_folio) {
 		int err = mapping->a_ops->error_remove_folio(mapping, folio);
 
+		if (err == MF_DELAYED)
+			return err;
+
 		if (err != 0)
 			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
 		else if (!filemap_release_folio(folio, GFP_NOIO))
@@ -1108,18 +1113,19 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}
 
-	/*
-	 * The shmem page is kept in page cache instead of truncating
-	 * so is expected to have an extra refcount after error-handling.
-	 */
-	extra_pins = shmem_mapping(mapping);
-
 	/*
 	 * Truncation is a bit tricky. Enable it per file system for now.
 	 *
 	 * Open: to take i_rwsem or not for this? Right now we don't.
 	 */
 	ret = truncate_error_folio(folio, page_to_pfn(p), mapping);
+
+	/*
+	 * The shmem page, or any page with MF_DELAYED error handling, is kept in
+	 * page cache instead of truncating, so is expected to have an extra
+	 * refcount after error-handling.
+	 */
+	extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;
 	if (has_extra_refcount(ps, p, extra_pins))
 		ret = MF_FAILED;
 
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] KVM: selftests: Add memory failure tests in guest_memfd_test
  2025-10-15 18:35 [RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
  2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
@ 2025-10-15 18:35 ` Lisa Wang
  2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
  2025-10-15 18:35 ` [RFC PATCH 3/3] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables Lisa Wang
  2025-10-15 18:58 ` [RFC PATCH RESEND 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
  3 siblings, 1 reply; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:35 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

After modifying truncate_error_folio(), we expect memory_failure() will
return 0 instead of MF_FAILED. Also, we want to make sure memory_failure()
signaling function is same.

Test that memory_failure() returns 0 for guest_memfd, where
.error_remove_folio() is handled by not actually truncating, and returning
MF_DELAYED.

In addition, test that SIGBUS signaling behavior is not changed before
and after this modification.

There are two kinds of guest memory failure injections - madvise or
debugfs. When memory failure is injected using madvise, the
MF_ACTION_REQUIRED flag is set, and the page is mapped and dirty, the
process should get a SIGBUS. When memory is failure is injected using
debugfs, the KILL_EARLY machine check memory corruption kill policy is
set, and the page is mapped and dirty, the process should get a SIGBUS.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Lisa Wang <wyihan@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 168 ++++++++++++++++++
 1 file changed, 168 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index e7d9aeb418d3..7bcf8d2d5d4d 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,6 +10,8 @@
 #include <errno.h>
 #include <stdio.h>
 #include <fcntl.h>
+#include <linux/prctl.h>
+#include <sys/prctl.h>
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
@@ -97,6 +99,171 @@ static void test_fault_overflow(int fd, size_t total_size)
 	test_fault_sigbus(fd, total_size, total_size * 4);
 }
 
+static unsigned long addr_to_pfn(void *addr)
+{
+	const uint64_t pagemap_pfn_mask = BIT(54) - 1;
+	const uint64_t pagemap_page_present = BIT(63);
+	uint64_t page_info;
+	ssize_t n_bytes;
+	int pagemap_fd;
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	TEST_ASSERT(pagemap_fd > 0, "Opening pagemap should succeed.");
+
+	n_bytes = pread(pagemap_fd, &page_info, 8, (uint64_t)addr / page_size * 8);
+	TEST_ASSERT(n_bytes == 8, "pread of pagemap failed. n_bytes=%ld", n_bytes);
+
+	close(pagemap_fd);
+
+	TEST_ASSERT(page_info & pagemap_page_present, "The page for addr should be present");
+	return page_info & pagemap_pfn_mask;
+}
+
+static void write_memory_failure(unsigned long pfn, bool mark, int return_code)
+{
+	char path[PATH_MAX];
+	char *filename;
+	char buf[20];
+	int ret;
+	int len;
+	int fd;
+
+	filename = mark ? "corrupt-pfn" : "unpoison-pfn";
+	snprintf(path, PATH_MAX, "/sys/kernel/debug/hwpoison/%s", filename);
+
+	fd = open(path, O_WRONLY);
+	TEST_ASSERT(fd > 0, "Failed to open %s.", path);
+
+	len = snprintf(buf, sizeof(buf), "0x%lx\n", pfn);
+	if (len < 0 || (unsigned int)len > sizeof(buf))
+		TEST_ASSERT(0, "snprintf failed or truncated.");
+
+	ret = write(fd, buf, len);
+	if (return_code == 0) {
+		/*
+		 * If the memory_failure() returns 0, write() should be successful,
+		 * which returns how many bytes it writes.
+		 */
+		TEST_ASSERT(ret > 0, "Writing memory failure (path: %s) failed: %s", path,
+			    strerror(errno));
+	} else {
+		TEST_ASSERT_EQ(ret, -1);
+		/* errno is memory_failure() return code. */
+		TEST_ASSERT_EQ(errno, return_code);
+	}
+
+	close(fd);
+}
+
+static void mark_memory_failure(unsigned long pfn, int return_code)
+{
+	write_memory_failure(pfn, true, return_code);
+}
+
+static void unmark_memory_failure(unsigned long pfn, int return_code)
+{
+	write_memory_failure(pfn, false, return_code);
+}
+
+enum memory_failure_injection_method {
+	MF_INJECT_DEBUGFS,
+	MF_INJECT_MADVISE,
+};
+
+static void do_test_memory_failure(int fd, size_t total_size,
+				   enum memory_failure_injection_method method, int kill_config,
+				   bool map_page, bool dirty_page, bool sigbus_expected,
+				   int return_code)
+{
+	unsigned long memory_failure_pfn;
+	char *memory_failure_addr;
+	char *mem;
+	int ret;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+	memory_failure_addr = mem + page_size;
+	if (dirty_page)
+		*memory_failure_addr = 'A';
+	else
+		READ_ONCE(*memory_failure_addr);
+
+	/* Fault in page to read pfn, then unmap page for testing if needed. */
+	memory_failure_pfn = addr_to_pfn(memory_failure_addr);
+	if (!map_page)
+		madvise(memory_failure_addr, page_size, MADV_DONTNEED);
+
+	ret = prctl(PR_MCE_KILL, PR_MCE_KILL_SET, kill_config, 0, 0);
+	TEST_ASSERT_EQ(ret, 0);
+
+	ret = 0;
+	switch (method) {
+	case MF_INJECT_DEBUGFS: {
+		/* DEBUGFS injection handles return_code test inside the mark_memory_failure(). */
+		if (sigbus_expected)
+			TEST_EXPECT_SIGBUS(mark_memory_failure(memory_failure_pfn, return_code));
+		else
+			mark_memory_failure(memory_failure_pfn, return_code);
+		break;
+	}
+	case MF_INJECT_MADVISE: {
+		/*
+		 * MADV_HWPOISON uses get_user_pages() so the page will always
+		 * be faulted in at the point of memory_failure()
+		 */
+		if (sigbus_expected)
+			TEST_EXPECT_SIGBUS(ret = madvise(memory_failure_addr,
+							 page_size, MADV_HWPOISON));
+		else
+			ret = madvise(memory_failure_addr, page_size, MADV_HWPOISON);
+
+		if (return_code == 0)
+			TEST_ASSERT(ret == return_code, "Memory failure failed. Errno: %s",
+							strerror(errno));
+		else {
+			/* errno is memory_failure() return code. */
+			TEST_ASSERT_EQ(errno, return_code);
+		}
+		break;
+	}
+	default:
+		TEST_FAIL("Unhandled memory failure injection method %d.", method);
+	}
+
+	TEST_EXPECT_SIGBUS(READ_ONCE(*memory_failure_addr));
+	TEST_EXPECT_SIGBUS(*memory_failure_addr = 'A');
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap() should succeed.");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+			total_size);
+	TEST_ASSERT(!ret, "Truncate the entire file (cleanup) should succeed.");
+
+	ret = prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_DEFAULT, 0, 0);
+	TEST_ASSERT_EQ(ret, 0);
+
+	unmark_memory_failure(memory_failure_pfn, 0);
+}
+
+static void test_memory_failure(int fd, size_t total_size)
+{
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, true, true, true, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, true, false, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, false, true, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, true, true, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, true, false, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, false, true, false, 0);
+	/*
+	 * If madvise() is used to inject errors, memory_failure() handling is invoked with the
+	 * MF_ACTION_REQUIRED flag set, aligned with memory failure handling for a consumed memory
+	 * error, where the machine check memory corruption kill policy is ignored. Hence, testing with
+	 * PR_MCE_KILL_DEFAULT covers all cases.
+	 */
+	do_test_memory_failure(fd, total_size, MF_INJECT_MADVISE, PR_MCE_KILL_DEFAULT, true, true, true, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_MADVISE, PR_MCE_KILL_DEFAULT, true, false, false, 0);
+}
+
 static void test_fault_private(int fd, size_t total_size)
 {
 	test_fault_sigbus(fd, 0, total_size);
@@ -273,6 +440,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
 		if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) {
 			gmem_test(mmap_supported, vm, flags);
 			gmem_test(fault_overflow, vm, flags);
+			gmem_test(memory_failure, vm, flags);
 		} else {
 			gmem_test(fault_private, vm, flags);
 		}
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables
  2025-10-15 18:35 [RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
  2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
  2025-10-15 18:35 ` [RFC PATCH 2/3] KVM: selftests: Add memory failure tests in guest_memfd_test Lisa Wang
@ 2025-10-15 18:35 ` Lisa Wang
  2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
  2025-10-15 18:58 ` [RFC PATCH RESEND 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
  3 siblings, 1 reply; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:35 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

Test that
+ memory failure handling results in unmapping of bad memory from stage
  2 page tables, hence requiring faulting on next guest access
+ when the guest tries to fault a poisoned page from guest_memfd, the
  userspace VMM informed with EHWPOISON

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Lisa Wang <wyihan@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 65 +++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 7bcf8d2d5d4d..dc3398e22edd 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -539,6 +539,70 @@ static void test_guest_memfd_guest(void)
 	kvm_vm_free(vm);
 }
 
+static void __guest_code_read(uint8_t *mem)
+{
+	READ_ONCE(*mem);
+	GUEST_DONE();
+}
+
+static void guest_read(struct kvm_vcpu *vcpu, uint64_t gpa, int expected_errno)
+{
+	vcpu_arch_set_entry_point(vcpu, __guest_code_read);
+	vcpu_args_set(vcpu, 1, gpa);
+
+	if (expected_errno) {
+		TEST_ASSERT_EQ(_vcpu_run(vcpu), -1);
+		TEST_ASSERT_EQ(errno, expected_errno);
+	} else {
+		vcpu_run(vcpu);
+		TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE);
+	}
+}
+
+static void test_memory_failure_guest(void)
+{
+	const uint64_t gpa = SZ_4G;
+	const int slot = 1;
+
+	unsigned long memory_failure_pfn;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	uint8_t *mem;
+	size_t size;
+	int fd;
+
+	if (!kvm_has_cap(KVM_CAP_GUEST_MEMFD_FLAGS))
+		return;
+
+	vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1, __guest_code_read);
+
+	size = vm->page_size;
+	fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED);
+	vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
+
+	mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+	virt_pg_map(vm, gpa, gpa);
+
+	/* Fault in page to read pfn, then unmap page for testing. */
+	READ_ONCE(*mem);
+	memory_failure_pfn = addr_to_pfn(mem);
+	munmap(mem, size);
+
+	/* Fault page into stage2 page tables. */
+	guest_read(vcpu, gpa, 0);
+
+	mark_memory_failure(memory_failure_pfn, 0);
+
+	guest_read(vcpu, gpa, EHWPOISON);
+	munmap(mem, size);
+
+	close(fd);
+	kvm_vm_free(vm);
+
+	unmark_memory_failure(memory_failure_pfn, 0);
+}
+
 int main(int argc, char *argv[])
 {
 	unsigned long vm_types, vm_type;
@@ -559,4 +623,5 @@ int main(int argc, char *argv[])
 		test_guest_memfd(vm_type);
 
 	test_guest_memfd_guest();
+	test_memory_failure_guest();
 }
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH RESEND 0/3] mm: Fix MF_DELAYED handling on memory failure
  2025-10-15 18:35 [RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
                   ` (2 preceding siblings ...)
  2025-10-15 18:35 ` [RFC PATCH 3/3] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables Lisa Wang
@ 2025-10-15 18:58 ` Lisa Wang
  3 siblings, 0 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:58 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

[resend to correct the mailing list address]

Hello,

This patch series addresses an issue in the memory failure handling path
where MF_DELAYED is incorrectly treated as an error. This issue was
revealed because guest_memfd’s .error_remove_folio() callback returns
MF_DELAYED.

Currently, when the .error_remove_folio() callback for guest_memfd returns
MF_DELAYED, there are a few issues.

1. truncate_error_folio() maps this to MF_FAILED. This causes
   memory_failure() to return -EBUSY, which unconditionally triggers a
   SIGBUS. The process’ configured memory corruption kill policy is ignored
   - even if PR_MCE_KILL_LATE is set, the process will still get a SIGBUS
   on deferred memory failures.

2. “Failed to punch page” is printed, even though MF_DELAYED indicates that
   it was intentionally not punched.

The first patch corrects this by updating truncate_error_folio() to
propagate MF_DELAYED to its caller. This allows memory_failure() to return
0, indicating success, and lets the delayed handling proceed as designed.
This patch also updates me_pagecache_clean() to account for the folio's
refcount, which remains elevated during delayed handling, aligning its
logic with me_swapcache_dirty().

The subsequent two patches add KVM selftests to validate the fix and the
expected behavior of guest_memfd memory failure:

The first test patch verifies that memory_failure() now returns 0 in the
delayed case and confirms that SIGBUS signaling logic remains correct for
other scenarios (e.g., madvise injection or PR_MCE_KILL_EARLY).

The second test patch confirms that after a memory failure, the poisoned
page is correctly unmapped from the KVM guest's stage 2 page tables and
that a subsequent access by the guest correctly notifies the userspace VMM
with EHWPOISON.

This patch series is built upon kvm/next. In addition, to align with the
change of INIT_SHARED and to use the macro wrapper in guest_memfd
selftests, we put these patches behind Sean’s patches [1].

For ease of testing, this series is also available, stitched together, at
https://github.com/googleprodkernel/linux-cc/tree/memory-failure-mf-delayed-fix-rfc-v1 

[1]: https://lore.kernel.org/all/20251003232606.4070510-1-seanjc@google.com/T/

Thank you,

Lisa Wang (3):
  mm: memory_failure: Fix MF_DELAYED handling on truncation during
    failure
  KVM: selftests: Add memory failure tests in guest_memfd_test
  KVM: selftests: Test guest_memfd behavior with respect to stage 2 page
    tables

 mm/memory-failure.c                           |  24 +-
 .../testing/selftests/kvm/guest_memfd_test.c  | 233 ++++++++++++++++++
 2 files changed, 248 insertions(+), 9 deletions(-)

-- 
2.51.0.788.g6d19910ace-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
  2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
@ 2025-10-15 18:58   ` Lisa Wang
  2025-10-16 20:18   ` David Hildenbrand
  2025-10-20 12:37   ` David Hildenbrand
  2 siblings, 0 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:58 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

The .error_remove_folio a_ops is used by different filesystems to handle
folio truncation upon discovery of a memory failure in the memory
associated with the given folio.

Currently, MF_DELAYED is treated as an error, causing "Failed to punch
page" to be written to the console. MF_DELAYED is then relayed to the
caller of truncat_error_folio() as MF_FAILED. This further causes
memory_failure() to return -EBUSY, which then always causes a SIGBUS.

This is also implies that regardless of whether the thread's memory
corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
memory failure within guest_memfd memory will always cause a SIGBUS.

Update truncate_error_folio() to return MF_DELAYED to the caller if the
.error_remove_folio() callback reports MF_DELAYED.

Generalize the comment: MF_DELAYED means memory failure was handled and
some other part of memory failure will be handled later (e.g. a next
access will result in the process being killed). Specifically for
guest_memfd, a next access by the guest will result in an error returned
to the userspace VMM.

With delayed handling, the filemap continues to hold refcounts on the
folio. Hence, take that into account when checking for extra refcounts
in me_pagecache_clean(). This is aligned with the implementation in
me_swapcache_dirty(), where, if a folio is still in the swap cache,
extra_pins is set to true.

Signed-off-by: Lisa Wang <wyihan@google.com>
---
 mm/memory-failure.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df6ee59527dd..77f665c16a73 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -922,9 +922,11 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
  * by the m-f() handler immediately.
  *
  * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.
- * The page is unmapped, and is removed from the LRU or file mapping.
- * An attempt to access the page again will trigger page fault and the
- * PF handler will kill the process.
+ * It means memory_failure was handled (e.g. removed from file mapping or the
+ * LRU) and some other part of memory failure will be handled later (e.g. a
+ * next access will result in the process being killed). Specifically for
+ * guest_memfd, a next access by the guest will result in an error returned to
+ * the userspace VMM.
  *
  * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
  * The page has been completely isolated, that is, unmapped, taken out of
@@ -999,6 +1001,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
 	if (mapping->a_ops->error_remove_folio) {
 		int err = mapping->a_ops->error_remove_folio(mapping, folio);

+		if (err == MF_DELAYED)
+			return err;
+
 		if (err != 0)
 			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
 		else if (!filemap_release_folio(folio, GFP_NOIO))
@@ -1108,18 +1113,19 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}

-	/*
-	 * The shmem page is kept in page cache instead of truncating
-	 * so is expected to have an extra refcount after error-handling.
-	 */
-	extra_pins = shmem_mapping(mapping);
-
 	/*
 	 * Truncation is a bit tricky. Enable it per file system for now.
 	 *
 	 * Open: to take i_rwsem or not for this? Right now we don't.
 	 */
 	ret = truncate_error_folio(folio, page_to_pfn(p), mapping);
+
+	/*
+	 * The shmem page, or any page with MF_DELAYED error handling, is kept in
+	 * page cache instead of truncating, so is expected to have an extra
+	 * refcount after error-handling.
+	 */
+	extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;
 	if (has_extra_refcount(ps, p, extra_pins))
 		ret = MF_FAILED;

-- 
2.51.0.788.g6d19910ace-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH RESEND 2/3] KVM: selftests: Add memory failure tests in guest_memfd_test
  2025-10-15 18:35 ` [RFC PATCH 2/3] KVM: selftests: Add memory failure tests in guest_memfd_test Lisa Wang
@ 2025-10-15 18:58   ` Lisa Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:58 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

After modifying truncate_error_folio(), we expect memory_failure() will
return 0 instead of MF_FAILED. Also, we want to make sure memory_failure()
signaling function is same.

Test that memory_failure() returns 0 for guest_memfd, where
.error_remove_folio() is handled by not actually truncating, and returning
MF_DELAYED.

In addition, test that SIGBUS signaling behavior is not changed before
and after this modification.

There are two kinds of guest memory failure injections - madvise or
debugfs. When memory failure is injected using madvise, the
MF_ACTION_REQUIRED flag is set, and the page is mapped and dirty, the
process should get a SIGBUS. When memory is failure is injected using
debugfs, the KILL_EARLY machine check memory corruption kill policy is
set, and the page is mapped and dirty, the process should get a SIGBUS.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Lisa Wang <wyihan@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 168 ++++++++++++++++++
 1 file changed, 168 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index e7d9aeb418d3..7bcf8d2d5d4d 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,6 +10,8 @@
 #include <errno.h>
 #include <stdio.h>
 #include <fcntl.h>
+#include <linux/prctl.h>
+#include <sys/prctl.h>
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
@@ -97,6 +99,171 @@ static void test_fault_overflow(int fd, size_t total_size)
 	test_fault_sigbus(fd, total_size, total_size * 4);
 }
 
+static unsigned long addr_to_pfn(void *addr)
+{
+	const uint64_t pagemap_pfn_mask = BIT(54) - 1;
+	const uint64_t pagemap_page_present = BIT(63);
+	uint64_t page_info;
+	ssize_t n_bytes;
+	int pagemap_fd;
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	TEST_ASSERT(pagemap_fd > 0, "Opening pagemap should succeed.");
+
+	n_bytes = pread(pagemap_fd, &page_info, 8, (uint64_t)addr / page_size * 8);
+	TEST_ASSERT(n_bytes == 8, "pread of pagemap failed. n_bytes=%ld", n_bytes);
+
+	close(pagemap_fd);
+
+	TEST_ASSERT(page_info & pagemap_page_present, "The page for addr should be present");
+	return page_info & pagemap_pfn_mask;
+}
+
+static void write_memory_failure(unsigned long pfn, bool mark, int return_code)
+{
+	char path[PATH_MAX];
+	char *filename;
+	char buf[20];
+	int ret;
+	int len;
+	int fd;
+
+	filename = mark ? "corrupt-pfn" : "unpoison-pfn";
+	snprintf(path, PATH_MAX, "/sys/kernel/debug/hwpoison/%s", filename);
+
+	fd = open(path, O_WRONLY);
+	TEST_ASSERT(fd > 0, "Failed to open %s.", path);
+
+	len = snprintf(buf, sizeof(buf), "0x%lx\n", pfn);
+	if (len < 0 || (unsigned int)len > sizeof(buf))
+		TEST_ASSERT(0, "snprintf failed or truncated.");
+
+	ret = write(fd, buf, len);
+	if (return_code == 0) {
+		/*
+		 * If the memory_failure() returns 0, write() should be successful,
+		 * which returns how many bytes it writes.
+		 */
+		TEST_ASSERT(ret > 0, "Writing memory failure (path: %s) failed: %s", path,
+			    strerror(errno));
+	} else {
+		TEST_ASSERT_EQ(ret, -1);
+		/* errno is memory_failure() return code. */
+		TEST_ASSERT_EQ(errno, return_code);
+	}
+
+	close(fd);
+}
+
+static void mark_memory_failure(unsigned long pfn, int return_code)
+{
+	write_memory_failure(pfn, true, return_code);
+}
+
+static void unmark_memory_failure(unsigned long pfn, int return_code)
+{
+	write_memory_failure(pfn, false, return_code);
+}
+
+enum memory_failure_injection_method {
+	MF_INJECT_DEBUGFS,
+	MF_INJECT_MADVISE,
+};
+
+static void do_test_memory_failure(int fd, size_t total_size,
+				   enum memory_failure_injection_method method, int kill_config,
+				   bool map_page, bool dirty_page, bool sigbus_expected,
+				   int return_code)
+{
+	unsigned long memory_failure_pfn;
+	char *memory_failure_addr;
+	char *mem;
+	int ret;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+	memory_failure_addr = mem + page_size;
+	if (dirty_page)
+		*memory_failure_addr = 'A';
+	else
+		READ_ONCE(*memory_failure_addr);
+
+	/* Fault in page to read pfn, then unmap page for testing if needed. */
+	memory_failure_pfn = addr_to_pfn(memory_failure_addr);
+	if (!map_page)
+		madvise(memory_failure_addr, page_size, MADV_DONTNEED);
+
+	ret = prctl(PR_MCE_KILL, PR_MCE_KILL_SET, kill_config, 0, 0);
+	TEST_ASSERT_EQ(ret, 0);
+
+	ret = 0;
+	switch (method) {
+	case MF_INJECT_DEBUGFS: {
+		/* DEBUGFS injection handles return_code test inside the mark_memory_failure(). */
+		if (sigbus_expected)
+			TEST_EXPECT_SIGBUS(mark_memory_failure(memory_failure_pfn, return_code));
+		else
+			mark_memory_failure(memory_failure_pfn, return_code);
+		break;
+	}
+	case MF_INJECT_MADVISE: {
+		/*
+		 * MADV_HWPOISON uses get_user_pages() so the page will always
+		 * be faulted in at the point of memory_failure()
+		 */
+		if (sigbus_expected)
+			TEST_EXPECT_SIGBUS(ret = madvise(memory_failure_addr,
+							 page_size, MADV_HWPOISON));
+		else
+			ret = madvise(memory_failure_addr, page_size, MADV_HWPOISON);
+
+		if (return_code == 0)
+			TEST_ASSERT(ret == return_code, "Memory failure failed. Errno: %s",
+							strerror(errno));
+		else {
+			/* errno is memory_failure() return code. */
+			TEST_ASSERT_EQ(errno, return_code);
+		}
+		break;
+	}
+	default:
+		TEST_FAIL("Unhandled memory failure injection method %d.", method);
+	}
+
+	TEST_EXPECT_SIGBUS(READ_ONCE(*memory_failure_addr));
+	TEST_EXPECT_SIGBUS(*memory_failure_addr = 'A');
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap() should succeed.");
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+			total_size);
+	TEST_ASSERT(!ret, "Truncate the entire file (cleanup) should succeed.");
+
+	ret = prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_DEFAULT, 0, 0);
+	TEST_ASSERT_EQ(ret, 0);
+
+	unmark_memory_failure(memory_failure_pfn, 0);
+}
+
+static void test_memory_failure(int fd, size_t total_size)
+{
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, true, true, true, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, true, false, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, false, true, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, true, true, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, true, false, false, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, false, true, false, 0);
+	/*
+	 * If madvise() is used to inject errors, memory_failure() handling is invoked with the
+	 * MF_ACTION_REQUIRED flag set, aligned with memory failure handling for a consumed memory
+	 * error, where the machine check memory corruption kill policy is ignored. Hence, testing with
+	 * PR_MCE_KILL_DEFAULT covers all cases.
+	 */
+	do_test_memory_failure(fd, total_size, MF_INJECT_MADVISE, PR_MCE_KILL_DEFAULT, true, true, true, 0);
+	do_test_memory_failure(fd, total_size, MF_INJECT_MADVISE, PR_MCE_KILL_DEFAULT, true, false, false, 0);
+}
+
 static void test_fault_private(int fd, size_t total_size)
 {
 	test_fault_sigbus(fd, 0, total_size);
@@ -273,6 +440,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
 		if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) {
 			gmem_test(mmap_supported, vm, flags);
 			gmem_test(fault_overflow, vm, flags);
+			gmem_test(memory_failure, vm, flags);
 		} else {
 			gmem_test(fault_private, vm, flags);
 		}
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH RESEND 3/3] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables
  2025-10-15 18:35 ` [RFC PATCH 3/3] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables Lisa Wang
@ 2025-10-15 18:58   ` Lisa Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-15 18:58 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest
  Cc: david, rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen, Lisa Wang

Test that
+ memory failure handling results in unmapping of bad memory from stage
  2 page tables, hence requiring faulting on next guest access
+ when the guest tries to fault a poisoned page from guest_memfd, the
  userspace VMM informed with EHWPOISON

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Lisa Wang <wyihan@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 65 +++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 7bcf8d2d5d4d..dc3398e22edd 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -539,6 +539,70 @@ static void test_guest_memfd_guest(void)
 	kvm_vm_free(vm);
 }
 
+static void __guest_code_read(uint8_t *mem)
+{
+	READ_ONCE(*mem);
+	GUEST_DONE();
+}
+
+static void guest_read(struct kvm_vcpu *vcpu, uint64_t gpa, int expected_errno)
+{
+	vcpu_arch_set_entry_point(vcpu, __guest_code_read);
+	vcpu_args_set(vcpu, 1, gpa);
+
+	if (expected_errno) {
+		TEST_ASSERT_EQ(_vcpu_run(vcpu), -1);
+		TEST_ASSERT_EQ(errno, expected_errno);
+	} else {
+		vcpu_run(vcpu);
+		TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE);
+	}
+}
+
+static void test_memory_failure_guest(void)
+{
+	const uint64_t gpa = SZ_4G;
+	const int slot = 1;
+
+	unsigned long memory_failure_pfn;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	uint8_t *mem;
+	size_t size;
+	int fd;
+
+	if (!kvm_has_cap(KVM_CAP_GUEST_MEMFD_FLAGS))
+		return;
+
+	vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1, __guest_code_read);
+
+	size = vm->page_size;
+	fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED);
+	vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
+
+	mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+	virt_pg_map(vm, gpa, gpa);
+
+	/* Fault in page to read pfn, then unmap page for testing. */
+	READ_ONCE(*mem);
+	memory_failure_pfn = addr_to_pfn(mem);
+	munmap(mem, size);
+
+	/* Fault page into stage2 page tables. */
+	guest_read(vcpu, gpa, 0);
+
+	mark_memory_failure(memory_failure_pfn, 0);
+
+	guest_read(vcpu, gpa, EHWPOISON);
+	munmap(mem, size);
+
+	close(fd);
+	kvm_vm_free(vm);
+
+	unmark_memory_failure(memory_failure_pfn, 0);
+}
+
 int main(int argc, char *argv[])
 {
 	unsigned long vm_types, vm_type;
@@ -559,4 +623,5 @@ int main(int argc, char *argv[])
 		test_guest_memfd(vm_type);
 
 	test_guest_memfd_guest();
+	test_memory_failure_guest();
 }
-- 
2.51.0.788.g6d19910ace-goog



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
  2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
  2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
@ 2025-10-16 20:18   ` David Hildenbrand
  2025-10-17 17:30     ` Lisa Wang
  2025-10-20 12:37   ` David Hildenbrand
  2 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-10-16 20:18 UTC (permalink / raw)
  To: Lisa Wang, linmiaohe, nao.horiguchi, akpm, pbonzini, shuah,
	linux-mm, linux-kernel, kvm, linux-kselftest
  Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen

On 15.10.25 20:58, Lisa Wang wrote:
> The .error_remove_folio a_ops is used by different filesystems to handle
> folio truncation upon discovery of a memory failure in the memory
> associated with the given folio.
> 
> Currently, MF_DELAYED is treated as an error, causing "Failed to punch
> page" to be written to the console. MF_DELAYED is then relayed to the
> caller of truncat_error_folio() as MF_FAILED. This further causes
> memory_failure() to return -EBUSY, which then always causes a SIGBUS.
> 
> This is also implies that regardless of whether the thread's memory
> corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
> memory failure within guest_memfd memory will always cause a SIGBUS.
> 
> Update truncate_error_folio() to return MF_DELAYED to the caller if the
> .error_remove_folio() callback reports MF_DELAYED.
> 
> Generalize the comment: MF_DELAYED means memory failure was handled and
> some other part of memory failure will be handled later (e.g. a next
> access will result in the process being killed). Specifically for
> guest_memfd, a next access by the guest will result in an error returned
> to the userspace VMM.
> 
> With delayed handling, the filemap continues to hold refcounts on the
> folio. Hence, take that into account when checking for extra refcounts
> in me_pagecache_clean(). This is aligned with the implementation in
> me_swapcache_dirty(), where, if a folio is still in the swap cache,
> extra_pins is set to true.
> 
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>   mm/memory-failure.c | 24 +++++++++++++++---------
>   1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index df6ee59527dd..77f665c16a73 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -922,9 +922,11 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>    * by the m-f() handler immediately.
>    *
>    * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.
> - * The page is unmapped, and is removed from the LRU or file mapping.
> - * An attempt to access the page again will trigger page fault and the
> - * PF handler will kill the process.
> + * It means memory_failure was handled (e.g. removed from file mapping or the
> + * LRU) and some other part of memory failure will be handled later (e.g. a
> + * next access will result in the process being killed). Specifically for
> + * guest_memfd, a next access by the guest will result in an error returned to
> + * the userspace VMM.
>    *
>    * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
>    * The page has been completely isolated, that is, unmapped, taken out of
> @@ -999,6 +1001,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
>   	if (mapping->a_ops->error_remove_folio) {
>   		int err = mapping->a_ops->error_remove_folio(mapping, folio);
>   
> +		if (err == MF_DELAYED)
> +			return err;
> +
>   		if (err != 0)
>   			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
>   		else if (!filemap_release_folio(folio, GFP_NOIO))
> @@ -1108,18 +1113,19 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>   		goto out;
>   	}
>   
> -	/*
> -	 * The shmem page is kept in page cache instead of truncating
> -	 * so is expected to have an extra refcount after error-handling.
> -	 */
> -	extra_pins = shmem_mapping(mapping);
> -
>   	/*
>   	 * Truncation is a bit tricky. Enable it per file system for now.
>   	 *
>   	 * Open: to take i_rwsem or not for this? Right now we don't.
>   	 */
>   	ret = truncate_error_folio(folio, page_to_pfn(p), mapping);
> +
> +	/*
> +	 * The shmem page, or any page with MF_DELAYED error handling, is kept in
> +	 * page cache instead of truncating, so is expected to have an extra
> +	 * refcount after error-handling.
> +	 */
> +	extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;

Well, to do it cleanly shouldn't we let shmem_error_remove_folio() also 
return MF_DELAYED and remove this shmem special case?

Or is there a good reason shmem_mapping() wants to return 0 -- and maybe 
guest_memfd would also wan to do that?

Just reading the code here the inconsistency is unclear.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
  2025-10-16 20:18   ` David Hildenbrand
@ 2025-10-17 17:30     ` Lisa Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Lisa Wang @ 2025-10-17 17:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linmiaohe, nao.horiguchi, akpm, pbonzini, shuah, linux-mm,
	linux-kernel, kvm, linux-kselftest, rientjes, seanjc,
	ackerleytng, vannapurve, michael.roth, jiaqiyan, tabba,
	dave.hansen

On Thu, Oct 16, 2025 at 10:18:17PM +0200, David Hildenbrand wrote:
> On 15.10.25 20:58, Lisa Wang wrote:
> > The .error_remove_folio a_ops is used by different filesystems to handle
> > folio truncation upon discovery of a memory failure in the memory
> > associated with the given folio.
> > [...snip...]
> > +
> > +	/*
> > +	 * The shmem page, or any page with MF_DELAYED error handling, is kept in
> > +	 * page cache instead of truncating, so is expected to have an extra
> > +	 * refcount after error-handling.
> > +	 */
> > +	extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;

Hello David,

Thank you for reviewing these patches!

> Well, to do it cleanly shouldn't we let shmem_error_remove_folio() also
> return MF_DELAYED and remove this shmem special case?

I agree shmem_error_remove_folio() should probably also return MF_DELAYED.
MF_DELAYED sounds right because shmem does not truncate, and hence it
should not call filemap_release_folio() to release fs-specific metadata on
a folio.

There's no bug now in memory failure handling for shmem calling
filemap_release_folio(), because

shmem does not have folio->private
=> filemap_release_folio() is a no-op anyway
=> filemap_release_folio() returns true
=> truncate_error_folio() returns MF_RECOVERED
=> truncate_error_folio()'s caller cleans MF_RECOVERED up to eventually
return 0.

> Or is there a good reason shmem_mapping() wants to return 0 -- and maybe
> guest_memfd would also wan to do that?

The tradeoff is if I change shmem_error_remove_folio()'s return, mf_stats
will be changed. I'd be happy to update shmem_error_remove_folio() to
return MF_DELAYED as well, but is it okay that the userspace-visible
behavior in the form of statistics will change?

> Just reading the code here the inconsistency is unclear.

Another option is to add kvm_gmem_mapping() like shmem_mapping(). I did not
do it because KVM is a module, so we'd need extra steps to check of KVM is
loaded in memory, and that's a little more complicated. Also,
kvm_gmem_error_folio() already returns MF_DELAYED, which seems to be the
right thing to return.

> 
> -- 
> Cheers
> 
> David / dhildenb

Lisa

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
  2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
  2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
  2025-10-16 20:18   ` David Hildenbrand
@ 2025-10-20 12:37   ` David Hildenbrand
  2 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2025-10-20 12:37 UTC (permalink / raw)
  To: Lisa Wang, linmiaohe, nao.horiguchi, akpm, pbonzini, shuah,
	linux-mm, linux-kernel, kvm, linux-kselftest
  Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
	jiaqiyan, tabba, dave.hansen

Lisa accidentally dropped all Tos/CCs, so here is her mail with my reply forwarded:


-------- Forwarded Message --------
Subject: Re: [RFC PATCH RESEND 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure
Date: Mon, 20 Oct 2025 14:35:45 +0200
From: David Hildenbrand <david@redhat.com>
To: Lisa Wang <wyihan@google.com>

On 17.10.25 17:31, Lisa Wang wrote:
> On Thu, Oct 16, 2025 at 10:18:17PM +0200, David Hildenbrand wrote:
>> On 15.10.25 20:58, Lisa Wang wrote:
>>> The .error_remove_folio a_ops is used by different filesystems to handle
>>> folio truncation upon discovery of a memory failure in the memory
>>> associated with the given folio.
>>>
>>> Currently, MF_DELAYED is treated as an error, causing "Failed to punch
>>> page" to be written to the console. MF_DELAYED is then relayed to the
>>> caller of truncat_error_folio() as MF_FAILED. This further causes
>>> memory_failure() to return -EBUSY, which then always causes a SIGBUS.
>>>
>>> This is also implies that regardless of whether the thread's memory
>>> corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
>>> memory failure within guest_memfd memory will always cause a SIGBUS.
>>>
>>> Update truncate_error_folio() to return MF_DELAYED to the caller if the
>>> .error_remove_folio() callback reports MF_DELAYED.
>>>
>>> Generalize the comment: MF_DELAYED means memory failure was handled and
>>> some other part of memory failure will be handled later (e.g. a next
>>> access will result in the process being killed). Specifically for
>>> guest_memfd, a next access by the guest will result in an error returned
>>> to the userspace VMM.
>>>
>>> With delayed handling, the filemap continues to hold refcounts on the
>>> folio. Hence, take that into account when checking for extra refcounts
>>> in me_pagecache_clean(). This is aligned with the implementation in
>>> me_swapcache_dirty(), where, if a folio is still in the swap cache,
>>> extra_pins is set to true.
>>>
>>> Signed-off-by: Lisa Wang <wyihan@google.com>
>>> ---
>>>    mm/memory-failure.c | 24 +++++++++++++++---------
>>>    1 file changed, 15 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index df6ee59527dd..77f665c16a73 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -922,9 +922,11 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>>     * by the m-f() handler immediately.
>>>     *
>>>     * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.
>>> - * The page is unmapped, and is removed from the LRU or file mapping.
>>> - * An attempt to access the page again will trigger page fault and the
>>> - * PF handler will kill the process.
>>> + * It means memory_failure was handled (e.g. removed from file mapping or the
>>> + * LRU) and some other part of memory failure will be handled later (e.g. a
>>> + * next access will result in the process being killed). Specifically for
>>> + * guest_memfd, a next access by the guest will result in an error returned to
>>> + * the userspace VMM.
>>>     *
>>>     * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
>>>     * The page has been completely isolated, that is, unmapped, taken out of
>>> @@ -999,6 +1001,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
>>>    	if (mapping->a_ops->error_remove_folio) {
>>>    		int err = mapping->a_ops->error_remove_folio(mapping, folio);
>>> +		if (err == MF_DELAYED)
>>> +			return err;
>>> +
>>>    		if (err != 0)
>>>    			pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
>>>    		else if (!filemap_release_folio(folio, GFP_NOIO))
>>> @@ -1108,18 +1113,19 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>>>    		goto out;
>>>    	}
>>> -	/*
>>> -	 * The shmem page is kept in page cache instead of truncating
>>> -	 * so is expected to have an extra refcount after error-handling.
>>> -	 */
>>> -	extra_pins = shmem_mapping(mapping);
>>> -
>>>    	/*
>>>    	 * Truncation is a bit tricky. Enable it per file system for now.
>>>    	 *
>>>    	 * Open: to take i_rwsem or not for this? Right now we don't.
>>>    	 */
>>>    	ret = truncate_error_folio(folio, page_to_pfn(p), mapping);
>>> +
>>> +	/*
>>> +	 * The shmem page, or any page with MF_DELAYED error handling, is kept in
>>> +	 * page cache instead of truncating, so is expected to have an extra
>>> +	 * refcount after error-handling.
>>> +	 */
>>> +	extra_pins = shmem_mapping(mapping) || ret == MF_DELAYED;
> 
> Hello David,
> 
> Thank you for reviewing these patches!
> 
>> Well, to do it cleanly shouldn't we let shmem_error_remove_folio() also
>> return MF_DELAYED and remove this shmem special case?
>>
> 
> I agree shmem_error_remove_folio() should probably also return MF_DELAYED.
> MF_DELAYED sounds right because shmem does not truncate, and hence it
> should not call filemap_release_folio() to release fs-specific metadata on
> a folio.

Just to clarify for others, this is the code we are talking about in filemap_release_folio():

if (mapping->a_ops->error_remove_folio) {
	int err = mapping->a_ops->error_remove_folio(mapping, folio);

	if (err != 0)
		pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
	else if (!filemap_release_folio(folio, GFP_NOIO))
		pr_info("%#lx: failed to release buffers\n", pfn);
	else
		ret = MF_RECOVERED;
} ...

> 
> There's no bug now in memory failure handling for shmem calling
> filemap_release_folio(), because

Right, because shmem error_remove_folio() will currently return 0.

> 
> shmem does not have folio->private
> => filemap_release_folio() is a no-op anyway
> => filemap_release_folio() returns true
> => truncate_error_folio() returns MF_RECOVERED

Agreed.

> => truncate_error_folio()'s caller cleans MF_RECOVERED up to eventually
> return 0.

Yes.

> 
>> Or is there a good reason shmem_mapping() wants to return 0 -- and maybe
>> guest_memfd would also wan to do that?
>>
> 
> The tradeoff is if I change shmem_error_remove_folio()'s return, mf_stats
> will be changed.

But it actually sounds like the right thing to do, no? Who cares about
the stats being delayed vs. recovered?

> I'd be happy to update shmem_error_remove_folio() to
> return MF_DELAYED as well, but is it okay that the userspace-visible
> behavior in the form of statistics will change?

They never really were "recovered", but always "delayed", correct?
In that case, it almost sounds like a fix.

> 
>> Just reading the code here the inconsistency is unclear.
> 
> Another option is to add kvm_gmem_mapping() like shmem_mapping().

Oh no.

As an alternative we could introduce a new MF_* to handle this case.

But it almost sounds like MF_DELAYED does exactly what we want, so I
would suggest to try that first to see if there is any pushback/good
reason to let shmem result in "recovered" when it's really "delayed".


BTW, the behavior from truncate (recovered) -> keep (delayed) was added in

commit a7605426666196c5a460dd3de6f8dac1d3c21f00
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Jan 14 14:05:19 2022 -0800

      mm: shmem: don't truncate page if memory failure happens
           The current behavior of memory failure is to truncate the page cache
      regardless of dirty or clean.  If the page is dirty the later access
      will get the obsolete data from disk without any notification to the
      users.  This may cause silent data loss.  It is even worse for shmem
      since shmem is in-memory filesystem, truncating page cache means
      discarding data blocks.  The later read would return all zero.
           The right approach is to keep the corrupted page in page cache, any
      later access would return error for syscalls or SIGBUS for page fault,
      until the file is truncated, hole punched or removed.  The regular
      storage backed filesystems would be more complicated so this patch is
      focused on shmem.  This also unblock the support for soft offlining
      shmem THP.
      

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-10-20 12:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-15 18:35 [RFC PATCH 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
2025-10-15 18:35 ` [RFC PATCH 1/3] mm: memory_failure: Fix MF_DELAYED handling on truncation during failure Lisa Wang
2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
2025-10-16 20:18   ` David Hildenbrand
2025-10-17 17:30     ` Lisa Wang
2025-10-20 12:37   ` David Hildenbrand
2025-10-15 18:35 ` [RFC PATCH 2/3] KVM: selftests: Add memory failure tests in guest_memfd_test Lisa Wang
2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
2025-10-15 18:35 ` [RFC PATCH 3/3] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables Lisa Wang
2025-10-15 18:58   ` [RFC PATCH RESEND " Lisa Wang
2025-10-15 18:58 ` [RFC PATCH RESEND 0/3] mm: Fix MF_DELAYED handling on memory failure Lisa Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox