* [PATCH RFC v3 1/7] mm: memory_failure: Clarify the MF_DELAYED definition
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 2/7] mm: memory_failure: Allow truncate_error_folio to return MF_DELAYED Lisa Wang
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
This patch clarifies the definition of MF_DELAYED to represent cases
where a folio's removal is initiated but not immediately completed
(e.g., due to remaining metadata references).
Signed-off-by: Lisa Wang <wyihan@google.com>
---
mm/memory-failure.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..2e53b3024391 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -849,24 +849,25 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
}
/*
- * MF_IGNORED - The m-f() handler marks the page as PG_hwpoisoned'ed.
+ * MF_IGNORED - The m-f() handler marks the page as PG_hwpoison'ed.
* But it could not do more to isolate the page from being accessed again,
* nor does it kill the process. This is extremely rare and one of the
* potential causes is that the page state has been changed due to
* underlying race condition. This is the most severe outcomes.
*
- * MF_FAILED - The m-f() handler marks the page as PG_hwpoisoned'ed.
+ * MF_FAILED - The m-f() handler marks the page as PG_hwpoison'ed.
* It should have killed the process, but it can't isolate the page,
* due to conditions such as extra pin, unmap failure, etc. Accessing
* the page again may trigger another MCE and the process will be killed
* by the m-f() handler immediately.
*
- * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed.
- * The page is unmapped, and is removed from the LRU or file mapping.
- * An attempt to access the page again will trigger page fault and the
- * PF handler will kill the process.
+ * MF_DELAYED - The m-f() handler marks the page as PG_hwpoison'ed.
+ * It means the page was unmapped and partially isolated (e.g. removed from
+ * file mapping or the LRU) but full cleanup is deferred (e.g. the metadata
+ * for the memory, as in struct page/folio, is still referenced). Any
+ * further access to the page will result in the process being killed.
*
- * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
+ * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoison'ed.
* The page has been completely isolated, that is, unmapped, taken out of
* the buddy system, or hole-punched out of the file mapping.
*/
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH RFC v3 2/7] mm: memory_failure: Allow truncate_error_folio to return MF_DELAYED
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 1/7] mm: memory_failure: Clarify the MF_DELAYED definition Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 3/7] mm: shmem: Update shmem handler to the MF_DELAYED definition Lisa Wang
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
The .error_remove_folio a_ops is used by different filesystems to handle
folio truncation upon discovery of a memory failure in the memory
associated with the given folio.
Currently, MF_DELAYED is treated as an error, causing "Failed to punch
page" to be written to the console. MF_DELAYED is then relayed to the
caller of truncate_error_folio() as MF_FAILED. This further causes
memory_failure() to return -EBUSY, which then always causes a SIGBUS.
This is also implies that regardless of whether the thread's memory
corruption kill policy is PR_MCE_KILL_EARLY or PR_MCE_KILL_LATE, a
memory failure with MF_DELAYED will always cause a SIGBUS.
Update truncate_error_folio() to return MF_DELAYED to the caller if the
.error_remove_folio() callback reports MF_DELAYED.
Signed-off-by: Lisa Wang <wyihan@google.com>
---
mm/memory-failure.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2e53b3024391..fd9ed2cd761d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -941,7 +941,9 @@ static int truncate_error_folio(struct folio *folio, unsigned long pfn,
if (mapping->a_ops->error_remove_folio) {
int err = mapping->a_ops->error_remove_folio(mapping, folio);
- if (err != 0)
+ if (err == MF_DELAYED)
+ ret = err;
+ else if (err != 0)
pr_info("%#lx: Failed to punch page: %d\n", pfn, err);
else if (!filemap_release_folio(folio, GFP_NOIO))
pr_info("%#lx: failed to release buffers\n", pfn);
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH RFC v3 3/7] mm: shmem: Update shmem handler to the MF_DELAYED definition
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 1/7] mm: memory_failure: Clarify the MF_DELAYED definition Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 2/7] mm: memory_failure: Allow truncate_error_folio to return MF_DELAYED Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 4/7] mm: memory_failure: Generalize extra_pins handling to all MF_DELAYED cases Lisa Wang
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
To align with the definition of MF_DELAYED, update
shmem_error_remove_folio() to return MF_DELAYED.
shmem handles memory failures but defers the actual file truncation. The
function's return value should therefore be MF_DELAYED to accurately
reflect the state.
Currently, this logical error does not cause a bug, because:
- For shmem folios, folio->private is not set.
- As a result, filemap_release_folio() is a no-op and returns true.
- This, in turn, causes truncate_error_folio() to incorrectly return
MF_RECOVERED.
- The caller then treats MF_RECOVERED as a success condition, masking the
issue.
The previous patch relays MF_DELAYED to the caller of
truncate_error_folio() before any logging, so returning MF_DELAYED from
shmem_error_remove_folio() will retain the original behavior of not
adding any logs.
The return value of truncate_error_folio() is consumed in action_result(),
which treats MF_DELAYED the same way as MF_RECOVERED, hence action_result()
also returns the same thing after this change.
Signed-off-by: Lisa Wang <wyihan@google.com>
---
mm/shmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index b40f3cd48961..fd8f90540361 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5207,7 +5207,7 @@ static void __init shmem_destroy_inodecache(void)
static int shmem_error_remove_folio(struct address_space *mapping,
struct folio *folio)
{
- return 0;
+ return MF_DELAYED;
}
static const struct address_space_operations shmem_aops = {
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH RFC v3 4/7] mm: memory_failure: Generalize extra_pins handling to all MF_DELAYED cases
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
` (2 preceding siblings ...)
2026-04-08 17:24 ` [PATCH RFC v3 3/7] mm: shmem: Update shmem handler to the MF_DELAYED definition Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 5/7] mm: selftests: Add shmem into memory failure test Lisa Wang
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
Generalize extra_pins handling to all MF_DELAYED cases not only
shmem_mapping.
If MF_DELAYED is returned, the filemap continues to hold refcounts on the
folio. Hence, take that into account when checking for extra refcounts.
As clarified in an earlier patch, a return value of MF_DELAYED implies that
the page still has elevated refcounts. Hence, set extra_pins to true if the
return value is MF_DELAYED. This is aligned with the implementation in
me_swapcache_dirty(), where, if a folio is still in the swap cache, ret is
set to MF_DELAYED and extra_pins is set to true.
Signed-off-by: Lisa Wang <wyihan@google.com>
---
mm/memory-failure.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fd9ed2cd761d..d6628ad114a5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1052,18 +1052,14 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
goto out;
}
- /*
- * The shmem page is kept in page cache instead of truncating
- * so is expected to have an extra refcount after error-handling.
- */
- extra_pins = shmem_mapping(mapping);
-
/*
* Truncation is a bit tricky. Enable it per file system for now.
*
* Open: to take i_rwsem or not for this? Right now we don't.
*/
ret = truncate_error_folio(folio, page_to_pfn(p), mapping);
+
+ extra_pins = ret == MF_DELAYED;
if (has_extra_refcount(ps, p, extra_pins))
ret = MF_FAILED;
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH RFC v3 5/7] mm: selftests: Add shmem into memory failure test
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
` (3 preceding siblings ...)
2026-04-08 17:24 ` [PATCH RFC v3 4/7] mm: memory_failure: Generalize extra_pins handling to all MF_DELAYED cases Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 6/7] KVM: selftests: Add memory failure tests in guest_memfd_test Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 7/7] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables Lisa Wang
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
Add a shmem memory failure selftest to test the shmem memory failure is
correct after modifying shmem return value.
Test that
+ madvise() call returns 0 when the poisoned shmem page is clean
+ trigger a SIGBUS when the poisoned shmem page is dirty
+ trigger another SIGBUS when the poisoned shmem page is fault-in again.
Signed-off-by: Lisa Wang <wyihan@google.com>
---
tools/testing/selftests/mm/memory-failure.c | 109 +++++++++++++++++++++++++++-
1 file changed, 106 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/memory-failure.c b/tools/testing/selftests/mm/memory-failure.c
index 3d9e0b9ffb41..eb3f8d98f6c9 100644
--- a/tools/testing/selftests/mm/memory-failure.c
+++ b/tools/testing/selftests/mm/memory-failure.c
@@ -30,9 +30,14 @@ enum result_type {
MADV_HARD_ANON,
MADV_HARD_CLEAN_PAGECACHE,
MADV_HARD_DIRTY_PAGECACHE,
+ MADV_HARD_CLEAN_SHMEM,
+ MADV_HARD_DIRTY_SHMEM,
MADV_SOFT_ANON,
MADV_SOFT_CLEAN_PAGECACHE,
MADV_SOFT_DIRTY_PAGECACHE,
+ MADV_SOFT_CLEAN_SHMEM,
+ MADV_SOFT_DIRTY_SHMEM,
+ READ_ERROR,
};
static jmp_buf signal_jmp_buf;
@@ -165,17 +170,21 @@ static void check(struct __test_metadata *_metadata, FIXTURE_DATA(memory_failure
case MADV_HARD_CLEAN_PAGECACHE:
case MADV_SOFT_CLEAN_PAGECACHE:
case MADV_SOFT_DIRTY_PAGECACHE:
- /* It is not expected to receive a SIGBUS signal. */
- ASSERT_EQ(setjmp, 0);
-
+ case MADV_SOFT_DIRTY_SHMEM:
/* The page content should remain unchanged. */
ASSERT_TRUE(check_memory(vaddr, self->page_size));
+ case MADV_HARD_CLEAN_SHMEM:
+ case MADV_SOFT_CLEAN_SHMEM:
+ /* It is not expected to receive a SIGBUS signal. */
+ ASSERT_EQ(setjmp, 0);
/* The backing pfn of addr should have changed. */
ASSERT_NE(pagemap_get_pfn(self->pagemap_fd, vaddr), self->pfn);
break;
case MADV_HARD_ANON:
case MADV_HARD_DIRTY_PAGECACHE:
+ case MADV_HARD_DIRTY_SHMEM:
+ case READ_ERROR:
/* The SIGBUS signal should have been received. */
ASSERT_EQ(setjmp, 1);
@@ -260,6 +269,20 @@ static int prepare_file(const char *fname, unsigned long size)
return fd;
}
+static int prepare_shmem(const char *fname, unsigned long size)
+{
+ int fd;
+
+ fd = memfd_create(fname, 0);
+ if (fd < 0)
+ return -1;
+ if (ftruncate(fd, size) < 0) {
+ close(fd);
+ return -1;
+ }
+ return fd;
+}
+
/* Borrowed from mm/gup_longterm.c. */
static int get_fs_type(int fd)
{
@@ -356,4 +379,84 @@ TEST_F(memory_failure, dirty_pagecache)
ASSERT_EQ(close(fd), 0);
}
+TEST_F(memory_failure, dirty_shmem)
+{
+ int fd;
+ char *addr;
+ int ret;
+
+ fd = prepare_shmem("shmem-file", self->page_size);
+ if (fd < 0)
+ SKIP(return, "failed to open test shmem-file.\n");
+
+ addr = mmap(0, self->page_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ close(fd);
+ SKIP(return, "mmap failed, not enough memory.\n");
+ }
+ memset(addr, 0xce, self->page_size);
+
+ prepare(_metadata, self, addr);
+
+ ret = sigsetjmp(signal_jmp_buf, 1);
+ if (ret == 0)
+ ASSERT_EQ(variant->inject(self, addr), 0);
+
+ if (variant->type == MADV_HARD) {
+ check(_metadata, self, addr, MADV_HARD_DIRTY_SHMEM, ret);
+ ret = sigsetjmp(signal_jmp_buf, 1);
+ if (ret == 0)
+ FORCE_READ(*addr);
+ check(_metadata, self, addr, READ_ERROR, ret);
+ } else {
+ check(_metadata, self, addr, MADV_SOFT_DIRTY_SHMEM, ret);
+ }
+
+ ASSERT_EQ(munmap(addr, self->page_size), 0);
+
+ ASSERT_EQ(close(fd), 0);
+ cleanup(_metadata, self, addr);
+}
+
+TEST_F(memory_failure, clean_shmem)
+{
+ int fd;
+ char *addr;
+ int ret;
+
+ fd = prepare_shmem("shmem-file", self->page_size);
+ if (fd < 0)
+ SKIP(return, "failed to open test shmem-file.\n");
+
+ addr = mmap(0, self->page_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED) {
+ close(fd);
+ SKIP(return, "mmap failed, not enough memory.\n");
+ }
+ FORCE_READ(*addr);
+
+ prepare(_metadata, self, addr);
+
+ ret = sigsetjmp(signal_jmp_buf, 1);
+ if (ret == 0)
+ ASSERT_EQ(variant->inject(self, addr), 0);
+
+ if (variant->type == MADV_HARD) {
+ check(_metadata, self, addr, MADV_HARD_CLEAN_SHMEM, ret);
+ ret = sigsetjmp(signal_jmp_buf, 1);
+ if (ret == 0)
+ FORCE_READ(*addr);
+ check(_metadata, self, addr, READ_ERROR, ret);
+ } else {
+ check(_metadata, self, addr, MADV_SOFT_CLEAN_SHMEM, ret);
+ }
+
+ ASSERT_EQ(munmap(addr, self->page_size), 0);
+
+ ASSERT_EQ(close(fd), 0);
+ cleanup(_metadata, self, addr);
+}
+
TEST_HARNESS_MAIN
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH RFC v3 6/7] KVM: selftests: Add memory failure tests in guest_memfd_test
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
` (4 preceding siblings ...)
2026-04-08 17:24 ` [PATCH RFC v3 5/7] mm: selftests: Add shmem into memory failure test Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
2026-04-08 17:24 ` [PATCH RFC v3 7/7] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables Lisa Wang
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
After modifying truncate_error_folio(), we expect memory_failure() will
return 0 instead of MF_FAILED. Also, we want to make sure memory_failure()
signaling function is same.
Test that memory_failure() returns 0 for guest_memfd, where
.error_remove_folio() is handled by not actually truncating, and returning
MF_DELAYED.
In addition, test that SIGBUS signaling behavior is not changed before
and after this modification.
There are two kinds of guest memory failure injections - madvise or
debugfs. When memory failure is injected using madvise, the
MF_ACTION_REQUIRED flag is set, and the page is mapped and dirty, the
process should get a SIGBUS. When memory is failure is injected using
debugfs, the KILL_EARLY machine check memory corruption kill policy is
set, and the page is mapped and dirty, the process should get a SIGBUS.
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Lisa Wang <wyihan@google.com>
---
tools/testing/selftests/kvm/guest_memfd_test.c | 168 +++++++++++++++++++++++++
1 file changed, 168 insertions(+)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 618c937f3c90..0ea4e7d7e6d5 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -10,6 +10,8 @@
#include <errno.h>
#include <stdio.h>
#include <fcntl.h>
+#include <linux/prctl.h>
+#include <sys/prctl.h>
#include <linux/bitmap.h>
#include <linux/falloc.h>
@@ -193,6 +195,171 @@ static void test_fault_overflow(int fd, size_t total_size)
test_fault_sigbus(fd, total_size, total_size * 4);
}
+static unsigned long addr_to_pfn(void *addr)
+{
+ const uint64_t pagemap_pfn_mask = BIT(54) - 1;
+ const uint64_t pagemap_page_present = BIT(63);
+ uint64_t page_info;
+ ssize_t n_bytes;
+ int pagemap_fd;
+
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ TEST_ASSERT(pagemap_fd >= 0, "Opening pagemap should succeed.");
+
+ n_bytes = pread(pagemap_fd, &page_info, 8, (uint64_t)addr / page_size * 8);
+ TEST_ASSERT(n_bytes == 8, "pread of pagemap failed. n_bytes=%ld", n_bytes);
+
+ close(pagemap_fd);
+
+ TEST_ASSERT(page_info & pagemap_page_present, "The page for addr should be present");
+ return page_info & pagemap_pfn_mask;
+}
+
+static void write_memory_failure(unsigned long pfn, bool mark, int return_code)
+{
+ char path[PATH_MAX];
+ char *filename;
+ char buf[20];
+ int ret;
+ int len;
+ int fd;
+
+ filename = mark ? "corrupt-pfn" : "unpoison-pfn";
+ snprintf(path, PATH_MAX, "/sys/kernel/debug/hwpoison/%s", filename);
+
+ fd = open(path, O_WRONLY);
+ TEST_ASSERT(fd >= 0, "Failed to open %s.", path);
+
+ len = snprintf(buf, sizeof(buf), "0x%lx\n", pfn);
+ if (len < 0 || (unsigned int)len >= sizeof(buf))
+ TEST_ASSERT(0, "snprintf failed or truncated.");
+
+ ret = write(fd, buf, len);
+ if (return_code == 0) {
+ /*
+ * If the memory_failure() returns 0, write() should be successful,
+ * which returns how many bytes it writes.
+ */
+ TEST_ASSERT(ret > 0, "Writing memory failure (path: %s) failed: %s", path,
+ strerror(errno));
+ } else {
+ TEST_ASSERT_EQ(ret, -1);
+ /* errno is memory_failure() return code. */
+ TEST_ASSERT_EQ(errno, return_code);
+ }
+
+ close(fd);
+}
+
+static void mark_memory_failure(unsigned long pfn, int return_code)
+{
+ write_memory_failure(pfn, true, return_code);
+}
+
+static void unmark_memory_failure(unsigned long pfn, int return_code)
+{
+ write_memory_failure(pfn, false, return_code);
+}
+
+enum memory_failure_injection_method {
+ MF_INJECT_DEBUGFS,
+ MF_INJECT_MADVISE,
+};
+
+static void do_test_memory_failure(int fd, size_t total_size,
+ enum memory_failure_injection_method method, int kill_config,
+ bool map_page, bool dirty_page, bool sigbus_expected,
+ int return_code)
+{
+ unsigned long memory_failure_pfn;
+ char *memory_failure_addr;
+ char *mem;
+ int ret;
+
+ mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+ memory_failure_addr = mem + page_size;
+ if (dirty_page)
+ *memory_failure_addr = 'A';
+ else
+ READ_ONCE(*memory_failure_addr);
+
+ /* Fault in page to read pfn, then unmap page for testing if needed. */
+ memory_failure_pfn = addr_to_pfn(memory_failure_addr);
+ if (!map_page)
+ madvise(memory_failure_addr, page_size, MADV_DONTNEED);
+
+ ret = prctl(PR_MCE_KILL, PR_MCE_KILL_SET, kill_config, 0, 0);
+ TEST_ASSERT_EQ(ret, 0);
+
+ ret = 0;
+ switch (method) {
+ case MF_INJECT_DEBUGFS: {
+ /* DEBUGFS injection handles return_code test inside the mark_memory_failure(). */
+ if (sigbus_expected)
+ TEST_EXPECT_SIGBUS(mark_memory_failure(memory_failure_pfn, return_code));
+ else
+ mark_memory_failure(memory_failure_pfn, return_code);
+ break;
+ }
+ case MF_INJECT_MADVISE: {
+ /*
+ * MADV_HWPOISON uses get_user_pages() so the page will always
+ * be faulted in at the point of memory_failure()
+ */
+ if (sigbus_expected)
+ TEST_EXPECT_SIGBUS(ret = madvise(memory_failure_addr,
+ page_size, MADV_HWPOISON));
+ else
+ ret = madvise(memory_failure_addr, page_size, MADV_HWPOISON);
+
+ if (return_code == 0)
+ TEST_ASSERT(ret == return_code, "Memory failure failed. Errno: %s",
+ strerror(errno));
+ else {
+ /* errno is memory_failure() return code. */
+ TEST_ASSERT_EQ(errno, return_code);
+ }
+ break;
+ }
+ default:
+ TEST_FAIL("Unhandled memory failure injection method %d.", method);
+ }
+
+ TEST_EXPECT_SIGBUS(READ_ONCE(*memory_failure_addr));
+ TEST_EXPECT_SIGBUS(*memory_failure_addr = 'A');
+
+ ret = munmap(mem, total_size);
+ TEST_ASSERT(!ret, "munmap() should succeed.");
+
+ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+ total_size);
+ TEST_ASSERT(!ret, "Truncate the entire file (cleanup) should succeed.");
+
+ ret = prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_DEFAULT, 0, 0);
+ TEST_ASSERT_EQ(ret, 0);
+
+ unmark_memory_failure(memory_failure_pfn, 0);
+}
+
+static void test_memory_failure(int fd, size_t total_size)
+{
+ do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, true, true, true, 0);
+ do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, true, false, false, 0);
+ do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_EARLY, false, true, false, 0);
+ do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, true, true, false, 0);
+ do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, true, false, false, 0);
+ do_test_memory_failure(fd, total_size, MF_INJECT_DEBUGFS, PR_MCE_KILL_LATE, false, true, false, 0);
+ /*
+ * If madvise() is used to inject errors, memory_failure() handling is invoked with the
+ * MF_ACTION_REQUIRED flag set, aligned with memory failure handling for a consumed memory
+ * error, where the machine check memory corruption kill policy is ignored. Hence, testing with
+ * PR_MCE_KILL_DEFAULT covers all cases.
+ */
+ do_test_memory_failure(fd, total_size, MF_INJECT_MADVISE, PR_MCE_KILL_DEFAULT, true, true, true, 0);
+ do_test_memory_failure(fd, total_size, MF_INJECT_MADVISE, PR_MCE_KILL_DEFAULT, true, false, false, 0);
+}
+
static void test_fault_private(int fd, size_t total_size)
{
test_fault_sigbus(fd, 0, total_size);
@@ -370,6 +537,7 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint64_t flags)
gmem_test(mmap_supported, vm, flags);
gmem_test(fault_overflow, vm, flags);
gmem_test(numa_allocation, vm, flags);
+ gmem_test(memory_failure, vm, flags);
} else {
gmem_test(fault_private, vm, flags);
}
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH RFC v3 7/7] KVM: selftests: Test guest_memfd behavior with respect to stage 2 page tables
2026-04-08 17:24 [PATCH RFC v3 0/7] mm: Fix MF_DELAYED handling on memory failure Lisa Wang
` (5 preceding siblings ...)
2026-04-08 17:24 ` [PATCH RFC v3 6/7] KVM: selftests: Add memory failure tests in guest_memfd_test Lisa Wang
@ 2026-04-08 17:24 ` Lisa Wang
6 siblings, 0 replies; 8+ messages in thread
From: Lisa Wang @ 2026-04-08 17:24 UTC (permalink / raw)
To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Paolo Bonzini,
Shuah Khan, Hugh Dickins, Baolin Wang, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel, kvm,
linux-kselftest
Cc: rientjes, seanjc, ackerleytng, vannapurve, michael.roth,
jiaqiyan, tabba, dave.hansen, Lisa Wang
Test that
+ memory failure handling results in unmapping of bad memory from stage
2 page tables, hence requiring faulting on next guest access
+ when the guest tries to fault a poisoned page from guest_memfd, the
userspace VMM informed with EHWPOISON
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Lisa Wang <wyihan@google.com>
---
tools/testing/selftests/kvm/guest_memfd_test.c | 70 +++++++++++++++++++++++++-
1 file changed, 69 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 0ea4e7d7e6d5..6615ee8bfcd8 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -26,6 +26,7 @@
#include "ucall_common.h"
static size_t page_size;
+static uint64_t test_memory_failure_guest_gpa;
static void test_file_read_write(int fd, size_t total_size)
{
@@ -637,6 +638,73 @@ static void test_guest_memfd_guest(void)
kvm_vm_free(vm);
}
+static void __guest_code_read(void)
+{
+ uint8_t *mem = (uint8_t *)test_memory_failure_guest_gpa;
+
+ READ_ONCE(*mem);
+ GUEST_SYNC(0);
+ READ_ONCE(*mem);
+ GUEST_DONE();
+}
+
+static void guest_read(struct kvm_vcpu *vcpu, int expected_errno)
+{
+ if (expected_errno) {
+ TEST_ASSERT_EQ(_vcpu_run(vcpu), -1);
+ TEST_ASSERT_EQ(errno, expected_errno);
+ } else {
+ vcpu_run(vcpu);
+ TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_SYNC);
+ }
+}
+
+static void test_memory_failure_guest(void)
+{
+ const uint64_t gpa = SZ_4G;
+ const int slot = 1;
+
+ unsigned long memory_failure_pfn;
+ struct kvm_vcpu *vcpu;
+ struct kvm_vm *vm;
+ uint8_t *mem;
+ size_t size;
+ int fd;
+
+ if (!kvm_has_cap(KVM_CAP_GUEST_MEMFD_FLAGS))
+ return;
+
+ vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1, __guest_code_read);
+
+ size = vm->page_size;
+ fd = vm_create_guest_memfd(vm, size, GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED);
+ vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, size, NULL, fd, 0);
+
+ mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+ virt_pg_map(vm, gpa, gpa);
+
+ test_memory_failure_guest_gpa = gpa;
+ sync_global_to_guest(vm, test_memory_failure_guest_gpa);
+
+ /* Fault in page to read pfn, then unmap page for testing. */
+ READ_ONCE(*mem);
+ memory_failure_pfn = addr_to_pfn(mem);
+ munmap(mem, size);
+
+ /* Fault page into stage2 page tables. */
+ guest_read(vcpu, 0);
+
+ mark_memory_failure(memory_failure_pfn, 0);
+
+ guest_read(vcpu, EHWPOISON);
+
+ close(fd);
+ kvm_vm_free(vm);
+
+ unmark_memory_failure(memory_failure_pfn, 0);
+}
+
int main(int argc, char *argv[])
{
unsigned long vm_types, vm_type;
@@ -644,7 +712,6 @@ int main(int argc, char *argv[])
TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
page_size = getpagesize();
-
/*
* Not all architectures support KVM_CAP_VM_TYPES. However, those that
* support guest_memfd have that support for the default VM type.
@@ -657,4 +724,5 @@ int main(int argc, char *argv[])
test_guest_memfd(vm_type);
test_guest_memfd_guest();
+ test_memory_failure_guest();
}
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply [flat|nested] 8+ messages in thread