* [PATCH v3 0/3] memfd-based Userspace MFR Policy for HugeTLB
@ 2026-02-03 19:23 Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-03 19:23 UTC (permalink / raw)
To: linmiaohe, william.roche, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, Jiaqi Yan
Problem
=======
This patchset is a follow-up for the userspace memory failure
recovery (MFR) policy proposed in [1] and [2], but focused on
a smaller scope: HugeTLB.
To recap the problem for HugeTLB discussed in [1] and [2]:
Cloud providers like Google and Oracle usually serve capacity-
and performance-critical guest memory with 1G HugeTLB
hugepages, as this significantly reduces the overhead
associated with managing page tables and TLB misses. However,
the kernel's current MFR behavior for HugeTLB is not ideal.
Once a byte of memory in a hugepage is hardware corrupted, the
kernel discards the whole hugepage, including the healthy
portion, from the HugeTLB system. Customer workload running in
the VM can hardly recover from such a great loss of memory.
[1] and [2] proposed the idea that the decision to keep or
discard a large chunk of contiguous memory exclusively owned
by a userspace process due to a recoverable uncorrected
memory error (UE) should be controlled by userspace. What this
means in the Cloud case is that, since a virtual machine
monitor (VMM) has taken host memory to exclusively back the
guest memory for a VM, the VMM can keep holding the memory
even after memory errors occur.
MFD_MF_KEEP_UE_MAPPED for HugeTLB
=================================
[2] proposed a solution centered around the memfd associated
with the memory exclusively owned by userspace.
A userspace process must opt into the MFD_MF_KEEP_UE_MAPPED
policy when it creates a new HugeTLB-backed memfd:
#define MFD_MF_KEEP_UE_MAPPED 0x0020U
int memfd_create(const char *name, unsigned int flags);
For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED
enabled memfd, whenever it runs into a UE, MFR doesn't hard
offline the HWPoison huge folio. In other words, the
HWPoison memory remains accessible via the returned memfd
or the memory mapping created with that memfd. MFR still sends
SIGBUS to the userspace process as required. MFR also still
maintains HWPoison metadata on the hugepage having the UE.
A HWPoison hugepage will be immediately isolated and
prevented from future allocation once userspace truncates it
via the memfd, or the owning memfd is closed.
By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard
offlines hugepages having UEs.
Implementation
==============
Implementation is relatively straightforward with two major parts.
Part 1: When hugepages owned by an MFD_MF_KEEP_UE_MAPPED
enabled memfd run into a UE:
* MFR defers hard offline operations, i.e., unmapping and
dissolving. MFR still sets HWPoison flags and holds a refcount
for every raw HWPoison page. MFR still sends SIGBUS to the
consuming thread, but si_addr_lsb will be reduced to PAGE_SHIFT.
* If the memory was not faulted in yet, the fault handler also
needs to unblock the fault to the HWPoison folio.
Part 2: When an MFD_MF_KEEP_UE_MAPPED enabled memfd is being
released, or when a userspace process truncates a range of
hugepages belonging to an MFD_MF_KEEP_UE_MAPPED enabled memfd:
* When the HugeTLB in-memory file system removes a filemap's
folios one by one, it asks MFR to deal with HWPoison folios
on the fly, implemented by filemap_offline_hwpoison_folio().
* MFR drops the refcounts being held for the raw HWPoison
pages within the folio. Now that the HWPoison folio becomes
a free HugeTLB folio, MFR dissolves it into a set of raw pages.
Changelog
=========
v3 -> v2 [3]
- Rebase onto [4] to simplify filemap_offline_hwpoison_folio_hugetlb().
With free_has_hwpoisoned() rejecting HWPoison subpages in a HugeTLB
folio, there is no need to take_page_off_buddy() after
dissolve_free_hugetlb_folio().
- Address comments from William Roche <william.roche@oracle.com> and
Jane Chu <jane.chu@oracle.com>.
- Update size_shift in kill_accessing_process() if MFD_MF_KEEP_UE_MAPPED
is enabled. Thanks William Roche <william.roche@oracle.com> for providing
his patch on this.
- Add a new tunable to hugetlb-mfr to control the number of pages within
the 1st hugepage to MADV_HWPOISON.
v2 -> v1 [2]
- Rebase onto commit 6da43bbeb6918 ("Merge tag 'vfio-v6.18-rc6' of
https://github.com/awilliam/linux-vfio").
- Remove populate_memfd_hwp_folios() and offline_memfd_hwp_folios() so
that no memory allocation is needed during releasing HWPoison memfd.
- Insert filemap_offline_hwpoison_folio() into remove_inode_single_folio().
Now dissolving and offlining HWPoison huge folios is done on the fly.
- Fix the bug pointed out by William Roche <william.roche@oracle.com>:
call take_page_off_buddy() no matter HWPoison page is buddy page or not.
- Remove update_per_node_mf_stats() when dissolve failed.
- Make hugetlb-mfr allocate 4 1G hugepages to cover new code introduced
in remove_inode_hugepages().
- Make hugetlb-mfr support testing both 1GB and 2MB HugeTLB hugepages.
- Fix some typos in documentation.
[1] https://lwn.net/Articles/991513
[2] https://lore.kernel.org/lkml/20250118231549.1652825-1-jiaqiyan@google.com
[3] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com
[4] https://lore.kernel.org/linux-mm/20260202194125.2191216-1-jiaqiyan@google.com
Jiaqi Yan (3):
mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
selftests/mm: test userspace MFR for HugeTLB hugepage
Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/mfd_mfr_policy.rst | 60 +++
fs/hugetlbfs/inode.c | 25 +-
include/linux/hugetlb.h | 7 +
include/linux/pagemap.h | 23 ++
include/uapi/linux/memfd.h | 6 +
mm/hugetlb.c | 8 +-
mm/memfd.c | 15 +-
mm/memory-failure.c | 124 +++++-
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
tools/testing/selftests/mm/hugetlb-mfr.c | 369 ++++++++++++++++++
12 files changed, 627 insertions(+), 15 deletions(-)
create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c
--
2.53.0.rc2.204.g2597b5adb4-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-03 19:23 [PATCH v3 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
@ 2026-02-03 19:23 ` Jiaqi Yan
2026-02-04 17:29 ` William Roche
2026-02-09 11:54 ` Miaohe Lin
2026-02-03 19:23 ` [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
2 siblings, 2 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-03 19:23 UTC (permalink / raw)
To: linmiaohe, william.roche, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, Jiaqi Yan
Sometimes immediately hard offlining a large chunk of contigous memory
having uncorrected memory errors (UE) may not be the best option.
Cloud providers usually serve capacity- and performance-critical guest
memory with 1G HugeTLB hugepages, as this significantly reduces the
overhead associated with managing page tables and TLB misses. However,
for today's HugeTLB system, once a byte of memory in a hugepage is
hardware corrupted, the kernel discards the whole hugepage, including
the healthy portion. Customer workload running in the VM can hardly
recover from such a great loss of memory.
Therefore keeping or discarding a large chunk of contiguous memory
owned by userspace (particularly to serve guest memory) due to
recoverable UE may better be controlled by userspace process
that owns the memory, e.g. VMM in the Cloud environment.
Introduce a memfd-based userspace memory failure (MFR) policy,
MFD_MF_KEEP_UE_MAPPED. It is possible to support for other memfd,
but the current implementation only covers HugeTLB.
For a hugepage associated with MFD_MF_KEEP_UE_MAPPED enabled memfd,
whenever it runs into a new UE,
* MFR defers hard offline operations, i.e., unmapping and
dissolving. MFR still sets HWPoison flag, holds a refcount
for every raw HWPoison page, record them in a list, sends SIGBUS
to the consuming thread, but si_addr_lsb is reduced to PAGE_SHIFT.
If userspace is able to handle the SIGBUS, the HWPoison hugepage
remains accessible via the mapping created with that memfd.
* If the memory was not faulted in yet, the fault handler also
allows fault in the HWPoison folio.
For a MFD_MF_KEEP_UE_MAPPED enabled memfd, when it is closed, or
when userspace process truncates its hugepages:
* When the HugeTLB in-memory file system removes the filemap's
folios one by one, it asks MFR to deal with HWPoison folios
on the fly, implemented by filemap_offline_hwpoison_folio().
* MFR drops the refcounts being held for the raw HWPoison
pages within the folio. Now that the HWPoison folio becomes
free, MFR dissolves it into a set of raw pages. The healthy pages
are recycled into buddy allocator, while the HWPoison ones are
prevented from re-allocation.
By default MFD_MF_KEEP_UE_MAPPED is not set, so MFR immediately hard
offlines hugepages having UEs.
Tested with new selftest hugetlb-mfr in the follow-up commit.
Co-developed-by: William Roche <william.roche@oracle.com>
Signed-off-by: William Roche <william.roche@oracle.com>
Tested-by: William Roche <william.roche@oracle.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
fs/hugetlbfs/inode.c | 25 +++++++-
include/linux/hugetlb.h | 7 +++
include/linux/pagemap.h | 23 +++++++
include/uapi/linux/memfd.h | 6 ++
mm/hugetlb.c | 8 ++-
mm/memfd.c | 15 ++++-
mm/memory-failure.c | 124 ++++++++++++++++++++++++++++++++++---
7 files changed, 193 insertions(+), 15 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3b4c152c5c73a..8b0f5aa49711f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -551,6 +551,18 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
}
folio_unlock(folio);
+
+ /*
+ * There may be pending HWPoison-ed folios when a memfd is being
+ * removed or part of it is being truncated.
+ *
+ * HugeTLBFS' error_remove_folio keeps the HWPoison-ed folios in
+ * page cache until mm wants to drop the folio at the end of the
+ * of the filemap. At this point, if memory failure was delayed
+ * by MFD_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
+ */
+ filemap_offline_hwpoison_folio(mapping, folio);
+
return ret;
}
@@ -582,13 +594,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
const pgoff_t end = lend >> PAGE_SHIFT;
struct folio_batch fbatch;
pgoff_t next, index;
- int i, freed = 0;
+ int i, j, freed = 0;
bool truncate_op = (lend == LLONG_MAX);
folio_batch_init(&fbatch);
next = lstart >> PAGE_SHIFT;
while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
- for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+ for (i = 0, j = 0; i < folio_batch_count(&fbatch); ++i) {
struct folio *folio = fbatch.folios[i];
u32 hash = 0;
@@ -603,8 +615,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
index, truncate_op))
freed++;
+ /*
+ * Skip HWPoison-ed hugepages, which should no
+ * longer be hugetlb if successfully dissolved.
+ */
+ if (folio_test_hugetlb(folio))
+ fbatch.folios[j++] = folio;
+
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
+ fbatch.nr = j;
+
folio_batch_release(&fbatch);
cond_resched();
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e51b8ef0cebd9..7fadf1772335d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -879,10 +879,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
#ifdef CONFIG_MEMORY_FAILURE
extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
+extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+ struct address_space *mapping);
#else
static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
{
}
+static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
+ struct address_space *mapping)
+{
+ return false;
+}
#endif
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f8861..53772c29451eb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -211,6 +211,7 @@ enum mapping_flags {
AS_KERNEL_FILE = 10, /* mapping for a fake kernel file that shouldn't
account usage to user cgroups */
AS_NO_DATA_INTEGRITY = 11, /* no data integrity guarantees */
+ AS_MF_KEEP_UE_MAPPED = 12, /* For MFD_MF_KEEP_UE_MAPPED. */
/* Bits 16-25 are used for FOLIO_ORDER */
AS_FOLIO_ORDER_BITS = 5,
AS_FOLIO_ORDER_MIN = 16,
@@ -356,6 +357,16 @@ static inline bool mapping_no_data_integrity(const struct address_space *mapping
return test_bit(AS_NO_DATA_INTEGRITY, &mapping->flags);
}
+static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
+{
+ return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
+static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
+{
+ set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
{
return mapping->gfp_mask;
@@ -1303,6 +1314,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
void delete_from_page_cache_batch(struct address_space *mapping,
struct folio_batch *fbatch);
bool filemap_release_folio(struct folio *folio, gfp_t gfp);
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
+ */
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+ struct folio *folio);
+#else
+static inline void filemap_offline_hwpoison_folio(struct address_space *mapping,
+ struct folio *folio)
+{
+}
+#endif
loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
int whence);
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 273a4e15dfcff..d9875da551b7f 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -12,6 +12,12 @@
#define MFD_NOEXEC_SEAL 0x0008U
/* executable */
#define MFD_EXEC 0x0010U
+/*
+ * Keep owned folios mapped when uncorrectable memory errors (UE) causes
+ * memory failure (MF) within the folio. Only at the end of the mapping
+ * will its HWPoison-ed folios be dealt with.
+ */
+#define MFD_MF_KEEP_UE_MAPPED 0x0020U
/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a1832da0f6236..2a161c281da2a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5836,9 +5836,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
* So we need to block hugepage fault by PG_hwpoison bit check.
*/
if (unlikely(folio_test_hwpoison(folio))) {
- ret = VM_FAULT_HWPOISON_LARGE |
- VM_FAULT_SET_HINDEX(hstate_index(h));
- goto backout_unlocked;
+ if (!mapping_mf_keep_ue_mapped(mapping)) {
+ ret = VM_FAULT_HWPOISON_LARGE |
+ VM_FAULT_SET_HINDEX(hstate_index(h));
+ goto backout_unlocked;
+ }
}
/* Check for page in userfault range. */
diff --git a/mm/memfd.c b/mm/memfd.c
index ab5312aff14b9..f9fdf014b67ba 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -340,7 +340,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+ MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
static int check_sysctl_memfd_noexec(unsigned int *flags)
{
@@ -414,6 +415,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
if (!(flags & MFD_HUGETLB)) {
if (flags & ~MFD_ALL_FLAGS)
return -EINVAL;
+ if (flags & MFD_MF_KEEP_UE_MAPPED)
+ return -EINVAL;
} else {
/* Allow huge page size encoding in flags. */
if (flags & ~(MFD_ALL_FLAGS |
@@ -486,6 +489,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
file->f_flags |= O_LARGEFILE;
+ /*
+ * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create;
+ * no API to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED
+ * is not seal-able.
+ *
+ * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
+ */
+ if (flags & MFD_MF_KEEP_UE_MAPPED)
+ mapping_set_mf_keep_ue_mapped(file->f_mapping);
+
if (flags & MFD_NOEXEC_SEAL) {
inode->i_mode &= ~0111;
file_seals = memfd_file_seals_ptr(file);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 58b34f5d2c05d..b9cecbbe08dae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -410,6 +410,8 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
unsigned long addr)
{
struct to_kill *tk;
+ const struct folio *folio;
+ struct address_space *mapping;
tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
if (!tk) {
@@ -420,8 +422,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
tk->addr = addr;
if (is_zone_device_page(p))
tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
- else
- tk->size_shift = folio_shift(page_folio(p));
+ else {
+ folio = page_folio(p);
+ mapping = folio_mapping(folio);
+ if (mapping && mapping_mf_keep_ue_mapped(mapping))
+ /*
+ * Let userspace know the radius of HWPoison is
+ * the size of raw page; accessing other pages
+ * inside the folio is still ok.
+ */
+ tk->size_shift = PAGE_SHIFT;
+ else
+ tk->size_shift = folio_shift(folio);
+ }
/*
* Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -844,6 +857,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
int flags)
{
int ret;
+ struct folio *folio;
+ struct address_space *mapping;
struct hwpoison_walk priv = {
.pfn = pfn,
};
@@ -861,8 +876,14 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
* ret = 0 when poison page is a clean page and it's dropped, no
* SIGBUS is needed.
*/
- if (ret == 1 && priv.tk.addr)
+ if (ret == 1 && priv.tk.addr) {
+ folio = pfn_folio(pfn);
+ mapping = folio_mapping(folio);
+ if (mapping && mapping_mf_keep_ue_mapped(mapping))
+ priv.tk.size_shift = PAGE_SHIFT;
+
kill_proc(&priv.tk, pfn, flags);
+ }
mmap_read_unlock(p->mm);
return ret > 0 ? -EHWPOISON : 0;
@@ -1206,6 +1227,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
}
}
+ /*
+ * MF still needs to holds a refcount for the deferred actions in
+ * filemap_offline_hwpoison_folio.
+ */
+ if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+ return res;
+
if (has_extra_refcount(ps, p, extra_pins))
res = MF_FAILED;
@@ -1602,6 +1630,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
{
LIST_HEAD(tokill);
bool unmap_success;
+ bool keep_mapped;
int forcekill;
bool mlocked = folio_test_mlocked(folio);
@@ -1629,8 +1658,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
*/
collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
- if (!unmap_success)
+ keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
+ if (!keep_mapped)
+ unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
+
+ unmap_success = !folio_mapped(folio);
+ if (!keep_mapped && !unmap_success)
pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
pfn, folio_mapcount(folio));
@@ -1655,7 +1688,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
!unmap_success;
kill_procs(&tokill, forcekill, pfn, flags);
- return unmap_success;
+ return unmap_success || keep_mapped;
}
static int identify_page_state(unsigned long pfn, struct page *p,
@@ -1896,6 +1929,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
unsigned long count = 0;
head = llist_del_all(raw_hwp_list_head(folio));
+ /*
+ * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
+ * it has already taken off the head of the llist.
+ */
+ if (head == NULL)
+ return 0;
+
llist_for_each_entry_safe(p, next, head, node) {
if (move_flag)
SetPageHWPoison(p->page);
@@ -1912,7 +1952,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
struct llist_head *head;
struct raw_hwp_page *raw_hwp;
struct raw_hwp_page *p;
- int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
+ struct address_space *mapping = folio->mapping;
+ bool has_hwpoison = folio_test_set_hwpoison(folio);
/*
* Once the hwpoison hugepage has lost reliable raw error info,
@@ -1931,8 +1972,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
if (raw_hwp) {
raw_hwp->page = page;
llist_add(&raw_hwp->node, head);
+ if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+ /*
+ * A new raw HWPoison page. Don't return HWPOISON.
+ * Error event will be counted in action_result().
+ */
+ return 0;
+
/* the first error event will be counted in action_result(). */
- if (ret)
+ if (has_hwpoison)
num_poisoned_pages_inc(page_to_pfn(page));
} else {
/*
@@ -1947,7 +1995,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
*/
__folio_free_raw_hwp(folio, false);
}
- return ret;
+
+ return has_hwpoison ? -EHWPOISON : 0;
}
static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
@@ -1980,6 +2029,18 @@ void folio_clear_hugetlb_hwpoison(struct folio *folio)
folio_free_raw_hwp(folio, true);
}
+bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+ struct address_space *mapping)
+{
+ if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
+ return false;
+
+ if (!mapping)
+ return false;
+
+ return mapping_mf_keep_ue_mapped(mapping);
+}
+
/*
* Called from hugetlb code with hugetlb_lock held.
*
@@ -2037,6 +2098,51 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
return ret;
}
+static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
+{
+ int ret;
+ struct llist_node *head;
+ struct raw_hwp_page *curr, *next;
+
+ /*
+ * Since folio is still in the folio_batch, drop the refcount
+ * elevated by filemap_get_folios.
+ */
+ folio_put_refs(folio, 1);
+ head = llist_del_all(raw_hwp_list_head(folio));
+
+ /*
+ * Release refcounts held by try_memory_failure_hugetlb, one per
+ * HWPoison-ed page in the raw hwp list.
+ *
+ * Set HWPoison flag on each page so that free_has_hwpoisoned()
+ * can exclude them during dissolve_free_hugetlb_folio().
+ */
+ llist_for_each_entry_safe(curr, next, head, node) {
+ folio_put(folio);
+ SetPageHWPoison(curr->page);
+ kfree(curr);
+ }
+
+ /* Refcount now should be zero and ready to dissolve folio. */
+ ret = dissolve_free_hugetlb_folio(folio);
+ if (ret)
+ pr_err("failed to dissolve hugetlb folio: %d\n", ret);
+}
+
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+ struct folio *folio)
+{
+ WARN_ON_ONCE(!mapping);
+
+ if (!folio_test_hwpoison(folio))
+ return;
+
+ /* Pending MFR currently only exist for hugetlb. */
+ if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+ filemap_offline_hwpoison_folio_hugetlb(folio);
+}
+
/*
* Taking refcount of hugetlb pages needs extra care about race conditions
* with basic operations like hugepage allocation/free/demotion.
--
2.53.0.rc2.204.g2597b5adb4-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
2026-02-03 19:23 [PATCH v3 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
@ 2026-02-03 19:23 ` Jiaqi Yan
2026-02-04 17:53 ` William Roche
2026-02-09 12:01 ` Miaohe Lin
2026-02-03 19:23 ` [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
2 siblings, 2 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-03 19:23 UTC (permalink / raw)
To: linmiaohe, william.roche, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, Jiaqi Yan
Test the userspace memory failure recovery (MFR) policy for HugeTLB:
1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
2. Allocate and map 4 hugepages to the process.
3. Create sub-threads to MADV_HWPOISON inner addresses of the 1st hugepage.
4. Check if the process gets correct SIGBUS for each poisoned raw page.
5. Check if all memory are still accessible and content valid.
6. Check if the poisoned hugepage is dealt with after memfd released.
Two configurables in the test:
- hugepage_size: size of the hugepage, 1G or 2M.
- nr_hwp_pages: number of pages within the 1st hugepage to MADV_HWPOISON.
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
tools/testing/selftests/mm/.gitignore | 1 +
tools/testing/selftests/mm/Makefile | 3 +
tools/testing/selftests/mm/hugetlb-mfr.c | 369 +++++++++++++++++++++++
3 files changed, 373 insertions(+)
create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c
diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index c2a8586e51a1f..11664d20935db 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -5,6 +5,7 @@ hugepage-mremap
hugepage-shm
hugepage-vmemmap
hugetlb-madvise
+hugetlb-mfr
hugetlb-read-hwpoison
hugetlb-soft-offline
khugepaged
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index eaf9312097f7b..7469142a87dcc 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -63,6 +63,7 @@ TEST_GEN_FILES += hmm-tests
TEST_GEN_FILES += hugetlb-madvise
TEST_GEN_FILES += hugetlb-read-hwpoison
TEST_GEN_FILES += hugetlb-soft-offline
+TEST_GEN_FILES += hugetlb-mfr
TEST_GEN_FILES += hugepage-mmap
TEST_GEN_FILES += hugepage-mremap
TEST_GEN_FILES += hugepage-shm
@@ -233,6 +234,8 @@ $(OUTPUT)/migration: LDLIBS += -lnuma
$(OUTPUT)/rmap: LDLIBS += -lnuma
+$(OUTPUT)/hugetlb-mfr: LDLIBS += -lnuma
+
local_config.mk local_config.h: check_config.sh
/bin/sh ./check_config.sh $(CC)
diff --git a/tools/testing/selftests/mm/hugetlb-mfr.c b/tools/testing/selftests/mm/hugetlb-mfr.c
new file mode 100644
index 0000000000000..6de59efdb101f
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-mfr.c
@@ -0,0 +1,369 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test the userspace memory failure recovery (MFR) policy for HugeTLB
+ * hugepage case:
+ * 1. Create a memfd backed by HugeTLB and MFD_MF_KEEP_UE_MAPPED bit set.
+ * 2. Allocate and map 4 hugepages.
+ * 3. Create sub-threads to MADV_HWPOISON inner addresses of the 1st hugepage.
+ * 4. Check if each sub-thread get correct SIGBUS for the poisoned raw pages.
+ * 5. Check if all memory are still accessible and content still valid.
+ * 6. Check if the poisoned hugepage is dealt with after memfd released.
+ *
+ * Test takes two arguments:
+ * - hugepage_size: size of the hugepage, 1G or 2M.
+ * - nr_hwp_pages: number of pages within the 1st hugepage to MADV_HWPOISON.
+ *
+ * Example ways to run the test:
+ * ./hugetlb-mfr 2M 3
+ * or
+ * ./hugetlb-mfr 1G 1
+ * assuming /sys/kernel/mm/hugepages/hugepages-${xxx}kB/nr_hugepages > 4
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <errno.h>
+#include <numaif.h>
+#include <numa.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/statfs.h>
+#include <sys/types.h>
+
+#include "../kselftest.h"
+#include "vm_util.h"
+
+#define min(a, b) ((a) < (b) ? (a) : (b))
+
+#define EPREFIX " !!! "
+#define BYTE_LENTH_IN_1G 0x40000000UL
+#define BYTE_LENTH_IN_2M 0x200000UL
+#define HUGETLB_1GB_STR "1G"
+#define HUGETLB_2MB_STR "2M"
+#define HUGETLB_FILL 0xab
+
+static const unsigned long offsets_1g[] = {
+ 0x200000, 0x3ff000, 0x801000, 0x2000000,
+ 0x3fff000, 0x4001000, 0x7fff000, 0x8011000
+};
+static const unsigned long offsets_2m[] = {
+ 0x020000, 0x041000, 0x07f000, 0x120000,
+ 0x13f000, 0x141000, 0x17f000, 0x18f000
+};
+static size_t nr_hwp_pages;
+
+static void *sigbus_addr;
+static int sigbus_addr_lsb;
+static bool expecting_sigbus;
+static bool got_sigbus;
+static bool was_mceerr;
+
+static int create_hugetlbfs_file(struct statfs *file_stat,
+ unsigned long hugepage_size)
+{
+ int fd;
+ int flags = MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED;
+
+ if (hugepage_size == BYTE_LENTH_IN_2M)
+ flags |= MFD_HUGE_2MB;
+ else
+ flags |= MFD_HUGE_1GB;
+
+ fd = memfd_create("hugetlb_tmp", flags);
+ if (fd < 0)
+ ksft_exit_fail_perror("Failed to memfd_create");
+
+ memset(file_stat, 0, sizeof(*file_stat));
+ if (fstatfs(fd, file_stat)) {
+ close(fd);
+ ksft_exit_fail_perror("Failed to fstatfs");
+ }
+ if (file_stat->f_type != HUGETLBFS_MAGIC) {
+ close(fd);
+ ksft_exit_fail_msg("Not hugetlbfs file");
+ }
+
+ ksft_print_msg("Created hugetlb_tmp file\n");
+ ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
+ if (file_stat->f_bsize != hugepage_size)
+ ksft_exit_fail_msg("Hugepage size is not %#lx", hugepage_size);
+
+ return fd;
+}
+
+/*
+ * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
+ */
+static void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+ if (!expecting_sigbus)
+ ksft_exit_fail_msg("unexpected sigbus with addr=%p",
+ info->si_addr);
+
+ got_sigbus = true;
+ was_mceerr = (info->si_code == BUS_MCEERR_AO ||
+ info->si_code == BUS_MCEERR_AR);
+ sigbus_addr = info->si_addr;
+ sigbus_addr_lsb = info->si_addr_lsb;
+}
+
+static void *do_hwpoison(void *hwpoison_addr)
+{
+ int hwpoison_size = getpagesize();
+
+ ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
+ hwpoison_addr, hwpoison_size);
+ if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
+ ksft_exit_fail_perror("Failed to MADV_HWPOISON");
+
+ pthread_exit(NULL);
+}
+
+static void test_hwpoison_multiple_pages(unsigned char *start_addr,
+ unsigned long hugepage_size)
+{
+ pthread_t pthread;
+ int ret;
+ unsigned char *hwpoison_addr;
+ const unsigned long *offsets;
+ size_t i;
+
+ if (hugepage_size == BYTE_LENTH_IN_2M)
+ offsets = offsets_2m;
+ else
+ offsets = offsets_1g;
+
+ for (i = 0; i < nr_hwp_pages; ++i) {
+ sigbus_addr = (void *)0xBADBADBAD;
+ sigbus_addr_lsb = 0;
+ was_mceerr = false;
+ got_sigbus = false;
+ expecting_sigbus = true;
+ hwpoison_addr = start_addr + offsets[i];
+
+ ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
+ if (ret)
+ ksft_exit_fail_perror("Failed to create hwpoison thread");
+
+ ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
+ hwpoison_addr);
+
+ pthread_join(pthread, NULL);
+
+ if (!got_sigbus)
+ ksft_test_result_fail("Didn't get a SIGBUS\n");
+ if (!was_mceerr)
+ ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
+ if (sigbus_addr != hwpoison_addr)
+ ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
+ sigbus_addr, hwpoison_addr);
+ if (sigbus_addr_lsb != pshift())
+ ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
+ sigbus_addr_lsb, pshift());
+
+ ksft_print_msg("Received expected and correct SIGBUS\n");
+ }
+}
+
+static int read_nr_hugepages(unsigned long hugepage_size,
+ unsigned long *nr_hugepages)
+{
+ char buffer[256] = {0};
+ char cmd[256] = {0};
+
+ sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
+ hugepage_size);
+ FILE *cmdfile = popen(cmd, "r");
+
+ if (cmdfile == NULL) {
+ ksft_perror(EPREFIX "failed to popen nr_hugepages");
+ return -1;
+ }
+
+ if (!fgets(buffer, sizeof(buffer), cmdfile)) {
+ ksft_perror(EPREFIX "failed to read nr_hugepages");
+ pclose(cmdfile);
+ return -1;
+ }
+
+ *nr_hugepages = atoll(buffer);
+ pclose(cmdfile);
+ return 0;
+}
+
+/*
+ * Main thread that drives the test.
+ */
+static void test_main(int fd, unsigned long hugepage_size)
+{
+ unsigned char *map, *iter;
+ struct sigaction new, old;
+ const unsigned long hugepagesize_kb = hugepage_size / 1024;
+ unsigned long nr_hugepages_before = 0;
+ unsigned long nr_hugepages_after = 0;
+ unsigned long nodemask = 1UL << 0;
+ unsigned long len = hugepage_size * 4;
+ int ret;
+
+ if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) {
+ close(fd);
+ ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+ }
+ ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before);
+
+ if (ftruncate(fd, len) < 0)
+ ksft_exit_fail_perror("Failed to ftruncate");
+
+ ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len);
+
+ map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (map == MAP_FAILED)
+ ksft_exit_fail_msg("Failed to mmap");
+
+ ksft_print_msg("Created HugeTLB mapping: %p\n", map);
+
+ ret = mbind(map, len, MPOL_BIND, &nodemask, sizeof(nodemask) * 8,
+ MPOL_MF_STRICT | MPOL_MF_MOVE);
+ if (ret < 0) {
+ perror("mbind");
+ ksft_exit_fail_msg("Failed to bind to node\n");
+ }
+
+ memset(map, HUGETLB_FILL, len);
+ ksft_print_msg("Memset every byte to 0xab\n");
+
+ new.sa_sigaction = &sigbus_handler;
+ new.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGBUS, &new, &old) < 0)
+ ksft_exit_fail_msg("Failed to setup SIGBUS handler");
+
+ ksft_print_msg("Setup SIGBUS handler successfully\n");
+
+ test_hwpoison_multiple_pages(map, hugepage_size);
+
+ /*
+ * Since MADV_HWPOISON doesn't corrupt the memory in hardware, and
+ * MFD_MF_KEEP_UE_MAPPED keeps the hugepage mapped, every byte should
+ * remain accessible and hold original data.
+ */
+ expecting_sigbus = false;
+ for (iter = map; iter < map + len; ++iter) {
+ if (*iter != HUGETLB_FILL) {
+ ksft_print_msg("At addr=%p: got=%#x, expected=%#x\n",
+ iter, *iter, HUGETLB_FILL);
+ ksft_test_result_fail("Memory content corrupted\n");
+ break;
+ }
+ }
+ ksft_print_msg("Memory content all valid\n");
+
+ if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+ close(fd);
+ ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+ }
+
+ /*
+ * After MADV_HWPOISON, hugepage should still be in HugeTLB pool.
+ */
+ ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after);
+ if (nr_hugepages_before != nr_hugepages_after)
+ ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n",
+ nr_hugepages_before - nr_hugepages_after);
+
+ /* End of the lifetime of the created HugeTLB memfd. */
+ if (ftruncate(fd, 0) < 0)
+ ksft_exit_fail_perror("Failed to ftruncate to 0");
+ munmap(map, len);
+ close(fd);
+
+ /*
+ * After freed by userspace, MADV_HWPOISON-ed hugepage should be
+ * dissolved into raw pages and removed from HugeTLB pool.
+ */
+ if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+ close(fd);
+ ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+ }
+ ksft_print_msg("NR hugepages after closure is %ld\n", nr_hugepages_after);
+ if (nr_hugepages_before != nr_hugepages_after + 1)
+ ksft_test_result_fail("NR hugepages is not reduced after memfd closure\n");
+
+ ksft_test_result_pass("All done\n");
+}
+
+static unsigned long parse_hugepage_size(char *argv)
+{
+ if (strncasecmp(argv, HUGETLB_1GB_STR, strlen(HUGETLB_1GB_STR)) == 0)
+ return BYTE_LENTH_IN_1G;
+
+ if (strncasecmp(argv, HUGETLB_2MB_STR, strlen(HUGETLB_2MB_STR)) == 0)
+ return BYTE_LENTH_IN_2M;
+
+ ksft_print_msg("Please provide valid hugepage_size: 1G or 2M\n");
+ assert(false);
+}
+
+static size_t parse_nr_hwp_pages(char *argv)
+{
+ unsigned long val;
+ char *endptr;
+ size_t limit = min(ARRAY_SIZE(offsets_1g), ARRAY_SIZE(offsets_2m));
+
+ if (strlen(argv) < 1) {
+ ksft_print_msg("Please provide valid nr_hwpoison: 1-8\n");
+ assert(false);
+ }
+
+ errno = 0;
+ val = strtoul(argv, &endptr, 10);
+
+ if (*endptr != '\0') {
+ ksft_print_msg("Found invalid chars: '%s", endptr);
+ assert(false);
+ }
+
+ if (errno == ERANGE) {
+ ksft_print_msg("Value '%s' out of range for size_t\n", argv);
+ assert(false);
+ }
+
+ if (val > limit) {
+ ksft_print_msg("Value '%s' must < %lu\n", argv, limit);
+ assert(false);
+ }
+
+ return val;
+}
+
+int main(int argc, char **argv)
+{
+ int fd;
+ struct statfs file_stat;
+ unsigned long hugepage_size;
+
+ if (argc != 3) {
+ ksft_print_msg("Usage: %s <hugepage_size=1G|2M> <nr_hwp_pages>\n", argv[0]);
+ return -EINVAL;
+ }
+
+ ksft_print_header();
+ ksft_set_plan(1);
+
+ hugepage_size = parse_hugepage_size(argv[1]);
+ nr_hwp_pages = parse_nr_hwp_pages(argv[2]);
+ fd = create_hugetlbfs_file(&file_stat, hugepage_size);
+ test_main(fd, hugepage_size);
+
+ ksft_finished();
+}
--
2.53.0.rc2.204.g2597b5adb4-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED
2026-02-03 19:23 [PATCH v3 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
@ 2026-02-03 19:23 ` Jiaqi Yan
2026-02-04 17:56 ` William Roche
2 siblings, 1 reply; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-03 19:23 UTC (permalink / raw)
To: linmiaohe, william.roche, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, Jiaqi Yan
Document its motivation, userspace API, behaviors, and limitations.
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/mfd_mfr_policy.rst | 60 +++++++++++++++++++
2 files changed, 61 insertions(+)
create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 8a61ac4c1bf19..6d8d94028a6cd 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -68,6 +68,7 @@ Everything else
futex2
perf_ring_buffer
ntsync
+ mfd_mfr_policy
.. only:: subproject and html
diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst
new file mode 100644
index 0000000000000..c5a25df39791a
--- /dev/null
+++ b/Documentation/userspace-api/mfd_mfr_policy.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Userspace Memory Failure Recovery Policy via memfd
+==================================================
+
+:Author:
+ Jiaqi Yan <jiaqiyan@google.com>
+
+
+Motivation
+==========
+
+When a userspace process is able to recover from memory failures (MF)
+caused by uncorrected memory error (UE) in the DIMM, especially when it is
+able to avoid consuming known UEs, keeping the memory page mapped and
+accessible is benifical to the owning process for a couple of reasons:
+
+- The memory pages affected by UE have a large smallest granularity, for
+ example 1G hugepage, but the actual corrupted amount of the page is only
+ several cachlines. Losing the entire hugepage of data is unacceptable to
+ the application.
+
+- In addition to keeping the data accessible, the application still wants
+ to access with a large page size for the fastest virtual-to-physical
+ translations.
+
+Memory failure recovery for 1G or larger HugeTLB is a good example. With
+memfd userspace process can control whether the kernel hard offlines its
+hugepages that backs the in-RAM file created by memfd.
+
+
+User API
+========
+
+``int memfd_create(const char *name, unsigned int flags)``
+
+``MFD_MF_KEEP_UE_MAPPED``
+
+ When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery
+ in the kernel does not hard offline memory due to UE until the
+ returned ``memfd`` is released. IOW, the HWPoison-ed memory remains
+ accessible via the returned ``memfd`` or the memory mapping created
+ with the returned ``memfd``. Note the affected memory will be
+ immediately isolated and prevented from future use once the memfd
+ is closed. By default ``MFD_MF_KEEP_UE_MAPPED`` is not set, and
+ kernel hard offlines memory having UEs.
+
+Notes about the behavior and limitations
+
+- Even if the page affected by UE is kept, a portion of the (huge)page is
+ already lost due to hardware corruption, and the size of the portion
+ is the smallest page size that kernel uses to manages memory on the
+ architecture, i.e. PAGESIZE. Accessing a virtual address within any of
+ these parts results in a SIGBUS; accessing virtual address outside these
+ parts are good until it is corrupted by new memory error.
+
+- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so
+ ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.
+ Otherwise ``memfd_create`` returns EINVAL.
--
2.53.0.rc2.204.g2597b5adb4-goog
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-03 19:23 ` [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
@ 2026-02-04 17:29 ` William Roche
2026-02-10 4:46 ` Jiaqi Yan
2026-02-09 11:54 ` Miaohe Lin
1 sibling, 1 reply; 15+ messages in thread
From: William Roche @ 2026-02-04 17:29 UTC (permalink / raw)
To: Jiaqi Yan, linmiaohe, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel
On 2/3/26 20:23, Jiaqi Yan wrote:
> [...]
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 3b4c152c5c73a..8b0f5aa49711f 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -551,6 +551,18 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> }
>
> folio_unlock(folio);
> +
> + /*
> + * There may be pending HWPoison-ed folios when a memfd is being
> + * removed or part of it is being truncated.
> + *
> + * HugeTLBFS' error_remove_folio keeps the HWPoison-ed folios in
> + * page cache until mm wants to drop the folio at the end of the
> + * of the filemap. At this point, if memory failure was delayed
"of the" is repeated
> + * by MFD_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
> + */
> + filemap_offline_hwpoison_folio(mapping, folio);
> +
> return ret;
> }
>
> @@ -582,13 +594,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> const pgoff_t end = lend >> PAGE_SHIFT;
> struct folio_batch fbatch;
> pgoff_t next, index;
> - int i, freed = 0;
> + int i, j, freed = 0;
> bool truncate_op = (lend == LLONG_MAX);
>
> folio_batch_init(&fbatch);
> next = lstart >> PAGE_SHIFT;
> while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> - for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + for (i = 0, j = 0; i < folio_batch_count(&fbatch); ++i) {
> struct folio *folio = fbatch.folios[i];
> u32 hash = 0;
>
> @@ -603,8 +615,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> index, truncate_op))
> freed++;
>
> + /*
> + * Skip HWPoison-ed hugepages, which should no
> + * longer be hugetlb if successfully dissolved.
> + */
> + if (folio_test_hugetlb(folio))
> + fbatch.folios[j++] = folio;
> +
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> }
> + fbatch.nr = j;
> +
> folio_batch_release(&fbatch);
> cond_resched();
> }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e51b8ef0cebd9..7fadf1772335d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -879,10 +879,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
>
> #ifdef CONFIG_MEMORY_FAILURE
> extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> + struct address_space *mapping);
> #else
> static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
> {
> }
> +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
comma is missing
> + struct address_space *mapping)
> +{
> + return false;
> +}
> #endif
>
> #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ec442af3f8861..53772c29451eb 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -211,6 +211,7 @@ enum mapping_flags {
> AS_KERNEL_FILE = 10, /* mapping for a fake kernel file that shouldn't
> account usage to user cgroups */
> AS_NO_DATA_INTEGRITY = 11, /* no data integrity guarantees */
> + AS_MF_KEEP_UE_MAPPED = 12, /* For MFD_MF_KEEP_UE_MAPPED. */
> /* Bits 16-25 are used for FOLIO_ORDER */
> AS_FOLIO_ORDER_BITS = 5,
> AS_FOLIO_ORDER_MIN = 16,
> @@ -356,6 +357,16 @@ static inline bool mapping_no_data_integrity(const struct address_space *mapping
> return test_bit(AS_NO_DATA_INTEGRITY, &mapping->flags);
> }
>
> +static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
> +{
> + return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
> +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
> +{
> + set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
> static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
> {
> return mapping->gfp_mask;
> @@ -1303,6 +1314,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
> void delete_from_page_cache_batch(struct address_space *mapping,
> struct folio_batch *fbatch);
> bool filemap_release_folio(struct folio *folio, gfp_t gfp);
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
> + */
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> + struct folio *folio);
> +#else
> +static inline void filemap_offline_hwpoison_folio(struct address_space *mapping,
> + struct folio *folio)
> +{
> +}
> +#endif
> loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
> int whence);
>
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 273a4e15dfcff..d9875da551b7f 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -12,6 +12,12 @@
> #define MFD_NOEXEC_SEAL 0x0008U
> /* executable */
> #define MFD_EXEC 0x0010U
> +/*
> + * Keep owned folios mapped when uncorrectable memory errors (UE) causes
> + * memory failure (MF) within the folio. Only at the end of the mapping
> + * will its HWPoison-ed folios be dealt with.
> + */
> +#define MFD_MF_KEEP_UE_MAPPED 0x0020U
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a1832da0f6236..2a161c281da2a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5836,9 +5836,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
> * So we need to block hugepage fault by PG_hwpoison bit check.
> */
> if (unlikely(folio_test_hwpoison(folio))) {
> - ret = VM_FAULT_HWPOISON_LARGE |
> - VM_FAULT_SET_HINDEX(hstate_index(h));
> - goto backout_unlocked;
> + if (!mapping_mf_keep_ue_mapped(mapping)) {
> + ret = VM_FAULT_HWPOISON_LARGE |
> + VM_FAULT_SET_HINDEX(hstate_index(h));
> + goto backout_unlocked;
> + }
> }
>
> /* Check for page in userfault range. */
> diff --git a/mm/memfd.c b/mm/memfd.c
> index ab5312aff14b9..f9fdf014b67ba 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -340,7 +340,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> + MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
>
> static int check_sysctl_memfd_noexec(unsigned int *flags)
> {
> @@ -414,6 +415,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
> if (!(flags & MFD_HUGETLB)) {
> if (flags & ~MFD_ALL_FLAGS)
> return -EINVAL;
> + if (flags & MFD_MF_KEEP_UE_MAPPED)
> + return -EINVAL;
> } else {
> /* Allow huge page size encoding in flags. */
> if (flags & ~(MFD_ALL_FLAGS |
> @@ -486,6 +489,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
> file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> file->f_flags |= O_LARGEFILE;
>
> + /*
> + * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create;
> + * no API to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED
> + * is not seal-able.
> + *
> + * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
> + */
> + if (flags & MFD_MF_KEEP_UE_MAPPED)
> + mapping_set_mf_keep_ue_mapped(file->f_mapping);
> +
> if (flags & MFD_NOEXEC_SEAL) {
> inode->i_mode &= ~0111;
> file_seals = memfd_file_seals_ptr(file);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 58b34f5d2c05d..b9cecbbe08dae 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -410,6 +410,8 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> unsigned long addr)
> {
> struct to_kill *tk;
> + const struct folio *folio;
> + struct address_space *mapping;
>
> tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> if (!tk) {
> @@ -420,8 +422,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> tk->addr = addr;
> if (is_zone_device_page(p))
> tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> - else
> - tk->size_shift = folio_shift(page_folio(p));
> + else {
> + folio = page_folio(p);
> + mapping = folio_mapping(folio);
> + if (mapping && mapping_mf_keep_ue_mapped(mapping))
> + /*
> + * Let userspace know the radius of HWPoison is
> + * the size of raw page; accessing other pages
> + * inside the folio is still ok.
> + */
> + tk->size_shift = PAGE_SHIFT;
> + else
> + tk->size_shift = folio_shift(folio);
> + }
>
> /*
> * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -844,6 +857,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> int flags)
> {
> int ret;
> + struct folio *folio;
> + struct address_space *mapping;
> struct hwpoison_walk priv = {
> .pfn = pfn,
> };
> @@ -861,8 +876,14 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> * ret = 0 when poison page is a clean page and it's dropped, no
> * SIGBUS is needed.
> */
> - if (ret == 1 && priv.tk.addr)
> + if (ret == 1 && priv.tk.addr) {
> + folio = pfn_folio(pfn);
> + mapping = folio_mapping(folio);
> + if (mapping && mapping_mf_keep_ue_mapped(mapping))
> + priv.tk.size_shift = PAGE_SHIFT;
> +
> kill_proc(&priv.tk, pfn, flags);
> + }
> mmap_read_unlock(p->mm);
>
> return ret > 0 ? -EHWPOISON : 0;
> @@ -1206,6 +1227,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> }
> }
>
> + /*
> + * MF still needs to holds a refcount for the deferred actions in
to hold (without the s)
> + * filemap_offline_hwpoison_folio.
> + */
> + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> + return res;
> +
> if (has_extra_refcount(ps, p, extra_pins))
> res = MF_FAILED;
>
> @@ -1602,6 +1630,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> {
> LIST_HEAD(tokill);
> bool unmap_success;
> + bool keep_mapped;
> int forcekill;
> bool mlocked = folio_test_mlocked(folio);
>
> @@ -1629,8 +1658,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> */
> collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>
> - unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> - if (!unmap_success)
> + keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
We shoud use folio_mapping(folio) instead of folio->mapping.
But more importantly this function can be called on non hugepages
folios, and hugetlb_should_keep_hwpoison_mapped() is warning (ONCE) in
this case. So shouldn't the caller make sure that we are dealing with
hugepages first ?
> + if (!keep_mapped)
> + unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> +
> + unmap_success = !folio_mapped(folio);
> + if (!keep_mapped && !unmap_success)
> pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
> pfn, folio_mapcount(folio));
>
> @@ -1655,7 +1688,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> !unmap_success;
> kill_procs(&tokill, forcekill, pfn, flags);
>
> - return unmap_success;
> + return unmap_success || keep_mapped;
> }
>
> static int identify_page_state(unsigned long pfn, struct page *p,
> @@ -1896,6 +1929,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> unsigned long count = 0;
>
> head = llist_del_all(raw_hwp_list_head(folio));
> + /*
> + * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
> + * it has already taken off the head of the llist.
> + */
> + if (head == NULL)
> + return 0;
> +
> llist_for_each_entry_safe(p, next, head, node) {
> if (move_flag)
> SetPageHWPoison(p->page);
> @@ -1912,7 +1952,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> struct llist_head *head;
> struct raw_hwp_page *raw_hwp;
> struct raw_hwp_page *p;
> - int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> + struct address_space *mapping = folio->mapping;
Same here - We shoud use folio_mapping(folio) instead of folio->mapping.
> + bool has_hwpoison = folio_test_set_hwpoison(folio);
>
> /*
> * Once the hwpoison hugepage has lost reliable raw error info,
> @@ -1931,8 +1972,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> if (raw_hwp) {
> raw_hwp->page = page;
> llist_add(&raw_hwp->node, head);
> + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> + /*
> + * A new raw HWPoison page. Don't return HWPOISON.
> + * Error event will be counted in action_result().
> + */
> + return 0;
> +
> /* the first error event will be counted in action_result(). */
> - if (ret)
> + if (has_hwpoison)
> num_poisoned_pages_inc(page_to_pfn(page));
> } else {
> /*
> @@ -1947,7 +1995,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> */
> __folio_free_raw_hwp(folio, false);
> }
> - return ret;
> +
> + return has_hwpoison ? -EHWPOISON : 0;
> }
>
> static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> @@ -1980,6 +2029,18 @@ void folio_clear_hugetlb_hwpoison(struct folio *folio)
> folio_free_raw_hwp(folio, true);
> }
>
> +bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> + struct address_space *mapping)
> +{
> + if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
> + return false;
> +
> + if (!mapping)
> + return false;
> +
> + return mapping_mf_keep_ue_mapped(mapping);
> +}
The definition of this above function should be encapsulated with
#ifdef CONFIG_MEMORY_FAILURE
#endif
> +
> /*
> * Called from hugetlb code with hugetlb_lock held.
> *
> @@ -2037,6 +2098,51 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> return ret;
> }
>
> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> +{
> + int ret;
> + struct llist_node *head;
> + struct raw_hwp_page *curr, *next;
> +
> + /*
> + * Since folio is still in the folio_batch, drop the refcount
> + * elevated by filemap_get_folios.
> + */
> + folio_put_refs(folio, 1);
> + head = llist_del_all(raw_hwp_list_head(folio));
> +
> + /*
> + * Release refcounts held by try_memory_failure_hugetlb, one per
> + * HWPoison-ed page in the raw hwp list.
> + *
> + * Set HWPoison flag on each page so that free_has_hwpoisoned()
> + * can exclude them during dissolve_free_hugetlb_folio().
> + */
> + llist_for_each_entry_safe(curr, next, head, node) {
> + folio_put(folio);
> + SetPageHWPoison(curr->page);
> + kfree(curr);
> + }
> +
> + /* Refcount now should be zero and ready to dissolve folio. */
> + ret = dissolve_free_hugetlb_folio(folio);
> + if (ret)
> + pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> +}
> +
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> + struct folio *folio)
> +{
> + WARN_ON_ONCE(!mapping);
> +
> + if (!folio_test_hwpoison(folio))
> + return;
> +
> + /* Pending MFR currently only exist for hugetlb. */
> + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> + filemap_offline_hwpoison_folio_hugetlb(folio);
Shouldn't we also test here that we are dealing with hugepages first
before testing hugetlb_should_keep_hwpoison_mapped(folio, mapping) ?
> +}
> +
> /*
> * Taking refcount of hugetlb pages needs extra care about race conditions
> * with basic operations like hugepage allocation/free/demotion.
Don't we also need to take into account the repeated errors in
try_memory_failure_hugetlb() ?
Something like that:
@@ -2036,9 +2099,10 @@ static int try_memory_failure_hugetlb(unsigned
long pfn, int flags, int *hugetlb
{
int res, rv;
struct page *p = pfn_to_page(pfn);
- struct folio *folio;
+ struct folio *folio = page_folio(p);
unsigned long page_flags;
bool migratable_cleared = false;
+ struct address_space *mapping = folio_mapping(folio);
*hugetlb = 1;
retry:
@@ -2060,15 +2124,17 @@ static int try_memory_failure_hugetlb(unsigned
long pfn, int flags, int *hugetlb
rv = kill_accessing_process(current, pfn, flags);
if (res == MF_HUGETLB_PAGE_PRE_POISONED)
action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
- else
+ else {
+ if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+ return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_DELAYED);
action_result(pfn, MF_MSG_HUGE, MF_FAILED);
+ }
return rv;
default:
WARN_ON((res != MF_HUGETLB_FREED) && (res != MF_HUGETLB_IN_USED));
break;
}
- folio = page_folio(p);
folio_lock(folio);
if (hwpoison_filter(p)) {
So that we don't call action_result(pfn, MF_MSG_HUGE, MF_FAILED); for a
repeated error ?
--
2.47.3
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
2026-02-03 19:23 ` [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
@ 2026-02-04 17:53 ` William Roche
2026-02-12 3:11 ` Jiaqi Yan
2026-02-09 12:01 ` Miaohe Lin
1 sibling, 1 reply; 15+ messages in thread
From: William Roche @ 2026-02-04 17:53 UTC (permalink / raw)
To: Jiaqi Yan, linmiaohe, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel
On 2/3/26 20:23, Jiaqi Yan wrote:
> Test the userspace memory failure recovery (MFR) policy for HugeTLB:
>
> 1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
>
> 2. Allocate and map 4 hugepages to the process.
>
> 3. Create sub-threads to MADV_HWPOISON inner addresses of the 1st hugepage.
>
> 4. Check if the process gets correct SIGBUS for each poisoned raw page.
>
> 5. Check if all memory are still accessible and content valid.
>
> 6. Check if the poisoned hugepage is dealt with after memfd released.
>
> Two configurables in the test:
>
> - hugepage_size: size of the hugepage, 1G or 2M.
>
> - nr_hwp_pages: number of pages within the 1st hugepage to MADV_HWPOISON.
In this version, you are introducing this new test argument
"nr_hwp_pages" to indicate how many of the pre-defined offsets we want
to poison inside the hugepage (between 1 and 8).
But is there any advantage to give the choice to the user instead of
testing them all ?
As a suggestion, should we have this test program setting or verifying
the minimal number of hugepages of the right type, instead of relying on
the user to set them manually ?
And at the end, should we try to unpoison the impacted pages ? So that
the lab machine where the tests run can continue to use all its memory ?
Thanks for your feedback,
William.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED
2026-02-03 19:23 ` [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
@ 2026-02-04 17:56 ` William Roche
0 siblings, 0 replies; 15+ messages in thread
From: William Roche @ 2026-02-04 17:56 UTC (permalink / raw)
To: Jiaqi Yan, linmiaohe, harry.yoo, jane.chu
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel
On 2/3/26 20:23, Jiaqi Yan wrote:
> Document its motivation, userspace API, behaviors, and limitations.
>
> Reviewed-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
> Documentation/userspace-api/index.rst | 1 +
> .../userspace-api/mfd_mfr_policy.rst | 60 +++++++++++++++++++
> 2 files changed, 61 insertions(+)
> create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
>
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index 8a61ac4c1bf19..6d8d94028a6cd 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -68,6 +68,7 @@ Everything else
> futex2
> perf_ring_buffer
> ntsync
> + mfd_mfr_policy
>
> .. only:: subproject and html
>
> diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst
> new file mode 100644
> index 0000000000000..c5a25df39791a
> --- /dev/null
> +++ b/Documentation/userspace-api/mfd_mfr_policy.rst
> @@ -0,0 +1,60 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================================================
> +Userspace Memory Failure Recovery Policy via memfd
> +==================================================
> +
> +:Author:
> + Jiaqi Yan <jiaqiyan@google.com>
> +
> +
> +Motivation
> +==========
> +
> +When a userspace process is able to recover from memory failures (MF)
> +caused by uncorrected memory error (UE) in the DIMM, especially when it is
> +able to avoid consuming known UEs, keeping the memory page mapped and
> +accessible is benifical to the owning process for a couple of reasons:
> +
> +- The memory pages affected by UE have a large smallest granularity, for
> + example 1G hugepage, but the actual corrupted amount of the page is only
> + several cachlines. Losing the entire hugepage of data is unacceptable to
> + the application.
> +
> +- In addition to keeping the data accessible, the application still wants
> + to access with a large page size for the fastest virtual-to-physical
> + translations.
> +
> +Memory failure recovery for 1G or larger HugeTLB is a good example. With
> +memfd userspace process can control whether the kernel hard offlines its
> +hugepages that backs the in-RAM file created by memfd.
> +
> +
> +User API
> +========
> +
> +``int memfd_create(const char *name, unsigned int flags)``
> +
> +``MFD_MF_KEEP_UE_MAPPED``
> +
> + When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery
> + in the kernel does not hard offline memory due to UE until the
> + returned ``memfd`` is released. IOW, the HWPoison-ed memory remains
> + accessible via the returned ``memfd`` or the memory mapping created
> + with the returned ``memfd``. Note the affected memory will be
> + immediately isolated and prevented from future use once the memfd
> + is closed. By default ``MFD_MF_KEEP_UE_MAPPED`` is not set, and
> + kernel hard offlines memory having UEs.
> +
> +Notes about the behavior and limitations
> +
> +- Even if the page affected by UE is kept, a portion of the (huge)page is
> + already lost due to hardware corruption, and the size of the portion
> + is the smallest page size that kernel uses to manages memory on the
> + architecture, i.e. PAGESIZE. Accessing a virtual address within any of
> + these parts results in a SIGBUS; accessing virtual address outside these
> + parts are good until it is corrupted by new memory error.
> +
> +- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so
> + ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.
> + Otherwise ``memfd_create`` returns EINVAL.
Reviewed-by: William Roche <william.roche@oracle.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-03 19:23 ` [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
2026-02-04 17:29 ` William Roche
@ 2026-02-09 11:54 ` Miaohe Lin
2026-02-10 4:47 ` Jiaqi Yan
1 sibling, 1 reply; 15+ messages in thread
From: Miaohe Lin @ 2026-02-09 11:54 UTC (permalink / raw)
To: Jiaqi Yan
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, william.roche, harry.yoo, jane.chu
On 2026/2/4 3:23, Jiaqi Yan wrote:
> Sometimes immediately hard offlining a large chunk of contigous memory
> having uncorrected memory errors (UE) may not be the best option.
> Cloud providers usually serve capacity- and performance-critical guest
> memory with 1G HugeTLB hugepages, as this significantly reduces the
> overhead associated with managing page tables and TLB misses. However,
> for today's HugeTLB system, once a byte of memory in a hugepage is
> hardware corrupted, the kernel discards the whole hugepage, including
> the healthy portion. Customer workload running in the VM can hardly
> recover from such a great loss of memory.
Thanks for your patch. Some questions below.
>
> Therefore keeping or discarding a large chunk of contiguous memory
> owned by userspace (particularly to serve guest memory) due to
> recoverable UE may better be controlled by userspace process
> that owns the memory, e.g. VMM in the Cloud environment.
>
> Introduce a memfd-based userspace memory failure (MFR) policy,
> MFD_MF_KEEP_UE_MAPPED. It is possible to support for other memfd,
> but the current implementation only covers HugeTLB.
>
> For a hugepage associated with MFD_MF_KEEP_UE_MAPPED enabled memfd,
> whenever it runs into a new UE,
>
> * MFR defers hard offline operations, i.e., unmapping and
So the folio can't be unpoisoned until hugetlb folio becomes free?
> dissolving. MFR still sets HWPoison flag, holds a refcount
> for every raw HWPoison page, record them in a list, sends SIGBUS
> to the consuming thread, but si_addr_lsb is reduced to PAGE_SHIFT.
> If userspace is able to handle the SIGBUS, the HWPoison hugepage
> remains accessible via the mapping created with that memfd.
>
> * If the memory was not faulted in yet, the fault handler also
> allows fault in the HWPoison folio.
>
> For a MFD_MF_KEEP_UE_MAPPED enabled memfd, when it is closed, or
> when userspace process truncates its hugepages:
>
> * When the HugeTLB in-memory file system removes the filemap's
> folios one by one, it asks MFR to deal with HWPoison folios
> on the fly, implemented by filemap_offline_hwpoison_folio().
>
> * MFR drops the refcounts being held for the raw HWPoison
> pages within the folio. Now that the HWPoison folio becomes
> free, MFR dissolves it into a set of raw pages. The healthy pages
> are recycled into buddy allocator, while the HWPoison ones are
> prevented from re-allocation.
>
...
>
> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> +{
> + int ret;
> + struct llist_node *head;
> + struct raw_hwp_page *curr, *next;
> +
> + /*
> + * Since folio is still in the folio_batch, drop the refcount
> + * elevated by filemap_get_folios.
> + */
> + folio_put_refs(folio, 1);
> + head = llist_del_all(raw_hwp_list_head(folio));
We might race with get_huge_page_for_hwpoison()? llist_add() might be called
by folio_set_hugetlb_hwpoison() just after llist_del_all()?
> +
> + /*
> + * Release refcounts held by try_memory_failure_hugetlb, one per
> + * HWPoison-ed page in the raw hwp list.
> + *
> + * Set HWPoison flag on each page so that free_has_hwpoisoned()
> + * can exclude them during dissolve_free_hugetlb_folio().
> + */
> + llist_for_each_entry_safe(curr, next, head, node) {
> + folio_put(folio);
The hugetlb folio refcnt will only be increased once even if it contains multiple UE sub-pages.
See __get_huge_page_for_hwpoison() for details. So folio_put() might be called more times than
folio_try_get() in __get_huge_page_for_hwpoison().
> + SetPageHWPoison(curr->page);
If hugetlb folio vmemmap is optimized, I think SetPageHWPoison might trigger BUG.
> + kfree(curr);
> + }
Above logic is almost same as folio_clear_hugetlb_hwpoison. Maybe we can reuse that?
> +
> + /* Refcount now should be zero and ready to dissolve folio. */
> + ret = dissolve_free_hugetlb_folio(folio);
> + if (ret)
> + pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> +}
> +
Thanks.
.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
2026-02-03 19:23 ` [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
2026-02-04 17:53 ` William Roche
@ 2026-02-09 12:01 ` Miaohe Lin
2026-02-12 3:17 ` Jiaqi Yan
1 sibling, 1 reply; 15+ messages in thread
From: Miaohe Lin @ 2026-02-09 12:01 UTC (permalink / raw)
To: Jiaqi Yan
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, william.roche, harry.yoo, jane.chu
On 2026/2/4 3:23, Jiaqi Yan wrote:
> Test the userspace memory failure recovery (MFR) policy for HugeTLB:
>
> 1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
>
> 2. Allocate and map 4 hugepages to the process.
>
> 3. Create sub-threads to MADV_HWPOISON inner addresses of the 1st hugepage.
>
> 4. Check if the process gets correct SIGBUS for each poisoned raw page.
>
> 5. Check if all memory are still accessible and content valid.
>
> 6. Check if the poisoned hugepage is dealt with after memfd released.
>
> Two configurables in the test:
>
> - hugepage_size: size of the hugepage, 1G or 2M.
>
> - nr_hwp_pages: number of pages within the 1st hugepage to MADV_HWPOISON.
>
> Reviewed-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
It's not required but could this testcase be written into the tools/testing/selftests/mm/memory-failure.c [1]?
[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-unstable&id=cf2929c618fec0a22702b3abd0778bbdde6e458e
Thanks.
.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-04 17:29 ` William Roche
@ 2026-02-10 4:46 ` Jiaqi Yan
0 siblings, 0 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-10 4:46 UTC (permalink / raw)
To: William Roche
Cc: linmiaohe, harry.yoo, jane.chu, nao.horiguchi, tony.luck,
wangkefeng.wang, willy, akpm, osalvador, rientjes, duenwen,
jthoughton, jgg, ankita, peterx, sidhartha.kumar, ziy, david,
dave.hansen, muchun.song, linux-mm, linux-kernel, linux-fsdevel
On Wed, Feb 4, 2026 at 9:30 AM William Roche <william.roche@oracle.com> wrote:
>
> On 2/3/26 20:23, Jiaqi Yan wrote:
> > [...]
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 3b4c152c5c73a..8b0f5aa49711f 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -551,6 +551,18 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> > }
> >
> > folio_unlock(folio);
> > +
> > + /*
> > + * There may be pending HWPoison-ed folios when a memfd is being
> > + * removed or part of it is being truncated.
> > + *
> > + * HugeTLBFS' error_remove_folio keeps the HWPoison-ed folios in
> > + * page cache until mm wants to drop the folio at the end of the
> > + * of the filemap. At this point, if memory failure was delayed
>
> "of the" is repeated
>
> > + * by MFD_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
> > + */
> > + filemap_offline_hwpoison_folio(mapping, folio);
> > +
> > return ret;
> > }
> >
> > @@ -582,13 +594,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> > const pgoff_t end = lend >> PAGE_SHIFT;
> > struct folio_batch fbatch;
> > pgoff_t next, index;
> > - int i, freed = 0;
> > + int i, j, freed = 0;
> > bool truncate_op = (lend == LLONG_MAX);
> >
> > folio_batch_init(&fbatch);
> > next = lstart >> PAGE_SHIFT;
> > while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> > - for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > + for (i = 0, j = 0; i < folio_batch_count(&fbatch); ++i) {
> > struct folio *folio = fbatch.folios[i];
> > u32 hash = 0;
> >
> > @@ -603,8 +615,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> > index, truncate_op))
> > freed++;
> >
> > + /*
> > + * Skip HWPoison-ed hugepages, which should no
> > + * longer be hugetlb if successfully dissolved.
> > + */
> > + if (folio_test_hugetlb(folio))
> > + fbatch.folios[j++] = folio;
> > +
> > mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > }
> > + fbatch.nr = j;
> > +
> > folio_batch_release(&fbatch);
> > cond_resched();
> > }
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index e51b8ef0cebd9..7fadf1772335d 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -879,10 +879,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
> >
> > #ifdef CONFIG_MEMORY_FAILURE
> > extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> > +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> > + struct address_space *mapping);
> > #else
> > static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
> > {
> > }
> > +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
>
> comma is missing
>
> > + struct address_space *mapping)
> > +{
> > + return false;
> > +}
> > #endif
> >
> > #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index ec442af3f8861..53772c29451eb 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -211,6 +211,7 @@ enum mapping_flags {
> > AS_KERNEL_FILE = 10, /* mapping for a fake kernel file that shouldn't
> > account usage to user cgroups */
> > AS_NO_DATA_INTEGRITY = 11, /* no data integrity guarantees */
> > + AS_MF_KEEP_UE_MAPPED = 12, /* For MFD_MF_KEEP_UE_MAPPED. */
> > /* Bits 16-25 are used for FOLIO_ORDER */
> > AS_FOLIO_ORDER_BITS = 5,
> > AS_FOLIO_ORDER_MIN = 16,
> > @@ -356,6 +357,16 @@ static inline bool mapping_no_data_integrity(const struct address_space *mapping
> > return test_bit(AS_NO_DATA_INTEGRITY, &mapping->flags);
> > }
> >
> > +static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
> > +{
> > + return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> > +}
> > +
> > +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
> > +{
> > + set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> > +}
> > +
> > static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
> > {
> > return mapping->gfp_mask;
> > @@ -1303,6 +1314,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
> > void delete_from_page_cache_batch(struct address_space *mapping,
> > struct folio_batch *fbatch);
> > bool filemap_release_folio(struct folio *folio, gfp_t gfp);
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +/*
> > + * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
> > + */
> > +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> > + struct folio *folio);
> > +#else
> > +static inline void filemap_offline_hwpoison_folio(struct address_space *mapping,
> > + struct folio *folio)
> > +{
> > +}
> > +#endif
> > loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
> > int whence);
> >
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 273a4e15dfcff..d9875da551b7f 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -12,6 +12,12 @@
> > #define MFD_NOEXEC_SEAL 0x0008U
> > /* executable */
> > #define MFD_EXEC 0x0010U
> > +/*
> > + * Keep owned folios mapped when uncorrectable memory errors (UE) causes
> > + * memory failure (MF) within the folio. Only at the end of the mapping
> > + * will its HWPoison-ed folios be dealt with.
> > + */
> > +#define MFD_MF_KEEP_UE_MAPPED 0x0020U
> >
> > /*
> > * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a1832da0f6236..2a161c281da2a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5836,9 +5836,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
> > * So we need to block hugepage fault by PG_hwpoison bit check.
> > */
> > if (unlikely(folio_test_hwpoison(folio))) {
> > - ret = VM_FAULT_HWPOISON_LARGE |
> > - VM_FAULT_SET_HINDEX(hstate_index(h));
> > - goto backout_unlocked;
> > + if (!mapping_mf_keep_ue_mapped(mapping)) {
> > + ret = VM_FAULT_HWPOISON_LARGE |
> > + VM_FAULT_SET_HINDEX(hstate_index(h));
> > + goto backout_unlocked;
> > + }
> > }
> >
> > /* Check for page in userfault range. */
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index ab5312aff14b9..f9fdf014b67ba 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -340,7 +340,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
> > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> > + MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
> >
> > static int check_sysctl_memfd_noexec(unsigned int *flags)
> > {
> > @@ -414,6 +415,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
> > if (!(flags & MFD_HUGETLB)) {
> > if (flags & ~MFD_ALL_FLAGS)
> > return -EINVAL;
> > + if (flags & MFD_MF_KEEP_UE_MAPPED)
> > + return -EINVAL;
> > } else {
> > /* Allow huge page size encoding in flags. */
> > if (flags & ~(MFD_ALL_FLAGS |
> > @@ -486,6 +489,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
> > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > file->f_flags |= O_LARGEFILE;
> >
> > + /*
> > + * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create;
> > + * no API to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED
> > + * is not seal-able.
> > + *
> > + * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
> > + */
> > + if (flags & MFD_MF_KEEP_UE_MAPPED)
> > + mapping_set_mf_keep_ue_mapped(file->f_mapping);
> > +
> > if (flags & MFD_NOEXEC_SEAL) {
> > inode->i_mode &= ~0111;
> > file_seals = memfd_file_seals_ptr(file);
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 58b34f5d2c05d..b9cecbbe08dae 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -410,6 +410,8 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> > unsigned long addr)
> > {
> > struct to_kill *tk;
> > + const struct folio *folio;
> > + struct address_space *mapping;
> >
> > tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> > if (!tk) {
> > @@ -420,8 +422,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> > tk->addr = addr;
> > if (is_zone_device_page(p))
> > tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> > - else
> > - tk->size_shift = folio_shift(page_folio(p));
> > + else {
> > + folio = page_folio(p);
> > + mapping = folio_mapping(folio);
> > + if (mapping && mapping_mf_keep_ue_mapped(mapping))
> > + /*
> > + * Let userspace know the radius of HWPoison is
> > + * the size of raw page; accessing other pages
> > + * inside the folio is still ok.
> > + */
> > + tk->size_shift = PAGE_SHIFT;
> > + else
> > + tk->size_shift = folio_shift(folio);
> > + }
> >
> > /*
> > * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> > @@ -844,6 +857,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> > int flags)
> > {
> > int ret;
> > + struct folio *folio;
> > + struct address_space *mapping;
> > struct hwpoison_walk priv = {
> > .pfn = pfn,
> > };
> > @@ -861,8 +876,14 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> > * ret = 0 when poison page is a clean page and it's dropped, no
> > * SIGBUS is needed.
> > */
> > - if (ret == 1 && priv.tk.addr)
> > + if (ret == 1 && priv.tk.addr) {
> > + folio = pfn_folio(pfn);
> > + mapping = folio_mapping(folio);
> > + if (mapping && mapping_mf_keep_ue_mapped(mapping))
> > + priv.tk.size_shift = PAGE_SHIFT;
> > +
> > kill_proc(&priv.tk, pfn, flags);
> > + }
> > mmap_read_unlock(p->mm);
> >
> > return ret > 0 ? -EHWPOISON : 0;
> > @@ -1206,6 +1227,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> > }
> > }
> >
> > + /*
> > + * MF still needs to holds a refcount for the deferred actions in
>
> to hold (without the s)
>
> > + * filemap_offline_hwpoison_folio.
> > + */
> > + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> > + return res;
> > +
> > if (has_extra_refcount(ps, p, extra_pins))
> > res = MF_FAILED;
> >
> > @@ -1602,6 +1630,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> > {
> > LIST_HEAD(tokill);
> > bool unmap_success;
> > + bool keep_mapped;
> > int forcekill;
> > bool mlocked = folio_test_mlocked(folio);
> >
> > @@ -1629,8 +1658,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> > */
> > collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> >
> > - unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> > - if (!unmap_success)
> > + keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
>
> We shoud use folio_mapping(folio) instead of folio->mapping.
>
> But more importantly this function can be called on non hugepages
> folios, and hugetlb_should_keep_hwpoison_mapped() is warning (ONCE) in
> this case. So shouldn't the caller make sure that we are dealing with
> hugepages first ?
I guess the WARN_ON_ONCE() in hugetlb_should_keep_hwpoison_mapped() is
confusing. I want hugetlb_should_keep_hwpoison_mapped() to test and
return false for non hugepage. Let me remove WARN_ON_ONCE().
>
>
> > + if (!keep_mapped)
> > + unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> > +
> > + unmap_success = !folio_mapped(folio);
> > + if (!keep_mapped && !unmap_success)
> > pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
> > pfn, folio_mapcount(folio));
> >
> > @@ -1655,7 +1688,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> > !unmap_success;
> > kill_procs(&tokill, forcekill, pfn, flags);
> >
> > - return unmap_success;
> > + return unmap_success || keep_mapped;
> > }
> >
> > static int identify_page_state(unsigned long pfn, struct page *p,
> > @@ -1896,6 +1929,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> > unsigned long count = 0;
> >
> > head = llist_del_all(raw_hwp_list_head(folio));
> > + /*
> > + * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
> > + * it has already taken off the head of the llist.
> > + */
> > + if (head == NULL)
> > + return 0;
> > +
> > llist_for_each_entry_safe(p, next, head, node) {
> > if (move_flag)
> > SetPageHWPoison(p->page);
> > @@ -1912,7 +1952,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> > struct llist_head *head;
> > struct raw_hwp_page *raw_hwp;
> > struct raw_hwp_page *p;
> > - int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> > + struct address_space *mapping = folio->mapping;
>
> Same here - We shoud use folio_mapping(folio) instead of folio->mapping.
>
> > + bool has_hwpoison = folio_test_set_hwpoison(folio);
> >
> > /*
> > * Once the hwpoison hugepage has lost reliable raw error info,
> > @@ -1931,8 +1972,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> > if (raw_hwp) {
> > raw_hwp->page = page;
> > llist_add(&raw_hwp->node, head);
> > + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> > + /*
> > + * A new raw HWPoison page. Don't return HWPOISON.
> > + * Error event will be counted in action_result().
> > + */
> > + return 0;
> > +
> > /* the first error event will be counted in action_result(). */
> > - if (ret)
> > + if (has_hwpoison)
> > num_poisoned_pages_inc(page_to_pfn(page));
> > } else {
> > /*
> > @@ -1947,7 +1995,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> > */
> > __folio_free_raw_hwp(folio, false);
> > }
> > - return ret;
> > +
> > + return has_hwpoison ? -EHWPOISON : 0;
> > }
> >
> > static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> > @@ -1980,6 +2029,18 @@ void folio_clear_hugetlb_hwpoison(struct folio *folio)
> > folio_free_raw_hwp(folio, true);
> > }
> >
> > +bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> > + struct address_space *mapping)
> > +{
> > + if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
> > + return false;
> > +
> > + if (!mapping)
> > + return false;
> > +
> > + return mapping_mf_keep_ue_mapped(mapping);
> > +}
>
> The definition of this above function should be encapsulated with
> #ifdef CONFIG_MEMORY_FAILURE
> #endif
>
> > +
> > /*
> > * Called from hugetlb code with hugetlb_lock held.
> > *
> > @@ -2037,6 +2098,51 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> > return ret;
> > }
> >
> > +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> > +{
> > + int ret;
> > + struct llist_node *head;
> > + struct raw_hwp_page *curr, *next;
> > +
> > + /*
> > + * Since folio is still in the folio_batch, drop the refcount
> > + * elevated by filemap_get_folios.
> > + */
> > + folio_put_refs(folio, 1);
> > + head = llist_del_all(raw_hwp_list_head(folio));
> > +
> > + /*
> > + * Release refcounts held by try_memory_failure_hugetlb, one per
> > + * HWPoison-ed page in the raw hwp list.
> > + *
> > + * Set HWPoison flag on each page so that free_has_hwpoisoned()
> > + * can exclude them during dissolve_free_hugetlb_folio().
> > + */
> > + llist_for_each_entry_safe(curr, next, head, node) {
> > + folio_put(folio);
> > + SetPageHWPoison(curr->page);
> > + kfree(curr);
> > + }
> > +
> > + /* Refcount now should be zero and ready to dissolve folio. */
> > + ret = dissolve_free_hugetlb_folio(folio);
> > + if (ret)
> > + pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> > +}
> > +
> > +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> > + struct folio *folio)
> > +{
> > + WARN_ON_ONCE(!mapping);
> > +
> > + if (!folio_test_hwpoison(folio))
> > + return;
> > +
> > + /* Pending MFR currently only exist for hugetlb. */
> > + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> > + filemap_offline_hwpoison_folio_hugetlb(folio);
>
> Shouldn't we also test here that we are dealing with hugepages first
> before testing hugetlb_should_keep_hwpoison_mapped(folio, mapping) ?
>
> > +}
> > +
> > /*
> > * Taking refcount of hugetlb pages needs extra care about race conditions
> > * with basic operations like hugepage allocation/free/demotion.
>
>
> Don't we also need to take into account the repeated errors in
> try_memory_failure_hugetlb() ?
Ah, looks like I haven't pull the recently commit a148a2040191
("mm/memory-failure: fix missing ->mf_stats count in hugetlb poison").
When dealing with a new error in already HWPoison folio,
MFD_MF_KEEP_UE_MAPPED makes folio_set_hugetlb_hwpoison() return 0 (now
MF_HUGETLB_IN_USED for hugetlb_update_hwpoison()) so
__get_huge_page_for_hwpoison() can return 1/MF_HUGETLB_IN_USED. The
idea is to make try_memory_failure_hugetlb() just handle new error as
a first-time poisoned in-use hugetlb page.
Of course for an old error __get_huge_page_for_hwpoison should return
MF_HUGETLB_PAGE_PRE_POISONED.
>
> Something like that:
>
> @@ -2036,9 +2099,10 @@ static int try_memory_failure_hugetlb(unsigned
> long pfn, int flags, int *hugetlb
> {
> int res, rv;
> struct page *p = pfn_to_page(pfn);
> - struct folio *folio;
> + struct folio *folio = page_folio(p);
> unsigned long page_flags;
> bool migratable_cleared = false;
> + struct address_space *mapping = folio_mapping(folio);
>
> *hugetlb = 1;
> retry:
> @@ -2060,15 +2124,17 @@ static int try_memory_failure_hugetlb(unsigned
> long pfn, int flags, int *hugetlb
> rv = kill_accessing_process(current, pfn, flags);
> if (res == MF_HUGETLB_PAGE_PRE_POISONED)
> action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> - else
> + else {
> + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> + return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_DELAYED);
If hugetlb_update_hwpoison() returns MF_HUGETLB_IN_USED for
MFD_MF_KEEP_UE_MAPPED, then try_memory_failure_hugetlb() should
normally run to the end and report MF_MSG_HUGE + MF_RECOVERED.
> action_result(pfn, MF_MSG_HUGE, MF_FAILED);
> + }
> return rv;
> default:
> WARN_ON((res != MF_HUGETLB_FREED) && (res != MF_HUGETLB_IN_USED));
> break;
> }
>
> - folio = page_folio(p);
> folio_lock(folio);
>
> if (hwpoison_filter(p)) {
>
>
> So that we don't call action_result(pfn, MF_MSG_HUGE, MF_FAILED); for a
> repeated error ?
>
>
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-09 11:54 ` Miaohe Lin
@ 2026-02-10 4:47 ` Jiaqi Yan
2026-02-10 7:31 ` Miaohe Lin
0 siblings, 1 reply; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-10 4:47 UTC (permalink / raw)
To: Miaohe Lin
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, william.roche, harry.yoo, jane.chu
On Mon, Feb 9, 2026 at 3:54 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2026/2/4 3:23, Jiaqi Yan wrote:
> > Sometimes immediately hard offlining a large chunk of contigous memory
> > having uncorrected memory errors (UE) may not be the best option.
> > Cloud providers usually serve capacity- and performance-critical guest
> > memory with 1G HugeTLB hugepages, as this significantly reduces the
> > overhead associated with managing page tables and TLB misses. However,
> > for today's HugeTLB system, once a byte of memory in a hugepage is
> > hardware corrupted, the kernel discards the whole hugepage, including
> > the healthy portion. Customer workload running in the VM can hardly
> > recover from such a great loss of memory.
>
> Thanks for your patch. Some questions below.
>
> >
> > Therefore keeping or discarding a large chunk of contiguous memory
> > owned by userspace (particularly to serve guest memory) due to
> > recoverable UE may better be controlled by userspace process
> > that owns the memory, e.g. VMM in the Cloud environment.
> >
> > Introduce a memfd-based userspace memory failure (MFR) policy,
> > MFD_MF_KEEP_UE_MAPPED. It is possible to support for other memfd,
> > but the current implementation only covers HugeTLB.
> >
> > For a hugepage associated with MFD_MF_KEEP_UE_MAPPED enabled memfd,
> > whenever it runs into a new UE,
> >
> > * MFR defers hard offline operations, i.e., unmapping and
>
> So the folio can't be unpoisoned until hugetlb folio becomes free?
Are you asking from testing perspective, are we still able to clean up
injected test errors via unpoison_memory() with MFD_MF_KEEP_UE_MAPPED?
If so, unpoison_memory() can't turn the HWPoison hugetlb page to
normal hugetlb page as MFD_MF_KEEP_UE_MAPPED automatically dissolves
it. unpoison_memory(pfn) can probably still turn the HWPoison raw page
back to a normal one, but you already lost the hugetlb page.
>
> > dissolving. MFR still sets HWPoison flag, holds a refcount
> > for every raw HWPoison page, record them in a list, sends SIGBUS
> > to the consuming thread, but si_addr_lsb is reduced to PAGE_SHIFT.
> > If userspace is able to handle the SIGBUS, the HWPoison hugepage
> > remains accessible via the mapping created with that memfd.
> >
> > * If the memory was not faulted in yet, the fault handler also
> > allows fault in the HWPoison folio.
> >
> > For a MFD_MF_KEEP_UE_MAPPED enabled memfd, when it is closed, or
> > when userspace process truncates its hugepages:
> >
> > * When the HugeTLB in-memory file system removes the filemap's
> > folios one by one, it asks MFR to deal with HWPoison folios
> > on the fly, implemented by filemap_offline_hwpoison_folio().
> >
> > * MFR drops the refcounts being held for the raw HWPoison
> > pages within the folio. Now that the HWPoison folio becomes
> > free, MFR dissolves it into a set of raw pages. The healthy pages
> > are recycled into buddy allocator, while the HWPoison ones are
> > prevented from re-allocation.
> >
> ...
>
> >
> > +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> > +{
> > + int ret;
> > + struct llist_node *head;
> > + struct raw_hwp_page *curr, *next;
> > +
> > + /*
> > + * Since folio is still in the folio_batch, drop the refcount
> > + * elevated by filemap_get_folios.
> > + */
> > + folio_put_refs(folio, 1);
> > + head = llist_del_all(raw_hwp_list_head(folio));
>
> We might race with get_huge_page_for_hwpoison()? llist_add() might be called
> by folio_set_hugetlb_hwpoison() just after llist_del_all()?
Oh, when there is a new UE while we releasing the folio here, right?
In that case, would mutex_lock(&mf_mutex) eliminate potential race?
>
> > +
> > + /*
> > + * Release refcounts held by try_memory_failure_hugetlb, one per
> > + * HWPoison-ed page in the raw hwp list.
> > + *
> > + * Set HWPoison flag on each page so that free_has_hwpoisoned()
> > + * can exclude them during dissolve_free_hugetlb_folio().
> > + */
> > + llist_for_each_entry_safe(curr, next, head, node) {
> > + folio_put(folio);
>
> The hugetlb folio refcnt will only be increased once even if it contains multiple UE sub-pages.
> See __get_huge_page_for_hwpoison() for details. So folio_put() might be called more times than
> folio_try_get() in __get_huge_page_for_hwpoison().
The changes in folio_set_hugetlb_hwpoison() should make
__get_huge_page_for_hwpoison() not to take the "out" path which
decrease the increased refcount for folio. IOW, every time a new UE
happens, we handle the hugetlb page as if it is an in-use hugetlb
page.
>
> > + SetPageHWPoison(curr->page);
>
> If hugetlb folio vmemmap is optimized, I think SetPageHWPoison might trigger BUG.
Ah, I see, vmemmap optimization doesn't allow us to move flags from
raw_hwp_list to tail pages. I guess the best I can do is to bail out
if vmemmap is enabled like folio_clear_hugetlb_hwpoison().
>
> > + kfree(curr);
> > + }
>
> Above logic is almost same as folio_clear_hugetlb_hwpoison. Maybe we can reuse that?
Will give it a try.
>
> > +
> > + /* Refcount now should be zero and ready to dissolve folio. */
> > + ret = dissolve_free_hugetlb_folio(folio);
> > + if (ret)
> > + pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> > +}
> > +
>
> Thanks.
> .
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-10 4:47 ` Jiaqi Yan
@ 2026-02-10 7:31 ` Miaohe Lin
2026-02-13 5:01 ` Jiaqi Yan
0 siblings, 1 reply; 15+ messages in thread
From: Miaohe Lin @ 2026-02-10 7:31 UTC (permalink / raw)
To: Jiaqi Yan
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, william.roche, harry.yoo, jane.chu
On 2026/2/10 12:47, Jiaqi Yan wrote:
> On Mon, Feb 9, 2026 at 3:54 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2026/2/4 3:23, Jiaqi Yan wrote:
>>> Sometimes immediately hard offlining a large chunk of contigous memory
>>> having uncorrected memory errors (UE) may not be the best option.
>>> Cloud providers usually serve capacity- and performance-critical guest
>>> memory with 1G HugeTLB hugepages, as this significantly reduces the
>>> overhead associated with managing page tables and TLB misses. However,
>>> for today's HugeTLB system, once a byte of memory in a hugepage is
>>> hardware corrupted, the kernel discards the whole hugepage, including
>>> the healthy portion. Customer workload running in the VM can hardly
>>> recover from such a great loss of memory.
>>
>> Thanks for your patch. Some questions below.
>>
>>>
>>> Therefore keeping or discarding a large chunk of contiguous memory
>>> owned by userspace (particularly to serve guest memory) due to
>>> recoverable UE may better be controlled by userspace process
>>> that owns the memory, e.g. VMM in the Cloud environment.
>>>
>>> Introduce a memfd-based userspace memory failure (MFR) policy,
>>> MFD_MF_KEEP_UE_MAPPED. It is possible to support for other memfd,
>>> but the current implementation only covers HugeTLB.
>>>
>>> For a hugepage associated with MFD_MF_KEEP_UE_MAPPED enabled memfd,
>>> whenever it runs into a new UE,
>>>
>>> * MFR defers hard offline operations, i.e., unmapping and
>>
>> So the folio can't be unpoisoned until hugetlb folio becomes free?
>
> Are you asking from testing perspective, are we still able to clean up
> injected test errors via unpoison_memory() with MFD_MF_KEEP_UE_MAPPED?
>
> If so, unpoison_memory() can't turn the HWPoison hugetlb page to
> normal hugetlb page as MFD_MF_KEEP_UE_MAPPED automatically dissolves
We might loss some testability but that should be an acceptable compromise.
> it. unpoison_memory(pfn) can probably still turn the HWPoison raw page
> back to a normal one, but you already lost the hugetlb page.
>
>>
>>> dissolving. MFR still sets HWPoison flag, holds a refcount
>>> for every raw HWPoison page, record them in a list, sends SIGBUS
>>> to the consuming thread, but si_addr_lsb is reduced to PAGE_SHIFT.
>>> If userspace is able to handle the SIGBUS, the HWPoison hugepage
>>> remains accessible via the mapping created with that memfd.
>>>
>>> * If the memory was not faulted in yet, the fault handler also
>>> allows fault in the HWPoison folio.
>>>
>>> For a MFD_MF_KEEP_UE_MAPPED enabled memfd, when it is closed, or
>>> when userspace process truncates its hugepages:
>>>
>>> * When the HugeTLB in-memory file system removes the filemap's
>>> folios one by one, it asks MFR to deal with HWPoison folios
>>> on the fly, implemented by filemap_offline_hwpoison_folio().
>>>
>>> * MFR drops the refcounts being held for the raw HWPoison
>>> pages within the folio. Now that the HWPoison folio becomes
>>> free, MFR dissolves it into a set of raw pages. The healthy pages
>>> are recycled into buddy allocator, while the HWPoison ones are
>>> prevented from re-allocation.
>>>
>> ...
>>
>>>
>>> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
>>> +{
>>> + int ret;
>>> + struct llist_node *head;
>>> + struct raw_hwp_page *curr, *next;
>>> +
>>> + /*
>>> + * Since folio is still in the folio_batch, drop the refcount
>>> + * elevated by filemap_get_folios.
>>> + */
>>> + folio_put_refs(folio, 1);
>>> + head = llist_del_all(raw_hwp_list_head(folio));
>>
>> We might race with get_huge_page_for_hwpoison()? llist_add() might be called
>> by folio_set_hugetlb_hwpoison() just after llist_del_all()?
>
> Oh, when there is a new UE while we releasing the folio here, right?
Right.
> In that case, would mutex_lock(&mf_mutex) eliminate potential race?
IMO spin_lock_irq(&hugetlb_lock) might be better.
>
>>
>>> +
>>> + /*
>>> + * Release refcounts held by try_memory_failure_hugetlb, one per
>>> + * HWPoison-ed page in the raw hwp list.
>>> + *
>>> + * Set HWPoison flag on each page so that free_has_hwpoisoned()
>>> + * can exclude them during dissolve_free_hugetlb_folio().
>>> + */
>>> + llist_for_each_entry_safe(curr, next, head, node) {
>>> + folio_put(folio);
>>
>> The hugetlb folio refcnt will only be increased once even if it contains multiple UE sub-pages.
>> See __get_huge_page_for_hwpoison() for details. So folio_put() might be called more times than
>> folio_try_get() in __get_huge_page_for_hwpoison().
>
> The changes in folio_set_hugetlb_hwpoison() should make
> __get_huge_page_for_hwpoison() not to take the "out" path which
> decrease the increased refcount for folio. IOW, every time a new UE
> happens, we handle the hugetlb page as if it is an in-use hugetlb
> page.
See below code snippet (comment [1] and [2]):
int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
bool *migratable_cleared)
{
struct page *page = pfn_to_page(pfn);
struct folio *folio = page_folio(page);
int ret = 2; /* fallback to normal page handling */
bool count_increased = false;
if (!folio_test_hugetlb(folio))
goto out;
if (flags & MF_COUNT_INCREASED) {
ret = 1;
count_increased = true;
} else if (folio_test_hugetlb_freed(folio)) {
ret = 0;
} else if (folio_test_hugetlb_migratable(folio)) {
^^^^*hugetlb_migratable is checked before trying to get folio refcnt* [1]
ret = folio_try_get(folio);
if (ret)
count_increased = true;
} else {
ret = -EBUSY;
if (!(flags & MF_NO_RETRY))
goto out;
}
if (folio_set_hugetlb_hwpoison(folio, page)) {
ret = -EHWPOISON;
goto out;
}
/*
* Clearing hugetlb_migratable for hwpoisoned hugepages to prevent them
* from being migrated by memory hotremove.
*/
if (count_increased && folio_test_hugetlb_migratable(folio)) {
folio_clear_hugetlb_migratable(folio);
^^^^^*hugetlb_migratable is cleared when first time seeing folio* [2]
*migratable_cleared = true;
}
Or am I miss something?
>
>>
>>> + SetPageHWPoison(curr->page);
>>
>> If hugetlb folio vmemmap is optimized, I think SetPageHWPoison might trigger BUG.
>
> Ah, I see, vmemmap optimization doesn't allow us to move flags from
> raw_hwp_list to tail pages. I guess the best I can do is to bail out
> if vmemmap is enabled like folio_clear_hugetlb_hwpoison().
I think you can do this after hugetlb_vmemmap_restore_folio() is called.
Thanks.
.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
2026-02-04 17:53 ` William Roche
@ 2026-02-12 3:11 ` Jiaqi Yan
0 siblings, 0 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-12 3:11 UTC (permalink / raw)
To: William Roche
Cc: linmiaohe, harry.yoo, jane.chu, nao.horiguchi, tony.luck,
wangkefeng.wang, willy, akpm, osalvador, rientjes, duenwen,
jthoughton, jgg, ankita, peterx, sidhartha.kumar, ziy, david,
dave.hansen, muchun.song, linux-mm, linux-kernel, linux-fsdevel
On Wed, Feb 4, 2026 at 9:53 AM William Roche <william.roche@oracle.com> wrote:
>
>
> On 2/3/26 20:23, Jiaqi Yan wrote:
> > Test the userspace memory failure recovery (MFR) policy for HugeTLB:
> >
> > 1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
> >
> > 2. Allocate and map 4 hugepages to the process.
> >
> > 3. Create sub-threads to MADV_HWPOISON inner addresses of the 1st hugepage.
> >
> > 4. Check if the process gets correct SIGBUS for each poisoned raw page.
> >
> > 5. Check if all memory are still accessible and content valid.
> >
> > 6. Check if the poisoned hugepage is dealt with after memfd released.
> >
> > Two configurables in the test:
> >
> > - hugepage_size: size of the hugepage, 1G or 2M.
> >
> > - nr_hwp_pages: number of pages within the 1st hugepage to MADV_HWPOISON.
> In this version, you are introducing this new test argument
> "nr_hwp_pages" to indicate how many of the pre-defined offsets we want
> to poison inside the hugepage (between 1 and 8).
> But is there any advantage to give the choice to the user instead of
> testing them all ?
Yeah, nr_hwp_pages doesn't seem very useful. It was useful when I want
the test to run in two modes: single page vs multple pages. Let me
just make the test to testing 8 pages.
>
> As a suggestion, should we have this test program setting or verifying
> the minimal number of hugepages of the right type, instead of relying on
> the user to set them manually ?
Yeah, I agree, we can just run for 2 different hugepagesize.
> And at the end, should we try to unpoison the impacted pages ? So that
> the lab machine where the tests run can continue to use all its memory ?
Good point, we can include this as a test cleanup step.
>
> Thanks for your feedback,
> William.
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
2026-02-09 12:01 ` Miaohe Lin
@ 2026-02-12 3:17 ` Jiaqi Yan
0 siblings, 0 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-12 3:17 UTC (permalink / raw)
To: Miaohe Lin
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, william.roche, harry.yoo, jane.chu
On Mon, Feb 9, 2026 at 4:01 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2026/2/4 3:23, Jiaqi Yan wrote:
> > Test the userspace memory failure recovery (MFR) policy for HugeTLB:
> >
> > 1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
> >
> > 2. Allocate and map 4 hugepages to the process.
> >
> > 3. Create sub-threads to MADV_HWPOISON inner addresses of the 1st hugepage.
> >
> > 4. Check if the process gets correct SIGBUS for each poisoned raw page.
> >
> > 5. Check if all memory are still accessible and content valid.
> >
> > 6. Check if the poisoned hugepage is dealt with after memfd released.
> >
> > Two configurables in the test:
> >
> > - hugepage_size: size of the hugepage, 1G or 2M.
> >
> > - nr_hwp_pages: number of pages within the 1st hugepage to MADV_HWPOISON.
> >
> > Reviewed-by: Jane Chu <jane.chu@oracle.com>
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>
> It's not required but could this testcase be written into the tools/testing/selftests/mm/memory-failure.c [1]?
Good point, let me catch up with your new test fixtures and see what I
can do with this new test.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-unstable&id=cf2929c618fec0a22702b3abd0778bbdde6e458e
>
> Thanks.
> .
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
2026-02-10 7:31 ` Miaohe Lin
@ 2026-02-13 5:01 ` Jiaqi Yan
0 siblings, 0 replies; 15+ messages in thread
From: Jiaqi Yan @ 2026-02-13 5:01 UTC (permalink / raw)
To: Miaohe Lin
Cc: nao.horiguchi, tony.luck, wangkefeng.wang, willy, akpm,
osalvador, rientjes, duenwen, jthoughton, jgg, ankita, peterx,
sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
linux-kernel, linux-fsdevel, william.roche, harry.yoo, jane.chu
On Mon, Feb 9, 2026 at 11:31 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2026/2/10 12:47, Jiaqi Yan wrote:
> > On Mon, Feb 9, 2026 at 3:54 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>
> >> On 2026/2/4 3:23, Jiaqi Yan wrote:
> >>> Sometimes immediately hard offlining a large chunk of contigous memory
> >>> having uncorrected memory errors (UE) may not be the best option.
> >>> Cloud providers usually serve capacity- and performance-critical guest
> >>> memory with 1G HugeTLB hugepages, as this significantly reduces the
> >>> overhead associated with managing page tables and TLB misses. However,
> >>> for today's HugeTLB system, once a byte of memory in a hugepage is
> >>> hardware corrupted, the kernel discards the whole hugepage, including
> >>> the healthy portion. Customer workload running in the VM can hardly
> >>> recover from such a great loss of memory.
> >>
> >> Thanks for your patch. Some questions below.
> >>
> >>>
> >>> Therefore keeping or discarding a large chunk of contiguous memory
> >>> owned by userspace (particularly to serve guest memory) due to
> >>> recoverable UE may better be controlled by userspace process
> >>> that owns the memory, e.g. VMM in the Cloud environment.
> >>>
> >>> Introduce a memfd-based userspace memory failure (MFR) policy,
> >>> MFD_MF_KEEP_UE_MAPPED. It is possible to support for other memfd,
> >>> but the current implementation only covers HugeTLB.
> >>>
> >>> For a hugepage associated with MFD_MF_KEEP_UE_MAPPED enabled memfd,
> >>> whenever it runs into a new UE,
> >>>
> >>> * MFR defers hard offline operations, i.e., unmapping and
> >>
> >> So the folio can't be unpoisoned until hugetlb folio becomes free?
> >
> > Are you asking from testing perspective, are we still able to clean up
> > injected test errors via unpoison_memory() with MFD_MF_KEEP_UE_MAPPED?
> >
> > If so, unpoison_memory() can't turn the HWPoison hugetlb page to
> > normal hugetlb page as MFD_MF_KEEP_UE_MAPPED automatically dissolves
>
> We might loss some testability but that should be an acceptable compromise.
To clarify, looking at unpoison_memory(), it seems unpoison should
still work if called before truncated or memfd closed.
What I wanted to say is, for my test hugetlb-mfr.c, since I really
want to test the cleanup code (dissolving free hugepage having
multiple errors) after truncation or memfd closed, so we can only
unpoison the raw pages rejected by buddy allocator.
>
> > it. unpoison_memory(pfn) can probably still turn the HWPoison raw page
> > back to a normal one, but you already lost the hugetlb page.
> >
> >>
> >>> dissolving. MFR still sets HWPoison flag, holds a refcount
> >>> for every raw HWPoison page, record them in a list, sends SIGBUS
> >>> to the consuming thread, but si_addr_lsb is reduced to PAGE_SHIFT.
> >>> If userspace is able to handle the SIGBUS, the HWPoison hugepage
> >>> remains accessible via the mapping created with that memfd.
> >>>
> >>> * If the memory was not faulted in yet, the fault handler also
> >>> allows fault in the HWPoison folio.
> >>>
> >>> For a MFD_MF_KEEP_UE_MAPPED enabled memfd, when it is closed, or
> >>> when userspace process truncates its hugepages:
> >>>
> >>> * When the HugeTLB in-memory file system removes the filemap's
> >>> folios one by one, it asks MFR to deal with HWPoison folios
> >>> on the fly, implemented by filemap_offline_hwpoison_folio().
> >>>
> >>> * MFR drops the refcounts being held for the raw HWPoison
> >>> pages within the folio. Now that the HWPoison folio becomes
> >>> free, MFR dissolves it into a set of raw pages. The healthy pages
> >>> are recycled into buddy allocator, while the HWPoison ones are
> >>> prevented from re-allocation.
> >>>
> >> ...
> >>
> >>>
> >>> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> >>> +{
> >>> + int ret;
> >>> + struct llist_node *head;
> >>> + struct raw_hwp_page *curr, *next;
> >>> +
> >>> + /*
> >>> + * Since folio is still in the folio_batch, drop the refcount
> >>> + * elevated by filemap_get_folios.
> >>> + */
> >>> + folio_put_refs(folio, 1);
> >>> + head = llist_del_all(raw_hwp_list_head(folio));
> >>
> >> We might race with get_huge_page_for_hwpoison()? llist_add() might be called
> >> by folio_set_hugetlb_hwpoison() just after llist_del_all()?
> >
> > Oh, when there is a new UE while we releasing the folio here, right?
>
> Right.
>
> > In that case, would mutex_lock(&mf_mutex) eliminate potential race?
>
> IMO spin_lock_irq(&hugetlb_lock) might be better.
Looks like I don't need any lock given the correction below.
>
> >
> >>
> >>> +
> >>> + /*
> >>> + * Release refcounts held by try_memory_failure_hugetlb, one per
> >>> + * HWPoison-ed page in the raw hwp list.
> >>> + *
> >>> + * Set HWPoison flag on each page so that free_has_hwpoisoned()
> >>> + * can exclude them during dissolve_free_hugetlb_folio().
> >>> + */
> >>> + llist_for_each_entry_safe(curr, next, head, node) {
> >>> + folio_put(folio);
> >>
> >> The hugetlb folio refcnt will only be increased once even if it contains multiple UE sub-pages.
> >> See __get_huge_page_for_hwpoison() for details. So folio_put() might be called more times than
> >> folio_try_get() in __get_huge_page_for_hwpoison().
> >
> > The changes in folio_set_hugetlb_hwpoison() should make
> > __get_huge_page_for_hwpoison() not to take the "out" path which
> > decrease the increased refcount for folio. IOW, every time a new UE
> > happens, we handle the hugetlb page as if it is an in-use hugetlb
> > page.
>
> See below code snippet (comment [1] and [2]):
>
> int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> bool *migratable_cleared)
> {
> struct page *page = pfn_to_page(pfn);
> struct folio *folio = page_folio(page);
> int ret = 2; /* fallback to normal page handling */
> bool count_increased = false;
>
> if (!folio_test_hugetlb(folio))
> goto out;
>
> if (flags & MF_COUNT_INCREASED) {
> ret = 1;
> count_increased = true;
> } else if (folio_test_hugetlb_freed(folio)) {
> ret = 0;
> } else if (folio_test_hugetlb_migratable(folio)) {
>
> ^^^^*hugetlb_migratable is checked before trying to get folio refcnt* [1]
>
> ret = folio_try_get(folio);
> if (ret)
> count_increased = true;
> } else {
> ret = -EBUSY;
> if (!(flags & MF_NO_RETRY))
> goto out;
> }
>
> if (folio_set_hugetlb_hwpoison(folio, page)) {
> ret = -EHWPOISON;
> goto out;
> }
>
> /*
> * Clearing hugetlb_migratable for hwpoisoned hugepages to prevent them
> * from being migrated by memory hotremove.
> */
> if (count_increased && folio_test_hugetlb_migratable(folio)) {
> folio_clear_hugetlb_migratable(folio);
>
> ^^^^^*hugetlb_migratable is cleared when first time seeing folio* [2]
>
> *migratable_cleared = true;
> }
>
> Or am I miss something?
Thanks for your explaination! You are absolutely right. It turns out
the extra refcount I saw (during running hugetlb-mfr.c) on the folio
at the moment of filemap_offline_hwpoison_folio_hugetlb() is actually
because of the MF_COUNT_INCREASED during MADV_HWPOISON. In the past I
used to think that is the effect of folio_try_get() in
__get_huge_page_for_hwpoison(), and it is wrong. Now I see two cases:
- MADV_HWPOISON: instead of __get_huge_page_for_hwpoison(),
madvise_inject_error() is the one that increments hugepage refcount
for every error injected. Different from other cases,
MFD_MF_KEEP_UE_MAPPED makes the hugepage still a in-use page after
memory_failure(MF_COUNT_INCREASED), so I think madvise_inject_error()
should decrement in MFD_MF_KEEP_UE_MAPPED case.
- In the real world: as you pointed out, MF always just increments
hugepage refcount once in __get_huge_page_for_hwpoison(), even if it
runs into multiple errors. When
filemap_offline_hwpoison_folio_hugetlb() drops the refcount elevated
by filemap_get_folios(), it only needs to decrement again if
folio_ref_dec_and_test() returns false. I tested something like below:
/* drop the refcount elevated by filemap_get_folios. */
folio_put(folio);
if (folio_ref_count(folio))
folio_put(folio);
/* now refcount should be zero. */
ret = dissolve_free_hugetlb_folio(folio);
Besides, the good news is that
filemap_offline_hwpoison_folio_hugetlb() no longer needs to touch
raw_hwp_list.
>
> >
> >>
> >>> + SetPageHWPoison(curr->page);
> >>
> >> If hugetlb folio vmemmap is optimized, I think SetPageHWPoison might trigger BUG.
> >
> > Ah, I see, vmemmap optimization doesn't allow us to move flags from
> > raw_hwp_list to tail pages. I guess the best I can do is to bail out
> > if vmemmap is enabled like folio_clear_hugetlb_hwpoison().
>
> I think you can do this after hugetlb_vmemmap_restore_folio() is called.
Since I can get rid of the wrong folio_put() per raw HWPoison page, I
can just rely on dissolve_free_hugetlb_folio() to do the
hugetlb_vmemmap_restore_folio() and reuse the
folio_clear_hugetlb_hwpoison() code to move HWPoison flags to raw
pages.
I will do some more testing while preparing v4. Will also try if I can
avoid adding a speical cased folio_put() in madvise_inject_error().
>
> Thanks.
> .
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-02-13 5:01 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-03 19:23 [PATCH v3 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
2026-02-04 17:29 ` William Roche
2026-02-10 4:46 ` Jiaqi Yan
2026-02-09 11:54 ` Miaohe Lin
2026-02-10 4:47 ` Jiaqi Yan
2026-02-10 7:31 ` Miaohe Lin
2026-02-13 5:01 ` Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
2026-02-04 17:53 ` William Roche
2026-02-12 3:11 ` Jiaqi Yan
2026-02-09 12:01 ` Miaohe Lin
2026-02-12 3:17 ` Jiaqi Yan
2026-02-03 19:23 ` [PATCH v3 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
2026-02-04 17:56 ` William Roche
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox