[PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB
@ 2025-11-16  1:32 Jiaqi Yan
  2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:32 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Problem
=======

This patchset is a follow-up for the userspace memory failure
recovery (MFR) policy proposed in [1] and [2], but focused on
a smaller scope: HugeTLB.

To recap the problem for HugeTLB discussed in [1] and [2]:
Cloud providers like Google and Oracle usually serve capacity-
and performance-critical guest memory with 1G HugeTLB
hugepages, as this significantly reduces the overhead
associated with managing page tables and TLB misses. However,
the kernel's current MFR behavior for HugeTLB is not ideal.
Once a byte of memory in a hugepage is hardware corrupted, the
kernel discards the whole hugepage, including the healthy
portion, from the HugeTLB system. Customer workload running in
the VM can hardly recover from such a great loss of memory.

[1] and [2] proposed the idea that the decision to keep or
discard a large chunk of contiguous memory exclusively owned
by a userspace process due to a recoverable uncorrected
memory error (UE) should be controlled by userspace. What this
means in the Cloud case is that, since a virtual machine
monitor (VMM) has taken host memory to exclusively back the
guest memory for a VM, the VMM can keep holding the memory
even after memory errors occur.

MFD_MF_KEEP_UE_MAPPED for HugeTLB
=================================

[2] proposed a solution centered around the memfd associated
with the memory exclusively owned by userspace.

A userspace process must opt into the MFD_MF_KEEP_UE_MAPPED
policy when it creates a new HugeTLB-backed memfd:

  #define MFD_MF_KEEP_UE_MAPPED	0x0020U
  int memfd_create(const char *name, unsigned int flags);

For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED
enabled memfd, whenever it runs into a UE, MFR doesn't hard
offline the HWPoison-ed huge folio. In other words, the
HWPoison-ed memory remains accessible via the returned memfd
or the memory mapping created with that memfd. MFR still sends
SIGBUS to the userspace process as required. MFR also still
maintains HWPoison metadata on the hugepage having the UE.

A HWPoison-ed hugepage will be immediately isolated and
prevented from future allocation once userspace truncates it
via the memfd, or the owning memfd is closed.

By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard
offlines hugepages having UEs.

Implementation
==============

Implementation is relatively straightforward with two major parts.

Part 1: When hugepages owned by an MFD_MF_KEEP_UE_MAPPED
enabled memfd run into a UE:

* MFR defers hard offline operations, i.e., unmapping and
  dissolving. MFR still sets HWPoison flags and holds a refcount
  for every raw HWPoison-ed page. MFR still sends SIGBUS to the
  consuming thread, but si_addr_lsb will be reduced to PAGE_SHIFT.
* If the memory was not faulted in yet, the fault handler also
  needs to unblock the fault to the HWPoison-ed folio.

Part 2: When an MFD_MF_KEEP_UE_MAPPED enabled memfd is being
released, or when a userspace process truncates a range of
hugepages belonging to an MFD_MF_KEEP_UE_MAPPED enabled memfd:

* When the HugeTLB in-memory file system removes a filemap's
  folios one by one, it asks MFR to deal with HWPoison-ed folios
  on the fly, implemented by filemap_offline_hwpoison_folio().

* MFR drops the refcounts being held for the raw HWPoison-ed
  pages within the folio. Now that the HWPoison-ed folio becomes
  a free HugeTLB folio, MFR dissolves it into a set of raw pages.
  dissolve_free_hugetlb_folio() frees them all to the buddy
  allocator, including the HWPoison-ed raw pages. So MFR also
  needs to take these HWPoison-ed pages off the buddy allocator.

One thing worthy of note, as pointed out by William Roche:
During the time window between freeing to the buddy allocator
and taking off the buddy allocator, a high-order folio with
HWPoison-ed subpages can be allocated. This racing issue already
exists today, after buddy allocator reduced sanity checks [3].
With MFD_MF_KEEP_UE_MAPPED, multiple raw HWPoison-ed pages can
be allocated. Since MFD_MF_KEEP_UE_MAPPED could exaggerate the
issue, I have proposed a solution [4] based on discussion with
Harry Yoo and Miaohe Lin, and will send it out as a separately
formal patchset.

Changelog
=========

v2 -> v1 [2]
- Rebased onto commit 6da43bbeb6918 ("Merge tag 'vfio-v6.18-rc6' of
  https://github.com/awilliam/linux-vfio").
- Removed populate_memfd_hwp_folios and offline_memfd_hwp_folios so
  that no memory allocation is needed during releasing HWPoison-ed
  memfd.
- Inserted filemap_offline_hwpoison_folio into remove_inode_single_folio.
  Now dissolving and offlining HWPoison-ed huge folios is done on the fly.
- Fixed the bug pointed out by William Roche <william.roche@oracle.com>:
  call take_page_off_buddy no matter HWPoison-ed page is buddy page or not.
- Removed update_per_node_mf_stats when dissolve failed.
- Made hugetlb-mfr allocate 4 1G hugepages to cover new code introduced
  in remove_inode_hugepages.
- Made hugetlb-mfr support testing both 1GB and 2MB HugeTLB hugepages.
- Fixed some typos in documentation.

[1] https://lwn.net/Articles/991513
[2] https://lore.kernel.org/lkml/20250118231549.1652825-1-jiaqiyan@google.com
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://lore.kernel.org/lkml/CACw3F51VGxg4q9nM_eQN7OXs7JaZo9K-nvDwxtZgtjFSNyjQaw@mail.gmail.com

Jiaqi Yan (3):
  mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  selftests/mm: test userspace MFR for HugeTLB hugepage
  Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED

 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/mfd_mfr_policy.rst          |  60 ++++
 fs/hugetlbfs/inode.c                          |  25 +-
 include/linux/hugetlb.h                       |   7 +
 include/linux/pagemap.h                       |  24 ++
 include/uapi/linux/memfd.h                    |   6 +
 mm/hugetlb.c                                  |  20 +-
 mm/memfd.c                                    |  15 +-
 mm/memory-failure.c                           | 124 ++++++-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c      | 327 ++++++++++++++++++
 12 files changed, 592 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c

-- 
2.52.0.rc1.455.g30608eb744-goog

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  2025-11-16  1:32 [PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
@ 2025-11-16  1:32 ` Jiaqi Yan
  2025-11-25 21:47   ` William Roche
                     ` (2 more replies)
  2025-11-16  1:32 ` [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
  2025-11-16  1:32 ` [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
  2 siblings, 3 replies; 10+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:32 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Sometimes immediately hard offlining a large chunk of contigous memory
having uncorrected memory errors (UE) may not be the best option.
Cloud providers usually serve capacity- and performance-critical guest
memory with 1G HugeTLB hugepages, as this significantly reduces the
overhead associated with managing page tables and TLB misses. However,
for today's HugeTLB system, once a byte of memory in a hugepage is
hardware corrupted, the kernel discards the whole hugepage, including
the healthy portion. Customer workload running in the VM can hardly
recover from such a great loss of memory.

Therefore keeping or discarding a large chunk of contiguous memory
owned by userspace (particularly to serve guest memory) due to
recoverable UE may better be controlled by userspace process
that owns the memory, e.g. VMM in Cloud environment.

Introduce a memfd-based userspace memory failure (MFR) policy,
MFD_MF_KEEP_UE_MAPPED. It is intended to be supported for other memfd,
but the current implementation only covers HugeTLB.

For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED enabled memfd,
whenever it runs into a UE, MFR doesn't hard offline the HWPoison-ed
huge folio. IOW the HWPoison-ed memory remains accessible via the memory
mapping created with that memfd. MFR still sends SIGBUS to the process
as required. MFR also still maintains HWPoison metadata for the hugepage
having the UE.

A HWPoison-ed hugepage will be immediately isolated and prevented from
future allocation once userspace truncates it via the memfd, or the
owning memfd is closed.

By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard offlines
hugepages having UEs.

Tested with selftest in the follow-up commit.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Tested-by: William Roche <william.roche@oracle.com>
---
 fs/hugetlbfs/inode.c       |  25 +++++++-
 include/linux/hugetlb.h    |   7 +++
 include/linux/pagemap.h    |  24 +++++++
 include/uapi/linux/memfd.h |   6 ++
 mm/hugetlb.c               |  20 +++++-
 mm/memfd.c                 |  15 ++++-
 mm/memory-failure.c        | 124 +++++++++++++++++++++++++++++++++----
 7 files changed, 202 insertions(+), 19 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index f42548ee9083c..f8a5aa091d51d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -532,6 +532,18 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
 	}
 
 	folio_unlock(folio);
+
+	/*
+	 * There may be pending HWPoison-ed folios when a memfd is being
+	 * removed or part of it is being truncated.
+	 *
+	 * HugeTLBFS' error_remove_folio keeps the HWPoison-ed folios in
+	 * page cache until mm wants to drop the folio at the end of the
+	 * of the filemap. At this point, if memory failure was delayed
+	 * by MFD_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
+	 */
+	filemap_offline_hwpoison_folio(mapping, folio);
+
 	return ret;
 }
 
@@ -563,13 +575,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 	const pgoff_t end = lend >> PAGE_SHIFT;
 	struct folio_batch fbatch;
 	pgoff_t next, index;
-	int i, freed = 0;
+	int i, j, freed = 0;
 	bool truncate_op = (lend == LLONG_MAX);
 
 	folio_batch_init(&fbatch);
 	next = lstart >> PAGE_SHIFT;
 	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
-		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+		for (i = 0, j = 0; i < folio_batch_count(&fbatch); ++i) {
 			struct folio *folio = fbatch.folios[i];
 			u32 hash = 0;
 
@@ -584,8 +596,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 							index, truncate_op))
 				freed++;
 
+			/*
+			 * Skip HWPoison-ed hugepages, which should no
+			 * longer be hugetlb if successfully dissolved.
+			 */
+			if (folio_test_hugetlb(folio))
+				fbatch.folios[j++] = folio;
+
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
+		fbatch.nr = j;
+
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8e63e46b8e1f0..b7733ef5ee917 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -871,10 +871,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
 
 #ifdef CONFIG_MEMORY_FAILURE
 extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
+extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+						struct address_space *mapping);
 #else
 static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
 {
 }
+static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
+						       struct address_space *mapping)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 09b581c1d878d..9ad511aacde7c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -213,6 +213,8 @@ enum mapping_flags {
 	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
 	AS_KERNEL_FILE = 10,	/* mapping for a fake kernel file that shouldn't
 				   account usage to user cgroups */
+	/* For MFD_MF_KEEP_UE_MAPPED. */
+	AS_MF_KEEP_UE_MAPPED = 11,
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -348,6 +350,16 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(const struct addres
 	return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
 }
 
+static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
+{
+	return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
+static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
+{
+	set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
 {
 	return mapping->gfp_mask;
@@ -1274,6 +1286,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct folio_batch *fbatch);
 bool filemap_release_folio(struct folio *folio, gfp_t gfp);
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
+ */
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio);
+#else
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio)
+{
+}
+#endif
 loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
 		int whence);
 
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 273a4e15dfcff..d9875da551b7f 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -12,6 +12,12 @@
 #define MFD_NOEXEC_SEAL		0x0008U
 /* executable */
 #define MFD_EXEC		0x0010U
+/*
+ * Keep owned folios mapped when uncorrectable memory errors (UE) causes
+ * memory failure (MF) within the folio. Only at the end of the mapping
+ * will its HWPoison-ed folios be dealt with.
+ */
+#define MFD_MF_KEEP_UE_MAPPED	0x0020U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec0..dd3bc0b75e059 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6415,6 +6415,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned
 	return same;
 }
 
+bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+					 struct address_space *mapping)
+{
+	if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
+		return false;
+
+	if (!mapping)
+		return false;
+
+	return mapping_mf_keep_ue_mapped(mapping);
+}
+
 static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			struct vm_fault *vmf)
 {
@@ -6537,9 +6549,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 		 * So we need to block hugepage fault by PG_hwpoison bit check.
 		 */
 		if (unlikely(folio_test_hwpoison(folio))) {
-			ret = VM_FAULT_HWPOISON_LARGE |
-				VM_FAULT_SET_HINDEX(hstate_index(h));
-			goto backout_unlocked;
+			if (!mapping_mf_keep_ue_mapped(mapping)) {
+				ret = VM_FAULT_HWPOISON_LARGE |
+				      VM_FAULT_SET_HINDEX(hstate_index(h));
+				goto backout_unlocked;
+			}
 		}
 
 		/* Check for page in userfault range. */
diff --git a/mm/memfd.c b/mm/memfd.c
index 1d109c1acf211..bfdde4cf90500 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -313,7 +313,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
 
 static int check_sysctl_memfd_noexec(unsigned int *flags)
 {
@@ -387,6 +388,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
 	if (!(flags & MFD_HUGETLB)) {
 		if (flags & ~MFD_ALL_FLAGS)
 			return -EINVAL;
+		if (flags & MFD_MF_KEEP_UE_MAPPED)
+			return -EINVAL;
 	} else {
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(MFD_ALL_FLAGS |
@@ -447,6 +450,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	file->f_flags |= O_LARGEFILE;
 
+	/*
+	 * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
+	 * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
+	 * seal-able.
+	 *
+	 * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
+	 */
+	if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
+		mapping_set_mf_keep_ue_mapped(file->f_mapping);
+
 	if (flags & MFD_NOEXEC_SEAL) {
 		struct inode *inode = file_inode(file);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3edebb0cda30b..c5e3e28872797 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -373,11 +373,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
-static void __add_to_kill(struct task_struct *tsk, const struct page *p,
+static void __add_to_kill(struct task_struct *tsk, struct page *p,
 			  struct vm_area_struct *vma, struct list_head *to_kill,
 			  unsigned long addr)
 {
 	struct to_kill *tk;
+	struct folio *folio;
+	struct address_space *mapping;
 
 	tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
 	if (!tk) {
@@ -388,8 +390,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
 	tk->addr = addr;
 	if (is_zone_device_page(p))
 		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
-	else
-		tk->size_shift = folio_shift(page_folio(p));
+	else {
+		folio = page_folio(p);
+		mapping = folio_mapping(folio);
+		if (mapping && mapping_mf_keep_ue_mapped(mapping))
+			/*
+			 * Let userspace know the radius of HWPoison is
+			 * the size of raw page; accessing other pages
+			 * inside the folio is still ok.
+			 */
+			tk->size_shift = PAGE_SHIFT;
+		else
+			tk->size_shift = folio_shift(folio);
+	}
 
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -414,7 +427,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
 	list_add_tail(&tk->nd, to_kill);
 }
 
-static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
+static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,
 		struct vm_area_struct *vma, struct list_head *to_kill,
 		unsigned long addr)
 {
@@ -535,7 +548,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
  * Collect processes when the error hit an anonymous page.
  */
 static void collect_procs_anon(const struct folio *folio,
-		const struct page *page, struct list_head *to_kill,
+		struct page *page, struct list_head *to_kill,
 		int force_early)
 {
 	struct task_struct *tsk;
@@ -573,7 +586,7 @@ static void collect_procs_anon(const struct folio *folio,
  * Collect processes when the error hit a file mapped page.
  */
 static void collect_procs_file(const struct folio *folio,
-		const struct page *page, struct list_head *to_kill,
+		struct page *page, struct list_head *to_kill,
 		int force_early)
 {
 	struct vm_area_struct *vma;
@@ -655,7 +668,7 @@ static void collect_procs_fsdax(const struct page *page,
 /*
  * Collect the processes who have the corrupted page mapped to kill.
  */
-static void collect_procs(const struct folio *folio, const struct page *page,
+static void collect_procs(const struct folio *folio, struct page *page,
 		struct list_head *tokill, int force_early)
 {
 	if (!folio->mapping)
@@ -1173,6 +1186,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		}
 	}
 
+	/*
+	 * MF still needs to holds a refcount for the deferred actions in
+	 * filemap_offline_hwpoison_folio.
+	 */
+	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+		return res;
+
 	if (has_extra_refcount(ps, p, extra_pins))
 		res = MF_FAILED;
 
@@ -1569,6 +1589,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 {
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	bool keep_mapped;
 	int forcekill;
 	bool mlocked = folio_test_mlocked(folio);
 
@@ -1596,8 +1617,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 	 */
 	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
 
-	unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
-	if (!unmap_success)
+	keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
+	if (!keep_mapped)
+		unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
+
+	unmap_success = !folio_mapped(folio);
+	if (!keep_mapped && !unmap_success)
 		pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
 		       pfn, folio_mapcount(folio));
 
@@ -1622,7 +1647,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 		    !unmap_success;
 	kill_procs(&tokill, forcekill, pfn, flags);
 
-	return unmap_success;
+	return unmap_success || keep_mapped;
 }
 
 static int identify_page_state(unsigned long pfn, struct page *p,
@@ -1862,6 +1887,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 	unsigned long count = 0;
 
 	head = llist_del_all(raw_hwp_list_head(folio));
+	/*
+	 * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
+	 * it has already taken off the head of the llist.
+	 */
+	if (head == NULL)
+		return 0;
+
 	llist_for_each_entry_safe(p, next, head, node) {
 		if (move_flag)
 			SetPageHWPoison(p->page);
@@ -1878,7 +1910,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 	struct llist_head *head;
 	struct raw_hwp_page *raw_hwp;
 	struct raw_hwp_page *p;
-	int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
+	struct address_space *mapping = folio->mapping;
+	bool has_hwpoison = folio_test_set_hwpoison(folio);
 
 	/*
 	 * Once the hwpoison hugepage has lost reliable raw error info,
@@ -1897,8 +1930,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 	if (raw_hwp) {
 		raw_hwp->page = page;
 		llist_add(&raw_hwp->node, head);
+		if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+			/*
+			 * A new raw HWPoison page. Don't return HWPOISON.
+			 * Error event will be counted in action_result().
+			 */
+			return 0;
+
 		/* the first error event will be counted in action_result(). */
-		if (ret)
+		if (has_hwpoison)
 			num_poisoned_pages_inc(page_to_pfn(page));
 	} else {
 		/*
@@ -1913,7 +1953,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 		 */
 		__folio_free_raw_hwp(folio, false);
 	}
-	return ret;
+
+	return has_hwpoison ? -EHWPOISON : 0;
 }
 
 static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
@@ -2002,6 +2043,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return ret;
 }
 
+static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
+{
+	int ret;
+	struct llist_node *head;
+	struct raw_hwp_page *curr, *next;
+	struct page *page;
+	unsigned long pfn;
+
+	/*
+	 * Since folio is still in the folio_batch, drop the refcount
+	 * elevated by filemap_get_folios.
+	 */
+	folio_put_refs(folio, 1);
+	head = llist_del_all(raw_hwp_list_head(folio));
+
+	/*
+	 * Release refcounts held by try_memory_failure_hugetlb, one per
+	 * HWPoison-ed page in the raw hwp list.
+	 */
+	llist_for_each_entry(curr, head, node) {
+		SetPageHWPoison(curr->page);
+		folio_put(folio);
+	}
+
+	/* Refcount now should be zero and ready to dissolve folio. */
+	ret = dissolve_free_hugetlb_folio(folio);
+	if (ret) {
+		pr_err("failed to dissolve hugetlb folio: %d\n", ret);
+		return;
+	}
+
+	llist_for_each_entry_safe(curr, next, head, node) {
+		page = curr->page;
+		pfn = page_to_pfn(page);
+		drain_all_pages(page_zone(page));
+		if (!take_page_off_buddy(page))
+			pr_err("%#lx: unable to take off buddy allocator\n", pfn);
+
+		page_ref_inc(page);
+		kfree(curr);
+		pr_info("%#lx: pending hard offline completed\n", pfn);
+	}
+}
+
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio)
+{
+	WARN_ON_ONCE(!mapping);
+
+	if (!folio_test_hwpoison(folio))
+		return;
+
+	/* Pending MFR currently only exist for hugetlb. */
+	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+		filemap_offline_hwpoison_folio_hugetlb(folio);
+}
+
 /*
  * Taking refcount of hugetlb pages needs extra care about race conditions
  * with basic operations like hugepage allocation/free/demotion.
-- 
2.52.0.rc1.455.g30608eb744-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
  2025-11-16  1:32 [PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
  2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
@ 2025-11-16  1:32 ` Jiaqi Yan
  2025-12-03  4:14   ` jane.chu
  2025-11-16  1:32 ` [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
  2 siblings, 1 reply; 10+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:32 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Test the userspace memory failure recovery (MFR) policy for HugeTLB
1G or 2M hugepage case:
1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
2. Allocate and map 4 hugepages to the process.
3. Create sub-threads to MADV_HWPOISON inner addresses of one hugepage.
4. Check if the process gets correct SIGBUS for each poisoned raw page.
5. Check if all memory are still accessible and content valid.
6. Check if the poisoned hugepage is dealt with after memfd released.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 tools/testing/selftests/mm/.gitignore    |   1 +
 tools/testing/selftests/mm/Makefile      |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c | 327 +++++++++++++++++++++++
 3 files changed, 329 insertions(+)
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index c2a8586e51a1f..11664d20935db 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -5,6 +5,7 @@ hugepage-mremap
 hugepage-shm
 hugepage-vmemmap
 hugetlb-madvise
+hugetlb-mfr
 hugetlb-read-hwpoison
 hugetlb-soft-offline
 khugepaged
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index eaf9312097f7b..de3bdcf7914cd 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -63,6 +63,7 @@ TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugetlb-madvise
 TEST_GEN_FILES += hugetlb-read-hwpoison
 TEST_GEN_FILES += hugetlb-soft-offline
+TEST_GEN_FILES += hugetlb-mfr
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/mm/hugetlb-mfr.c b/tools/testing/selftests/mm/hugetlb-mfr.c
new file mode 100644
index 0000000000000..30939b2194188
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-mfr.c
@@ -0,0 +1,327 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test the userspace memory failure recovery (MFR) policy for HugeTLB
+ * hugepage case:
+ * 1. Create a memfd backed by HugeTLB and MFD_MF_KEEP_UE_MAPPED bit set.
+ * 2. Allocate and map 4 hugepages.
+ * 3. Create sub-threads to MADV_HWPOISON inner addresses of one hugepage.
+ * 4. Check if each sub-thread get correct SIGBUS for the poisoned raw page.
+ * 5. Check if all memory are still accessible and content still valid.
+ * 6. Check if the poisoned hugepage is dealt with after memfd released.
+ *
+ * Two ways to run the test:
+ *   ./hugetlb-mfr 2M
+ * or
+ *   ./hugetlb-mfr 1G
+ * assuming /sys/kernel/mm/hugepages/hugepages-${xxx}kB/nr_hugepages > 4
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <errno.h>
+#include <numaif.h>
+#include <numa.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/statfs.h>
+#include <sys/types.h>
+
+#include "../kselftest.h"
+#include "vm_util.h"
+
+#define EPREFIX			" !!! "
+#define BYTE_LENTH_IN_1G	0x40000000UL
+#define BYTE_LENTH_IN_2M	0x200000UL
+#define HUGETLB_1GB_STR		"1G"
+#define HUGETLB_2MB_STR		"2M"
+#define HUGETLB_FILL		0xab
+
+static const unsigned long offsets_1g[] = {0x200000, 0x400000, 0x800000};
+static const unsigned long offsets_2m[] = {0x020000, 0x040000, 0x080000};
+
+static void *sigbus_addr;
+static int sigbus_addr_lsb;
+static bool expecting_sigbus;
+static bool got_sigbus;
+static bool was_mceerr;
+
+static int create_hugetlbfs_file(struct statfs *file_stat,
+				 unsigned long hugepage_size)
+{
+	int fd;
+	int flags = MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED;
+
+	if (hugepage_size == BYTE_LENTH_IN_2M)
+		flags |= MFD_HUGE_2MB;
+	else
+		flags |= MFD_HUGE_1GB;
+
+	fd = memfd_create("hugetlb_tmp", flags);
+	if (fd < 0)
+		ksft_exit_fail_perror("Failed to memfd_create");
+
+	memset(file_stat, 0, sizeof(*file_stat));
+	if (fstatfs(fd, file_stat)) {
+		close(fd);
+		ksft_exit_fail_perror("Failed to fstatfs");
+	}
+	if (file_stat->f_type != HUGETLBFS_MAGIC) {
+		close(fd);
+		ksft_exit_fail_msg("Not hugetlbfs file");
+	}
+
+	ksft_print_msg("Created hugetlb_tmp file\n");
+	ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
+	if (file_stat->f_bsize != hugepage_size)
+		ksft_exit_fail_msg("Hugepage size is not %#lx", hugepage_size);
+
+	return fd;
+}
+
+/*
+ * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
+ */
+static void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+	if (!expecting_sigbus)
+		ksft_exit_fail_msg("unexpected sigbus with addr=%p",
+				   info->si_addr);
+
+	got_sigbus = true;
+	was_mceerr = (info->si_code == BUS_MCEERR_AO ||
+		      info->si_code == BUS_MCEERR_AR);
+	sigbus_addr = info->si_addr;
+	sigbus_addr_lsb = info->si_addr_lsb;
+}
+
+static void *do_hwpoison(void *hwpoison_addr)
+{
+	int hwpoison_size = getpagesize();
+
+	ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
+		       hwpoison_addr, hwpoison_size);
+	if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
+		ksft_exit_fail_perror("Failed to MADV_HWPOISON");
+
+	pthread_exit(NULL);
+}
+
+static void test_hwpoison_multiple_pages(unsigned char *start_addr,
+					 unsigned long hugepage_size)
+{
+	pthread_t pthread;
+	int ret;
+	unsigned char *hwpoison_addr;
+	const unsigned long *offsets;
+	size_t offsets_count;
+	size_t i;
+
+	if (hugepage_size == BYTE_LENTH_IN_2M) {
+		offsets = offsets_2m;
+		offsets_count = ARRAY_SIZE(offsets_2m);
+	} else {
+		offsets = offsets_1g;
+		offsets_count = ARRAY_SIZE(offsets_1g);
+	}
+
+	for (i = 0; i < offsets_count; ++i) {
+		sigbus_addr = (void *)0xBADBADBAD;
+		sigbus_addr_lsb = 0;
+		was_mceerr = false;
+		got_sigbus = false;
+		expecting_sigbus = true;
+		hwpoison_addr = start_addr + offsets[i];
+
+		ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
+		if (ret)
+			ksft_exit_fail_perror("Failed to create hwpoison thread");
+
+		ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
+			       hwpoison_addr);
+
+		pthread_join(pthread, NULL);
+
+		if (!got_sigbus)
+			ksft_test_result_fail("Didn't get a SIGBUS\n");
+		if (!was_mceerr)
+			ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
+		if (sigbus_addr != hwpoison_addr)
+			ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
+					      sigbus_addr, hwpoison_addr);
+		if (sigbus_addr_lsb != pshift())
+			ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
+					      sigbus_addr_lsb, pshift());
+
+		ksft_print_msg("Received expected and correct SIGBUS\n");
+	}
+}
+
+static int read_nr_hugepages(unsigned long hugepage_size,
+			     unsigned long *nr_hugepages)
+{
+	char buffer[256] = {0};
+	char cmd[256] = {0};
+
+	sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
+		hugepage_size);
+	FILE *cmdfile = popen(cmd, "r");
+
+	if (cmdfile == NULL) {
+		ksft_perror(EPREFIX "failed to popen nr_hugepages");
+		return -1;
+	}
+
+	if (!fgets(buffer, sizeof(buffer), cmdfile)) {
+		ksft_perror(EPREFIX "failed to read nr_hugepages");
+		pclose(cmdfile);
+		return -1;
+	}
+
+	*nr_hugepages = atoll(buffer);
+	pclose(cmdfile);
+	return 0;
+}
+
+/*
+ * Main thread that drives the test.
+ */
+static void test_main(int fd, unsigned long hugepage_size)
+{
+	unsigned char *map, *iter;
+	struct sigaction new, old;
+	const unsigned long hugepagesize_kb = hugepage_size / 1024;
+	unsigned long nr_hugepages_before = 0;
+	unsigned long nr_hugepages_after = 0;
+	unsigned long nodemask = 1UL << 0;
+	unsigned long len = hugepage_size * 4;
+	int ret;
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before);
+
+	if (ftruncate(fd, len) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate");
+
+	ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len);
+
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (map == MAP_FAILED)
+		ksft_exit_fail_msg("Failed to mmap");
+
+	ksft_print_msg("Created HugeTLB mapping: %p\n", map);
+
+	ret = mbind(map, len, MPOL_BIND, &nodemask, sizeof(nodemask) * 8,
+		    MPOL_MF_STRICT | MPOL_MF_MOVE);
+	if (ret < 0) {
+		perror("mbind");
+		ksft_exit_fail_msg("Failed to bind to node\n");
+	}
+
+	memset(map, HUGETLB_FILL, len);
+	ksft_print_msg("Memset every byte to 0xab\n");
+
+	new.sa_sigaction = &sigbus_handler;
+	new.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &new, &old) < 0)
+		ksft_exit_fail_msg("Failed to setup SIGBUS handler");
+
+	ksft_print_msg("Setup SIGBUS handler successfully\n");
+
+	test_hwpoison_multiple_pages(map, hugepage_size);
+
+	/*
+	 * Since MADV_HWPOISON doesn't corrupt the memory in hardware, and
+	 * MFD_MF_KEEP_UE_MAPPED keeps the hugepage mapped, every byte should
+	 * remain accessible and hold original data.
+	 */
+	expecting_sigbus = false;
+	for (iter = map; iter < map + len; ++iter) {
+		if (*iter != HUGETLB_FILL) {
+			ksft_print_msg("At addr=%p: got=%#x, expected=%#x\n",
+				       iter, *iter, HUGETLB_FILL);
+			ksft_test_result_fail("Memory content corrupted\n");
+			break;
+		}
+	}
+	ksft_print_msg("Memory content all valid\n");
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+
+	/*
+	 * After MADV_HWPOISON, hugepage should still be in HugeTLB pool.
+	 */
+	ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after)
+		ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n",
+				      nr_hugepages_before - nr_hugepages_after);
+
+	/* End of the lifetime of the created HugeTLB memfd. */
+	if (ftruncate(fd, 0) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate to 0");
+	munmap(map, len);
+	close(fd);
+
+	/*
+	 * After freed by userspace, MADV_HWPOISON-ed hugepage should be
+	 * dissolved into raw pages and removed from HugeTLB pool.
+	 */
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages after closure is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after + 1)
+		ksft_test_result_fail("NR hugepages is not reduced after memfd closure\n");
+
+	ksft_test_result_pass("All done\n");
+}
+
+static unsigned long parse_hugepage_size(char *argv)
+{
+	if (strncasecmp(argv, HUGETLB_1GB_STR, strlen(HUGETLB_1GB_STR)) == 0)
+		return BYTE_LENTH_IN_1G;
+
+	if (strncasecmp(argv, HUGETLB_2MB_STR, strlen(HUGETLB_2MB_STR)) == 0)
+		return BYTE_LENTH_IN_2M;
+
+	ksft_print_msg("Please provide valid hugepage_size: 1G or 2M\n");
+	assert(false);
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+	struct statfs file_stat;
+	unsigned long hugepage_size;
+
+	if (argc != 2) {
+		ksft_print_msg("Usage: %s <hugepage_size=1G|2M>\n", argv[0]);
+		return -EINVAL;
+	}
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	hugepage_size = parse_hugepage_size(argv[1]);
+	fd = create_hugetlbfs_file(&file_stat, hugepage_size);
+	test_main(fd, hugepage_size);
+
+	ksft_finished();
+}
-- 
2.52.0.rc1.455.g30608eb744-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED
  2025-11-16  1:32 [PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
  2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
  2025-11-16  1:32 ` [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
@ 2025-11-16  1:32 ` Jiaqi Yan
  2025-12-03  4:18   ` jane.chu
  2 siblings, 1 reply; 10+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:32 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel, Jiaqi Yan

Document its motivation, userspace API, behaviors, and limitations.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 Documentation/userspace-api/index.rst         |  1 +
 .../userspace-api/mfd_mfr_policy.rst          | 60 +++++++++++++++++++
 2 files changed, 61 insertions(+)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index b8c73be4fb112..d8c6977d9e67a 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -67,6 +67,7 @@ Everything else
    futex2
    perf_ring_buffer
    ntsync
+   mfd_mfr_policy
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst
new file mode 100644
index 0000000000000..c5a25df39791a
--- /dev/null
+++ b/Documentation/userspace-api/mfd_mfr_policy.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Userspace Memory Failure Recovery Policy via memfd
+==================================================
+
+:Author:
+    Jiaqi Yan <jiaqiyan@google.com>
+
+
+Motivation
+==========
+
+When a userspace process is able to recover from memory failures (MF)
+caused by uncorrected memory error (UE) in the DIMM, especially when it is
+able to avoid consuming known UEs, keeping the memory page mapped and
+accessible is benifical to the owning process for a couple of reasons:
+
+- The memory pages affected by UE have a large smallest granularity, for
+  example 1G hugepage, but the actual corrupted amount of the page is only
+  several cachlines. Losing the entire hugepage of data is unacceptable to
+  the application.
+
+- In addition to keeping the data accessible, the application still wants
+  to access with a large page size for the fastest virtual-to-physical
+  translations.
+
+Memory failure recovery for 1G or larger HugeTLB is a good example. With
+memfd userspace process can control whether the kernel hard offlines its
+hugepages that backs the in-RAM file created by memfd.
+
+
+User API
+========
+
+``int memfd_create(const char *name, unsigned int flags)``
+
+``MFD_MF_KEEP_UE_MAPPED``
+
+	When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery
+	in the kernel does not hard offline memory due to UE until the
+	returned ``memfd`` is released. IOW, the HWPoison-ed memory remains
+	accessible via the returned ``memfd`` or the memory mapping created
+	with the returned ``memfd``. Note the affected memory will be
+	immediately isolated and prevented from future use once the memfd
+	is closed. By default ``MFD_MF_KEEP_UE_MAPPED`` is not set, and
+	kernel hard offlines memory having UEs.
+
+Notes about the behavior and limitations
+
+- Even if the page affected by UE is kept, a portion of the (huge)page is
+  already lost due to hardware corruption, and the size of the portion
+  is the smallest page size that kernel uses to manages memory on the
+  architecture, i.e. PAGESIZE. Accessing a virtual address within any of
+  these parts results in a SIGBUS; accessing virtual address outside these
+  parts are good until it is corrupted by new memory error.
+
+- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so
+  ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.
+  Otherwise ``memfd_create`` returns EINVAL.
-- 
2.52.0.rc1.455.g30608eb744-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
@ 2025-11-25 21:47   ` William Roche
  2025-11-25 22:04   ` William Roche
  2025-12-03  4:11   ` jane.chu
  2 siblings, 0 replies; 10+ messages in thread
From: William Roche @ 2025-11-25 21:47 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 16193 bytes --]

Hello Jiaqi,

Here is a summary of a few nits in this code:

  - Some functions declarations are problematic according to me
  - The parameter testing to activate the feature looks incorrect
  - The function signature change is probably not necessary
  - Maybe we should wait for an agreement on your other proposal:
[PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio

The last item is not a nit, but as your above proposal may require to 
keep all data of a
hugetlb folio to recycle it correctly (especially the list of poisoned 
sub-pages), and
to avoid the race condition with returning poisoned pages to the 
freelist right before
removing them; you may need to change some aspects of this current code.


On 11/16/25 02:32, Jiaqi Yan wrote:

> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8e63e46b8e1f0..b7733ef5ee917 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -871,10 +871,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
>   
>   #ifdef CONFIG_MEMORY_FAILURE
>   extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> +						struct address_space *mapping);
>   #else
>   static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
>   {
>   }
> +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
> +						       struct address_space *mapping)
> +{
> +	return false;
> +}
>   #endif
>   
You are conditionally declaring this 
hugetlb_should_keep_hwpoison_mapped() function and implementing it into 
mm/hugetlb.c, but this file can be compiled in both cases 
(CONFIG_MEMORY_FAILURE enabled or not) So you either need to have a 
single consistent declaration with the implementation and use something 
like that:

bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio, struct 
address_space *mapping) { +#ifdef CONFIG_MEMORY_FAILURE if 
(WARN_ON_ONCE(!folio_test_hugetlb(folio))) return false; @@ -6087,6 
+6088,9 @@ bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio, 
return false; return mapping_mf_keep_ue_mapped(mapping); +#else + return 
false; +#endif }

Or keep your double declaration and hide the implementation when 
CONFIG_MEMORY_FAILURE is enabled:

+#ifdef CONFIG_MEMORY_FAILURE bool 
hugetlb_should_keep_hwpoison_mapped(struct folio *folio, struct 
address_space *mapping) { if (WARN_ON_ONCE(!folio_test_hugetlb(folio))) 
return false; @@ -6087,6 +6088,9 @@ bool 
hugetlb_should_keep_hwpoison_mapped(struct folio *folio, return false; 
return mapping_mf_keep_ue_mapped(mapping); } +#endif

>   #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878d..9ad511aacde7c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -213,6 +213,8 @@ enum mapping_flags {
>   	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
>   	AS_KERNEL_FILE = 10,	/* mapping for a fake kernel file that shouldn't
>   				   account usage to user cgroups */
> +	/* For MFD_MF_KEEP_UE_MAPPED. */
> +	AS_MF_KEEP_UE_MAPPED = 11,
>   	/* Bits 16-25 are used for FOLIO_ORDER */
>   	AS_FOLIO_ORDER_BITS = 5,
>   	AS_FOLIO_ORDER_MIN = 16,
> @@ -348,6 +350,16 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(const struct addres
>   	return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
>   }
>   
> +static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
> +{
> +	return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
> +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
> +{
> +	set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
>   static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
>   {
>   	return mapping->gfp_mask;
> @@ -1274,6 +1286,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
>   void delete_from_page_cache_batch(struct address_space *mapping,
>   				  struct folio_batch *fbatch);
>   bool filemap_release_folio(struct folio *folio, gfp_t gfp);
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
> + */
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio);
> +#else
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio)
> +{
> +}
> +#endif
>   loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
>   		int whence);

This filemap_offline_hwpoison_folio() declaration also is problematic in 
the case without CONFIG_MEMORY_FAILURE, as we implement a public 
function filemap_offline_hwpoison_folio() in all the files including 
this "pagemap.h" header.

This coud be solved using "static inline" in this second case.

> diff --git a/mm/memfd.c b/mm/memfd.c
> index 1d109c1acf211..bfdde4cf90500 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -313,7 +313,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
>   #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>   #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>   
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +		       MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
>   
>   static int check_sysctl_memfd_noexec(unsigned int *flags)
>   {
> @@ -387,6 +388,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
>   	if (!(flags & MFD_HUGETLB)) {
>   		if (flags & ~MFD_ALL_FLAGS)
>   			return -EINVAL;
> +		if (flags & MFD_MF_KEEP_UE_MAPPED)
> +			return -EINVAL;
>   	} else {
>   		/* Allow huge page size encoding in flags. */
>   		if (flags & ~(MFD_ALL_FLAGS |
> @@ -447,6 +450,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
>   	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>   	file->f_flags |= O_LARGEFILE;
>   
> +	/*
> +	 * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
> +	 * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
> +	 * seal-able.
> +	 *
> +	 * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
> +	 */
> +	if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
> +		mapping_set_mf_keep_ue_mapped(file->f_mapping);

The flags value that we need to have in order to set the "keep" value on 
the address space
is MFD_MF_KEEP_UE_MAPPED alone, as we already verified that the value is 
only given combined to MFD_HUGETLB. This is a nit identified by Harry 
Yoo during our internal conversations. Thanks Harry !

> +
>   	if (flags & MFD_NOEXEC_SEAL) {
>   		struct inode *inode = file_inode(file);
>   
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3edebb0cda30b..c5e3e28872797 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -373,11 +373,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>    * Schedule a process for later kill.
>    * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>    */
> -static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> +static void __add_to_kill(struct task_struct *tsk, struct page *p,
>   			  struct vm_area_struct *vma, struct list_head *to_kill,
>   			  unsigned long addr)

Is there any reason to remove the "const" on the page structure in the 
signature ?
It looks like you only do that for the new call to page_folio(p), but we 
don't touch the page

>   {
>   	struct to_kill *tk;
> +	struct folio *folio;
You could use a "const" struct folio *folio too.
> +	struct address_space *mapping;
>   
>   	tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
>   	if (!tk) {
> @@ -388,8 +390,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
>   	tk->addr = addr;
>   	if (is_zone_device_page(p))
>   		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> -	else
> -		tk->size_shift = folio_shift(page_folio(p));
> +	else {
> +		folio = page_folio(p);

Now with both folio and p being "const", the code should work.


> +		mapping = folio_mapping(folio);
> +		if (mapping && mapping_mf_keep_ue_mapped(mapping))
> +			/*
> +			 * Let userspace know the radius of HWPoison is
> +			 * the size of raw page; accessing other pages
> +			 * inside the folio is still ok.
> +			 */
> +			tk->size_shift = PAGE_SHIFT;
> +		else
> +			tk->size_shift = folio_shift(folio);
> +	}
>   
>   	/*
>   	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -414,7 +427,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
>   	list_add_tail(&tk->nd, to_kill);
>   }
>   
> -static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
> +static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,
No need to change the signature here too (otherwise you would have 
missed both functions
add_to_kill_fsdax() and add_to_kill_ksm().

>   		struct vm_area_struct *vma, struct list_head *to_kill,
>   		unsigned long addr)
>   {
> @@ -535,7 +548,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
>    * Collect processes when the error hit an anonymous page.
>    */
>   static void collect_procs_anon(const struct folio *folio,
> -		const struct page *page, struct list_head *to_kill,
> +		struct page *page, struct list_head *to_kill,

No need to change


>   		int force_early)
>   {
>   	struct task_struct *tsk;
> @@ -573,7 +586,7 @@ static void collect_procs_anon(const struct folio *folio,
>    * Collect processes when the error hit a file mapped page.
>    */
>   static void collect_procs_file(const struct folio *folio,
> -		const struct page *page, struct list_head *to_kill,
> +		struct page *page, struct list_head *to_kill,
>   		int force_early)
No need to change

>   {
>   	struct vm_area_struct *vma;
> @@ -655,7 +668,7 @@ static void collect_procs_fsdax(const struct page *page,
>   /*
>    * Collect the processes who have the corrupted page mapped to kill.
>    */
> -static void collect_procs(const struct folio *folio, const struct page *page,
> +static void collect_procs(const struct folio *folio, struct page *page,

No need to change

>   		struct list_head *tokill, int force_early)
>   {
>   	if (!folio->mapping)
> @@ -1173,6 +1186,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>   		}
>   	}
>   
> +	/*
> +	 * MF still needs to holds a refcount for the deferred actions in
> +	 * filemap_offline_hwpoison_folio.
> +	 */
> +	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +		return res;
> +
>   	if (has_extra_refcount(ps, p, extra_pins))
>   		res = MF_FAILED;
>   
> @@ -1569,6 +1589,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   {
>   	LIST_HEAD(tokill);
>   	bool unmap_success;
> +	bool keep_mapped;
>   	int forcekill;
>   	bool mlocked = folio_test_mlocked(folio);
>   
> @@ -1596,8 +1617,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   	 */
>   	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>   
> -	unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> -	if (!unmap_success)
> +	keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
> +	if (!keep_mapped)
> +		unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> +
> +	unmap_success = !folio_mapped(folio);
> +	if (!keep_mapped && !unmap_success)
>   		pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
>   		       pfn, folio_mapcount(folio));
>   
> @@ -1622,7 +1647,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   		    !unmap_success;
>   	kill_procs(&tokill, forcekill, pfn, flags);
>   
> -	return unmap_success;
> +	return unmap_success || keep_mapped;
>   }
>   
>   static int identify_page_state(unsigned long pfn, struct page *p,
> @@ -1862,6 +1887,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
>   	unsigned long count = 0;
>   
>   	head = llist_del_all(raw_hwp_list_head(folio));
> +	/*
> +	 * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
> +	 * it has already taken off the head of the llist.
> +	 */
> +	if (head == NULL)
> +		return 0;
> +

This may not be necessary depending on how we recycle hugetlb pages -- 
see below too.


>   	llist_for_each_entry_safe(p, next, head, node) {
>   		if (move_flag)
>   			SetPageHWPoison(p->page);
> @@ -1878,7 +1910,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   	struct llist_head *head;
>   	struct raw_hwp_page *raw_hwp;
>   	struct raw_hwp_page *p;
> -	int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> +	struct address_space *mapping = folio->mapping;
> +	bool has_hwpoison = folio_test_set_hwpoison(folio);
>   
>   	/*
>   	 * Once the hwpoison hugepage has lost reliable raw error info,
> @@ -1897,8 +1930,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   	if (raw_hwp) {
>   		raw_hwp->page = page;
>   		llist_add(&raw_hwp->node, head);
> +		if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +			/*
> +			 * A new raw HWPoison page. Don't return HWPOISON.
> +			 * Error event will be counted in action_result().
> +			 */
> +			return 0;
> +
>   		/* the first error event will be counted in action_result(). */
> -		if (ret)
> +		if (has_hwpoison)
>   			num_poisoned_pages_inc(page_to_pfn(page));
>   	} else {
>   		/*
> @@ -1913,7 +1953,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   		 */
>   		__folio_free_raw_hwp(folio, false);
>   	}
> -	return ret;
> +
> +	return has_hwpoison ? -EHWPOISON : 0;
>   }
>   
>   static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> @@ -2002,6 +2043,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>   	return ret;
>   }
>   
> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> +{
> +	int ret;
> +	struct llist_node *head;
> +	struct raw_hwp_page *curr, *next;
> +	struct page *page;
> +	unsigned long pfn;
> +
> +	/*
> +	 * Since folio is still in the folio_batch, drop the refcount
> +	 * elevated by filemap_get_folios.
> +	 */
> +	folio_put_refs(folio, 1);
> +	head = llist_del_all(raw_hwp_list_head(folio));
According to me we should wait until your other patch set is approved to 
decide if the folio raw_hwp_list
has to be removed from the folio or if is should be left there so that 
the recycling of this huge page
works correctly...


> +
> +	/*
> +	 * Release refcounts held by try_memory_failure_hugetlb, one per
> +	 * HWPoison-ed page in the raw hwp list.
> +	 */
> +	llist_for_each_entry(curr, head, node) {
> +		SetPageHWPoison(curr->page);
> +		folio_put(folio);
> +	}
> +
> +	/* Refcount now should be zero and ready to dissolve folio. */
> +	ret = dissolve_free_hugetlb_folio(folio);
> +	if (ret) {
> +		pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> +		return;
> +	}
> +
> +	llist_for_each_entry_safe(curr, next, head, node) {
> +		page = curr->page;
> +		pfn = page_to_pfn(page);
> +		drain_all_pages(page_zone(page));
> +		if (!take_page_off_buddy(page))
> +			pr_err("%#lx: unable to take off buddy allocator\n", pfn);
> +
> +		page_ref_inc(page);
> +		kfree(curr);
> +		pr_info("%#lx: pending hard offline completed\n", pfn);
> +	}
> +}

Let's revisit this above function when an agreement is reached on the 
recycling hugetlb pages proposal.


> +
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio)
> +{
> +	WARN_ON_ONCE(!mapping);
> +
> +	if (!folio_test_hwpoison(folio))
> +		return;
> +
> +	/* Pending MFR currently only exist for hugetlb. */
> +	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +		filemap_offline_hwpoison_folio_hugetlb(folio);
> +}
> +
>   /*
>    * Taking refcount of hugetlb pages needs extra care about race conditions
>    * with basic operations like hugepage allocation/free/demotion.


HTH

Best regards,
William.

[-- Attachment #2: Type: text/html, Size: 20426 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
  2025-11-25 21:47   ` William Roche
@ 2025-11-25 22:04   ` William Roche
  2025-12-03  4:11   ` jane.chu
  2 siblings, 0 replies; 10+ messages in thread
From: William Roche @ 2025-11-25 22:04 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	rientjes, duenwen, jthoughton, jgg, ankita, peterx,
	sidhartha.kumar, ziy, david, dave.hansen, muchun.song, linux-mm,
	linux-kernel, linux-fsdevel

Sorry, resending for the non-HTML version.
  --

Hello Jiaqi,

Here is a summary of a few nits in this code:

  - Some functions declarations are problematic according to me
  - The parameter testing to activate the feature looks incorrect
  - The function signature change is probably not necessary
  - Maybe we should wait for an agreement on your other proposal:
[PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio

The last item is not a nit, but as your above proposal may require to 
keep all data of a
hugetlb folio to recycle it correctly (especially the list of poisoned 
sub-pages), and
to avoid the race condition with returning poisoned pages to the 
freelist right before
removing them; you may need to change some aspects of this current code.




On 11/16/25 02:32, Jiaqi Yan wrote:
> [...]
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8e63e46b8e1f0..b7733ef5ee917 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -871,10 +871,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
>   
>   #ifdef CONFIG_MEMORY_FAILURE
>   extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> +						struct address_space *mapping);
>   #else
>   static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
>   {
>   }
> +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
> +						       struct address_space *mapping)
> +{
> +	return false;
> +}
>   #endif

You are conditionally declaring this 
hugetlb_should_keep_hwpoison_mapped() function and implementing
it into mm/hugetlb.c, but this file can be compiled in both cases 
(CONFIG_MEMORY_FAILURE enabled or not)
So you either need to have a single consistent declaration with the 
implementation and use something like that:

  bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
                                          struct address_space *mapping)
  {
+#ifdef CONFIG_MEMORY_FAILURE
         if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
                 return false;

@@ -6087,6 +6088,9 @@ bool hugetlb_should_keep_hwpoison_mapped(struct 
folio *folio,
                 return false;

         return mapping_mf_keep_ue_mapped(mapping);
+#else
+       return false;
+#endif
  }

Or keep your double declaration and hide the implementation when 
CONFIG_MEMORY_FAILURE is enabled:

+#ifdef CONFIG_MEMORY_FAILURE
  bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
                                          struct address_space *mapping)
  {
         if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
                 return false;

  @@ -6087,6 +6088,9 @@ bool hugetlb_should_keep_hwpoison_mapped(struct 
folio *folio,
                 return false;

         return mapping_mf_keep_ue_mapped(mapping);
  }
+#endif



>   
>   #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878d..9ad511aacde7c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -213,6 +213,8 @@ enum mapping_flags {
>   	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
>   	AS_KERNEL_FILE = 10,	/* mapping for a fake kernel file that shouldn't
>   				   account usage to user cgroups */
> +	/* For MFD_MF_KEEP_UE_MAPPED. */
> +	AS_MF_KEEP_UE_MAPPED = 11,
>   	/* Bits 16-25 are used for FOLIO_ORDER */
>   	AS_FOLIO_ORDER_BITS = 5,
>   	AS_FOLIO_ORDER_MIN = 16,
> @@ -348,6 +350,16 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(const struct addres
>   	return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
>   }
>   
> +static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
> +{
> +	return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
> +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
> +{
> +	set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
>   static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
>   {
>   	return mapping->gfp_mask;
> @@ -1274,6 +1286,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
>   void delete_from_page_cache_batch(struct address_space *mapping,
>   				  struct folio_batch *fbatch);
>   bool filemap_release_folio(struct folio *folio, gfp_t gfp);
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
> + */
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio);
> +#else
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio)
> +{
> +}
> +#endif
>   loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
>   		int whence);
>   

This filemap_offline_hwpoison_folio() declaration also is problematic in 
the case without
CONFIG_MEMORY_FAILURE, as we implement a public function 
filemap_offline_hwpoison_folio()
in all the files including this "pagemap.h" header.

This coud be solved using "static inline" in this second case.



> diff --git a/mm/memfd.c b/mm/memfd.c
> index 1d109c1acf211..bfdde4cf90500 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -313,7 +313,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
>   #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>   #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>   
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +		       MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
>   
>   static int check_sysctl_memfd_noexec(unsigned int *flags)
>   {
> @@ -387,6 +388,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
>   	if (!(flags & MFD_HUGETLB)) {
>   		if (flags & ~MFD_ALL_FLAGS)
>   			return -EINVAL;
> +		if (flags & MFD_MF_KEEP_UE_MAPPED)
> +			return -EINVAL;
>   	} else {
>   		/* Allow huge page size encoding in flags. */
>   		if (flags & ~(MFD_ALL_FLAGS |
> @@ -447,6 +450,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
>   	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>   	file->f_flags |= O_LARGEFILE;
>   
> +	/*
> +	 * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
> +	 * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
> +	 * seal-able.
> +	 *
> +	 * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
> +	 */
> +	if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
> +		mapping_set_mf_keep_ue_mapped(file->f_mapping);
> +

The flags value that we need to have in order to set the "keep" value on 
the address space
is MFD_MF_KEEP_UE_MAPPED alone, as we already verified that the value is 
only given combined
to MFD_HUGETLB.
This is a nit identified by Harry Yoo during our internal conversations. 
Thanks Harry !


>   	if (flags & MFD_NOEXEC_SEAL) {
>   		struct inode *inode = file_inode(file);
>   
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3edebb0cda30b..c5e3e28872797 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -373,11 +373,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>    * Schedule a process for later kill.
>    * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>    */
> -static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> +static void __add_to_kill(struct task_struct *tsk, struct page *p,
>   			  struct vm_area_struct *vma, struct list_head *to_kill,
>   			  unsigned long addr)

Is there any reason to remove the "const" on the page structure in the 
signature ?
It looks like you only do that for the new call to page_folio(p), but we 
don't touch the page


>   {
>   	struct to_kill *tk;
> +	struct folio *folio;

You could use a "const" struct folio *folio too.



> +	struct address_space *mapping;
>   
>   	tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
>   	if (!tk) {
> @@ -388,8 +390,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
>   	tk->addr = addr;
>   	if (is_zone_device_page(p))
>   		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> -	else
> -		tk->size_shift = folio_shift(page_folio(p));
> +	else {
> +		folio = page_folio(p);

Now with both folio and p being "const", the code should work.



> +		mapping = folio_mapping(folio);
> +		if (mapping && mapping_mf_keep_ue_mapped(mapping))
> +			/*
> +			 * Let userspace know the radius of HWPoison is
> +			 * the size of raw page; accessing other pages
> +			 * inside the folio is still ok.
> +			 */
> +			tk->size_shift = PAGE_SHIFT;
> +		else
> +			tk->size_shift = folio_shift(folio);
> +	}
>   
>   	/*
>   	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -414,7 +427,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
>   	list_add_tail(&tk->nd, to_kill);
>   }
>   
> -static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
> +static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,

No need to change the signature here too (otherwise you would have 
missed both functions
add_to_kill_fsdax() and add_to_kill_ksm().


>   		struct vm_area_struct *vma, struct list_head *to_kill,
>   		unsigned long addr)
>   {
> @@ -535,7 +548,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
>    * Collect processes when the error hit an anonymous page.
>    */
>   static void collect_procs_anon(const struct folio *folio,
> -		const struct page *page, struct list_head *to_kill,
> +		struct page *page, struct list_head *to_kill,

No need to change


>   		int force_early)
>   {
>   	struct task_struct *tsk;
> @@ -573,7 +586,7 @@ static void collect_procs_anon(const struct folio *folio,
>    * Collect processes when the error hit a file mapped page.
>    */
>   static void collect_procs_file(const struct folio *folio,
> -		const struct page *page, struct list_head *to_kill,
> +		struct page *page, struct list_head *to_kill,
>   		int force_early)

No need to change

>   {
>   	struct vm_area_struct *vma;
> @@ -655,7 +668,7 @@ static void collect_procs_fsdax(const struct page *page,
>   /*
>    * Collect the processes who have the corrupted page mapped to kill.
>    */
> -static void collect_procs(const struct folio *folio, const struct page *page,
> +static void collect_procs(const struct folio *folio, struct page *page,
>   		struct list_head *tokill, int force_early)
>   {
>   	if (!folio->mapping)
> @@ -1173,6 +1186,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>   		}
>   	}
>   
> +	/*
> +	 * MF still needs to holds a refcount for the deferred actions in
> +	 * filemap_offline_hwpoison_folio.
> +	 */
> +	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +		return res;
> +
>   	if (has_extra_refcount(ps, p, extra_pins))
>   		res = MF_FAILED;
>   
> @@ -1569,6 +1589,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   {
>   	LIST_HEAD(tokill);
>   	bool unmap_success;
> +	bool keep_mapped;
>   	int forcekill;
>   	bool mlocked = folio_test_mlocked(folio);
>   
> @@ -1596,8 +1617,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   	 */
>   	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>   
> -	unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> -	if (!unmap_success)
> +	keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
> +	if (!keep_mapped)
> +		unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> +
> +	unmap_success = !folio_mapped(folio);
> +	if (!keep_mapped && !unmap_success)
>   		pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
>   		       pfn, folio_mapcount(folio));
>   
> @@ -1622,7 +1647,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   		    !unmap_success;
>   	kill_procs(&tokill, forcekill, pfn, flags);
>   
> -	return unmap_success;
> +	return unmap_success || keep_mapped;
>   }
>   
>   static int identify_page_state(unsigned long pfn, struct page *p,
> @@ -1862,6 +1887,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
>   	unsigned long count = 0;
>   
>   	head = llist_del_all(raw_hwp_list_head(folio));
> +	/*
> +	 * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
> +	 * it has already taken off the head of the llist.
> +	 */
> +	if (head == NULL)
> +		return 0;
> +

This may not be necessary depending on how we recycle hugetlb pages -- 
see below too.

>   	llist_for_each_entry_safe(p, next, head, node) {
>   		if (move_flag)
>   			SetPageHWPoison(p->page);
> @@ -1878,7 +1910,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   	struct llist_head *head;
>   	struct raw_hwp_page *raw_hwp;
>   	struct raw_hwp_page *p;
> -	int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> +	struct address_space *mapping = folio->mapping;
> +	bool has_hwpoison = folio_test_set_hwpoison(folio);
>   
>   	/*
>   	 * Once the hwpoison hugepage has lost reliable raw error info,
> @@ -1897,8 +1930,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   	if (raw_hwp) {
>   		raw_hwp->page = page;
>   		llist_add(&raw_hwp->node, head);
> +		if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +			/*
> +			 * A new raw HWPoison page. Don't return HWPOISON.
> +			 * Error event will be counted in action_result().
> +			 */
> +			return 0;
> +
>   		/* the first error event will be counted in action_result(). */
> -		if (ret)
> +		if (has_hwpoison)
>   			num_poisoned_pages_inc(page_to_pfn(page));
>   	} else {
>   		/*
> @@ -1913,7 +1953,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   		 */
>   		__folio_free_raw_hwp(folio, false);
>   	}
> -	return ret;
> +
> +	return has_hwpoison ? -EHWPOISON : 0;
>   }
>   
>   static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> @@ -2002,6 +2043,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>   	return ret;
>   }
>   
> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> +{
> +	int ret;
> +	struct llist_node *head;
> +	struct raw_hwp_page *curr, *next;
> +	struct page *page;
> +	unsigned long pfn;
> +
> +	/*
> +	 * Since folio is still in the folio_batch, drop the refcount
> +	 * elevated by filemap_get_folios.
> +	 */
> +	folio_put_refs(folio, 1);
> +	head = llist_del_all(raw_hwp_list_head(folio));

According to me we should wait until your other patch set is approved to 
decide if the folio raw_hwp_list
has to be removed from the folio or if is should be left there so that 
the recycling of this huge page
works correctly...

> +
> +	/*
> +	 * Release refcounts held by try_memory_failure_hugetlb, one per
> +	 * HWPoison-ed page in the raw hwp list.
> +	 */
> +	llist_for_each_entry(curr, head, node) {
> +		SetPageHWPoison(curr->page);
> +		folio_put(folio);
> +	}
> +
> +	/* Refcount now should be zero and ready to dissolve folio. */
> +	ret = dissolve_free_hugetlb_folio(folio);
> +	if (ret) {
> +		pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> +		return;
> +	}
> +
> +	llist_for_each_entry_safe(curr, next, head, node) {
> +		page = curr->page;
> +		pfn = page_to_pfn(page);
> +		drain_all_pages(page_zone(page));
> +		if (!take_page_off_buddy(page))
> +			pr_err("%#lx: unable to take off buddy allocator\n", pfn);
> +
> +		page_ref_inc(page);
> +		kfree(curr);
> +		pr_info("%#lx: pending hard offline completed\n", pfn);
> +	}
> +}

Let's revisit this above function when an agreement is reached on the 
recycling hugetlb pages proposal.





> +
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio)
> +{
> +	WARN_ON_ONCE(!mapping);
> +
> +	if (!folio_test_hwpoison(folio))
> +		return;
> +
> +	/* Pending MFR currently only exist for hugetlb. */
> +	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +		filemap_offline_hwpoison_folio_hugetlb(folio);
> +}
> +
>   /*
>    * Taking refcount of hugetlb pages needs extra care about race conditions
>    * with basic operations like hugepage allocation/free/demotion.


HTH

Best regards,
William.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
  2025-11-25 21:47   ` William Roche
  2025-11-25 22:04   ` William Roche
@ 2025-12-03  4:11   ` jane.chu
  2025-12-03 19:41     ` Jiaqi Yan
  2 siblings, 1 reply; 10+ messages in thread
From: jane.chu @ 2025-12-03  4:11 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, akpm, osalvador, rientjes,
	duenwen, jthoughton, jgg, ankita, peterx, sidhartha.kumar, ziy,
	david, dave.hansen, muchun.song, linux-mm, linux-kernel,
	linux-fsdevel

Hi, Jiaqi,

Thanks for the work, my comments inline.

On 11/15/2025 5:32 PM, Jiaqi Yan wrote:
> Sometimes immediately hard offlining a large chunk of contigous memory
> having uncorrected memory errors (UE) may not be the best option.
> Cloud providers usually serve capacity- and performance-critical guest
> memory with 1G HugeTLB hugepages, as this significantly reduces the
> overhead associated with managing page tables and TLB misses. However,
> for today's HugeTLB system, once a byte of memory in a hugepage is
> hardware corrupted, the kernel discards the whole hugepage, including
> the healthy portion. Customer workload running in the VM can hardly
> recover from such a great loss of memory.
> 
> Therefore keeping or discarding a large chunk of contiguous memory
> owned by userspace (particularly to serve guest memory) due to
> recoverable UE may better be controlled by userspace process
> that owns the memory, e.g. VMM in Cloud environment.
> 
> Introduce a memfd-based userspace memory failure (MFR) policy,
> MFD_MF_KEEP_UE_MAPPED. It is intended to be supported for other memfd,
> but the current implementation only covers HugeTLB.
> 
> For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED enabled memfd,
> whenever it runs into a UE, MFR doesn't hard offline the HWPoison-ed
> huge folio. IOW the HWPoison-ed memory remains accessible via the memory
> mapping created with that memfd. MFR still sends SIGBUS to the process
> as required. MFR also still maintains HWPoison metadata for the hugepage
> having the UE.
> 
> A HWPoison-ed hugepage will be immediately isolated and prevented from
> future allocation once userspace truncates it via the memfd, or the
> owning memfd is closed.
> 
> By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard offlines
> hugepages having UEs.
> 
> Tested with selftest in the follow-up commit.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> Tested-by: William Roche <william.roche@oracle.com>
> ---
>   fs/hugetlbfs/inode.c       |  25 +++++++-
>   include/linux/hugetlb.h    |   7 +++
>   include/linux/pagemap.h    |  24 +++++++
>   include/uapi/linux/memfd.h |   6 ++
>   mm/hugetlb.c               |  20 +++++-
>   mm/memfd.c                 |  15 ++++-
>   mm/memory-failure.c        | 124 +++++++++++++++++++++++++++++++++----
>   7 files changed, 202 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index f42548ee9083c..f8a5aa091d51d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -532,6 +532,18 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
>   	}
>   
>   	folio_unlock(folio);
> +
> +	/*
> +	 * There may be pending HWPoison-ed folios when a memfd is being
> +	 * removed or part of it is being truncated.
> +	 *
> +	 * HugeTLBFS' error_remove_folio keeps the HWPoison-ed folios in
> +	 * page cache until mm wants to drop the folio at the end of the
> +	 * of the filemap. At this point, if memory failure was delayed
> +	 * by MFD_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
> +	 */
> +	filemap_offline_hwpoison_folio(mapping, folio);
> +
>   	return ret;
>   }

Looks okay.

>   
> @@ -563,13 +575,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>   	const pgoff_t end = lend >> PAGE_SHIFT;
>   	struct folio_batch fbatch;
>   	pgoff_t next, index;
> -	int i, freed = 0;
> +	int i, j, freed = 0;
>   	bool truncate_op = (lend == LLONG_MAX);
>   
>   	folio_batch_init(&fbatch);
>   	next = lstart >> PAGE_SHIFT;
>   	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> -		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +		for (i = 0, j = 0; i < folio_batch_count(&fbatch); ++i) {
>   			struct folio *folio = fbatch.folios[i];
>   			u32 hash = 0;
>   
> @@ -584,8 +596,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>   							index, truncate_op))
>   				freed++;
>   
> +			/*
> +			 * Skip HWPoison-ed hugepages, which should no
> +			 * longer be hugetlb if successfully dissolved.
> +			 */
> +			if (folio_test_hugetlb(folio))
> +				fbatch.folios[j++] = folio;
> +
>   			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>   		}
> +		fbatch.nr = j;
> +
>   		folio_batch_release(&fbatch);
>   		cond_resched();
>   	}

Looks okay.

But this reminds me that for now remove_inode_single_folio() has no path 
to return 'false' anyway, and if it does, remove_inode_hugepages() will 
be broken since it has no logic to account for failed to be
removed folios.  Do you mind to make remove_inode_single_folio() a void 
function in order to avoid the confusion?


> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8e63e46b8e1f0..b7733ef5ee917 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -871,10 +871,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
>   
>   #ifdef CONFIG_MEMORY_FAILURE
>   extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> +						struct address_space *mapping);
>   #else
>   static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
>   {
>   }
> +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
> +						       struct address_space *mapping)
> +{
> +	return false;
> +}
>   #endif

It appears that hugetlb_should_keep_hwpoison_mapped() is only called 
within mm/memory-failure.c.  How about moving it there ?

>   
>   #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 09b581c1d878d..9ad511aacde7c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -213,6 +213,8 @@ enum mapping_flags {
>   	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
>   	AS_KERNEL_FILE = 10,	/* mapping for a fake kernel file that shouldn't
>   				   account usage to user cgroups */
> +	/* For MFD_MF_KEEP_UE_MAPPED. */
> +	AS_MF_KEEP_UE_MAPPED = 11,
>   	/* Bits 16-25 are used for FOLIO_ORDER */
>   	AS_FOLIO_ORDER_BITS = 5,
>   	AS_FOLIO_ORDER_MIN = 16,
> @@ -348,6 +350,16 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(const struct addres
>   	return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
>   }
>   
Okay.

> +static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
> +{
> +	return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
> +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
> +{
> +	set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> +}
> +
>   static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
>   {
>   	return mapping->gfp_mask;
> @@ -1274,6 +1286,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
>   void delete_from_page_cache_batch(struct address_space *mapping,
>   				  struct folio_batch *fbatch);
>   bool filemap_release_folio(struct folio *folio, gfp_t gfp);
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
> + */
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio);
> +#else
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio)
> +{
> +}
> +#endif

Okay.

>   loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
>   		int whence);
>   
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 273a4e15dfcff..d9875da551b7f 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -12,6 +12,12 @@
>   #define MFD_NOEXEC_SEAL		0x0008U
>   /* executable */
>   #define MFD_EXEC		0x0010U
> +/*
> + * Keep owned folios mapped when uncorrectable memory errors (UE) causes
> + * memory failure (MF) within the folio. Only at the end of the mapping
> + * will its HWPoison-ed folios be dealt with.
> + */
> +#define MFD_MF_KEEP_UE_MAPPED	0x0020U
>   
>   /*
>    * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0455119716ec0..dd3bc0b75e059 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6415,6 +6415,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned
>   	return same;
>   }
>   
> +bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> +					 struct address_space *mapping)
> +{
> +	if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
> +		return false;
> +
> +	if (!mapping)
> +		return false;
> +
> +	return mapping_mf_keep_ue_mapped(mapping);
> +}
> +

Okay.

>   static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   			struct vm_fault *vmf)
>   {
> @@ -6537,9 +6549,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   		 * So we need to block hugepage fault by PG_hwpoison bit check.
>   		 */
>   		if (unlikely(folio_test_hwpoison(folio))) {
> -			ret = VM_FAULT_HWPOISON_LARGE |
> -				VM_FAULT_SET_HINDEX(hstate_index(h));
> -			goto backout_unlocked;
> +			if (!mapping_mf_keep_ue_mapped(mapping)) {
> +				ret = VM_FAULT_HWPOISON_LARGE |
> +				      VM_FAULT_SET_HINDEX(hstate_index(h));
> +				goto backout_unlocked;
> +			}
>   		}
>   

Looks okay, but am curious at Miaohe and others' take.

To allow a known poisoned hugetlb page to be faulted in is for the sake 
of capacity, so this, versus a SIGBUS from the MF handler indicating a 
disruption and loss of both data and capacity.
No strong opinion here, just wondering if there is any merit to limit 
the scope to the MF handler only.

>   		/* Check for page in userfault range. */
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 1d109c1acf211..bfdde4cf90500 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -313,7 +313,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
>   #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>   #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>   
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> +		       MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
>   
>   static int check_sysctl_memfd_noexec(unsigned int *flags)
>   {
> @@ -387,6 +388,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
>   	if (!(flags & MFD_HUGETLB)) {
>   		if (flags & ~MFD_ALL_FLAGS)
>   			return -EINVAL;
> +		if (flags & MFD_MF_KEEP_UE_MAPPED)
> +			return -EINVAL;
>   	} else {
>   		/* Allow huge page size encoding in flags. */
>   		if (flags & ~(MFD_ALL_FLAGS |
> @@ -447,6 +450,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
>   	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>   	file->f_flags |= O_LARGEFILE;
>   
> +	/*
> +	 * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
> +	 * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
> +	 * seal-able.
> +	 *
> +	 * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
> +	 */
> +	if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
> +		mapping_set_mf_keep_ue_mapped(file->f_mapping);
> +
>   	if (flags & MFD_NOEXEC_SEAL) {
>   		struct inode *inode = file_inode(file);
>   

Okay.

> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3edebb0cda30b..c5e3e28872797 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -373,11 +373,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>    * Schedule a process for later kill.
>    * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>    */
> -static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> +static void __add_to_kill(struct task_struct *tsk, struct page *p,
>   			  struct vm_area_struct *vma, struct list_head *to_kill,
>   			  unsigned long addr)
>   {
>   	struct to_kill *tk;
> +	struct folio *folio;
> +	struct address_space *mapping;
>   
>   	tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
>   	if (!tk) {
> @@ -388,8 +390,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
>   	tk->addr = addr;
>   	if (is_zone_device_page(p))
>   		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> -	else
> -		tk->size_shift = folio_shift(page_folio(p));
> +	else {
> +		folio = page_folio(p);
> +		mapping = folio_mapping(folio);
> +		if (mapping && mapping_mf_keep_ue_mapped(mapping))
> +			/*
> +			 * Let userspace know the radius of HWPoison is
> +			 * the size of raw page; accessing other pages
> +			 * inside the folio is still ok.
> +			 */
> +			tk->size_shift = PAGE_SHIFT;
> +		else
> +			tk->size_shift = folio_shift(folio);
> +	}
>   
>   	/*
>   	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -414,7 +427,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
>   	list_add_tail(&tk->nd, to_kill);
>   }
>   
> -static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
> +static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,
>   		struct vm_area_struct *vma, struct list_head *to_kill,
>   		unsigned long addr)
>   {
> @@ -535,7 +548,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
>    * Collect processes when the error hit an anonymous page.
>    */
>   static void collect_procs_anon(const struct folio *folio,
> -		const struct page *page, struct list_head *to_kill,
> +		struct page *page, struct list_head *to_kill,
>   		int force_early)
>   {
>   	struct task_struct *tsk;
> @@ -573,7 +586,7 @@ static void collect_procs_anon(const struct folio *folio,
>    * Collect processes when the error hit a file mapped page.
>    */
>   static void collect_procs_file(const struct folio *folio,
> -		const struct page *page, struct list_head *to_kill,
> +		struct page *page, struct list_head *to_kill,
>   		int force_early)
>   {
>   	struct vm_area_struct *vma;
> @@ -655,7 +668,7 @@ static void collect_procs_fsdax(const struct page *page,
>   /*
>    * Collect the processes who have the corrupted page mapped to kill.
>    */
> -static void collect_procs(const struct folio *folio, const struct page *page,
> +static void collect_procs(const struct folio *folio, struct page *page,
>   		struct list_head *tokill, int force_early)
>   {
>   	if (!folio->mapping)
> @@ -1173,6 +1186,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>   		}
>   	}
>   
> +	/*
> +	 * MF still needs to holds a refcount for the deferred actions in
> +	 * filemap_offline_hwpoison_folio.
> +	 */
> +	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +		return res;
> +

Okay.

>   	if (has_extra_refcount(ps, p, extra_pins))
>   		res = MF_FAILED;
>   
> @@ -1569,6 +1589,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   {
>   	LIST_HEAD(tokill);
>   	bool unmap_success;
> +	bool keep_mapped;
>   	int forcekill;
>   	bool mlocked = folio_test_mlocked(folio);
>   
> @@ -1596,8 +1617,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   	 */
>   	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>   
> -	unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> -	if (!unmap_success)
> +	keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
> +	if (!keep_mapped)
> +		unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> +
> +	unmap_success = !folio_mapped(folio);
> +	if (!keep_mapped && !unmap_success)
>   		pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
>   		       pfn, folio_mapcount(folio));
>   
> @@ -1622,7 +1647,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
>   		    !unmap_success;
>   	kill_procs(&tokill, forcekill, pfn, flags);
>   
> -	return unmap_success;
> +	return unmap_success || keep_mapped;
>   }

Okay.

>   
>   static int identify_page_state(unsigned long pfn, struct page *p,
> @@ -1862,6 +1887,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
>   	unsigned long count = 0;
>   
>   	head = llist_del_all(raw_hwp_list_head(folio));
> +	/*
> +	 * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
> +	 * it has already taken off the head of the llist.
> +	 */
> +	if (head == NULL)
> +		return 0;
> +
>   	llist_for_each_entry_safe(p, next, head, node) {
>   		if (move_flag)
>   			SetPageHWPoison(p->page);
> @@ -1878,7 +1910,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   	struct llist_head *head;
>   	struct raw_hwp_page *raw_hwp;
>   	struct raw_hwp_page *p;
> -	int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> +	struct address_space *mapping = folio->mapping;
> +	bool has_hwpoison = folio_test_set_hwpoison(folio);
>   
>   	/*
>   	 * Once the hwpoison hugepage has lost reliable raw error info,
> @@ -1897,8 +1930,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   	if (raw_hwp) {
>   		raw_hwp->page = page;
>   		llist_add(&raw_hwp->node, head);
> +		if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +			/*
> +			 * A new raw HWPoison page. Don't return HWPOISON.
> +			 * Error event will be counted in action_result().
> +			 */
> +			return 0;
> +
>   		/* the first error event will be counted in action_result(). */
> -		if (ret)
> +		if (has_hwpoison)
>   			num_poisoned_pages_inc(page_to_pfn(page));
>   	} else {
>   		/*
> @@ -1913,7 +1953,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
>   		 */
>   		__folio_free_raw_hwp(folio, false);
>   	}
> -	return ret;
> +
> +	return has_hwpoison ? -EHWPOISON : 0;
>   }

Okay.

>   
>   static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> @@ -2002,6 +2043,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>   	return ret;
>   }
>   
> +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> +{
> +	int ret;
> +	struct llist_node *head;
> +	struct raw_hwp_page *curr, *next;
> +	struct page *page;
> +	unsigned long pfn;
> +
> +	/*
> +	 * Since folio is still in the folio_batch, drop the refcount
> +	 * elevated by filemap_get_folios.
> +	 */
> +	folio_put_refs(folio, 1);
> +	head = llist_del_all(raw_hwp_list_head(folio));
> +
> +	/*
> +	 * Release refcounts held by try_memory_failure_hugetlb, one per
> +	 * HWPoison-ed page in the raw hwp list.
> +	 */
> +	llist_for_each_entry(curr, head, node) {
> +		SetPageHWPoison(curr->page);
> +		folio_put(folio);
> +	}
> +
> +	/* Refcount now should be zero and ready to dissolve folio. */
> +	ret = dissolve_free_hugetlb_folio(folio);
> +	if (ret) {
> +		pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> +		return;
> +	}
> +
> +	llist_for_each_entry_safe(curr, next, head, node) {
> +		page = curr->page;
> +		pfn = page_to_pfn(page);
> +		drain_all_pages(page_zone(page));
> +		if (!take_page_off_buddy(page))
> +			pr_err("%#lx: unable to take off buddy allocator\n", pfn);
> +
> +		page_ref_inc(page);
> +		kfree(curr);
> +		pr_info("%#lx: pending hard offline completed\n", pfn);
> +	}
> +}
> +
> +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> +				    struct folio *folio)
> +{
> +	WARN_ON_ONCE(!mapping);
> +
> +	if (!folio_test_hwpoison(folio))
> +		return;
> +
> +	/* Pending MFR currently only exist for hugetlb. */
> +	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> +		filemap_offline_hwpoison_folio_hugetlb(folio);
> +}
> +
>   /*
>    * Taking refcount of hugetlb pages needs extra care about race conditions
>    * with basic operations like hugepage allocation/free/demotion.


Looks good.

thanks,
-jane


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage
  2025-11-16  1:32 ` [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
@ 2025-12-03  4:14   ` jane.chu
  0 siblings, 0 replies; 10+ messages in thread
From: jane.chu @ 2025-12-03  4:14 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, akpm, osalvador, rientjes,
	duenwen, jthoughton, jgg, ankita, peterx, sidhartha.kumar, ziy,
	david, dave.hansen, muchun.song, linux-mm, linux-kernel,
	linux-fsdevel


On 11/15/2025 5:32 PM, Jiaqi Yan wrote:
> Test the userspace memory failure recovery (MFR) policy for HugeTLB
> 1G or 2M hugepage case:
> 1. Create a memfd backed by HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
> 2. Allocate and map 4 hugepages to the process.
> 3. Create sub-threads to MADV_HWPOISON inner addresses of one hugepage.
> 4. Check if the process gets correct SIGBUS for each poisoned raw page.
> 5. Check if all memory are still accessible and content valid.
> 6. Check if the poisoned hugepage is dealt with after memfd released.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>   tools/testing/selftests/mm/.gitignore    |   1 +
>   tools/testing/selftests/mm/Makefile      |   1 +
>   tools/testing/selftests/mm/hugetlb-mfr.c | 327 +++++++++++++++++++++++
>   3 files changed, 329 insertions(+)
>   create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c
> 

Test looks fine.
Reviewed-by: Jane Chu <jane.chu@oracle.com>


> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index c2a8586e51a1f..11664d20935db 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -5,6 +5,7 @@ hugepage-mremap
>   hugepage-shm
>   hugepage-vmemmap
>   hugetlb-madvise
> +hugetlb-mfr
>   hugetlb-read-hwpoison
>   hugetlb-soft-offline
>   khugepaged
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index eaf9312097f7b..de3bdcf7914cd 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -63,6 +63,7 @@ TEST_GEN_FILES += hmm-tests
>   TEST_GEN_FILES += hugetlb-madvise
>   TEST_GEN_FILES += hugetlb-read-hwpoison
>   TEST_GEN_FILES += hugetlb-soft-offline
> +TEST_GEN_FILES += hugetlb-mfr
>   TEST_GEN_FILES += hugepage-mmap
>   TEST_GEN_FILES += hugepage-mremap
>   TEST_GEN_FILES += hugepage-shm
> diff --git a/tools/testing/selftests/mm/hugetlb-mfr.c b/tools/testing/selftests/mm/hugetlb-mfr.c
> new file mode 100644
> index 0000000000000..30939b2194188
> --- /dev/null
> +++ b/tools/testing/selftests/mm/hugetlb-mfr.c
> @@ -0,0 +1,327 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Test the userspace memory failure recovery (MFR) policy for HugeTLB
> + * hugepage case:
> + * 1. Create a memfd backed by HugeTLB and MFD_MF_KEEP_UE_MAPPED bit set.
> + * 2. Allocate and map 4 hugepages.
> + * 3. Create sub-threads to MADV_HWPOISON inner addresses of one hugepage.
> + * 4. Check if each sub-thread get correct SIGBUS for the poisoned raw page.
> + * 5. Check if all memory are still accessible and content still valid.
> + * 6. Check if the poisoned hugepage is dealt with after memfd released.
> + *
> + * Two ways to run the test:
> + *   ./hugetlb-mfr 2M
> + * or
> + *   ./hugetlb-mfr 1G
> + * assuming /sys/kernel/mm/hugepages/hugepages-${xxx}kB/nr_hugepages > 4
> + */
> +
> +#define _GNU_SOURCE
> +#include <assert.h>
> +#include <errno.h>
> +#include <numaif.h>
> +#include <numa.h>
> +#include <pthread.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include <linux/magic.h>
> +#include <linux/memfd.h>
> +#include <sys/mman.h>
> +#include <sys/prctl.h>
> +#include <sys/statfs.h>
> +#include <sys/types.h>
> +
> +#include "../kselftest.h"
> +#include "vm_util.h"
> +
> +#define EPREFIX			" !!! "
> +#define BYTE_LENTH_IN_1G	0x40000000UL
> +#define BYTE_LENTH_IN_2M	0x200000UL
> +#define HUGETLB_1GB_STR		"1G"
> +#define HUGETLB_2MB_STR		"2M"
> +#define HUGETLB_FILL		0xab
> +
> +static const unsigned long offsets_1g[] = {0x200000, 0x400000, 0x800000};
> +static const unsigned long offsets_2m[] = {0x020000, 0x040000, 0x080000};
> +
> +static void *sigbus_addr;
> +static int sigbus_addr_lsb;
> +static bool expecting_sigbus;
> +static bool got_sigbus;
> +static bool was_mceerr;
> +
> +static int create_hugetlbfs_file(struct statfs *file_stat,
> +				 unsigned long hugepage_size)
> +{
> +	int fd;
> +	int flags = MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED;
> +
> +	if (hugepage_size == BYTE_LENTH_IN_2M)
> +		flags |= MFD_HUGE_2MB;
> +	else
> +		flags |= MFD_HUGE_1GB;
> +
> +	fd = memfd_create("hugetlb_tmp", flags);
> +	if (fd < 0)
> +		ksft_exit_fail_perror("Failed to memfd_create");
> +
> +	memset(file_stat, 0, sizeof(*file_stat));
> +	if (fstatfs(fd, file_stat)) {
> +		close(fd);
> +		ksft_exit_fail_perror("Failed to fstatfs");
> +	}
> +	if (file_stat->f_type != HUGETLBFS_MAGIC) {
> +		close(fd);
> +		ksft_exit_fail_msg("Not hugetlbfs file");
> +	}
> +
> +	ksft_print_msg("Created hugetlb_tmp file\n");
> +	ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
> +	if (file_stat->f_bsize != hugepage_size)
> +		ksft_exit_fail_msg("Hugepage size is not %#lx", hugepage_size);
> +
> +	return fd;
> +}
> +
> +/*
> + * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
> + */
> +static void sigbus_handler(int signo, siginfo_t *info, void *context)
> +{
> +	if (!expecting_sigbus)
> +		ksft_exit_fail_msg("unexpected sigbus with addr=%p",
> +				   info->si_addr);
> +
> +	got_sigbus = true;
> +	was_mceerr = (info->si_code == BUS_MCEERR_AO ||
> +		      info->si_code == BUS_MCEERR_AR);
> +	sigbus_addr = info->si_addr;
> +	sigbus_addr_lsb = info->si_addr_lsb;
> +}
> +
> +static void *do_hwpoison(void *hwpoison_addr)
> +{
> +	int hwpoison_size = getpagesize();
> +
> +	ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
> +		       hwpoison_addr, hwpoison_size);
> +	if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
> +		ksft_exit_fail_perror("Failed to MADV_HWPOISON");
> +
> +	pthread_exit(NULL);
> +}
> +
> +static void test_hwpoison_multiple_pages(unsigned char *start_addr,
> +					 unsigned long hugepage_size)
> +{
> +	pthread_t pthread;
> +	int ret;
> +	unsigned char *hwpoison_addr;
> +	const unsigned long *offsets;
> +	size_t offsets_count;
> +	size_t i;
> +
> +	if (hugepage_size == BYTE_LENTH_IN_2M) {
> +		offsets = offsets_2m;
> +		offsets_count = ARRAY_SIZE(offsets_2m);
> +	} else {
> +		offsets = offsets_1g;
> +		offsets_count = ARRAY_SIZE(offsets_1g);
> +	}
> +
> +	for (i = 0; i < offsets_count; ++i) {
> +		sigbus_addr = (void *)0xBADBADBAD;
> +		sigbus_addr_lsb = 0;
> +		was_mceerr = false;
> +		got_sigbus = false;
> +		expecting_sigbus = true;
> +		hwpoison_addr = start_addr + offsets[i];
> +
> +		ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
> +		if (ret)
> +			ksft_exit_fail_perror("Failed to create hwpoison thread");
> +
> +		ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
> +			       hwpoison_addr);
> +
> +		pthread_join(pthread, NULL);
> +
> +		if (!got_sigbus)
> +			ksft_test_result_fail("Didn't get a SIGBUS\n");
> +		if (!was_mceerr)
> +			ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
> +		if (sigbus_addr != hwpoison_addr)
> +			ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
> +					      sigbus_addr, hwpoison_addr);
> +		if (sigbus_addr_lsb != pshift())
> +			ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
> +					      sigbus_addr_lsb, pshift());
> +
> +		ksft_print_msg("Received expected and correct SIGBUS\n");
> +	}
> +}
> +
> +static int read_nr_hugepages(unsigned long hugepage_size,
> +			     unsigned long *nr_hugepages)
> +{
> +	char buffer[256] = {0};
> +	char cmd[256] = {0};
> +
> +	sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
> +		hugepage_size);
> +	FILE *cmdfile = popen(cmd, "r");
> +
> +	if (cmdfile == NULL) {
> +		ksft_perror(EPREFIX "failed to popen nr_hugepages");
> +		return -1;
> +	}
> +
> +	if (!fgets(buffer, sizeof(buffer), cmdfile)) {
> +		ksft_perror(EPREFIX "failed to read nr_hugepages");
> +		pclose(cmdfile);
> +		return -1;
> +	}
> +
> +	*nr_hugepages = atoll(buffer);
> +	pclose(cmdfile);
> +	return 0;
> +}
> +
> +/*
> + * Main thread that drives the test.
> + */
> +static void test_main(int fd, unsigned long hugepage_size)
> +{
> +	unsigned char *map, *iter;
> +	struct sigaction new, old;
> +	const unsigned long hugepagesize_kb = hugepage_size / 1024;
> +	unsigned long nr_hugepages_before = 0;
> +	unsigned long nr_hugepages_after = 0;
> +	unsigned long nodemask = 1UL << 0;
> +	unsigned long len = hugepage_size * 4;
> +	int ret;
> +
> +	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) {
> +		close(fd);
> +		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
> +	}
> +	ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before);
> +
> +	if (ftruncate(fd, len) < 0)
> +		ksft_exit_fail_perror("Failed to ftruncate");
> +
> +	ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len);
> +
> +	map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +	if (map == MAP_FAILED)
> +		ksft_exit_fail_msg("Failed to mmap");
> +
> +	ksft_print_msg("Created HugeTLB mapping: %p\n", map);
> +
> +	ret = mbind(map, len, MPOL_BIND, &nodemask, sizeof(nodemask) * 8,
> +		    MPOL_MF_STRICT | MPOL_MF_MOVE);
> +	if (ret < 0) {
> +		perror("mbind");
> +		ksft_exit_fail_msg("Failed to bind to node\n");
> +	}
> +
> +	memset(map, HUGETLB_FILL, len);
> +	ksft_print_msg("Memset every byte to 0xab\n");
> +
> +	new.sa_sigaction = &sigbus_handler;
> +	new.sa_flags = SA_SIGINFO;
> +	if (sigaction(SIGBUS, &new, &old) < 0)
> +		ksft_exit_fail_msg("Failed to setup SIGBUS handler");
> +
> +	ksft_print_msg("Setup SIGBUS handler successfully\n");
> +
> +	test_hwpoison_multiple_pages(map, hugepage_size);
> +
> +	/*
> +	 * Since MADV_HWPOISON doesn't corrupt the memory in hardware, and
> +	 * MFD_MF_KEEP_UE_MAPPED keeps the hugepage mapped, every byte should
> +	 * remain accessible and hold original data.
> +	 */
> +	expecting_sigbus = false;
> +	for (iter = map; iter < map + len; ++iter) {
> +		if (*iter != HUGETLB_FILL) {
> +			ksft_print_msg("At addr=%p: got=%#x, expected=%#x\n",
> +				       iter, *iter, HUGETLB_FILL);
> +			ksft_test_result_fail("Memory content corrupted\n");
> +			break;
> +		}
> +	}
> +	ksft_print_msg("Memory content all valid\n");
> +
> +	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
> +		close(fd);
> +		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
> +	}
> +
> +	/*
> +	 * After MADV_HWPOISON, hugepage should still be in HugeTLB pool.
> +	 */
> +	ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after);
> +	if (nr_hugepages_before != nr_hugepages_after)
> +		ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n",
> +				      nr_hugepages_before - nr_hugepages_after);
> +
> +	/* End of the lifetime of the created HugeTLB memfd. */
> +	if (ftruncate(fd, 0) < 0)
> +		ksft_exit_fail_perror("Failed to ftruncate to 0");
> +	munmap(map, len);
> +	close(fd);
> +
> +	/*
> +	 * After freed by userspace, MADV_HWPOISON-ed hugepage should be
> +	 * dissolved into raw pages and removed from HugeTLB pool.
> +	 */
> +	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
> +		close(fd);
> +		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
> +	}
> +	ksft_print_msg("NR hugepages after closure is %ld\n", nr_hugepages_after);
> +	if (nr_hugepages_before != nr_hugepages_after + 1)
> +		ksft_test_result_fail("NR hugepages is not reduced after memfd closure\n");
> +
> +	ksft_test_result_pass("All done\n");
> +}
> +
> +static unsigned long parse_hugepage_size(char *argv)
> +{
> +	if (strncasecmp(argv, HUGETLB_1GB_STR, strlen(HUGETLB_1GB_STR)) == 0)
> +		return BYTE_LENTH_IN_1G;
> +
> +	if (strncasecmp(argv, HUGETLB_2MB_STR, strlen(HUGETLB_2MB_STR)) == 0)
> +		return BYTE_LENTH_IN_2M;
> +
> +	ksft_print_msg("Please provide valid hugepage_size: 1G or 2M\n");
> +	assert(false);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int fd;
> +	struct statfs file_stat;
> +	unsigned long hugepage_size;
> +
> +	if (argc != 2) {
> +		ksft_print_msg("Usage: %s <hugepage_size=1G|2M>\n", argv[0]);
> +		return -EINVAL;
> +	}
> +
> +	ksft_print_header();
> +	ksft_set_plan(1);
> +
> +	hugepage_size = parse_hugepage_size(argv[1]);
> +	fd = create_hugetlbfs_file(&file_stat, hugepage_size);
> +	test_main(fd, hugepage_size);
> +
> +	ksft_finished();
> +}



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED
  2025-11-16  1:32 ` [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
@ 2025-12-03  4:18   ` jane.chu
  0 siblings, 0 replies; 10+ messages in thread
From: jane.chu @ 2025-12-03  4:18 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, william.roche, harry.yoo
  Cc: tony.luck, wangkefeng.wang, willy, akpm, osalvador, rientjes,
	duenwen, jthoughton, jgg, ankita, peterx, sidhartha.kumar, ziy,
	david, dave.hansen, muchun.song, linux-mm, linux-kernel,
	linux-fsdevel


On 11/15/2025 5:32 PM, Jiaqi Yan wrote:
> Document its motivation, userspace API, behaviors, and limitations.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>   Documentation/userspace-api/index.rst         |  1 +
>   .../userspace-api/mfd_mfr_policy.rst          | 60 +++++++++++++++++++
>   2 files changed, 61 insertions(+)
>   create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
> 
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index b8c73be4fb112..d8c6977d9e67a 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -67,6 +67,7 @@ Everything else
>      futex2
>      perf_ring_buffer
>      ntsync
> +   mfd_mfr_policy
>   
>   .. only::  subproject and html
>   
> diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst
> new file mode 100644
> index 0000000000000..c5a25df39791a
> --- /dev/null
> +++ b/Documentation/userspace-api/mfd_mfr_policy.rst
> @@ -0,0 +1,60 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================================================
> +Userspace Memory Failure Recovery Policy via memfd
> +==================================================
> +
> +:Author:
> +    Jiaqi Yan <jiaqiyan@google.com>
> +
> +
> +Motivation
> +==========
> +
> +When a userspace process is able to recover from memory failures (MF)
> +caused by uncorrected memory error (UE) in the DIMM, especially when it is
> +able to avoid consuming known UEs, keeping the memory page mapped and
> +accessible is benifical to the owning process for a couple of reasons:
> +
> +- The memory pages affected by UE have a large smallest granularity, for
> +  example 1G hugepage, but the actual corrupted amount of the page is only
> +  several cachlines. Losing the entire hugepage of data is unacceptable to
> +  the application.
> +
> +- In addition to keeping the data accessible, the application still wants
> +  to access with a large page size for the fastest virtual-to-physical
> +  translations.
> +
> +Memory failure recovery for 1G or larger HugeTLB is a good example. With
> +memfd userspace process can control whether the kernel hard offlines its
> +hugepages that backs the in-RAM file created by memfd.
> +
> +
> +User API
> +========
> +
> +``int memfd_create(const char *name, unsigned int flags)``
> +
> +``MFD_MF_KEEP_UE_MAPPED``
> +
> +	When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery
> +	in the kernel does not hard offline memory due to UE until the
> +	returned ``memfd`` is released. IOW, the HWPoison-ed memory remains
> +	accessible via the returned ``memfd`` or the memory mapping created
> +	with the returned ``memfd``. Note the affected memory will be
> +	immediately isolated and prevented from future use once the memfd
> +	is closed. By default ``MFD_MF_KEEP_UE_MAPPED`` is not set, and
> +	kernel hard offlines memory having UEs.
> +
> +Notes about the behavior and limitations
> +
> +- Even if the page affected by UE is kept, a portion of the (huge)page is
> +  already lost due to hardware corruption, and the size of the portion
> +  is the smallest page size that kernel uses to manages memory on the
> +  architecture, i.e. PAGESIZE. Accessing a virtual address within any of
> +  these parts results in a SIGBUS; accessing virtual address outside these
> +  parts are good until it is corrupted by new memory error.
> +
> +- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so
> +  ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.
> +  Otherwise ``memfd_create`` returns EINVAL.

Looks okay.

Reviewed-by: Jane Chu <jane.chu@oracle.com>

thanks,
-jane


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy
  2025-12-03  4:11   ` jane.chu
@ 2025-12-03 19:41     ` Jiaqi Yan
  0 siblings, 0 replies; 10+ messages in thread
From: Jiaqi Yan @ 2025-12-03 19:41 UTC (permalink / raw)
  To: jane.chu, william.roche
  Cc: nao.horiguchi, linmiaohe, harry.yoo, tony.luck, wangkefeng.wang,
	willy, akpm, osalvador, rientjes, duenwen, jthoughton, jgg,
	ankita, peterx, sidhartha.kumar, ziy, david, dave.hansen,
	muchun.song, linux-mm, linux-kernel, linux-fsdevel

On Tue, Dec 2, 2025 at 8:11 PM <jane.chu@oracle.com> wrote:
>
> Hi, Jiaqi,
>
> Thanks for the work, my comments inline.

Thank you both for the thorough and helpful reviews, Jane and William!

I plan to first rework "[PATCH v1 0/2] Only free healthy pages in
high-order HWPoison folio", given it is the key to concerns you have
in this patch. Then I will address your comments on code
quality/readability for this patch.

>
> On 11/15/2025 5:32 PM, Jiaqi Yan wrote:
> > Sometimes immediately hard offlining a large chunk of contigous memory
> > having uncorrected memory errors (UE) may not be the best option.
> > Cloud providers usually serve capacity- and performance-critical guest
> > memory with 1G HugeTLB hugepages, as this significantly reduces the
> > overhead associated with managing page tables and TLB misses. However,
> > for today's HugeTLB system, once a byte of memory in a hugepage is
> > hardware corrupted, the kernel discards the whole hugepage, including
> > the healthy portion. Customer workload running in the VM can hardly
> > recover from such a great loss of memory.
> >
> > Therefore keeping or discarding a large chunk of contiguous memory
> > owned by userspace (particularly to serve guest memory) due to
> > recoverable UE may better be controlled by userspace process
> > that owns the memory, e.g. VMM in Cloud environment.
> >
> > Introduce a memfd-based userspace memory failure (MFR) policy,
> > MFD_MF_KEEP_UE_MAPPED. It is intended to be supported for other memfd,
> > but the current implementation only covers HugeTLB.
> >
> > For any hugepage associated with the MFD_MF_KEEP_UE_MAPPED enabled memfd,
> > whenever it runs into a UE, MFR doesn't hard offline the HWPoison-ed
> > huge folio. IOW the HWPoison-ed memory remains accessible via the memory
> > mapping created with that memfd. MFR still sends SIGBUS to the process
> > as required. MFR also still maintains HWPoison metadata for the hugepage
> > having the UE.
> >
> > A HWPoison-ed hugepage will be immediately isolated and prevented from
> > future allocation once userspace truncates it via the memfd, or the
> > owning memfd is closed.
> >
> > By default MFD_MF_KEEP_UE_MAPPED is not set, and MFR hard offlines
> > hugepages having UEs.
> >
> > Tested with selftest in the follow-up commit.
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > Tested-by: William Roche <william.roche@oracle.com>
> > ---
> >   fs/hugetlbfs/inode.c       |  25 +++++++-
> >   include/linux/hugetlb.h    |   7 +++
> >   include/linux/pagemap.h    |  24 +++++++
> >   include/uapi/linux/memfd.h |   6 ++
> >   mm/hugetlb.c               |  20 +++++-
> >   mm/memfd.c                 |  15 ++++-
> >   mm/memory-failure.c        | 124 +++++++++++++++++++++++++++++++++----
> >   7 files changed, 202 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index f42548ee9083c..f8a5aa091d51d 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -532,6 +532,18 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> >       }
> >
> >       folio_unlock(folio);
> > +
> > +     /*
> > +      * There may be pending HWPoison-ed folios when a memfd is being
> > +      * removed or part of it is being truncated.
> > +      *
> > +      * HugeTLBFS' error_remove_folio keeps the HWPoison-ed folios in
> > +      * page cache until mm wants to drop the folio at the end of the
> > +      * of the filemap. At this point, if memory failure was delayed
> > +      * by MFD_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
> > +      */
> > +     filemap_offline_hwpoison_folio(mapping, folio);
> > +
> >       return ret;
> >   }
>
> Looks okay.
>
> >
> > @@ -563,13 +575,13 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> >       const pgoff_t end = lend >> PAGE_SHIFT;
> >       struct folio_batch fbatch;
> >       pgoff_t next, index;
> > -     int i, freed = 0;
> > +     int i, j, freed = 0;
> >       bool truncate_op = (lend == LLONG_MAX);
> >
> >       folio_batch_init(&fbatch);
> >       next = lstart >> PAGE_SHIFT;
> >       while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> > -             for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> > +             for (i = 0, j = 0; i < folio_batch_count(&fbatch); ++i) {
> >                       struct folio *folio = fbatch.folios[i];
> >                       u32 hash = 0;
> >
> > @@ -584,8 +596,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> >                                                       index, truncate_op))
> >                               freed++;
> >
> > +                     /*
> > +                      * Skip HWPoison-ed hugepages, which should no
> > +                      * longer be hugetlb if successfully dissolved.
> > +                      */
> > +                     if (folio_test_hugetlb(folio))
> > +                             fbatch.folios[j++] = folio;
> > +
> >                       mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> >               }
> > +             fbatch.nr = j;
> > +
> >               folio_batch_release(&fbatch);
> >               cond_resched();
> >       }
>
> Looks okay.
>
> But this reminds me that for now remove_inode_single_folio() has no path
> to return 'false' anyway, and if it does, remove_inode_hugepages() will
> be broken since it has no logic to account for failed to be
> removed folios.  Do you mind to make remove_inode_single_folio() a void
> function in order to avoid the confusion?
>
>
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 8e63e46b8e1f0..b7733ef5ee917 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -871,10 +871,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
> >
> >   #ifdef CONFIG_MEMORY_FAILURE
> >   extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> > +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> > +                                             struct address_space *mapping);
> >   #else
> >   static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
> >   {
> >   }
> > +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
> > +                                                    struct address_space *mapping)
> > +{
> > +     return false;
> > +}
> >   #endif
>
> It appears that hugetlb_should_keep_hwpoison_mapped() is only called
> within mm/memory-failure.c.  How about moving it there ?
>
> >
> >   #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 09b581c1d878d..9ad511aacde7c 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -213,6 +213,8 @@ enum mapping_flags {
> >       AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
> >       AS_KERNEL_FILE = 10,    /* mapping for a fake kernel file that shouldn't
> >                                  account usage to user cgroups */
> > +     /* For MFD_MF_KEEP_UE_MAPPED. */
> > +     AS_MF_KEEP_UE_MAPPED = 11,
> >       /* Bits 16-25 are used for FOLIO_ORDER */
> >       AS_FOLIO_ORDER_BITS = 5,
> >       AS_FOLIO_ORDER_MIN = 16,
> > @@ -348,6 +350,16 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(const struct addres
> >       return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
> >   }
> >
> Okay.
>
> > +static inline bool mapping_mf_keep_ue_mapped(const struct address_space *mapping)
> > +{
> > +     return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> > +}
> > +
> > +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
> > +{
> > +     set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
> > +}
> > +
> >   static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
> >   {
> >       return mapping->gfp_mask;
> > @@ -1274,6 +1286,18 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
> >   void delete_from_page_cache_batch(struct address_space *mapping,
> >                                 struct folio_batch *fbatch);
> >   bool filemap_release_folio(struct folio *folio, gfp_t gfp);
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +/*
> > + * Provided by memory failure to offline HWPoison-ed folio managed by memfd.
> > + */
> > +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> > +                                 struct folio *folio);
> > +#else
> > +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> > +                                 struct folio *folio)
> > +{
> > +}
> > +#endif
>
> Okay.
>
> >   loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
> >               int whence);
> >
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 273a4e15dfcff..d9875da551b7f 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -12,6 +12,12 @@
> >   #define MFD_NOEXEC_SEAL             0x0008U
> >   /* executable */
> >   #define MFD_EXEC            0x0010U
> > +/*
> > + * Keep owned folios mapped when uncorrectable memory errors (UE) causes
> > + * memory failure (MF) within the folio. Only at the end of the mapping
> > + * will its HWPoison-ed folios be dealt with.
> > + */
> > +#define MFD_MF_KEEP_UE_MAPPED        0x0020U
> >
> >   /*
> >    * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 0455119716ec0..dd3bc0b75e059 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6415,6 +6415,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned
> >       return same;
> >   }
> >
> > +bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
> > +                                      struct address_space *mapping)
> > +{
> > +     if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
> > +             return false;
> > +
> > +     if (!mapping)
> > +             return false;
> > +
> > +     return mapping_mf_keep_ue_mapped(mapping);
> > +}
> > +
>
> Okay.
>
> >   static vm_fault_t hugetlb_no_page(struct address_space *mapping,
> >                       struct vm_fault *vmf)
> >   {
> > @@ -6537,9 +6549,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
> >                * So we need to block hugepage fault by PG_hwpoison bit check.
> >                */
> >               if (unlikely(folio_test_hwpoison(folio))) {
> > -                     ret = VM_FAULT_HWPOISON_LARGE |
> > -                             VM_FAULT_SET_HINDEX(hstate_index(h));
> > -                     goto backout_unlocked;
> > +                     if (!mapping_mf_keep_ue_mapped(mapping)) {
> > +                             ret = VM_FAULT_HWPOISON_LARGE |
> > +                                   VM_FAULT_SET_HINDEX(hstate_index(h));
> > +                             goto backout_unlocked;
> > +                     }
> >               }
> >
>
> Looks okay, but am curious at Miaohe and others' take.
>
> To allow a known poisoned hugetlb page to be faulted in is for the sake
> of capacity, so this, versus a SIGBUS from the MF handler indicating a
> disruption and loss of both data and capacity.
> No strong opinion here, just wondering if there is any merit to limit
> the scope to the MF handler only.
>
> >               /* Check for page in userfault range. */
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 1d109c1acf211..bfdde4cf90500 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -313,7 +313,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
> >   #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> >   #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> > +                    MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
> >
> >   static int check_sysctl_memfd_noexec(unsigned int *flags)
> >   {
> > @@ -387,6 +388,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
> >       if (!(flags & MFD_HUGETLB)) {
> >               if (flags & ~MFD_ALL_FLAGS)
> >                       return -EINVAL;
> > +             if (flags & MFD_MF_KEEP_UE_MAPPED)
> > +                     return -EINVAL;
> >       } else {
> >               /* Allow huge page size encoding in flags. */
> >               if (flags & ~(MFD_ALL_FLAGS |
> > @@ -447,6 +450,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
> >       file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> >       file->f_flags |= O_LARGEFILE;
> >
> > +     /*
> > +      * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
> > +      * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
> > +      * seal-able.
> > +      *
> > +      * For now MFD_MF_KEEP_UE_MAPPED is only supported by HugeTLBFS.
> > +      */
> > +     if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
> > +             mapping_set_mf_keep_ue_mapped(file->f_mapping);
> > +
> >       if (flags & MFD_NOEXEC_SEAL) {
> >               struct inode *inode = file_inode(file);
> >
>
> Okay.
>
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 3edebb0cda30b..c5e3e28872797 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -373,11 +373,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
> >    * Schedule a process for later kill.
> >    * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> >    */
> > -static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> > +static void __add_to_kill(struct task_struct *tsk, struct page *p,
> >                         struct vm_area_struct *vma, struct list_head *to_kill,
> >                         unsigned long addr)
> >   {
> >       struct to_kill *tk;
> > +     struct folio *folio;
> > +     struct address_space *mapping;
> >
> >       tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> >       if (!tk) {
> > @@ -388,8 +390,19 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> >       tk->addr = addr;
> >       if (is_zone_device_page(p))
> >               tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> > -     else
> > -             tk->size_shift = folio_shift(page_folio(p));
> > +     else {
> > +             folio = page_folio(p);
> > +             mapping = folio_mapping(folio);
> > +             if (mapping && mapping_mf_keep_ue_mapped(mapping))
> > +                     /*
> > +                      * Let userspace know the radius of HWPoison is
> > +                      * the size of raw page; accessing other pages
> > +                      * inside the folio is still ok.
> > +                      */
> > +                     tk->size_shift = PAGE_SHIFT;
> > +             else
> > +                     tk->size_shift = folio_shift(folio);
> > +     }
> >
> >       /*
> >        * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> > @@ -414,7 +427,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
> >       list_add_tail(&tk->nd, to_kill);
> >   }
> >
> > -static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
> > +static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,
> >               struct vm_area_struct *vma, struct list_head *to_kill,
> >               unsigned long addr)
> >   {
> > @@ -535,7 +548,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
> >    * Collect processes when the error hit an anonymous page.
> >    */
> >   static void collect_procs_anon(const struct folio *folio,
> > -             const struct page *page, struct list_head *to_kill,
> > +             struct page *page, struct list_head *to_kill,
> >               int force_early)
> >   {
> >       struct task_struct *tsk;
> > @@ -573,7 +586,7 @@ static void collect_procs_anon(const struct folio *folio,
> >    * Collect processes when the error hit a file mapped page.
> >    */
> >   static void collect_procs_file(const struct folio *folio,
> > -             const struct page *page, struct list_head *to_kill,
> > +             struct page *page, struct list_head *to_kill,
> >               int force_early)
> >   {
> >       struct vm_area_struct *vma;
> > @@ -655,7 +668,7 @@ static void collect_procs_fsdax(const struct page *page,
> >   /*
> >    * Collect the processes who have the corrupted page mapped to kill.
> >    */
> > -static void collect_procs(const struct folio *folio, const struct page *page,
> > +static void collect_procs(const struct folio *folio, struct page *page,
> >               struct list_head *tokill, int force_early)
> >   {
> >       if (!folio->mapping)
> > @@ -1173,6 +1186,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> >               }
> >       }
> >
> > +     /*
> > +      * MF still needs to holds a refcount for the deferred actions in
> > +      * filemap_offline_hwpoison_folio.
> > +      */
> > +     if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> > +             return res;
> > +
>
> Okay.
>
> >       if (has_extra_refcount(ps, p, extra_pins))
> >               res = MF_FAILED;
> >
> > @@ -1569,6 +1589,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> >   {
> >       LIST_HEAD(tokill);
> >       bool unmap_success;
> > +     bool keep_mapped;
> >       int forcekill;
> >       bool mlocked = folio_test_mlocked(folio);
> >
> > @@ -1596,8 +1617,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> >        */
> >       collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> >
> > -     unmap_success = !unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> > -     if (!unmap_success)
> > +     keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, folio->mapping);
> > +     if (!keep_mapped)
> > +             unmap_poisoned_folio(folio, pfn, flags & MF_MUST_KILL);
> > +
> > +     unmap_success = !folio_mapped(folio);
> > +     if (!keep_mapped && !unmap_success)
> >               pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
> >                      pfn, folio_mapcount(folio));
> >
> > @@ -1622,7 +1647,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
> >                   !unmap_success;
> >       kill_procs(&tokill, forcekill, pfn, flags);
> >
> > -     return unmap_success;
> > +     return unmap_success || keep_mapped;
> >   }
>
> Okay.
>
> >
> >   static int identify_page_state(unsigned long pfn, struct page *p,
> > @@ -1862,6 +1887,13 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
> >       unsigned long count = 0;
> >
> >       head = llist_del_all(raw_hwp_list_head(folio));
> > +     /*
> > +      * If filemap_offline_hwpoison_folio_hugetlb is handling this folio,
> > +      * it has already taken off the head of the llist.
> > +      */
> > +     if (head == NULL)
> > +             return 0;
> > +
> >       llist_for_each_entry_safe(p, next, head, node) {
> >               if (move_flag)
> >                       SetPageHWPoison(p->page);
> > @@ -1878,7 +1910,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> >       struct llist_head *head;
> >       struct raw_hwp_page *raw_hwp;
> >       struct raw_hwp_page *p;
> > -     int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
> > +     struct address_space *mapping = folio->mapping;
> > +     bool has_hwpoison = folio_test_set_hwpoison(folio);
> >
> >       /*
> >        * Once the hwpoison hugepage has lost reliable raw error info,
> > @@ -1897,8 +1930,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> >       if (raw_hwp) {
> >               raw_hwp->page = page;
> >               llist_add(&raw_hwp->node, head);
> > +             if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> > +                     /*
> > +                      * A new raw HWPoison page. Don't return HWPOISON.
> > +                      * Error event will be counted in action_result().
> > +                      */
> > +                     return 0;
> > +
> >               /* the first error event will be counted in action_result(). */
> > -             if (ret)
> > +             if (has_hwpoison)
> >                       num_poisoned_pages_inc(page_to_pfn(page));
> >       } else {
> >               /*
> > @@ -1913,7 +1953,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
> >                */
> >               __folio_free_raw_hwp(folio, false);
> >       }
> > -     return ret;
> > +
> > +     return has_hwpoison ? -EHWPOISON : 0;
> >   }
>
> Okay.
>
> >
> >   static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
> > @@ -2002,6 +2043,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> >       return ret;
> >   }
> >
> > +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
> > +{
> > +     int ret;
> > +     struct llist_node *head;
> > +     struct raw_hwp_page *curr, *next;
> > +     struct page *page;
> > +     unsigned long pfn;
> > +
> > +     /*
> > +      * Since folio is still in the folio_batch, drop the refcount
> > +      * elevated by filemap_get_folios.
> > +      */
> > +     folio_put_refs(folio, 1);
> > +     head = llist_del_all(raw_hwp_list_head(folio));
> > +
> > +     /*
> > +      * Release refcounts held by try_memory_failure_hugetlb, one per
> > +      * HWPoison-ed page in the raw hwp list.
> > +      */
> > +     llist_for_each_entry(curr, head, node) {
> > +             SetPageHWPoison(curr->page);
> > +             folio_put(folio);
> > +     }
> > +
> > +     /* Refcount now should be zero and ready to dissolve folio. */
> > +     ret = dissolve_free_hugetlb_folio(folio);
> > +     if (ret) {
> > +             pr_err("failed to dissolve hugetlb folio: %d\n", ret);
> > +             return;
> > +     }
> > +
> > +     llist_for_each_entry_safe(curr, next, head, node) {
> > +             page = curr->page;
> > +             pfn = page_to_pfn(page);
> > +             drain_all_pages(page_zone(page));
> > +             if (!take_page_off_buddy(page))
> > +                     pr_err("%#lx: unable to take off buddy allocator\n", pfn);
> > +
> > +             page_ref_inc(page);
> > +             kfree(curr);
> > +             pr_info("%#lx: pending hard offline completed\n", pfn);
> > +     }
> > +}
> > +
> > +void filemap_offline_hwpoison_folio(struct address_space *mapping,
> > +                                 struct folio *folio)
> > +{
> > +     WARN_ON_ONCE(!mapping);
> > +
> > +     if (!folio_test_hwpoison(folio))
> > +             return;
> > +
> > +     /* Pending MFR currently only exist for hugetlb. */
> > +     if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
> > +             filemap_offline_hwpoison_folio_hugetlb(folio);
> > +}
> > +
> >   /*
> >    * Taking refcount of hugetlb pages needs extra care about race conditions
> >    * with basic operations like hugepage allocation/free/demotion.
>
>
> Looks good.
>
> thanks,
> -jane


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-03 19:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-16  1:32 [PATCH v2 0/3] memfd-based Userspace MFR Policy for HugeTLB Jiaqi Yan
2025-11-16  1:32 ` [PATCH v2 1/3] mm: memfd/hugetlb: introduce memfd-based userspace MFR policy Jiaqi Yan
2025-11-25 21:47   ` William Roche
2025-11-25 22:04   ` William Roche
2025-12-03  4:11   ` jane.chu
2025-12-03 19:41     ` Jiaqi Yan
2025-11-16  1:32 ` [PATCH v2 2/3] selftests/mm: test userspace MFR for HugeTLB hugepage Jiaqi Yan
2025-12-03  4:14   ` jane.chu
2025-11-16  1:32 ` [PATCH v2 3/3] Documentation: add documentation for MFD_MF_KEEP_UE_MAPPED Jiaqi Yan
2025-12-03  4:18   ` jane.chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox