[PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures
@ 2025-10-09  7:00 Longlong Xia
  2025-10-09  7:00 ` [PATCH RFC 1/1] " Longlong Xia
  2025-10-09 18:57 ` [PATCH RFC 0/1] " David Hildenbrand
  0 siblings, 2 replies; 18+ messages in thread
From: Longlong Xia @ 2025-10-09  7:00 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi
  Cc: akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm,
	Longlong Xia

From: Longlong Xia <xialonglong@kylinos.cn>

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to migrate
mappings from the failing page to another healthy duplicate within
the same chain before resorting to killing processes.

The recovery process works as follows:
1. When a memory failure is detected on a KSM page, identify if the
failing node is part of a chain (has duplicates) (maybe add dup_haed
item to save head_node to struct stable_node?, saving searching
the whole stable tree, or other way to find head_node)
2. Search for another healthy duplicate page within the same chain
3. For each process mapping the failing page:
- Update the PTE to point to the healthy duplicate page ( maybe reuse
replace_page?, or split repalce_page into smaller function and use the
common part)
- Migrate the rmap_item to the new stable node
4. If all migrations succeed, remove the failing node from the chain
5. Only kill processes if recovery is impossible or fails

The original idea came from Naoya Horiguchi.
https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/

I test it with /sys/kernel/debug/hwpoison/corrupt-pfn in qemu-x86_64.
here is my test steps and result:

1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10e802000
 b. virtual addr = 0x7e4c68b2c000  phy_addr =0x10e902000
 c. virtual addr = 0x7e4c68c26000  phy_addr =0x10ea02000
 d. virtual addr = 0x7e4c68d20000  phy_addr =0x10eb02000

2. echo 0x10e802 > /sys/kernel/debug/hwpoison/corrupt-pfn
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10eb02000
 b. virtual addr = 0x7e4c68b2c000  phy_addr =0x10e902000
 c. virtual addr = 0x7e4c68c26000  phy_addr =0x10ea02000
 d. virtual addr = 0x7e4c68d20000  phy_addr =0x10eb02000 (share with a)

3.echo 0x10eb02 > /sys/kernel/debug/hwpoison/corrupt-pfn
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10ea02000
 b. virtual addr = 0x7e4c68b2c000  phy_addr =0x10e902000
 c. virtual addr = 0x7e4c68c26000  phy_addr =0x10ea02000 (share with a)
 d. virtual addr = 0x7e4c68c58000  phy_addr =0x10ea02000 (share with a)

4.echo 0x10ea02 > /sys/kernel/debug/hwpoison/corrupt-pfn
 a. virtual addr = 0x7e4c68a00000  phy_addr =0x10e902000
 b. virtual addr = 0x7e4c68a32000  phy_addr =0x10e902000(share with a)
 c. virtual addr = 0x7e4c68a64000  phy_addr =0x10e902000(share with a)
 d. virtual addr = 0x7e4c68a96000  phy_addr =0x10e902000(share with a)

5.echo 0x10e902 > /sys/kernel/debug/hwpoison/corrupt-pfn
MCE: Killing ksm_test:531 due to hardware memory corruption fault at 7e4c68a00000

kernel-log:
Injecting memory failure at pfn 0x10e802
Memory failure: 0x10e802: recovery action for dirty LRU page: Recovered
Injecting memory failure at pfn 0x10eb02
Memory failure: 0x10eb02: recovery action for dirty LRU page: Recovered
Injecting memory failure at pfn 0x10ea02
Memory failure: 0x10ea02: recovery action for dirty LRU page: Recovered
Injecting memory failure at pfn 0x10e902
Memory failure: 0x10e902: recovery action for dirty LRU page: Recovered
MCE: Killing ksm_test:531 due to hardware memory corruption fault at 7e4c68a00000

Thanks for review and comments!

Longlong Xia (1):
  mm/ksm: Add recovery mechanism for memory failures

 mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 183 insertions(+)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-09  7:00 [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures Longlong Xia
@ 2025-10-09  7:00 ` Longlong Xia
  2025-10-09 12:13   ` Lance Yang
                     ` (2 more replies)
  2025-10-09 18:57 ` [PATCH RFC 0/1] " David Hildenbrand
  1 sibling, 3 replies; 18+ messages in thread
From: Longlong Xia @ 2025-10-09  7:00 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi
  Cc: akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm,
	Longlong Xia

From: Longlong Xia <xialonglong@kylinos.cn>

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to migrate
mappings from the failing KSM page to another healthy KSM page within
the same chain before resorting to killing processes.

The recovery process works as follows:
1. When a memory failure is detected on a KSM page, identify if the
failing node is part of a chain (has duplicates)
2. Search for another healthy KSM page within the same chain
3. For each process mapping the failing page:
- Update the PTE to point to the healthy KSM page
- Migrate the rmap_item to the new stable node
4. If all migrations succeed, remove the failing node from the chain
5. Only kill processes if recovery is impossible or fails

Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
---
 mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 183 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 160787bb121c..590d30cfe800 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3084,6 +3084,183 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
+static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
+{
+	struct ksm_stable_node *stable_node, *dup;
+	struct rb_node *node;
+	int nid;
+
+	if (!is_stable_node_dup(dup_node))
+		return NULL;
+
+	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+		node = rb_first(root_stable_tree + nid);
+		for (; node; node = rb_next(node)) {
+			stable_node = rb_entry(node,
+					       struct ksm_stable_node,
+					       node);
+
+			if (!is_stable_node_chain(stable_node))
+				continue;
+
+			hlist_for_each_entry(dup, &stable_node->hlist,
+					     hlist_dup) {
+				if (dup == dup_node)
+					return stable_node;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static struct folio *
+find_target_folio(struct ksm_stable_node *failing_node, struct ksm_stable_node **target_dup)
+{
+	struct ksm_stable_node *chain_head, *dup;
+	struct hlist_node *hlist_safe;
+	struct folio *target_folio;
+
+	if (!is_stable_node_dup(failing_node))
+		return NULL;
+
+	chain_head = find_chain_head(failing_node);
+	if (!chain_head)
+		return NULL;
+
+	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
+		if (dup == failing_node)
+			continue;
+
+		target_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
+		if (target_folio) {
+			*target_dup = dup;
+			return target_folio;
+		}
+	}
+
+	return NULL;
+}
+
+static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
+			struct page *kpage, unsigned long addr)
+{
+	struct folio *kfolio = page_folio(kpage);
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio = page_folio(page);
+	pmd_t *pmd;
+	pte_t *ptep;
+	pte_t newpte;
+	spinlock_t *ptl;
+	int err = -EFAULT;
+	struct mmu_notifier_range range;
+
+	pmd = mm_find_pmd(mm, addr);
+	if (!pmd)
+		goto out;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
+				addr + PAGE_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out_mn;
+
+	if (!is_zero_pfn(page_to_pfn(kpage))) {
+		folio_get(kfolio);
+		folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
+		newpte = mk_pte(kpage, vma->vm_page_prot);
+	} else {
+		newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
+		ksm_map_zero_page(mm);
+		dec_mm_counter(mm, MM_ANONPAGES);
+	}
+
+	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at(mm, addr, ptep, newpte);
+
+	folio_remove_rmap_pte(folio, page, vma);
+	if (!folio_mapped(folio))
+		folio_free_swap(folio);
+	folio_put(folio);
+
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+out_mn:
+	mmu_notifier_invalidate_range_end(&range);
+out:
+	return err;
+}
+
+static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
+{
+	struct ksm_rmap_item *rmap_item;
+	struct hlist_node *hlist_safe;
+	struct folio *failing_folio = NULL;
+	struct folio *target_folio = NULL;
+	struct ksm_stable_node *target_dup = NULL;
+	int err;
+
+	if (!is_stable_node_dup(failing_node))
+		return false;
+
+	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
+	if (!failing_folio)
+		return false;
+
+	target_folio = find_target_folio(failing_node, &target_dup);
+	if (!target_folio) {
+		folio_put(failing_folio);
+		return false;
+	}
+
+	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
+		struct mm_struct *mm = rmap_item->mm;
+		unsigned long addr = rmap_item->address & PAGE_MASK;
+		struct vm_area_struct *vma;
+
+		mmap_read_lock(mm);
+		if (ksm_test_exit(mm)) {
+			mmap_read_unlock(mm);
+			continue;
+		}
+
+		vma = vma_lookup(mm, addr);
+		if (!vma) {
+			mmap_read_unlock(mm);
+			continue;
+		}
+
+		/* Update PTE to point to target_folio's page */
+		err = replace_failing_page(vma, &failing_folio->page,
+					     &target_folio->page, addr);
+		if (!err) {
+			hlist_del(&rmap_item->hlist);
+			rmap_item->head = target_dup;
+			hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
+			target_dup->rmap_hlist_len++;
+			failing_node->rmap_hlist_len--;
+
+		}
+
+		mmap_read_unlock(mm);
+	}
+
+	folio_unlock(target_folio);
+	folio_put(target_folio);
+	folio_put(failing_folio);
+
+	if (failing_node->rmap_hlist_len == 0) {
+		__stable_node_dup_del(failing_node);
+		free_stable_node(failing_node);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Collect processes when the error hit an ksm page.
  */
@@ -3098,6 +3275,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page,
 	stable_node = folio_stable_node(folio);
 	if (!stable_node)
 		return;
+
+	if (ksm_recover_within_chain(stable_node)) {
+		pr_debug("recovery within chain successful, no need to kill processes\n");
+		return;
+	}
+
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		struct anon_vma *av = rmap_item->anon_vma;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-09  7:00 ` [PATCH RFC 1/1] " Longlong Xia
@ 2025-10-09 12:13   ` Lance Yang
  2025-10-11  7:52     ` Lance Yang
  2025-10-11  3:25   ` Miaohe Lin
  2025-10-13 20:10   ` [PATCH RFC] " Markus Elfring
  2 siblings, 1 reply; 18+ messages in thread
From: Lance Yang @ 2025-10-09 12:13 UTC (permalink / raw)
  To: Longlong Xia
  Cc: linmiaohe, nao.horiguchi, akpm, david, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia

On Thu, Oct 9, 2025 at 3:56 PM Longlong Xia <xialonglong2025@163.com> wrote:
>
> From: Longlong Xia <xialonglong@kylinos.cn>
>
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
>
> This patch introduces a recovery mechanism that attempts to migrate
> mappings from the failing KSM page to another healthy KSM page within
> the same chain before resorting to killing processes.

Interesting, thanks for the patch! One question below.

>
> The recovery process works as follows:
> 1. When a memory failure is detected on a KSM page, identify if the
> failing node is part of a chain (has duplicates)
> 2. Search for another healthy KSM page within the same chain
> 3. For each process mapping the failing page:
> - Update the PTE to point to the healthy KSM page
> - Migrate the rmap_item to the new stable node
> 4. If all migrations succeed, remove the failing node from the chain
> 5. Only kill processes if recovery is impossible or fails
>
> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
> ---
>  mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 183 insertions(+)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 160787bb121c..590d30cfe800 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3084,6 +3084,183 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  }
>
>  #ifdef CONFIG_MEMORY_FAILURE
> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
> +{
> +       struct ksm_stable_node *stable_node, *dup;
> +       struct rb_node *node;
> +       int nid;
> +
> +       if (!is_stable_node_dup(dup_node))
> +               return NULL;
> +
> +       for (nid = 0; nid < ksm_nr_node_ids; nid++) {
> +               node = rb_first(root_stable_tree + nid);
> +               for (; node; node = rb_next(node)) {
> +                       stable_node = rb_entry(node,
> +                                              struct ksm_stable_node,
> +                                              node);
> +
> +                       if (!is_stable_node_chain(stable_node))
> +                               continue;
> +
> +                       hlist_for_each_entry(dup, &stable_node->hlist,
> +                                            hlist_dup) {
> +                               if (dup == dup_node)
> +                                       return stable_node;
> +                       }
> +               }
> +       }
> +
> +       return NULL;
> +}
> +
> +static struct folio *
> +find_target_folio(struct ksm_stable_node *failing_node, struct ksm_stable_node **target_dup)
> +{
> +       struct ksm_stable_node *chain_head, *dup;
> +       struct hlist_node *hlist_safe;
> +       struct folio *target_folio;
> +
> +       if (!is_stable_node_dup(failing_node))
> +               return NULL;
> +
> +       chain_head = find_chain_head(failing_node);
> +       if (!chain_head)
> +               return NULL;
> +
> +       hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
> +               if (dup == failing_node)
> +                       continue;
> +
> +               target_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
> +               if (target_folio) {
> +                       *target_dup = dup;
> +                       return target_folio;
> +               }
> +       }
> +
> +       return NULL;
> +}
> +
> +static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
> +                       struct page *kpage, unsigned long addr)
> +{
> +       struct folio *kfolio = page_folio(kpage);
> +       struct mm_struct *mm = vma->vm_mm;
> +       struct folio *folio = page_folio(page);
> +       pmd_t *pmd;
> +       pte_t *ptep;
> +       pte_t newpte;
> +       spinlock_t *ptl;
> +       int err = -EFAULT;
> +       struct mmu_notifier_range range;
> +
> +       pmd = mm_find_pmd(mm, addr);
> +       if (!pmd)
> +               goto out;
> +
> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> +                               addr + PAGE_SIZE);
> +       mmu_notifier_invalidate_range_start(&range);
> +
> +       ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +       if (!ptep)
> +               goto out_mn;
> +
> +       if (!is_zero_pfn(page_to_pfn(kpage))) {
> +               folio_get(kfolio);
> +               folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
> +               newpte = mk_pte(kpage, vma->vm_page_prot);
> +       } else {
> +               newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
> +               ksm_map_zero_page(mm);
> +               dec_mm_counter(mm, MM_ANONPAGES);
> +       }

Can find_target_folio() return the shared zeropage? If not, the else block
looks like dead code and can be removed.

And, a real hardware failure on the shared zeropage would be
non-recoverable, I guess.

Cheers,
Lance

> +
> +       flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
> +       ptep_clear_flush(vma, addr, ptep);
> +       set_pte_at(mm, addr, ptep, newpte);
> +
> +       folio_remove_rmap_pte(folio, page, vma);
> +       if (!folio_mapped(folio))
> +               folio_free_swap(folio);
> +       folio_put(folio);
> +
> +       pte_unmap_unlock(ptep, ptl);
> +       err = 0;
> +out_mn:
> +       mmu_notifier_invalidate_range_end(&range);
> +out:
> +       return err;
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
> +       struct ksm_rmap_item *rmap_item;
> +       struct hlist_node *hlist_safe;
> +       struct folio *failing_folio = NULL;
> +       struct folio *target_folio = NULL;
> +       struct ksm_stable_node *target_dup = NULL;
> +       int err;
> +
> +       if (!is_stable_node_dup(failing_node))
> +               return false;
> +
> +       failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
> +       if (!failing_folio)
> +               return false;
> +
> +       target_folio = find_target_folio(failing_node, &target_dup);
> +       if (!target_folio) {
> +               folio_put(failing_folio);
> +               return false;
> +       }
> +
> +       hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> +               struct mm_struct *mm = rmap_item->mm;
> +               unsigned long addr = rmap_item->address & PAGE_MASK;
> +               struct vm_area_struct *vma;
> +
> +               mmap_read_lock(mm);
> +               if (ksm_test_exit(mm)) {
> +                       mmap_read_unlock(mm);
> +                       continue;
> +               }
> +
> +               vma = vma_lookup(mm, addr);
> +               if (!vma) {
> +                       mmap_read_unlock(mm);
> +                       continue;
> +               }
> +
> +               /* Update PTE to point to target_folio's page */
> +               err = replace_failing_page(vma, &failing_folio->page,
> +                                            &target_folio->page, addr);
> +               if (!err) {
> +                       hlist_del(&rmap_item->hlist);
> +                       rmap_item->head = target_dup;
> +                       hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
> +                       target_dup->rmap_hlist_len++;
> +                       failing_node->rmap_hlist_len--;
> +
> +               }
> +
> +               mmap_read_unlock(mm);
> +       }
> +
> +       folio_unlock(target_folio);
> +       folio_put(target_folio);
> +       folio_put(failing_folio);
> +
> +       if (failing_node->rmap_hlist_len == 0) {
> +               __stable_node_dup_del(failing_node);
> +               free_stable_node(failing_node);
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  /*
>   * Collect processes when the error hit an ksm page.
>   */
> @@ -3098,6 +3275,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page,
>         stable_node = folio_stable_node(folio);
>         if (!stable_node)
>                 return;
> +
> +       if (ksm_recover_within_chain(stable_node)) {
> +               pr_debug("recovery within chain successful, no need to kill processes\n");
> +               return;
> +       }
> +
>         hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
>                 struct anon_vma *av = rmap_item->anon_vma;
>
> --
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-09 12:13   ` Lance Yang
@ 2025-10-11  7:52     ` Lance Yang
  2025-10-11  9:23       ` Miaohe Lin
  0 siblings, 1 reply; 18+ messages in thread
From: Lance Yang @ 2025-10-11  7:52 UTC (permalink / raw)
  To: linmiaohe
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, Lance Yang, david

@Miaohe

I'd like to raise a concern about a potential hardware failure :)

My tests show that if the shared zeropage (or huge zeropage) gets marked
with HWpoison, the kernel continues to install it for new mappings.
Surprisingly, it does not kill the accessing process ...

The concern is, once the page is no longer zero-filled due to the hardware
failure, what will happen? Would this lead to silent data corruption for
applications that expect to read zeros?

Thanks,
Lance

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-11  7:52     ` Lance Yang
@ 2025-10-11  9:23       ` Miaohe Lin
  2025-10-11  9:38         ` Lance Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Miaohe Lin @ 2025-10-11  9:23 UTC (permalink / raw)
  To: Lance Yang
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, david

On 2025/10/11 15:52, Lance Yang wrote:
> @Miaohe
> 
> I'd like to raise a concern about a potential hardware failure :)

Thanks for your thought.

> 
> My tests show that if the shared zeropage (or huge zeropage) gets marked
> with HWpoison, the kernel continues to install it for new mappings.
> Surprisingly, it does not kill the accessing process ...

Have you investigated the cause? If user space writes to shared zeropage,
it will trigger COW and a new page will be installed. After that, reading
the newly allocated page won't trigger memory error. In this scene, it does
not kill the accessing process.

> 
> The concern is, once the page is no longer zero-filled due to the hardware
> failure, what will happen? Would this lead to silent data corruption for
> applications that expect to read zeros?

IMHO, once the page is no longer zero-filled due to the hardware failure, later
any read will trigger memory error and memory_failure should handle that.

Thanks.
.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-11  9:23       ` Miaohe Lin
@ 2025-10-11  9:38         ` Lance Yang
  2025-10-11 12:57           ` Lance Yang
  2025-10-13  3:39           ` Miaohe Lin
  0 siblings, 2 replies; 18+ messages in thread
From: Lance Yang @ 2025-10-11  9:38 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, david



On 2025/10/11 17:23, Miaohe Lin wrote:
> On 2025/10/11 15:52, Lance Yang wrote:
>> @Miaohe
>>
>> I'd like to raise a concern about a potential hardware failure :)
> 
> Thanks for your thought.
> 
>>
>> My tests show that if the shared zeropage (or huge zeropage) gets marked
>> with HWpoison, the kernel continues to install it for new mappings.
>> Surprisingly, it does not kill the accessing process ...
> 
> Have you investigated the cause? If user space writes to shared zeropage,
> it will trigger COW and a new page will be installed. After that, reading
> the newly allocated page won't trigger memory error. In this scene, it does
> not kill the accessing process.

Not write just read :)

> 
>>
>> The concern is, once the page is no longer zero-filled due to the hardware
>> failure, what will happen? Would this lead to silent data corruption for
>> applications that expect to read zeros?
> 
> IMHO, once the page is no longer zero-filled due to the hardware failure, later
> any read will trigger memory error and memory_failure should handle that.

I've only tested injecting an error on the shared zeropage using 
corrupt-pfn:

echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn

But no memory error was triggered on a subsequent read ...

Anyway, I'm trying to explore other ways to simulate hardware failure :)

Thanks,
Lance



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-11  9:38         ` Lance Yang
@ 2025-10-11 12:57           ` Lance Yang
  2025-10-13  3:39           ` Miaohe Lin
  1 sibling, 0 replies; 18+ messages in thread
From: Lance Yang @ 2025-10-11 12:57 UTC (permalink / raw)
  To: Miaohe Lin, qiuxu.zhuo
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, david

Cc Qiuxu

On 2025/10/11 17:38, Lance Yang wrote:
> 
> 
> On 2025/10/11 17:23, Miaohe Lin wrote:
>> On 2025/10/11 15:52, Lance Yang wrote:
>>> @Miaohe
>>>
>>> I'd like to raise a concern about a potential hardware failure :)
>>
>> Thanks for your thought.
>>
>>>
>>> My tests show that if the shared zeropage (or huge zeropage) gets marked
>>> with HWpoison, the kernel continues to install it for new mappings.
>>> Surprisingly, it does not kill the accessing process ...
>>
>> Have you investigated the cause? If user space writes to shared zeropage,
>> it will trigger COW and a new page will be installed. After that, reading
>> the newly allocated page won't trigger memory error. In this scene, it 
>> does
>> not kill the accessing process.
> 
> Not write just read :)
> 
>>
>>>
>>> The concern is, once the page is no longer zero-filled due to the 
>>> hardware
>>> failure, what will happen? Would this lead to silent data corruption for
>>> applications that expect to read zeros?
>>
>> IMHO, once the page is no longer zero-filled due to the hardware 
>> failure, later
>> any read will trigger memory error and memory_failure should handle that.
> 
> I've only tested injecting an error on the shared zeropage using 
> corrupt-pfn:
> 
> echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn
> 
> But no memory error was triggered on a subsequent read ...
> 
> Anyway, I'm trying to explore other ways to simulate hardware failure :)
> 
> Thanks,
> Lance
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-11  9:38         ` Lance Yang
  2025-10-11 12:57           ` Lance Yang
@ 2025-10-13  3:39           ` Miaohe Lin
  2025-10-13  4:42             ` Lance Yang
  1 sibling, 1 reply; 18+ messages in thread
From: Miaohe Lin @ 2025-10-13  3:39 UTC (permalink / raw)
  To: Lance Yang
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, david

On 2025/10/11 17:38, Lance Yang wrote:
> 
> 
> On 2025/10/11 17:23, Miaohe Lin wrote:
>> On 2025/10/11 15:52, Lance Yang wrote:
>>> @Miaohe
>>>
>>> I'd like to raise a concern about a potential hardware failure :)
>>
>> Thanks for your thought.
>>
>>>
>>> My tests show that if the shared zeropage (or huge zeropage) gets marked
>>> with HWpoison, the kernel continues to install it for new mappings.
>>> Surprisingly, it does not kill the accessing process ...
>>
>> Have you investigated the cause? If user space writes to shared zeropage,
>> it will trigger COW and a new page will be installed. After that, reading
>> the newly allocated page won't trigger memory error. In this scene, it does
>> not kill the accessing process.
> 
> Not write just read :)
> 
>>
>>>
>>> The concern is, once the page is no longer zero-filled due to the hardware
>>> failure, what will happen? Would this lead to silent data corruption for
>>> applications that expect to read zeros?
>>
>> IMHO, once the page is no longer zero-filled due to the hardware failure, later
>> any read will trigger memory error and memory_failure should handle that.
> 
> I've only tested injecting an error on the shared zeropage using corrupt-pfn:
> 
> echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn
> 
> But no memory error was triggered on a subsequent read ...

It's because corrupt-pfn only provides a software error injection mechanism.
If you want to trigger memory error on read, you need use hardware error injection
mechanism e.g.APEI Error INJection [1].

[1] https://www.kernel.org/doc/html/v5.8/firmware-guide/acpi/apei/einj.html

Thanks.
.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13  3:39           ` Miaohe Lin
@ 2025-10-13  4:42             ` Lance Yang
  2025-10-13  9:15               ` Lance Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Lance Yang @ 2025-10-13  4:42 UTC (permalink / raw)
  To: Miaohe Lin, qiuxu.zhuo
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, david



On 2025/10/13 11:39, Miaohe Lin wrote:
> On 2025/10/11 17:38, Lance Yang wrote:
>>
>>
>> On 2025/10/11 17:23, Miaohe Lin wrote:
>>> On 2025/10/11 15:52, Lance Yang wrote:
>>>> @Miaohe
>>>>
>>>> I'd like to raise a concern about a potential hardware failure :)
>>>
>>> Thanks for your thought.
>>>
>>>>
>>>> My tests show that if the shared zeropage (or huge zeropage) gets marked
>>>> with HWpoison, the kernel continues to install it for new mappings.
>>>> Surprisingly, it does not kill the accessing process ...
>>>
>>> Have you investigated the cause? If user space writes to shared zeropage,
>>> it will trigger COW and a new page will be installed. After that, reading
>>> the newly allocated page won't trigger memory error. In this scene, it does
>>> not kill the accessing process.
>>
>> Not write just read :)
>>
>>>
>>>>
>>>> The concern is, once the page is no longer zero-filled due to the hardware
>>>> failure, what will happen? Would this lead to silent data corruption for
>>>> applications that expect to read zeros?
>>>
>>> IMHO, once the page is no longer zero-filled due to the hardware failure, later
>>> any read will trigger memory error and memory_failure should handle that.
>>
>> I've only tested injecting an error on the shared zeropage using corrupt-pfn:
>>
>> echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn
>>
>> But no memory error was triggered on a subsequent read ...
> 
> It's because corrupt-pfn only provides a software error injection mechanism.
> If you want to trigger memory error on read, you need use hardware error injection
> mechanism e.g.APEI Error INJection [1].
> 
> [1] https://www.kernel.org/doc/html/v5.8/firmware-guide/acpi/apei/einj.html

Nice! You're right, thanks for pointing that out!

I'm not very familiar with hardware error injection. Fortunately, Qiuxu 
is looking
into that and running some tests on the shared zeropage.

Well, I think he will follow up with his findings ;p

Cheers,
Lance



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13  4:42             ` Lance Yang
@ 2025-10-13  9:15               ` Lance Yang
  2025-10-13  9:25                 ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Lance Yang @ 2025-10-13  9:15 UTC (permalink / raw)
  To: david
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin,
	qiuxu.zhuo

@David

Cc: MM CORE folks

On 2025/10/13 12:42, Lance Yang wrote:
[...]

Cool. Hardware error injection with EINJ was the way to go!

I just ran some tests on the shared zero page (both regular and huge), and
found a tricky behavior:

1) When a hardware error is injected into the zeropage, the process that
attempts to read from a mapping backed by it is correctly killed with a 
SIGBUS.

2) However, even after the error is detected, the kernel continues to 
install
the known-poisoned zeropage for new anonymous mappings ...


For the shared zeropage:
```
[Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in 
user-access at 29b8cf5000
[Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to 
read_zeropage:13767 due to hardware memory corruption
[Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action 
for already poisoned page: Failed
```
And for the shared huge zeropage:
```
[Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in 
user-access at 1e1e00000
[Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to 
read_huge_zerop:13891 due to hardware memory corruption
[Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for 
already poisoned page: Failed
```

Since we've identified an uncorrectable hardware error on such a critical,
singleton page, should we be doing something more?

Thanks,
Lance


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13  9:15               ` Lance Yang
@ 2025-10-13  9:25                 ` David Hildenbrand
  2025-10-13  9:46                   ` Balbir Singh
  2025-10-13 11:00                   ` Lance Yang
  0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-10-13  9:25 UTC (permalink / raw)
  To: Lance Yang
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin,
	qiuxu.zhuo

On 13.10.25 11:15, Lance Yang wrote:
> @David
> 
> Cc: MM CORE folks
> 
> On 2025/10/13 12:42, Lance Yang wrote:
> [...]
> 
> Cool. Hardware error injection with EINJ was the way to go!
> 
> I just ran some tests on the shared zero page (both regular and huge), and
> found a tricky behavior:
> 
> 1) When a hardware error is injected into the zeropage, the process that
> attempts to read from a mapping backed by it is correctly killed with a
> SIGBUS.
> 
> 2) However, even after the error is detected, the kernel continues to
> install
> the known-poisoned zeropage for new anonymous mappings ...
> 
> 
> For the shared zeropage:
> ```
> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
> user-access at 29b8cf5000
> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
> read_zeropage:13767 due to hardware memory corruption
> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
> for already poisoned page: Failed
> ```
> And for the shared huge zeropage:
> ```
> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
> user-access at 1e1e00000
> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
> read_huge_zerop:13891 due to hardware memory corruption
> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for
> already poisoned page: Failed
> ```
> 
> Since we've identified an uncorrectable hardware error on such a critical,
> singleton page, should we be doing something more?

I mean, regarding the shared zeropage, we could try walking all page 
tables of all processes and replace it be a fresh shared zeropage.

But then, the page might also be used for other things (I/O etc), the 
shared zeropage is allocated by the architecture, we'd have to make 
is_zero_pfn() succeed on the old+new page etc ...

So a lot of work for little benefit I guess? The question is how often 
we would see that in practice. I'd assume we'd see it happen on random 
kernel memory more frequently where we can really just bring down the 
whole machine.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13  9:25                 ` David Hildenbrand
@ 2025-10-13  9:46                   ` Balbir Singh
  2025-10-13 11:00                   ` Lance Yang
  1 sibling, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2025-10-13  9:46 UTC (permalink / raw)
  To: David Hildenbrand, Lance Yang
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin,
	qiuxu.zhuo

On 10/13/25 20:25, David Hildenbrand wrote:
> On 13.10.25 11:15, Lance Yang wrote:
>> @David
>>
>> Cc: MM CORE folks
>>
>> On 2025/10/13 12:42, Lance Yang wrote:
>> [...]
>>
>> Cool. Hardware error injection with EINJ was the way to go!
>>
>> I just ran some tests on the shared zero page (both regular and huge), and
>> found a tricky behavior:
>>
>> 1) When a hardware error is injected into the zeropage, the process that
>> attempts to read from a mapping backed by it is correctly killed with a
>> SIGBUS.
>>
>> 2) However, even after the error is detected, the kernel continues to
>> install
>> the known-poisoned zeropage for new anonymous mappings ...
>>
>>
>> For the shared zeropage:
>> ```
>> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
>> user-access at 29b8cf5000
>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
>> read_zeropage:13767 due to hardware memory corruption
>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
>> for already poisoned page: Failed
>> ```
>> And for the shared huge zeropage:
>> ```
>> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
>> user-access at 1e1e00000
>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
>> read_huge_zerop:13891 due to hardware memory corruption
>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for
>> already poisoned page: Failed
>> ```
>>
>> Since we've identified an uncorrectable hardware error on such a critical,
>> singleton page, should we be doing something more?
> 
> I mean, regarding the shared zeropage, we could try walking all page tables of all processes and replace it be a fresh shared zeropage.
> 
> But then, the page might also be used for other things (I/O etc), the shared zeropage is allocated by the architecture, we'd have to make is_zero_pfn() succeed on the old+new page etc ...
> 
> So a lot of work for little benefit I guess? The question is how often we would see that in practice. I'd assume we'd see it happen on random kernel memory more frequently where we can really just bring down the whole machine.
> 

empty_zero_page belongs to the .bss and zero_pfn is quite deeply burried in it's usage.
The same concerns apply to zero_folio as well. I agree, it's a lot of work to recover
from errors on the zero_page

Balbir



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13  9:25                 ` David Hildenbrand
  2025-10-13  9:46                   ` Balbir Singh
@ 2025-10-13 11:00                   ` Lance Yang
  2025-10-13 11:13                     ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Lance Yang @ 2025-10-13 11:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin,
	qiuxu.zhuo



On 2025/10/13 17:25, David Hildenbrand wrote:
> On 13.10.25 11:15, Lance Yang wrote:
>> @David
>>
>> Cc: MM CORE folks
>>
>> On 2025/10/13 12:42, Lance Yang wrote:
>> [...]
>>
>> Cool. Hardware error injection with EINJ was the way to go!
>>
>> I just ran some tests on the shared zero page (both regular and huge), 
>> and
>> found a tricky behavior:
>>
>> 1) When a hardware error is injected into the zeropage, the process that
>> attempts to read from a mapping backed by it is correctly killed with a
>> SIGBUS.
>>
>> 2) However, even after the error is detected, the kernel continues to
>> install
>> the known-poisoned zeropage for new anonymous mappings ...
>>
>>
>> For the shared zeropage:
>> ```
>> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
>> user-access at 29b8cf5000
>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
>> read_zeropage:13767 due to hardware memory corruption
>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
>> for already poisoned page: Failed
>> ```
>> And for the shared huge zeropage:
>> ```
>> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
>> user-access at 1e1e00000
>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
>> read_huge_zerop:13891 due to hardware memory corruption
>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for
>> already poisoned page: Failed
>> ```
>>
>> Since we've identified an uncorrectable hardware error on such a 
>> critical,
>> singleton page, should we be doing something more?
> 
> I mean, regarding the shared zeropage, we could try walking all page 
> tables of all processes and replace it be a fresh shared zeropage.
> 
> But then, the page might also be used for other things (I/O etc), the 
> shared zeropage is allocated by the architecture, we'd have to make 
> is_zero_pfn() succeed on the old+new page etc ...
> 
> So a lot of work for little benefit I guess? The question is how often 
> we would see that in practice. I'd assume we'd see it happen on random 
> kernel memory more frequently where we can really just bring down the 
> whole machine.

Thanks for your thoughts!

I agree, fixing the regular zeropage is a really mess ...

But for the huge zeropage, what if we just stop installing it once it's
poisoned? We could just disable it globally. Something like this:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f698df156bf8..8543f4385ffe 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2193,6 +2193,10 @@ int memory_failure(unsigned long pfn, int flags)
         if (!(flags & MF_SW_SIMULATED))
                 hw_memory_failure = true;

+       if (is_huge_zero_pfn(pfn))
+               clear_bit(TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+                         &transparent_hugepage_flags);
+
         p = pfn_to_online_page(pfn);
         if (!p) {
                 res = arch_memory_failure(pfn, flags);

Seems easy enough ...


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13 11:00                   ` Lance Yang
@ 2025-10-13 11:13                     ` David Hildenbrand
  2025-10-13 11:18                       ` Lance Yang
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-10-13 11:13 UTC (permalink / raw)
  To: Lance Yang
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin,
	qiuxu.zhuo

On 13.10.25 13:00, Lance Yang wrote:
> 
> 
> On 2025/10/13 17:25, David Hildenbrand wrote:
>> On 13.10.25 11:15, Lance Yang wrote:
>>> @David
>>>
>>> Cc: MM CORE folks
>>>
>>> On 2025/10/13 12:42, Lance Yang wrote:
>>> [...]
>>>
>>> Cool. Hardware error injection with EINJ was the way to go!
>>>
>>> I just ran some tests on the shared zero page (both regular and huge),
>>> and
>>> found a tricky behavior:
>>>
>>> 1) When a hardware error is injected into the zeropage, the process that
>>> attempts to read from a mapping backed by it is correctly killed with a
>>> SIGBUS.
>>>
>>> 2) However, even after the error is detected, the kernel continues to
>>> install
>>> the known-poisoned zeropage for new anonymous mappings ...
>>>
>>>
>>> For the shared zeropage:
>>> ```
>>> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
>>> user-access at 29b8cf5000
>>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
>>> read_zeropage:13767 due to hardware memory corruption
>>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
>>> for already poisoned page: Failed
>>> ```
>>> And for the shared huge zeropage:
>>> ```
>>> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
>>> user-access at 1e1e00000
>>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
>>> read_huge_zerop:13891 due to hardware memory corruption
>>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for
>>> already poisoned page: Failed
>>> ```
>>>
>>> Since we've identified an uncorrectable hardware error on such a
>>> critical,
>>> singleton page, should we be doing something more?
>>
>> I mean, regarding the shared zeropage, we could try walking all page
>> tables of all processes and replace it be a fresh shared zeropage.
>>
>> But then, the page might also be used for other things (I/O etc), the
>> shared zeropage is allocated by the architecture, we'd have to make
>> is_zero_pfn() succeed on the old+new page etc ...
>>
>> So a lot of work for little benefit I guess? The question is how often
>> we would see that in practice. I'd assume we'd see it happen on random
>> kernel memory more frequently where we can really just bring down the
>> whole machine.
> 
> Thanks for your thoughts!
> 
> I agree, fixing the regular zeropage is a really mess ...
> 
> But for the huge zeropage, what if we just stop installing it once it's
> poisoned? We could just disable it globally. Something like this:

We now have the static huge zero folio that could silently be used for 
I/O without a reference etc.

So I'm afraid this is all just making corner cases slightly better.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-13 11:13                     ` David Hildenbrand
@ 2025-10-13 11:18                       ` Lance Yang
  0 siblings, 0 replies; 18+ messages in thread
From: Lance Yang @ 2025-10-13 11:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16,
	linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin,
	qiuxu.zhuo



On 2025/10/13 19:13, David Hildenbrand wrote:
> On 13.10.25 13:00, Lance Yang wrote:
>>
>>
>> On 2025/10/13 17:25, David Hildenbrand wrote:
>>> On 13.10.25 11:15, Lance Yang wrote:
>>>> @David
>>>>
>>>> Cc: MM CORE folks
>>>>
>>>> On 2025/10/13 12:42, Lance Yang wrote:
>>>> [...]
>>>>
>>>> Cool. Hardware error injection with EINJ was the way to go!
>>>>
>>>> I just ran some tests on the shared zero page (both regular and huge),
>>>> and
>>>> found a tricky behavior:
>>>>
>>>> 1) When a hardware error is injected into the zeropage, the process 
>>>> that
>>>> attempts to read from a mapping backed by it is correctly killed with a
>>>> SIGBUS.
>>>>
>>>> 2) However, even after the error is detected, the kernel continues to
>>>> install
>>>> the known-poisoned zeropage for new anonymous mappings ...
>>>>
>>>>
>>>> For the shared zeropage:
>>>> ```
>>>> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in
>>>> user-access at 29b8cf5000
>>>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to
>>>> read_zeropage:13767 due to hardware memory corruption
>>>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action
>>>> for already poisoned page: Failed
>>>> ```
>>>> And for the shared huge zeropage:
>>>> ```
>>>> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in
>>>> user-access at 1e1e00000
>>>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to
>>>> read_huge_zerop:13891 due to hardware memory corruption
>>>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action 
>>>> for
>>>> already poisoned page: Failed
>>>> ```
>>>>
>>>> Since we've identified an uncorrectable hardware error on such a
>>>> critical,
>>>> singleton page, should we be doing something more?
>>>
>>> I mean, regarding the shared zeropage, we could try walking all page
>>> tables of all processes and replace it be a fresh shared zeropage.
>>>
>>> But then, the page might also be used for other things (I/O etc), the
>>> shared zeropage is allocated by the architecture, we'd have to make
>>> is_zero_pfn() succeed on the old+new page etc ...
>>>
>>> So a lot of work for little benefit I guess? The question is how often
>>> we would see that in practice. I'd assume we'd see it happen on random
>>> kernel memory more frequently where we can really just bring down the
>>> whole machine.
>>
>> Thanks for your thoughts!
>>
>> I agree, fixing the regular zeropage is a really mess ...
>>
>> But for the huge zeropage, what if we just stop installing it once it's
>> poisoned? We could just disable it globally. Something like this:
> 
> We now have the static huge zero folio that could silently be used for 
> I/O without a reference etc.
> 
> So I'm afraid this is all just making corner cases slightly better.

Ah, I see. Appreciate you taking the time to explain that!


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-09  7:00 ` [PATCH RFC 1/1] " Longlong Xia
  2025-10-09 12:13   ` Lance Yang
@ 2025-10-11  3:25   ` Miaohe Lin
  2025-10-13 20:10   ` [PATCH RFC] " Markus Elfring
  2 siblings, 0 replies; 18+ messages in thread
From: Miaohe Lin @ 2025-10-11  3:25 UTC (permalink / raw)
  To: Longlong Xia
  Cc: akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm,
	Longlong Xia, Naoya Horiguchi

On 2025/10/9 15:00, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
> 
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to migrate
> mappings from the failing KSM page to another healthy KSM page within
> the same chain before resorting to killing processes.
> 
> The recovery process works as follows:
> 1. When a memory failure is detected on a KSM page, identify if the
> failing node is part of a chain (has duplicates)
> 2. Search for another healthy KSM page within the same chain
> 3. For each process mapping the failing page:
> - Update the PTE to point to the healthy KSM page
> - Migrate the rmap_item to the new stable node
> 4. If all migrations succeed, remove the failing node from the chain
> 5. Only kill processes if recovery is impossible or fails

Thanks for your patch. This looks really interesting to me and
I think this might be a significant improvement. :)

> 
> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
> ---
>  mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 183 insertions(+)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 160787bb121c..590d30cfe800 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3084,6 +3084,183 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  }
>  
>  #ifdef CONFIG_MEMORY_FAILURE
> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
> +{
> +	struct ksm_stable_node *stable_node, *dup;
> +	struct rb_node *node;
> +	int nid;
> +
> +	if (!is_stable_node_dup(dup_node))
> +		return NULL;
> +
> +	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
> +		node = rb_first(root_stable_tree + nid);
> +		for (; node; node = rb_next(node)) {
> +			stable_node = rb_entry(node,
> +					       struct ksm_stable_node,
> +					       node);
> +
> +			if (!is_stable_node_chain(stable_node))
> +				continue;
> +
> +			hlist_for_each_entry(dup, &stable_node->hlist,
> +					     hlist_dup) {
> +				if (dup == dup_node)
> +					return stable_node;
> +			}
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct folio *
> +find_target_folio(struct ksm_stable_node *failing_node, struct ksm_stable_node **target_dup)
> +{
> +	struct ksm_stable_node *chain_head, *dup;
> +	struct hlist_node *hlist_safe;
> +	struct folio *target_folio;
> +
> +	if (!is_stable_node_dup(failing_node))
> +		return NULL;
> +
> +	chain_head = find_chain_head(failing_node);
> +	if (!chain_head)
> +		return NULL;
> +
> +	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
> +		if (dup == failing_node)
> +			continue;
> +
> +		target_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
> +		if (target_folio) {
> +			*target_dup = dup;
> +			return target_folio;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
> +			struct page *kpage, unsigned long addr)
> +{
> +	struct folio *kfolio = page_folio(kpage);
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct folio *folio = page_folio(page);
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	pte_t newpte;
> +	spinlock_t *ptl;
> +	int err = -EFAULT;
> +	struct mmu_notifier_range range;
> +
> +	pmd = mm_find_pmd(mm, addr);
> +	if (!pmd)
> +		goto out;
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> +				addr + PAGE_SIZE);
> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	if (!ptep)
> +		goto out_mn;
> +
> +	if (!is_zero_pfn(page_to_pfn(kpage))) {
> +		folio_get(kfolio);
> +		folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
> +		newpte = mk_pte(kpage, vma->vm_page_prot);
> +	} else {
> +		newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
> +		ksm_map_zero_page(mm);
> +		dec_mm_counter(mm, MM_ANONPAGES);
> +	}
> +
> +	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
> +	ptep_clear_flush(vma, addr, ptep);
> +	set_pte_at(mm, addr, ptep, newpte);
> +
> +	folio_remove_rmap_pte(folio, page, vma);
> +	if (!folio_mapped(folio))
> +		folio_free_swap(folio);
> +	folio_put(folio);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +	err = 0;
> +out_mn:
> +	mmu_notifier_invalidate_range_end(&range);
> +out:
> +	return err;
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
> +	struct ksm_rmap_item *rmap_item;
> +	struct hlist_node *hlist_safe;
> +	struct folio *failing_folio = NULL;
> +	struct folio *target_folio = NULL;
> +	struct ksm_stable_node *target_dup = NULL;
> +	int err;
> +
> +	if (!is_stable_node_dup(failing_node))
> +		return false;
> +
> +	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
> +	if (!failing_folio)
> +		return false;
> +
> +	target_folio = find_target_folio(failing_node, &target_dup);
> +	if (!target_folio) {
> +		folio_put(failing_folio);
> +		return false;
> +	}
> +
> +	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> +		struct mm_struct *mm = rmap_item->mm;
> +		unsigned long addr = rmap_item->address & PAGE_MASK;
> +		struct vm_area_struct *vma;
> +
> +		mmap_read_lock(mm);

I briefly looked at the code and find lock order might be broken here.
mm/filemap.c documents the lock order as below:

 *  ->mmap_lock
 *    ->invalidate_lock		(filemap_fault)
 *      ->lock_page		(filemap_fault, access_process_vm)
 *

But mmap_lock is held after folio is locked here. Is this a problem?

Thanks.
.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC] mm/ksm: Add recovery mechanism for memory failures
  2025-10-09  7:00 ` [PATCH RFC 1/1] " Longlong Xia
  2025-10-09 12:13   ` Lance Yang
  2025-10-11  3:25   ` Miaohe Lin
@ 2025-10-13 20:10   ` Markus Elfring
  2 siblings, 0 replies; 18+ messages in thread
From: Markus Elfring @ 2025-10-13 20:10 UTC (permalink / raw)
  To: Longlong Xia, linux-mm, Miaohe Lin, Naoya Horiguchi
  Cc: Longlong Xia, LKML, Andrew Morton, David Hildenbrand,
	Kefeng Wang, xu xin

…
> This patch introduces …

See also:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n94


…
> +++ b/mm/ksm.c
…
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
…
> +		struct vm_area_struct *vma;
> +
> +		mmap_read_lock(mm);
…
> +		}
> +
> +		mmap_read_unlock(mm);
…

Under which circumstances would you become interested to apply a statement
like “guard(mmap_read_lock)(mm);”?
https://elixir.bootlin.com/linux/v6.17.1/source/include/linux/mmap_lock.h#L483-L484

Regards,
Markus


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures
  2025-10-09  7:00 [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures Longlong Xia
  2025-10-09  7:00 ` [PATCH RFC 1/1] " Longlong Xia
@ 2025-10-09 18:57 ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-10-09 18:57 UTC (permalink / raw)
  To: Longlong Xia, linmiaohe, nao.horiguchi
  Cc: akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia

On 09.10.25 09:00, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
> 
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to migrate
> mappings from the failing page to another healthy duplicate within
> the same chain before resorting to killing processes.

An alternative could be to allocate a new page and effectively migrate 
from the old (degraded) page to the new page by copying page content 
from one of the healty duplicates.

That would keep the #mappings per page in the chain balanced.

> 
> The recovery process works as follows:
> 1. When a memory failure is detected on a KSM page, identify if the
> failing node is part of a chain (has duplicates) (maybe add dup_haed
> item to save head_node to struct stable_node?, saving searching
> the whole stable tree, or other way to find head_node)
> 2. Search for another healthy duplicate page within the same chain
> 3. For each process mapping the failing page:
> - Update the PTE to point to the healthy duplicate page ( maybe reuse
> replace_page?, or split repalce_page into smaller function and use the
> common part)
> - Migrate the rmap_item to the new stable node
> 4. If all migrations succeed, remove the failing node from the chain
> 5. Only kill processes if recovery is impossible or fails

Does not sound too crazy.

But how realistic do we consider that in practice? We need quite a bunch 
of processes to dedup the same page to end up getting duplicates in the 
chain IIRC.

So isn't this rather an improvement only for less likely scenarios in 
practice?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-10-13 20:10 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-09  7:00 [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures Longlong Xia
2025-10-09  7:00 ` [PATCH RFC 1/1] " Longlong Xia
2025-10-09 12:13   ` Lance Yang
2025-10-11  7:52     ` Lance Yang
2025-10-11  9:23       ` Miaohe Lin
2025-10-11  9:38         ` Lance Yang
2025-10-11 12:57           ` Lance Yang
2025-10-13  3:39           ` Miaohe Lin
2025-10-13  4:42             ` Lance Yang
2025-10-13  9:15               ` Lance Yang
2025-10-13  9:25                 ` David Hildenbrand
2025-10-13  9:46                   ` Balbir Singh
2025-10-13 11:00                   ` Lance Yang
2025-10-13 11:13                     ` David Hildenbrand
2025-10-13 11:18                       ` Lance Yang
2025-10-11  3:25   ` Miaohe Lin
2025-10-13 20:10   ` [PATCH RFC] " Markus Elfring
2025-10-09 18:57 ` [PATCH RFC 0/1] " David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox