* [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures @ 2025-10-09 7:00 Longlong Xia 2025-10-09 7:00 ` [PATCH RFC 1/1] " Longlong Xia 2025-10-09 18:57 ` [PATCH RFC 0/1] " David Hildenbrand 0 siblings, 2 replies; 18+ messages in thread From: Longlong Xia @ 2025-10-09 7:00 UTC (permalink / raw) To: linmiaohe, nao.horiguchi Cc: akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia From: Longlong Xia <xialonglong@kylinos.cn> When a hardware memory error occurs on a KSM page, the current behavior is to kill all processes mapping that page. This can be overly aggressive when KSM has multiple duplicate pages in a chain where other duplicates are still healthy. This patch introduces a recovery mechanism that attempts to migrate mappings from the failing page to another healthy duplicate within the same chain before resorting to killing processes. The recovery process works as follows: 1. When a memory failure is detected on a KSM page, identify if the failing node is part of a chain (has duplicates) (maybe add dup_haed item to save head_node to struct stable_node?, saving searching the whole stable tree, or other way to find head_node) 2. Search for another healthy duplicate page within the same chain 3. For each process mapping the failing page: - Update the PTE to point to the healthy duplicate page ( maybe reuse replace_page?, or split repalce_page into smaller function and use the common part) - Migrate the rmap_item to the new stable node 4. If all migrations succeed, remove the failing node from the chain 5. Only kill processes if recovery is impossible or fails The original idea came from Naoya Horiguchi. https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/ I test it with /sys/kernel/debug/hwpoison/corrupt-pfn in qemu-x86_64. here is my test steps and result: 1. alloc 1024 page with same content and enable KSM to merge after merge (same phy_addr only print once) a. virtual addr = 0x7e4c68a00000 phy_addr =0x10e802000 b. virtual addr = 0x7e4c68b2c000 phy_addr =0x10e902000 c. virtual addr = 0x7e4c68c26000 phy_addr =0x10ea02000 d. virtual addr = 0x7e4c68d20000 phy_addr =0x10eb02000 2. echo 0x10e802 > /sys/kernel/debug/hwpoison/corrupt-pfn a. virtual addr = 0x7e4c68a00000 phy_addr =0x10eb02000 b. virtual addr = 0x7e4c68b2c000 phy_addr =0x10e902000 c. virtual addr = 0x7e4c68c26000 phy_addr =0x10ea02000 d. virtual addr = 0x7e4c68d20000 phy_addr =0x10eb02000 (share with a) 3.echo 0x10eb02 > /sys/kernel/debug/hwpoison/corrupt-pfn a. virtual addr = 0x7e4c68a00000 phy_addr =0x10ea02000 b. virtual addr = 0x7e4c68b2c000 phy_addr =0x10e902000 c. virtual addr = 0x7e4c68c26000 phy_addr =0x10ea02000 (share with a) d. virtual addr = 0x7e4c68c58000 phy_addr =0x10ea02000 (share with a) 4.echo 0x10ea02 > /sys/kernel/debug/hwpoison/corrupt-pfn a. virtual addr = 0x7e4c68a00000 phy_addr =0x10e902000 b. virtual addr = 0x7e4c68a32000 phy_addr =0x10e902000(share with a) c. virtual addr = 0x7e4c68a64000 phy_addr =0x10e902000(share with a) d. virtual addr = 0x7e4c68a96000 phy_addr =0x10e902000(share with a) 5.echo 0x10e902 > /sys/kernel/debug/hwpoison/corrupt-pfn MCE: Killing ksm_test:531 due to hardware memory corruption fault at 7e4c68a00000 kernel-log: Injecting memory failure at pfn 0x10e802 Memory failure: 0x10e802: recovery action for dirty LRU page: Recovered Injecting memory failure at pfn 0x10eb02 Memory failure: 0x10eb02: recovery action for dirty LRU page: Recovered Injecting memory failure at pfn 0x10ea02 Memory failure: 0x10ea02: recovery action for dirty LRU page: Recovered Injecting memory failure at pfn 0x10e902 Memory failure: 0x10e902: recovery action for dirty LRU page: Recovered MCE: Killing ksm_test:531 due to hardware memory corruption fault at 7e4c68a00000 Thanks for review and comments! Longlong Xia (1): mm/ksm: Add recovery mechanism for memory failures mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 183 insertions(+) -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-09 7:00 [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures Longlong Xia @ 2025-10-09 7:00 ` Longlong Xia 2025-10-09 12:13 ` Lance Yang ` (2 more replies) 2025-10-09 18:57 ` [PATCH RFC 0/1] " David Hildenbrand 1 sibling, 3 replies; 18+ messages in thread From: Longlong Xia @ 2025-10-09 7:00 UTC (permalink / raw) To: linmiaohe, nao.horiguchi Cc: akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia From: Longlong Xia <xialonglong@kylinos.cn> When a hardware memory error occurs on a KSM page, the current behavior is to kill all processes mapping that page. This can be overly aggressive when KSM has multiple duplicate pages in a chain where other duplicates are still healthy. This patch introduces a recovery mechanism that attempts to migrate mappings from the failing KSM page to another healthy KSM page within the same chain before resorting to killing processes. The recovery process works as follows: 1. When a memory failure is detected on a KSM page, identify if the failing node is part of a chain (has duplicates) 2. Search for another healthy KSM page within the same chain 3. For each process mapping the failing page: - Update the PTE to point to the healthy KSM page - Migrate the rmap_item to the new stable node 4. If all migrations succeed, remove the failing node from the chain 5. Only kill processes if recovery is impossible or fails Signed-off-by: Longlong Xia <xialonglong@kylinos.cn> --- mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 183 insertions(+) diff --git a/mm/ksm.c b/mm/ksm.c index 160787bb121c..590d30cfe800 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -3084,6 +3084,183 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) } #ifdef CONFIG_MEMORY_FAILURE +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node) +{ + struct ksm_stable_node *stable_node, *dup; + struct rb_node *node; + int nid; + + if (!is_stable_node_dup(dup_node)) + return NULL; + + for (nid = 0; nid < ksm_nr_node_ids; nid++) { + node = rb_first(root_stable_tree + nid); + for (; node; node = rb_next(node)) { + stable_node = rb_entry(node, + struct ksm_stable_node, + node); + + if (!is_stable_node_chain(stable_node)) + continue; + + hlist_for_each_entry(dup, &stable_node->hlist, + hlist_dup) { + if (dup == dup_node) + return stable_node; + } + } + } + + return NULL; +} + +static struct folio * +find_target_folio(struct ksm_stable_node *failing_node, struct ksm_stable_node **target_dup) +{ + struct ksm_stable_node *chain_head, *dup; + struct hlist_node *hlist_safe; + struct folio *target_folio; + + if (!is_stable_node_dup(failing_node)) + return NULL; + + chain_head = find_chain_head(failing_node); + if (!chain_head) + return NULL; + + hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) { + if (dup == failing_node) + continue; + + target_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK); + if (target_folio) { + *target_dup = dup; + return target_folio; + } + } + + return NULL; +} + +static int replace_failing_page(struct vm_area_struct *vma, struct page *page, + struct page *kpage, unsigned long addr) +{ + struct folio *kfolio = page_folio(kpage); + struct mm_struct *mm = vma->vm_mm; + struct folio *folio = page_folio(page); + pmd_t *pmd; + pte_t *ptep; + pte_t newpte; + spinlock_t *ptl; + int err = -EFAULT; + struct mmu_notifier_range range; + + pmd = mm_find_pmd(mm, addr); + if (!pmd) + goto out; + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, + addr + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); + + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out_mn; + + if (!is_zero_pfn(page_to_pfn(kpage))) { + folio_get(kfolio); + folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); + newpte = mk_pte(kpage, vma->vm_page_prot); + } else { + newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot))); + ksm_map_zero_page(mm); + dec_mm_counter(mm, MM_ANONPAGES); + } + + flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep))); + ptep_clear_flush(vma, addr, ptep); + set_pte_at(mm, addr, ptep, newpte); + + folio_remove_rmap_pte(folio, page, vma); + if (!folio_mapped(folio)) + folio_free_swap(folio); + folio_put(folio); + + pte_unmap_unlock(ptep, ptl); + err = 0; +out_mn: + mmu_notifier_invalidate_range_end(&range); +out: + return err; +} + +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node) +{ + struct ksm_rmap_item *rmap_item; + struct hlist_node *hlist_safe; + struct folio *failing_folio = NULL; + struct folio *target_folio = NULL; + struct ksm_stable_node *target_dup = NULL; + int err; + + if (!is_stable_node_dup(failing_node)) + return false; + + failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK); + if (!failing_folio) + return false; + + target_folio = find_target_folio(failing_node, &target_dup); + if (!target_folio) { + folio_put(failing_folio); + return false; + } + + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) { + struct mm_struct *mm = rmap_item->mm; + unsigned long addr = rmap_item->address & PAGE_MASK; + struct vm_area_struct *vma; + + mmap_read_lock(mm); + if (ksm_test_exit(mm)) { + mmap_read_unlock(mm); + continue; + } + + vma = vma_lookup(mm, addr); + if (!vma) { + mmap_read_unlock(mm); + continue; + } + + /* Update PTE to point to target_folio's page */ + err = replace_failing_page(vma, &failing_folio->page, + &target_folio->page, addr); + if (!err) { + hlist_del(&rmap_item->hlist); + rmap_item->head = target_dup; + hlist_add_head(&rmap_item->hlist, &target_dup->hlist); + target_dup->rmap_hlist_len++; + failing_node->rmap_hlist_len--; + + } + + mmap_read_unlock(mm); + } + + folio_unlock(target_folio); + folio_put(target_folio); + folio_put(failing_folio); + + if (failing_node->rmap_hlist_len == 0) { + __stable_node_dup_del(failing_node); + free_stable_node(failing_node); + return true; + } + + return false; +} + /* * Collect processes when the error hit an ksm page. */ @@ -3098,6 +3275,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page, stable_node = folio_stable_node(folio); if (!stable_node) return; + + if (ksm_recover_within_chain(stable_node)) { + pr_debug("recovery within chain successful, no need to kill processes\n"); + return; + } + hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { struct anon_vma *av = rmap_item->anon_vma; -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-09 7:00 ` [PATCH RFC 1/1] " Longlong Xia @ 2025-10-09 12:13 ` Lance Yang 2025-10-11 7:52 ` Lance Yang 2025-10-11 3:25 ` Miaohe Lin 2025-10-13 20:10 ` [PATCH RFC] " Markus Elfring 2 siblings, 1 reply; 18+ messages in thread From: Lance Yang @ 2025-10-09 12:13 UTC (permalink / raw) To: Longlong Xia Cc: linmiaohe, nao.horiguchi, akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia On Thu, Oct 9, 2025 at 3:56 PM Longlong Xia <xialonglong2025@163.com> wrote: > > From: Longlong Xia <xialonglong@kylinos.cn> > > When a hardware memory error occurs on a KSM page, the current > behavior is to kill all processes mapping that page. This can > be overly aggressive when KSM has multiple duplicate pages in > a chain where other duplicates are still healthy. > > This patch introduces a recovery mechanism that attempts to migrate > mappings from the failing KSM page to another healthy KSM page within > the same chain before resorting to killing processes. Interesting, thanks for the patch! One question below. > > The recovery process works as follows: > 1. When a memory failure is detected on a KSM page, identify if the > failing node is part of a chain (has duplicates) > 2. Search for another healthy KSM page within the same chain > 3. For each process mapping the failing page: > - Update the PTE to point to the healthy KSM page > - Migrate the rmap_item to the new stable node > 4. If all migrations succeed, remove the failing node from the chain > 5. Only kill processes if recovery is impossible or fails > > Signed-off-by: Longlong Xia <xialonglong@kylinos.cn> > --- > mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 183 insertions(+) > > diff --git a/mm/ksm.c b/mm/ksm.c > index 160787bb121c..590d30cfe800 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -3084,6 +3084,183 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) > } > > #ifdef CONFIG_MEMORY_FAILURE > +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node) > +{ > + struct ksm_stable_node *stable_node, *dup; > + struct rb_node *node; > + int nid; > + > + if (!is_stable_node_dup(dup_node)) > + return NULL; > + > + for (nid = 0; nid < ksm_nr_node_ids; nid++) { > + node = rb_first(root_stable_tree + nid); > + for (; node; node = rb_next(node)) { > + stable_node = rb_entry(node, > + struct ksm_stable_node, > + node); > + > + if (!is_stable_node_chain(stable_node)) > + continue; > + > + hlist_for_each_entry(dup, &stable_node->hlist, > + hlist_dup) { > + if (dup == dup_node) > + return stable_node; > + } > + } > + } > + > + return NULL; > +} > + > +static struct folio * > +find_target_folio(struct ksm_stable_node *failing_node, struct ksm_stable_node **target_dup) > +{ > + struct ksm_stable_node *chain_head, *dup; > + struct hlist_node *hlist_safe; > + struct folio *target_folio; > + > + if (!is_stable_node_dup(failing_node)) > + return NULL; > + > + chain_head = find_chain_head(failing_node); > + if (!chain_head) > + return NULL; > + > + hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) { > + if (dup == failing_node) > + continue; > + > + target_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK); > + if (target_folio) { > + *target_dup = dup; > + return target_folio; > + } > + } > + > + return NULL; > +} > + > +static int replace_failing_page(struct vm_area_struct *vma, struct page *page, > + struct page *kpage, unsigned long addr) > +{ > + struct folio *kfolio = page_folio(kpage); > + struct mm_struct *mm = vma->vm_mm; > + struct folio *folio = page_folio(page); > + pmd_t *pmd; > + pte_t *ptep; > + pte_t newpte; > + spinlock_t *ptl; > + int err = -EFAULT; > + struct mmu_notifier_range range; > + > + pmd = mm_find_pmd(mm, addr); > + if (!pmd) > + goto out; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, > + addr + PAGE_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > + if (!ptep) > + goto out_mn; > + > + if (!is_zero_pfn(page_to_pfn(kpage))) { > + folio_get(kfolio); > + folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); > + newpte = mk_pte(kpage, vma->vm_page_prot); > + } else { > + newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot))); > + ksm_map_zero_page(mm); > + dec_mm_counter(mm, MM_ANONPAGES); > + } Can find_target_folio() return the shared zeropage? If not, the else block looks like dead code and can be removed. And, a real hardware failure on the shared zeropage would be non-recoverable, I guess. Cheers, Lance > + > + flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep))); > + ptep_clear_flush(vma, addr, ptep); > + set_pte_at(mm, addr, ptep, newpte); > + > + folio_remove_rmap_pte(folio, page, vma); > + if (!folio_mapped(folio)) > + folio_free_swap(folio); > + folio_put(folio); > + > + pte_unmap_unlock(ptep, ptl); > + err = 0; > +out_mn: > + mmu_notifier_invalidate_range_end(&range); > +out: > + return err; > +} > + > +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node) > +{ > + struct ksm_rmap_item *rmap_item; > + struct hlist_node *hlist_safe; > + struct folio *failing_folio = NULL; > + struct folio *target_folio = NULL; > + struct ksm_stable_node *target_dup = NULL; > + int err; > + > + if (!is_stable_node_dup(failing_node)) > + return false; > + > + failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK); > + if (!failing_folio) > + return false; > + > + target_folio = find_target_folio(failing_node, &target_dup); > + if (!target_folio) { > + folio_put(failing_folio); > + return false; > + } > + > + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) { > + struct mm_struct *mm = rmap_item->mm; > + unsigned long addr = rmap_item->address & PAGE_MASK; > + struct vm_area_struct *vma; > + > + mmap_read_lock(mm); > + if (ksm_test_exit(mm)) { > + mmap_read_unlock(mm); > + continue; > + } > + > + vma = vma_lookup(mm, addr); > + if (!vma) { > + mmap_read_unlock(mm); > + continue; > + } > + > + /* Update PTE to point to target_folio's page */ > + err = replace_failing_page(vma, &failing_folio->page, > + &target_folio->page, addr); > + if (!err) { > + hlist_del(&rmap_item->hlist); > + rmap_item->head = target_dup; > + hlist_add_head(&rmap_item->hlist, &target_dup->hlist); > + target_dup->rmap_hlist_len++; > + failing_node->rmap_hlist_len--; > + > + } > + > + mmap_read_unlock(mm); > + } > + > + folio_unlock(target_folio); > + folio_put(target_folio); > + folio_put(failing_folio); > + > + if (failing_node->rmap_hlist_len == 0) { > + __stable_node_dup_del(failing_node); > + free_stable_node(failing_node); > + return true; > + } > + > + return false; > +} > + > /* > * Collect processes when the error hit an ksm page. > */ > @@ -3098,6 +3275,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page, > stable_node = folio_stable_node(folio); > if (!stable_node) > return; > + > + if (ksm_recover_within_chain(stable_node)) { > + pr_debug("recovery within chain successful, no need to kill processes\n"); > + return; > + } > + > hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { > struct anon_vma *av = rmap_item->anon_vma; > > -- > 2.43.0 > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-09 12:13 ` Lance Yang @ 2025-10-11 7:52 ` Lance Yang 2025-10-11 9:23 ` Miaohe Lin 0 siblings, 1 reply; 18+ messages in thread From: Lance Yang @ 2025-10-11 7:52 UTC (permalink / raw) To: linmiaohe Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, Lance Yang, david @Miaohe I'd like to raise a concern about a potential hardware failure :) My tests show that if the shared zeropage (or huge zeropage) gets marked with HWpoison, the kernel continues to install it for new mappings. Surprisingly, it does not kill the accessing process ... The concern is, once the page is no longer zero-filled due to the hardware failure, what will happen? Would this lead to silent data corruption for applications that expect to read zeros? Thanks, Lance ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-11 7:52 ` Lance Yang @ 2025-10-11 9:23 ` Miaohe Lin 2025-10-11 9:38 ` Lance Yang 0 siblings, 1 reply; 18+ messages in thread From: Miaohe Lin @ 2025-10-11 9:23 UTC (permalink / raw) To: Lance Yang Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, david On 2025/10/11 15:52, Lance Yang wrote: > @Miaohe > > I'd like to raise a concern about a potential hardware failure :) Thanks for your thought. > > My tests show that if the shared zeropage (or huge zeropage) gets marked > with HWpoison, the kernel continues to install it for new mappings. > Surprisingly, it does not kill the accessing process ... Have you investigated the cause? If user space writes to shared zeropage, it will trigger COW and a new page will be installed. After that, reading the newly allocated page won't trigger memory error. In this scene, it does not kill the accessing process. > > The concern is, once the page is no longer zero-filled due to the hardware > failure, what will happen? Would this lead to silent data corruption for > applications that expect to read zeros? IMHO, once the page is no longer zero-filled due to the hardware failure, later any read will trigger memory error and memory_failure should handle that. Thanks. . ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-11 9:23 ` Miaohe Lin @ 2025-10-11 9:38 ` Lance Yang 2025-10-11 12:57 ` Lance Yang 2025-10-13 3:39 ` Miaohe Lin 0 siblings, 2 replies; 18+ messages in thread From: Lance Yang @ 2025-10-11 9:38 UTC (permalink / raw) To: Miaohe Lin Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, david On 2025/10/11 17:23, Miaohe Lin wrote: > On 2025/10/11 15:52, Lance Yang wrote: >> @Miaohe >> >> I'd like to raise a concern about a potential hardware failure :) > > Thanks for your thought. > >> >> My tests show that if the shared zeropage (or huge zeropage) gets marked >> with HWpoison, the kernel continues to install it for new mappings. >> Surprisingly, it does not kill the accessing process ... > > Have you investigated the cause? If user space writes to shared zeropage, > it will trigger COW and a new page will be installed. After that, reading > the newly allocated page won't trigger memory error. In this scene, it does > not kill the accessing process. Not write just read :) > >> >> The concern is, once the page is no longer zero-filled due to the hardware >> failure, what will happen? Would this lead to silent data corruption for >> applications that expect to read zeros? > > IMHO, once the page is no longer zero-filled due to the hardware failure, later > any read will trigger memory error and memory_failure should handle that. I've only tested injecting an error on the shared zeropage using corrupt-pfn: echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn But no memory error was triggered on a subsequent read ... Anyway, I'm trying to explore other ways to simulate hardware failure :) Thanks, Lance ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-11 9:38 ` Lance Yang @ 2025-10-11 12:57 ` Lance Yang 2025-10-13 3:39 ` Miaohe Lin 1 sibling, 0 replies; 18+ messages in thread From: Lance Yang @ 2025-10-11 12:57 UTC (permalink / raw) To: Miaohe Lin, qiuxu.zhuo Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, david Cc Qiuxu On 2025/10/11 17:38, Lance Yang wrote: > > > On 2025/10/11 17:23, Miaohe Lin wrote: >> On 2025/10/11 15:52, Lance Yang wrote: >>> @Miaohe >>> >>> I'd like to raise a concern about a potential hardware failure :) >> >> Thanks for your thought. >> >>> >>> My tests show that if the shared zeropage (or huge zeropage) gets marked >>> with HWpoison, the kernel continues to install it for new mappings. >>> Surprisingly, it does not kill the accessing process ... >> >> Have you investigated the cause? If user space writes to shared zeropage, >> it will trigger COW and a new page will be installed. After that, reading >> the newly allocated page won't trigger memory error. In this scene, it >> does >> not kill the accessing process. > > Not write just read :) > >> >>> >>> The concern is, once the page is no longer zero-filled due to the >>> hardware >>> failure, what will happen? Would this lead to silent data corruption for >>> applications that expect to read zeros? >> >> IMHO, once the page is no longer zero-filled due to the hardware >> failure, later >> any read will trigger memory error and memory_failure should handle that. > > I've only tested injecting an error on the shared zeropage using > corrupt-pfn: > > echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn > > But no memory error was triggered on a subsequent read ... > > Anyway, I'm trying to explore other ways to simulate hardware failure :) > > Thanks, > Lance > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-11 9:38 ` Lance Yang 2025-10-11 12:57 ` Lance Yang @ 2025-10-13 3:39 ` Miaohe Lin 2025-10-13 4:42 ` Lance Yang 1 sibling, 1 reply; 18+ messages in thread From: Miaohe Lin @ 2025-10-13 3:39 UTC (permalink / raw) To: Lance Yang Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, david On 2025/10/11 17:38, Lance Yang wrote: > > > On 2025/10/11 17:23, Miaohe Lin wrote: >> On 2025/10/11 15:52, Lance Yang wrote: >>> @Miaohe >>> >>> I'd like to raise a concern about a potential hardware failure :) >> >> Thanks for your thought. >> >>> >>> My tests show that if the shared zeropage (or huge zeropage) gets marked >>> with HWpoison, the kernel continues to install it for new mappings. >>> Surprisingly, it does not kill the accessing process ... >> >> Have you investigated the cause? If user space writes to shared zeropage, >> it will trigger COW and a new page will be installed. After that, reading >> the newly allocated page won't trigger memory error. In this scene, it does >> not kill the accessing process. > > Not write just read :) > >> >>> >>> The concern is, once the page is no longer zero-filled due to the hardware >>> failure, what will happen? Would this lead to silent data corruption for >>> applications that expect to read zeros? >> >> IMHO, once the page is no longer zero-filled due to the hardware failure, later >> any read will trigger memory error and memory_failure should handle that. > > I've only tested injecting an error on the shared zeropage using corrupt-pfn: > > echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn > > But no memory error was triggered on a subsequent read ... It's because corrupt-pfn only provides a software error injection mechanism. If you want to trigger memory error on read, you need use hardware error injection mechanism e.g.APEI Error INJection [1]. [1] https://www.kernel.org/doc/html/v5.8/firmware-guide/acpi/apei/einj.html Thanks. . ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 3:39 ` Miaohe Lin @ 2025-10-13 4:42 ` Lance Yang 2025-10-13 9:15 ` Lance Yang 0 siblings, 1 reply; 18+ messages in thread From: Lance Yang @ 2025-10-13 4:42 UTC (permalink / raw) To: Miaohe Lin, qiuxu.zhuo Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, david On 2025/10/13 11:39, Miaohe Lin wrote: > On 2025/10/11 17:38, Lance Yang wrote: >> >> >> On 2025/10/11 17:23, Miaohe Lin wrote: >>> On 2025/10/11 15:52, Lance Yang wrote: >>>> @Miaohe >>>> >>>> I'd like to raise a concern about a potential hardware failure :) >>> >>> Thanks for your thought. >>> >>>> >>>> My tests show that if the shared zeropage (or huge zeropage) gets marked >>>> with HWpoison, the kernel continues to install it for new mappings. >>>> Surprisingly, it does not kill the accessing process ... >>> >>> Have you investigated the cause? If user space writes to shared zeropage, >>> it will trigger COW and a new page will be installed. After that, reading >>> the newly allocated page won't trigger memory error. In this scene, it does >>> not kill the accessing process. >> >> Not write just read :) >> >>> >>>> >>>> The concern is, once the page is no longer zero-filled due to the hardware >>>> failure, what will happen? Would this lead to silent data corruption for >>>> applications that expect to read zeros? >>> >>> IMHO, once the page is no longer zero-filled due to the hardware failure, later >>> any read will trigger memory error and memory_failure should handle that. >> >> I've only tested injecting an error on the shared zeropage using corrupt-pfn: >> >> echo $PFN > /sys/kernel/debug/hwpoison/corrupt-pfn >> >> But no memory error was triggered on a subsequent read ... > > It's because corrupt-pfn only provides a software error injection mechanism. > If you want to trigger memory error on read, you need use hardware error injection > mechanism e.g.APEI Error INJection [1]. > > [1] https://www.kernel.org/doc/html/v5.8/firmware-guide/acpi/apei/einj.html Nice! You're right, thanks for pointing that out! I'm not very familiar with hardware error injection. Fortunately, Qiuxu is looking into that and running some tests on the shared zeropage. Well, I think he will follow up with his findings ;p Cheers, Lance ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 4:42 ` Lance Yang @ 2025-10-13 9:15 ` Lance Yang 2025-10-13 9:25 ` David Hildenbrand 0 siblings, 1 reply; 18+ messages in thread From: Lance Yang @ 2025-10-13 9:15 UTC (permalink / raw) To: david Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin, qiuxu.zhuo @David Cc: MM CORE folks On 2025/10/13 12:42, Lance Yang wrote: [...] Cool. Hardware error injection with EINJ was the way to go! I just ran some tests on the shared zero page (both regular and huge), and found a tricky behavior: 1) When a hardware error is injected into the zeropage, the process that attempts to read from a mapping backed by it is correctly killed with a SIGBUS. 2) However, even after the error is detected, the kernel continues to install the known-poisoned zeropage for new anonymous mappings ... For the shared zeropage: ``` [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in user-access at 29b8cf5000 [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to read_zeropage:13767 due to hardware memory corruption [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action for already poisoned page: Failed ``` And for the shared huge zeropage: ``` [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in user-access at 1e1e00000 [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to read_huge_zerop:13891 due to hardware memory corruption [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for already poisoned page: Failed ``` Since we've identified an uncorrectable hardware error on such a critical, singleton page, should we be doing something more? Thanks, Lance ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 9:15 ` Lance Yang @ 2025-10-13 9:25 ` David Hildenbrand 2025-10-13 9:46 ` Balbir Singh 2025-10-13 11:00 ` Lance Yang 0 siblings, 2 replies; 18+ messages in thread From: David Hildenbrand @ 2025-10-13 9:25 UTC (permalink / raw) To: Lance Yang Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin, qiuxu.zhuo On 13.10.25 11:15, Lance Yang wrote: > @David > > Cc: MM CORE folks > > On 2025/10/13 12:42, Lance Yang wrote: > [...] > > Cool. Hardware error injection with EINJ was the way to go! > > I just ran some tests on the shared zero page (both regular and huge), and > found a tricky behavior: > > 1) When a hardware error is injected into the zeropage, the process that > attempts to read from a mapping backed by it is correctly killed with a > SIGBUS. > > 2) However, even after the error is detected, the kernel continues to > install > the known-poisoned zeropage for new anonymous mappings ... > > > For the shared zeropage: > ``` > [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in > user-access at 29b8cf5000 > [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to > read_zeropage:13767 due to hardware memory corruption > [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action > for already poisoned page: Failed > ``` > And for the shared huge zeropage: > ``` > [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in > user-access at 1e1e00000 > [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to > read_huge_zerop:13891 due to hardware memory corruption > [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for > already poisoned page: Failed > ``` > > Since we've identified an uncorrectable hardware error on such a critical, > singleton page, should we be doing something more? I mean, regarding the shared zeropage, we could try walking all page tables of all processes and replace it be a fresh shared zeropage. But then, the page might also be used for other things (I/O etc), the shared zeropage is allocated by the architecture, we'd have to make is_zero_pfn() succeed on the old+new page etc ... So a lot of work for little benefit I guess? The question is how often we would see that in practice. I'd assume we'd see it happen on random kernel memory more frequently where we can really just bring down the whole machine. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 9:25 ` David Hildenbrand @ 2025-10-13 9:46 ` Balbir Singh 2025-10-13 11:00 ` Lance Yang 1 sibling, 0 replies; 18+ messages in thread From: Balbir Singh @ 2025-10-13 9:46 UTC (permalink / raw) To: David Hildenbrand, Lance Yang Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin, qiuxu.zhuo On 10/13/25 20:25, David Hildenbrand wrote: > On 13.10.25 11:15, Lance Yang wrote: >> @David >> >> Cc: MM CORE folks >> >> On 2025/10/13 12:42, Lance Yang wrote: >> [...] >> >> Cool. Hardware error injection with EINJ was the way to go! >> >> I just ran some tests on the shared zero page (both regular and huge), and >> found a tricky behavior: >> >> 1) When a hardware error is injected into the zeropage, the process that >> attempts to read from a mapping backed by it is correctly killed with a >> SIGBUS. >> >> 2) However, even after the error is detected, the kernel continues to >> install >> the known-poisoned zeropage for new anonymous mappings ... >> >> >> For the shared zeropage: >> ``` >> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in >> user-access at 29b8cf5000 >> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to >> read_zeropage:13767 due to hardware memory corruption >> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action >> for already poisoned page: Failed >> ``` >> And for the shared huge zeropage: >> ``` >> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in >> user-access at 1e1e00000 >> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to >> read_huge_zerop:13891 due to hardware memory corruption >> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for >> already poisoned page: Failed >> ``` >> >> Since we've identified an uncorrectable hardware error on such a critical, >> singleton page, should we be doing something more? > > I mean, regarding the shared zeropage, we could try walking all page tables of all processes and replace it be a fresh shared zeropage. > > But then, the page might also be used for other things (I/O etc), the shared zeropage is allocated by the architecture, we'd have to make is_zero_pfn() succeed on the old+new page etc ... > > So a lot of work for little benefit I guess? The question is how often we would see that in practice. I'd assume we'd see it happen on random kernel memory more frequently where we can really just bring down the whole machine. > empty_zero_page belongs to the .bss and zero_pfn is quite deeply burried in it's usage. The same concerns apply to zero_folio as well. I agree, it's a lot of work to recover from errors on the zero_page Balbir ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 9:25 ` David Hildenbrand 2025-10-13 9:46 ` Balbir Singh @ 2025-10-13 11:00 ` Lance Yang 2025-10-13 11:13 ` David Hildenbrand 1 sibling, 1 reply; 18+ messages in thread From: Lance Yang @ 2025-10-13 11:00 UTC (permalink / raw) To: David Hildenbrand Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin, qiuxu.zhuo On 2025/10/13 17:25, David Hildenbrand wrote: > On 13.10.25 11:15, Lance Yang wrote: >> @David >> >> Cc: MM CORE folks >> >> On 2025/10/13 12:42, Lance Yang wrote: >> [...] >> >> Cool. Hardware error injection with EINJ was the way to go! >> >> I just ran some tests on the shared zero page (both regular and huge), >> and >> found a tricky behavior: >> >> 1) When a hardware error is injected into the zeropage, the process that >> attempts to read from a mapping backed by it is correctly killed with a >> SIGBUS. >> >> 2) However, even after the error is detected, the kernel continues to >> install >> the known-poisoned zeropage for new anonymous mappings ... >> >> >> For the shared zeropage: >> ``` >> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in >> user-access at 29b8cf5000 >> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to >> read_zeropage:13767 due to hardware memory corruption >> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action >> for already poisoned page: Failed >> ``` >> And for the shared huge zeropage: >> ``` >> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in >> user-access at 1e1e00000 >> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to >> read_huge_zerop:13891 due to hardware memory corruption >> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for >> already poisoned page: Failed >> ``` >> >> Since we've identified an uncorrectable hardware error on such a >> critical, >> singleton page, should we be doing something more? > > I mean, regarding the shared zeropage, we could try walking all page > tables of all processes and replace it be a fresh shared zeropage. > > But then, the page might also be used for other things (I/O etc), the > shared zeropage is allocated by the architecture, we'd have to make > is_zero_pfn() succeed on the old+new page etc ... > > So a lot of work for little benefit I guess? The question is how often > we would see that in practice. I'd assume we'd see it happen on random > kernel memory more frequently where we can really just bring down the > whole machine. Thanks for your thoughts! I agree, fixing the regular zeropage is a really mess ... But for the huge zeropage, what if we just stop installing it once it's poisoned? We could just disable it globally. Something like this: diff --git a/mm/memory-failure.c b/mm/memory-failure.c index f698df156bf8..8543f4385ffe 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -2193,6 +2193,10 @@ int memory_failure(unsigned long pfn, int flags) if (!(flags & MF_SW_SIMULATED)) hw_memory_failure = true; + if (is_huge_zero_pfn(pfn)) + clear_bit(TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG, + &transparent_hugepage_flags); + p = pfn_to_online_page(pfn); if (!p) { res = arch_memory_failure(pfn, flags); Seems easy enough ... ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 11:00 ` Lance Yang @ 2025-10-13 11:13 ` David Hildenbrand 2025-10-13 11:18 ` Lance Yang 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2025-10-13 11:13 UTC (permalink / raw) To: Lance Yang Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin, qiuxu.zhuo On 13.10.25 13:00, Lance Yang wrote: > > > On 2025/10/13 17:25, David Hildenbrand wrote: >> On 13.10.25 11:15, Lance Yang wrote: >>> @David >>> >>> Cc: MM CORE folks >>> >>> On 2025/10/13 12:42, Lance Yang wrote: >>> [...] >>> >>> Cool. Hardware error injection with EINJ was the way to go! >>> >>> I just ran some tests on the shared zero page (both regular and huge), >>> and >>> found a tricky behavior: >>> >>> 1) When a hardware error is injected into the zeropage, the process that >>> attempts to read from a mapping backed by it is correctly killed with a >>> SIGBUS. >>> >>> 2) However, even after the error is detected, the kernel continues to >>> install >>> the known-poisoned zeropage for new anonymous mappings ... >>> >>> >>> For the shared zeropage: >>> ``` >>> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in >>> user-access at 29b8cf5000 >>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to >>> read_zeropage:13767 due to hardware memory corruption >>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action >>> for already poisoned page: Failed >>> ``` >>> And for the shared huge zeropage: >>> ``` >>> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in >>> user-access at 1e1e00000 >>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to >>> read_huge_zerop:13891 due to hardware memory corruption >>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action for >>> already poisoned page: Failed >>> ``` >>> >>> Since we've identified an uncorrectable hardware error on such a >>> critical, >>> singleton page, should we be doing something more? >> >> I mean, regarding the shared zeropage, we could try walking all page >> tables of all processes and replace it be a fresh shared zeropage. >> >> But then, the page might also be used for other things (I/O etc), the >> shared zeropage is allocated by the architecture, we'd have to make >> is_zero_pfn() succeed on the old+new page etc ... >> >> So a lot of work for little benefit I guess? The question is how often >> we would see that in practice. I'd assume we'd see it happen on random >> kernel memory more frequently where we can really just bring down the >> whole machine. > > Thanks for your thoughts! > > I agree, fixing the regular zeropage is a really mess ... > > But for the huge zeropage, what if we just stop installing it once it's > poisoned? We could just disable it globally. Something like this: We now have the static huge zero folio that could silently be used for I/O without a reference etc. So I'm afraid this is all just making corner cases slightly better. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-13 11:13 ` David Hildenbrand @ 2025-10-13 11:18 ` Lance Yang 0 siblings, 0 replies; 18+ messages in thread From: Lance Yang @ 2025-10-13 11:18 UTC (permalink / raw) To: David Hildenbrand Cc: Longlong Xia, nao.horiguchi, akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, Miaohe Lin, qiuxu.zhuo On 2025/10/13 19:13, David Hildenbrand wrote: > On 13.10.25 13:00, Lance Yang wrote: >> >> >> On 2025/10/13 17:25, David Hildenbrand wrote: >>> On 13.10.25 11:15, Lance Yang wrote: >>>> @David >>>> >>>> Cc: MM CORE folks >>>> >>>> On 2025/10/13 12:42, Lance Yang wrote: >>>> [...] >>>> >>>> Cool. Hardware error injection with EINJ was the way to go! >>>> >>>> I just ran some tests on the shared zero page (both regular and huge), >>>> and >>>> found a tricky behavior: >>>> >>>> 1) When a hardware error is injected into the zeropage, the process >>>> that >>>> attempts to read from a mapping backed by it is correctly killed with a >>>> SIGBUS. >>>> >>>> 2) However, even after the error is detected, the kernel continues to >>>> install >>>> the known-poisoned zeropage for new anonymous mappings ... >>>> >>>> >>>> For the shared zeropage: >>>> ``` >>>> [Mon Oct 13 16:29:02 2025] mce: Uncorrected hardware memory error in >>>> user-access at 29b8cf5000 >>>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: Sending SIGBUS to >>>> read_zeropage:13767 due to hardware memory corruption >>>> [Mon Oct 13 16:29:02 2025] Memory failure: 0x29b8cf5: recovery action >>>> for already poisoned page: Failed >>>> ``` >>>> And for the shared huge zeropage: >>>> ``` >>>> [Mon Oct 13 16:35:34 2025] mce: Uncorrected hardware memory error in >>>> user-access at 1e1e00000 >>>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: Sending SIGBUS to >>>> read_huge_zerop:13891 due to hardware memory corruption >>>> [Mon Oct 13 16:35:34 2025] Memory failure: 0x1e1e00: recovery action >>>> for >>>> already poisoned page: Failed >>>> ``` >>>> >>>> Since we've identified an uncorrectable hardware error on such a >>>> critical, >>>> singleton page, should we be doing something more? >>> >>> I mean, regarding the shared zeropage, we could try walking all page >>> tables of all processes and replace it be a fresh shared zeropage. >>> >>> But then, the page might also be used for other things (I/O etc), the >>> shared zeropage is allocated by the architecture, we'd have to make >>> is_zero_pfn() succeed on the old+new page etc ... >>> >>> So a lot of work for little benefit I guess? The question is how often >>> we would see that in practice. I'd assume we'd see it happen on random >>> kernel memory more frequently where we can really just bring down the >>> whole machine. >> >> Thanks for your thoughts! >> >> I agree, fixing the regular zeropage is a really mess ... >> >> But for the huge zeropage, what if we just stop installing it once it's >> poisoned? We could just disable it globally. Something like this: > > We now have the static huge zero folio that could silently be used for > I/O without a reference etc. > > So I'm afraid this is all just making corner cases slightly better. Ah, I see. Appreciate you taking the time to explain that! ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 1/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-09 7:00 ` [PATCH RFC 1/1] " Longlong Xia 2025-10-09 12:13 ` Lance Yang @ 2025-10-11 3:25 ` Miaohe Lin 2025-10-13 20:10 ` [PATCH RFC] " Markus Elfring 2 siblings, 0 replies; 18+ messages in thread From: Miaohe Lin @ 2025-10-11 3:25 UTC (permalink / raw) To: Longlong Xia Cc: akpm, david, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia, Naoya Horiguchi On 2025/10/9 15:00, Longlong Xia wrote: > From: Longlong Xia <xialonglong@kylinos.cn> > > When a hardware memory error occurs on a KSM page, the current > behavior is to kill all processes mapping that page. This can > be overly aggressive when KSM has multiple duplicate pages in > a chain where other duplicates are still healthy. > > This patch introduces a recovery mechanism that attempts to migrate > mappings from the failing KSM page to another healthy KSM page within > the same chain before resorting to killing processes. > > The recovery process works as follows: > 1. When a memory failure is detected on a KSM page, identify if the > failing node is part of a chain (has duplicates) > 2. Search for another healthy KSM page within the same chain > 3. For each process mapping the failing page: > - Update the PTE to point to the healthy KSM page > - Migrate the rmap_item to the new stable node > 4. If all migrations succeed, remove the failing node from the chain > 5. Only kill processes if recovery is impossible or fails Thanks for your patch. This looks really interesting to me and I think this might be a significant improvement. :) > > Signed-off-by: Longlong Xia <xialonglong@kylinos.cn> > --- > mm/ksm.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 183 insertions(+) > > diff --git a/mm/ksm.c b/mm/ksm.c > index 160787bb121c..590d30cfe800 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -3084,6 +3084,183 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) > } > > #ifdef CONFIG_MEMORY_FAILURE > +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node) > +{ > + struct ksm_stable_node *stable_node, *dup; > + struct rb_node *node; > + int nid; > + > + if (!is_stable_node_dup(dup_node)) > + return NULL; > + > + for (nid = 0; nid < ksm_nr_node_ids; nid++) { > + node = rb_first(root_stable_tree + nid); > + for (; node; node = rb_next(node)) { > + stable_node = rb_entry(node, > + struct ksm_stable_node, > + node); > + > + if (!is_stable_node_chain(stable_node)) > + continue; > + > + hlist_for_each_entry(dup, &stable_node->hlist, > + hlist_dup) { > + if (dup == dup_node) > + return stable_node; > + } > + } > + } > + > + return NULL; > +} > + > +static struct folio * > +find_target_folio(struct ksm_stable_node *failing_node, struct ksm_stable_node **target_dup) > +{ > + struct ksm_stable_node *chain_head, *dup; > + struct hlist_node *hlist_safe; > + struct folio *target_folio; > + > + if (!is_stable_node_dup(failing_node)) > + return NULL; > + > + chain_head = find_chain_head(failing_node); > + if (!chain_head) > + return NULL; > + > + hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) { > + if (dup == failing_node) > + continue; > + > + target_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK); > + if (target_folio) { > + *target_dup = dup; > + return target_folio; > + } > + } > + > + return NULL; > +} > + > +static int replace_failing_page(struct vm_area_struct *vma, struct page *page, > + struct page *kpage, unsigned long addr) > +{ > + struct folio *kfolio = page_folio(kpage); > + struct mm_struct *mm = vma->vm_mm; > + struct folio *folio = page_folio(page); > + pmd_t *pmd; > + pte_t *ptep; > + pte_t newpte; > + spinlock_t *ptl; > + int err = -EFAULT; > + struct mmu_notifier_range range; > + > + pmd = mm_find_pmd(mm, addr); > + if (!pmd) > + goto out; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr, > + addr + PAGE_SIZE); > + mmu_notifier_invalidate_range_start(&range); > + > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > + if (!ptep) > + goto out_mn; > + > + if (!is_zero_pfn(page_to_pfn(kpage))) { > + folio_get(kfolio); > + folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE); > + newpte = mk_pte(kpage, vma->vm_page_prot); > + } else { > + newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot))); > + ksm_map_zero_page(mm); > + dec_mm_counter(mm, MM_ANONPAGES); > + } > + > + flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep))); > + ptep_clear_flush(vma, addr, ptep); > + set_pte_at(mm, addr, ptep, newpte); > + > + folio_remove_rmap_pte(folio, page, vma); > + if (!folio_mapped(folio)) > + folio_free_swap(folio); > + folio_put(folio); > + > + pte_unmap_unlock(ptep, ptl); > + err = 0; > +out_mn: > + mmu_notifier_invalidate_range_end(&range); > +out: > + return err; > +} > + > +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node) > +{ > + struct ksm_rmap_item *rmap_item; > + struct hlist_node *hlist_safe; > + struct folio *failing_folio = NULL; > + struct folio *target_folio = NULL; > + struct ksm_stable_node *target_dup = NULL; > + int err; > + > + if (!is_stable_node_dup(failing_node)) > + return false; > + > + failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK); > + if (!failing_folio) > + return false; > + > + target_folio = find_target_folio(failing_node, &target_dup); > + if (!target_folio) { > + folio_put(failing_folio); > + return false; > + } > + > + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) { > + struct mm_struct *mm = rmap_item->mm; > + unsigned long addr = rmap_item->address & PAGE_MASK; > + struct vm_area_struct *vma; > + > + mmap_read_lock(mm); I briefly looked at the code and find lock order might be broken here. mm/filemap.c documents the lock order as below: * ->mmap_lock * ->invalidate_lock (filemap_fault) * ->lock_page (filemap_fault, access_process_vm) * But mmap_lock is held after folio is locked here. Is this a problem? Thanks. . ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC] mm/ksm: Add recovery mechanism for memory failures 2025-10-09 7:00 ` [PATCH RFC 1/1] " Longlong Xia 2025-10-09 12:13 ` Lance Yang 2025-10-11 3:25 ` Miaohe Lin @ 2025-10-13 20:10 ` Markus Elfring 2 siblings, 0 replies; 18+ messages in thread From: Markus Elfring @ 2025-10-13 20:10 UTC (permalink / raw) To: Longlong Xia, linux-mm, Miaohe Lin, Naoya Horiguchi Cc: Longlong Xia, LKML, Andrew Morton, David Hildenbrand, Kefeng Wang, xu xin … > This patch introduces … See also: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n94 … > +++ b/mm/ksm.c … > +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node) > +{ … > + struct vm_area_struct *vma; > + > + mmap_read_lock(mm); … > + } > + > + mmap_read_unlock(mm); … Under which circumstances would you become interested to apply a statement like “guard(mmap_read_lock)(mm);”? https://elixir.bootlin.com/linux/v6.17.1/source/include/linux/mmap_lock.h#L483-L484 Regards, Markus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures 2025-10-09 7:00 [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures Longlong Xia 2025-10-09 7:00 ` [PATCH RFC 1/1] " Longlong Xia @ 2025-10-09 18:57 ` David Hildenbrand 1 sibling, 0 replies; 18+ messages in thread From: David Hildenbrand @ 2025-10-09 18:57 UTC (permalink / raw) To: Longlong Xia, linmiaohe, nao.horiguchi Cc: akpm, wangkefeng.wang, xu.xin16, linux-kernel, linux-mm, Longlong Xia On 09.10.25 09:00, Longlong Xia wrote: > From: Longlong Xia <xialonglong@kylinos.cn> > > When a hardware memory error occurs on a KSM page, the current > behavior is to kill all processes mapping that page. This can > be overly aggressive when KSM has multiple duplicate pages in > a chain where other duplicates are still healthy. > > This patch introduces a recovery mechanism that attempts to migrate > mappings from the failing page to another healthy duplicate within > the same chain before resorting to killing processes. An alternative could be to allocate a new page and effectively migrate from the old (degraded) page to the new page by copying page content from one of the healty duplicates. That would keep the #mappings per page in the chain balanced. > > The recovery process works as follows: > 1. When a memory failure is detected on a KSM page, identify if the > failing node is part of a chain (has duplicates) (maybe add dup_haed > item to save head_node to struct stable_node?, saving searching > the whole stable tree, or other way to find head_node) > 2. Search for another healthy duplicate page within the same chain > 3. For each process mapping the failing page: > - Update the PTE to point to the healthy duplicate page ( maybe reuse > replace_page?, or split repalce_page into smaller function and use the > common part) > - Migrate the rmap_item to the new stable node > 4. If all migrations succeed, remove the failing node from the chain > 5. Only kill processes if recovery is impossible or fails Does not sound too crazy. But how realistic do we consider that in practice? We need quite a bunch of processes to dedup the same page to end up getting duplicates in the chain IIRC. So isn't this rather an improvement only for less likely scenarios in practice? -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-10-13 20:10 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-10-09 7:00 [PATCH RFC 0/1] mm/ksm: Add recovery mechanism for memory failures Longlong Xia 2025-10-09 7:00 ` [PATCH RFC 1/1] " Longlong Xia 2025-10-09 12:13 ` Lance Yang 2025-10-11 7:52 ` Lance Yang 2025-10-11 9:23 ` Miaohe Lin 2025-10-11 9:38 ` Lance Yang 2025-10-11 12:57 ` Lance Yang 2025-10-13 3:39 ` Miaohe Lin 2025-10-13 4:42 ` Lance Yang 2025-10-13 9:15 ` Lance Yang 2025-10-13 9:25 ` David Hildenbrand 2025-10-13 9:46 ` Balbir Singh 2025-10-13 11:00 ` Lance Yang 2025-10-13 11:13 ` David Hildenbrand 2025-10-13 11:18 ` Lance Yang 2025-10-11 3:25 ` Miaohe Lin 2025-10-13 20:10 ` [PATCH RFC] " Markus Elfring 2025-10-09 18:57 ` [PATCH RFC 0/1] " David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox