[PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/1]  mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
@ 2025-10-16 10:18 Longlong Xia
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Longlong Xia @ 2025-10-16 10:18 UTC (permalink / raw)
  To: linmiaohe, david, lance.yang
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.

The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
   a. Attempt to allocate a new KSM page copy from healthy duplicate
      KSM page. If successful, migrate the mapping to this new KSM page.
   b. If allocation fails, migrate the mapping to the existing healthy
      duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
   error) does the kernel fall back to killing the affected processes.

The original idea came from Naoya Horiguchi.
https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/

I test it with einj in physical machine x86_64 CPU Intel(R) Xeon(R) Gold 6430.

test shell script
modprobe einj 2>/dev/null
echo 0x10 > /sys/kernel/debug/apei/einj/error_type
echo $ADDRESS > /sys/kernel/debug/apei/einj/param1
echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
echo 1 > /sys/kernel/debug/apei/einj/error_inject

FIRST WAY: allocate a new KSM page copy from healthy duplicate
1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
virtual addr = 0x71582be00000  phy_addr =0x124802000
virtual addr = 0x71582bf2c000  phy_addr =0x124902000
virtual addr = 0x71582c026000  phy_addr =0x125402000
virtual addr = 0x71582c120000  phy_addr =0x125502000


2. echo 0x124802000 > /sys/kernel/debug/apei/einj/param1
virtual addr = 0x71582be00000  phy_addr =0x1363b1000 (new allocated)
virtual addr = 0x71582bf2c000  phy_addr =0x124902000
virtual addr = 0x71582c026000  phy_addr =0x125402000
virtual addr = 0x71582c120000  phy_addr =0x125502000


3. echo 0x124902000 > /sys/kernel/debug/apei/einj/param1
virtual addr = 0x71582be00000  phy_addr =0x1363b1000
virtual addr = 0x71582bf2c000  phy_addr =0x13099a000 (new allocated)
virtual addr = 0x71582c026000  phy_addr =0x125402000
virtual addr = 0x71582c120000  phy_addr =0x125502000

kernel-log:
mce: [Hardware Error]: Machine check events logged
ksm: recovery successful, no need to kill processes
Memory failure: 0x124802: recovery action for dirty LRU page: Recovered
Memory failure: 0x124802: recovery action for already poisoned page: Failed
ksm: recovery successful, no need to kill processes
Memory failure: 0x124902: recovery action for dirty LRU page: Recovered
Memory failure: 0x124902: recovery action for already poisoned page: Failed


SECOND WAY: Migrate the mapping to the existing healthy duplicate KSM page
1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
virtual addr = 0x79a172000000  phy_addr =0x141802000
virtual addr = 0x79a17212c000  phy_addr =0x141902000
virtual addr = 0x79a172226000  phy_addr =0x13cc02000
virtual addr = 0x79a172320000  phy_addr =0x13cd02000

2 echo 0x141802000 > /sys/kernel/debug/apei/einj/param1
a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
b.virtual addr = 0x79a17212c000  phy_addr =0x141902000
c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a) 

3.echo 0x141902000 > /sys/kernel/debug/apei/einj/param1
a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
b.virtual addr = 0x79a172032000  phy_addr =0x13cd02000 (share with a) 
c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a) 

4. echo 0x13cd02000 > /sys/kernel/debug/apei/einj/param1
a.virtual addr = 0x79a172000000  phy_addr =0x13cc02000
b.virtual addr = 0x79a172032000  phy_addr =0x13cc02000 (share with a)
c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000 (share with a)
d.virtual addr = 0x79a172320000  phy_addr =0x13cc02000 (share with a)

5. echo 0x13cc02000 > /sys/kernel/debug/apei/einj/param1
Bus error (core dumped)

kernel-log:
mce: [Hardware Error]: Machine check events logged
ksm: recovery successful, no need to kill processes
Memory failure: 0x141802: recovery action for dirty LRU page: Recovered
Memory failure: 0x141802: recovery action for already poisoned page: Failed
ksm: recovery successful, no need to kill processes
Memory failure: 0x141902: recovery action for dirty LRU page: Recovered
Memory failure: 0x141902: recovery action for already poisoned page: Failed
ksm: recovery successful, no need to kill processes
Memory failure: 0x13cd02: recovery action for dirty LRU page: Recovered
Memory failure: 0x13cd02: recovery action for already poisoned page: Failed
Memory failure: 0x13cc02: recovery action for dirty LRU page: Recovered
Memory failure: 0x13cc02: recovery action for already poisoned page: Failed
MCE: Killing ksm_addr:5221 due to hardware memory corruption fault at 79a172000000

ZERO PAGE TEST:
when I test in physical machine x86_64 CPU Intel(R) Xeon(R) Gold 6430
[shell]# ./einj.sh 0x193f908000
./einj.sh: line 25: echo: write error: Address already in use

when I test in qemu-x86_64.
Injecting memory failure at pfn 0x3a9d0c
Memory failure: 0x3a9d0c: unhandlable page.
Memory failure: 0x3a9d0c: recovery action for get hwpoison page: Ignored

It seems return early before enter this patch's functions.

Thanks for review and comments!

Changes in v2:

- Implemented a two-tier recovery strategy: preferring newly allocated
  pages over existing duplicates to avoid concentrating mappings on a 
  single page suggested by David Hildenbrand
- Remove handling of the zeropage in replace_failing_page(), as it is 
  non-recoverable suggested by Lance Yang 
- Correct the locking order by acquiring the mmap_lock before the page 
  lock during page replacement, suggested by Miaohe Lin
- Add protection using the ksm_thread_mutex around the entire recovery 
  operation to prevent race conditions with concurrent KSM scanning
- Separated the logic into smaller, more focused functions for better
  maintainability
- Update patch title

Longlong Xia (1):
  mm/ksm: recover from memory failure on KSM page by migrating to
    healthy duplicate

 mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 246 insertions(+)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
@ 2025-10-16 10:18 ` Longlong Xia
  2025-10-16 14:37   ` [PATCH v2] " Markus Elfring
                     ` (3 more replies)
  2025-10-16 10:46 ` [PATCH v2 0/1] mm/ksm: " David Hildenbrand
  2025-10-16 11:01 ` Markus Elfring
  2 siblings, 4 replies; 17+ messages in thread
From: Longlong Xia @ 2025-10-16 10:18 UTC (permalink / raw)
  To: linmiaohe, david, lance.yang
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia

From: Longlong Xia <xialonglong@kylinos.cn>

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.

The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
   a. Attempt to allocate a new KSM page copy from healthy duplicate
      KSM page. If successful, migrate the mapping to this new KSM page.
   b. If allocation fails, migrate the mapping to the existing healthy
      duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
   error) does the kernel fall back to killing the affected processes.

Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
---
 mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 246 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 160787bb121c..9099bad1ab35 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
+static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
+{
+	struct ksm_stable_node *stable_node, *dup;
+	struct rb_node *node;
+	int nid;
+
+	if (!is_stable_node_dup(dup_node))
+		return NULL;
+
+	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+		node = rb_first(root_stable_tree + nid);
+		for (; node; node = rb_next(node)) {
+			stable_node = rb_entry(node,
+					struct ksm_stable_node,
+					node);
+
+			if (!is_stable_node_chain(stable_node))
+				continue;
+
+			hlist_for_each_entry(dup, &stable_node->hlist,
+					hlist_dup) {
+				if (dup == dup_node)
+					return stable_node;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
+		struct ksm_stable_node *failing_node,
+		struct ksm_stable_node **healthy_dupdup)
+{
+	struct ksm_stable_node *dup;
+	struct hlist_node *hlist_safe;
+	struct folio *healthy_folio;
+
+	if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
+		return NULL;
+
+	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
+		if (dup == failing_node)
+			continue;
+
+		healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
+		if (healthy_folio) {
+			*healthy_dupdup = dup;
+			return healthy_folio;
+		}
+	}
+
+	return NULL;
+}
+
+static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
+		struct folio *healthy_folio,
+		struct ksm_stable_node **new_stable_node)
+{
+	int nid;
+	unsigned long kpfn;
+	struct page *new_page = NULL;
+
+	if (!is_stable_node_chain(chain_head))
+		return NULL;
+
+	new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
+	if (!new_page)
+		return NULL;
+
+	copy_highpage(new_page, folio_page(healthy_folio, 0));
+
+	*new_stable_node = alloc_stable_node();
+	if (!*new_stable_node) {
+		__free_page(new_page);
+		return NULL;
+	}
+
+	INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
+	kpfn = page_to_pfn(new_page);
+	(*new_stable_node)->kpfn = kpfn;
+	nid = get_kpfn_nid(kpfn);
+	DO_NUMA((*new_stable_node)->nid = nid);
+	(*new_stable_node)->rmap_hlist_len = 0;
+
+	(*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
+	hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
+	ksm_stable_node_dups++;
+	folio_set_stable_node(page_folio(new_page), *new_stable_node);
+	folio_add_lru(page_folio(new_page));
+
+	return new_page;
+}
+
+static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
+		struct page *kpage, unsigned long addr)
+{
+	struct folio *kfolio = page_folio(kpage);
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio = page_folio(page);
+	pmd_t *pmd;
+	pte_t *ptep;
+	pte_t newpte;
+	spinlock_t *ptl;
+	int err = -EFAULT;
+	struct mmu_notifier_range range;
+
+	pmd = mm_find_pmd(mm, addr);
+	if (!pmd)
+		goto out;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
+			addr + PAGE_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out_mn;
+
+	folio_get(kfolio);
+	folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
+	newpte = mk_pte(kpage, vma->vm_page_prot);
+
+	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at(mm, addr, ptep, newpte);
+
+	folio_remove_rmap_pte(folio, page, vma);
+	if (!folio_mapped(folio))
+		folio_free_swap(folio);
+	folio_put(folio);
+
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+out_mn:
+	mmu_notifier_invalidate_range_end(&range);
+out:
+	return err;
+}
+
+static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
+		struct folio *failing_folio,
+		struct folio *target_folio,
+		struct ksm_stable_node *target_dup)
+{
+	struct ksm_rmap_item *rmap_item;
+	struct hlist_node *hlist_safe;
+	int err;
+
+	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
+		struct mm_struct *mm = rmap_item->mm;
+		unsigned long addr = rmap_item->address & PAGE_MASK;
+		struct vm_area_struct *vma;
+
+		if (!mmap_read_trylock(mm))
+			continue;
+
+		if (ksm_test_exit(mm)) {
+			mmap_read_unlock(mm);
+			continue;
+		}
+
+		vma = vma_lookup(mm, addr);
+		if (!vma) {
+			mmap_read_unlock(mm);
+			continue;
+		}
+
+		if (!folio_trylock(target_folio)) {
+			mmap_read_unlock(mm);
+			continue;
+		}
+
+		err = replace_failing_page(vma, &failing_folio->page,
+				folio_page(target_folio, 0), addr);
+		if (!err) {
+			hlist_del(&rmap_item->hlist);
+			rmap_item->head = target_dup;
+			hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
+			target_dup->rmap_hlist_len++;
+			failing_node->rmap_hlist_len--;
+		}
+
+		folio_unlock(target_folio);
+		mmap_read_unlock(mm);
+	}
+
+}
+
+static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
+{
+	struct folio *failing_folio = NULL;
+	struct ksm_stable_node *healthy_dupdup = NULL;
+	struct folio *healthy_folio = NULL;
+	struct ksm_stable_node *chain_head = NULL;
+	struct page *new_page = NULL;
+	struct ksm_stable_node *new_stable_node = NULL;
+
+	if (!is_stable_node_dup(failing_node))
+		return false;
+
+	guard(mutex)(&ksm_thread_mutex);
+	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
+	if (!failing_folio)
+		return false;
+
+	chain_head = find_chain_head(failing_node);
+	if (!chain_head)
+		return NULL;
+
+	healthy_folio = find_healthy_folio(chain_head, failing_node, &healthy_dupdup);
+	if (!healthy_folio) {
+		folio_put(failing_folio);
+		return false;
+	}
+
+	new_page = create_new_stable_node_dup(chain_head, healthy_folio, &new_stable_node);
+
+	folio_unlock(healthy_folio);
+	folio_put(healthy_folio);
+
+	if (new_page && new_stable_node) {
+		migrate_to_target_dup(failing_node, failing_folio,
+				page_folio(new_page), new_stable_node);
+	} else {
+		migrate_to_target_dup(failing_node, failing_folio,
+				healthy_folio, healthy_dupdup);
+	}
+
+	folio_put(failing_folio);
+
+	if (failing_node->rmap_hlist_len == 0) {
+		__stable_node_dup_del(failing_node);
+		free_stable_node(failing_node);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Collect processes when the error hit an ksm page.
  */
@@ -3098,6 +3338,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page,
 	stable_node = folio_stable_node(folio);
 	if (!stable_node)
 		return;
+
+	if (ksm_recover_within_chain(stable_node)) {
+		pr_info("ksm: recovery successful, no need to kill processes\n");
+		return;
+	}
+
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		struct anon_vma *av = rmap_item->anon_vma;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
@ 2025-10-16 10:46 ` David Hildenbrand
  2025-10-21 14:00   ` Long long Xia
  2025-10-16 11:01 ` Markus Elfring
  2 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-10-16 10:46 UTC (permalink / raw)
  To: Longlong Xia, linmiaohe, lance.yang
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm

On 16.10.25 12:18, Longlong Xia wrote:
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to
> migrate mappings from the failing KSM page to a newly
> allocated KSM page or another healthy duplicate already
> present in the same chain, before falling back to the
> process-killing procedure.
> 
> The recovery process works as follows:
> 1. Identify if the failing KSM page belongs to a stable node chain.
> 2. Locate a healthy duplicate KSM page within the same chain.
> 3. For each process mapping the failing page:
>     a. Attempt to allocate a new KSM page copy from healthy duplicate
>        KSM page. If successful, migrate the mapping to this new KSM page.
>     b. If allocation fails, migrate the mapping to the existing healthy
>        duplicate KSM page.
> 4. If all migrations succeed, remove the failing KSM page from the chain.
> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>     error) does the kernel fall back to killing the affected processes.
> 
> The original idea came from Naoya Horiguchi.
> https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/
> 
> I test it with einj in physical machine x86_64 CPU Intel(R) Xeon(R) Gold 6430.
> 
> test shell script
> modprobe einj 2>/dev/null
> echo 0x10 > /sys/kernel/debug/apei/einj/error_type
> echo $ADDRESS > /sys/kernel/debug/apei/einj/param1
> echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
> echo 1 > /sys/kernel/debug/apei/einj/error_inject
> 
> FIRST WAY: allocate a new KSM page copy from healthy duplicate
> 1. alloc 1024 page with same content and enable KSM to merge
> after merge (same phy_addr only print once)
> virtual addr = 0x71582be00000  phy_addr =0x124802000
> virtual addr = 0x71582bf2c000  phy_addr =0x124902000
> virtual addr = 0x71582c026000  phy_addr =0x125402000
> virtual addr = 0x71582c120000  phy_addr =0x125502000
> 
> 
> 2. echo 0x124802000 > /sys/kernel/debug/apei/einj/param1
> virtual addr = 0x71582be00000  phy_addr =0x1363b1000 (new allocated)
> virtual addr = 0x71582bf2c000  phy_addr =0x124902000
> virtual addr = 0x71582c026000  phy_addr =0x125402000
> virtual addr = 0x71582c120000  phy_addr =0x125502000
> 
> 
> 3. echo 0x124902000 > /sys/kernel/debug/apei/einj/param1
> virtual addr = 0x71582be00000  phy_addr =0x1363b1000
> virtual addr = 0x71582bf2c000  phy_addr =0x13099a000 (new allocated)
> virtual addr = 0x71582c026000  phy_addr =0x125402000
> virtual addr = 0x71582c120000  phy_addr =0x125502000
> 
> kernel-log:
> mce: [Hardware Error]: Machine check events logged
> ksm: recovery successful, no need to kill processes
> Memory failure: 0x124802: recovery action for dirty LRU page: Recovered
> Memory failure: 0x124802: recovery action for already poisoned page: Failed
> ksm: recovery successful, no need to kill processes
> Memory failure: 0x124902: recovery action for dirty LRU page: Recovered
> Memory failure: 0x124902: recovery action for already poisoned page: Failed
> 
> 
> SECOND WAY: Migrate the mapping to the existing healthy duplicate KSM page
> 1. alloc 1024 page with same content and enable KSM to merge
> after merge (same phy_addr only print once)
> virtual addr = 0x79a172000000  phy_addr =0x141802000
> virtual addr = 0x79a17212c000  phy_addr =0x141902000
> virtual addr = 0x79a172226000  phy_addr =0x13cc02000
> virtual addr = 0x79a172320000  phy_addr =0x13cd02000
> 
> 2 echo 0x141802000 > /sys/kernel/debug/apei/einj/param1
> a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
> b.virtual addr = 0x79a17212c000  phy_addr =0x141902000
> c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
> d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a)
> 
> 3.echo 0x141902000 > /sys/kernel/debug/apei/einj/param1
> a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
> b.virtual addr = 0x79a172032000  phy_addr =0x13cd02000 (share with a)
> c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
> d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a)
> 
> 4. echo 0x13cd02000 > /sys/kernel/debug/apei/einj/param1
> a.virtual addr = 0x79a172000000  phy_addr =0x13cc02000
> b.virtual addr = 0x79a172032000  phy_addr =0x13cc02000 (share with a)
> c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000 (share with a)
> d.virtual addr = 0x79a172320000  phy_addr =0x13cc02000 (share with a)
> 
> 5. echo 0x13cc02000 > /sys/kernel/debug/apei/einj/param1
> Bus error (core dumped)
> 
> kernel-log:
> mce: [Hardware Error]: Machine check events logged
> ksm: recovery successful, no need to kill processes
> Memory failure: 0x141802: recovery action for dirty LRU page: Recovered
> Memory failure: 0x141802: recovery action for already poisoned page: Failed
> ksm: recovery successful, no need to kill processes
> Memory failure: 0x141902: recovery action for dirty LRU page: Recovered
> Memory failure: 0x141902: recovery action for already poisoned page: Failed
> ksm: recovery successful, no need to kill processes
> Memory failure: 0x13cd02: recovery action for dirty LRU page: Recovered
> Memory failure: 0x13cd02: recovery action for already poisoned page: Failed
> Memory failure: 0x13cc02: recovery action for dirty LRU page: Recovered
> Memory failure: 0x13cc02: recovery action for already poisoned page: Failed
> MCE: Killing ksm_addr:5221 due to hardware memory corruption fault at 79a172000000
> 
> ZERO PAGE TEST:
> when I test in physical machine x86_64 CPU Intel(R) Xeon(R) Gold 6430
> [shell]# ./einj.sh 0x193f908000
> ./einj.sh: line 25: echo: write error: Address already in use
> 
> when I test in qemu-x86_64.
> Injecting memory failure at pfn 0x3a9d0c
> Memory failure: 0x3a9d0c: unhandlable page.
> Memory failure: 0x3a9d0c: recovery action for get hwpoison page: Ignored
> 
> It seems return early before enter this patch's functions.
> 
> Thanks for review and comments!
> 
> Changes in v2:
> 
> - Implemented a two-tier recovery strategy: preferring newly allocated
>    pages over existing duplicates to avoid concentrating mappings on a
>    single page suggested by David Hildenbrand

I also asked how relevant this is in practice [1]

"
But how realistic do we consider that in practice? We need quite a bunch
of processes to dedup the same page to end up getting duplicates in the
chain IIRC.

So isn't this rather an improvement only for less likely scenarios in
practice?
"

In particular for your test "alloc 1024 page with same content".

It certainly adds complexity, so we should clarify if this is really 
worth it.

[1] 
https://lore.kernel.org/all/8c4d8ebe-885e-40f0-a10e-7290067c7b96@redhat.com/

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
  2025-10-16 10:46 ` [PATCH v2 0/1] mm/ksm: " David Hildenbrand
@ 2025-10-16 11:01 ` Markus Elfring
  2 siblings, 0 replies; 17+ messages in thread
From: Markus Elfring @ 2025-10-16 11:01 UTC (permalink / raw)
  To: Longlong Xia, linux-mm, David Hildenbrand, Lance Yang,
	Miaohe Lin, Naoya Horiguchi
  Cc: Longlong Xia, LKML, Andrew Morton, Kefeng Wang, xu xin

> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
…

* The word wrapping can be improved another bit, can't it?
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n658

* How relevant is such a cover letter for a single patch?

* Is there a need to extend change descriptions at other places?


Regards,
Markus


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
@ 2025-10-16 14:37   ` Markus Elfring
  2025-10-17  3:09   ` [PATCH v2 1/1] " kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Markus Elfring @ 2025-10-16 14:37 UTC (permalink / raw)
  To: Longlong Xia, linux-mm, David Hildenbrand, Lance Yang,
	Miaohe Lin, Naoya Horiguchi
  Cc: Longlong Xia, LKML, Andrew Morton, Kefeng Wang, Qiuxu Zhuo, xu xin

…> This patch introduces …

See also once more:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n94


…> +++ b/mm/ksm.c
…
> +static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
> +		struct page *kpage, unsigned long addr)
> +{
…> +	int err = -EFAULT;
…> +	pmd = mm_find_pmd(mm, addr);
> +	if (!pmd)
> +		goto out;

Please return directly here.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.17#n532


…> +out_mn:
> +	mmu_notifier_invalidate_range_end(&range);
> +out:
> +	return err;
> +}
> +
> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
> +		struct folio *failing_folio,
> +		struct folio *target_folio,
> +		struct ksm_stable_node *target_dup)
> +{
…> +	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> +		struct mm_struct *mm = rmap_item->mm;
> +		unsigned long addr = rmap_item->address & PAGE_MASK;
> +		struct vm_area_struct *vma;
> +
> +		if (!mmap_read_trylock(mm))
> +			continue;
> +
> +		if (ksm_test_exit(mm)) {
> +			mmap_read_unlock(mm);
> +			continue;
> +		}

I suggest to avoid duplicate source code here by using another label.
Will such an implementation details become relevant for the application of scope-based resource management?
https://elixir.bootlin.com/linux/v6.17.1/source/include/linux/mmap_lock.h#L483-L484


…> +		folio_unlock(target_folio);
+unlock:> +		mmap_read_unlock(mm);
> +	}
> +
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
…> +	if (new_page && new_stable_node) {
> +		migrate_to_target_dup(failing_node, failing_folio,
> +				page_folio(new_page), new_stable_node);
> +	} else {
> +		migrate_to_target_dup(failing_node, failing_folio,
> +				healthy_folio, healthy_dupdup);
> +	}
…

How do you think about to omit curly brackets?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.17#n197

Regards,
Markus


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
  2025-10-16 14:37   ` [PATCH v2] " Markus Elfring
@ 2025-10-17  3:09   ` kernel test robot
  2025-10-23 11:54   ` Miaohe Lin
  2025-10-28  9:44   ` David Hildenbrand
  3 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2025-10-17  3:09 UTC (permalink / raw)
  To: Longlong Xia, linmiaohe, david, lance.yang
  Cc: llvm, oe-kbuild-all, markus.elfring, nao.horiguchi, akpm,
	wangkefeng.wang, qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm,
	Longlong Xia

Hi Longlong,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.18-rc1 next-20251016]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Longlong-Xia/mm-ksm-recover-from-memory-failure-on-KSM-page-by-migrating-to-healthy-duplicate/20251016-182115
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251016101813.484565-2-xialonglong2025%40163.com
patch subject: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
config: x86_64-buildonly-randconfig-003-20251017 (https://download.01.org/0day-ci/archive/20251017/202510171017.wBXHozQb-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251017/202510171017.wBXHozQb-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510171017.wBXHozQb-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/ksm.c:3160:6: warning: variable 'nid' set but not used [-Wunused-but-set-variable]
    3160 |         int nid;
         |             ^
   1 warning generated.


vim +/nid +3160 mm/ksm.c

  3155	
  3156	static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
  3157			struct folio *healthy_folio,
  3158			struct ksm_stable_node **new_stable_node)
  3159	{
> 3160		int nid;
  3161		unsigned long kpfn;
  3162		struct page *new_page = NULL;
  3163	
  3164		if (!is_stable_node_chain(chain_head))
  3165			return NULL;
  3166	
  3167		new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
  3168		if (!new_page)
  3169			return NULL;
  3170	
  3171		copy_highpage(new_page, folio_page(healthy_folio, 0));
  3172	
  3173		*new_stable_node = alloc_stable_node();
  3174		if (!*new_stable_node) {
  3175			__free_page(new_page);
  3176			return NULL;
  3177		}
  3178	
  3179		INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
  3180		kpfn = page_to_pfn(new_page);
  3181		(*new_stable_node)->kpfn = kpfn;
  3182		nid = get_kpfn_nid(kpfn);
  3183		DO_NUMA((*new_stable_node)->nid = nid);
  3184		(*new_stable_node)->rmap_hlist_len = 0;
  3185	
  3186		(*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
  3187		hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
  3188		ksm_stable_node_dups++;
  3189		folio_set_stable_node(page_folio(new_page), *new_stable_node);
  3190		folio_add_lru(page_folio(new_page));
  3191	
  3192		return new_page;
  3193	}
  3194	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:46 ` [PATCH v2 0/1] mm/ksm: " David Hildenbrand
@ 2025-10-21 14:00   ` Long long Xia
  2025-10-23 16:16     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Long long Xia @ 2025-10-21 14:00 UTC (permalink / raw)
  To: David Hildenbrand, linmiaohe, lance.yang
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm

Thanks for the reply.

I do some simple tests.
I hope these findings are helpful for the community's review.

1.Test VM
Configuration
Hardware: x86_64 QEMU VM, 1 vCPU, 256MB RAM per guest
Kernel: 6.6.89

Testcase1: Single VM and enable KSM

- VM Memory Usage:
   * RSS Total  = 275028 KB (268 MB)
   * RSS Anon   = 253656 KB (247 MB)
   * RSS File   = 21372 KB (20 MB)
   * RSS Shmem  = 0 KB (0 MB)

a.Traverse the stable tree
b. pages on the chain
2 chains detected
Chain #1: 51 duplicates, 12,956 pages (~51 MB)
Chain #2: 15 duplicates, 3,822 pages (~15 MB)
Average: 8,389 pages per chain
Sum: 16778 pages (64.6% of ksm_pages_sharing + ksm_pages_shared)
c. pages on the chain
Non-chain pages: 9,209 pages
d.chain_count = 2, not_chain_count = 4200
e.
/sys/kernel/mm/ksm/ksm_pages_sharing = 21721
/sys/kernel/mm/ksm/ksm_pages_shared = 4266
/sys/kernel/mm/ksm/ksm_pages_unshared = 38098


Testcase2: 10 VMs and enable KSM
a.Traverse the stable tree
b.Pages on the chain
8 chains detected
Chain #1: 458 duplicates, 117,012 pages (~457 MB)
Chain #2: 150 duplicates, 38,231 pages (~149 MB)
Chain #3: 10 duplicates, 2,320 pages (~9 MB)
Chain #4: 8 duplicates, 1,814 pages (~7 MB)
Chain #5-8: 4, 3, 3, 2 duplicates (920, 720, 600, 260 pages)
Average: 20,235 pages per chain
Sum: 161877 pages (44.5% of ksm_pages_sharing + ksm_pages_shared)
c.Pages on the chain
Non-chain pages: 201,486
d.chain_count = 8, not_chain_count = 15936
e.
/sys/kernel/mm/ksm/ksm_pages_sharing = 346789
/sys/kernel/mm/ksm/ksm_pages_shared = 16574
/sys/kernel/mm/ksm/ksm_pages_unshared = 264918


2.Test firefox browser
I open 10 Firefox browser windows, perform random searches, and then 
enable KSM

a.page_not_chain = 4043
b.chain_pages = 936 (18.8% of ksm_pages_sharing + ksm_pages_shared)
c.chain_count = 2, not_chain_count = 424
d.
/sys/kernel/mm/ksm/ksm_pages_sharing = 4554
/sys/kernel/mm/ksm/ksm_pages_shared = 425
/sys/kernel/mm/ksm/ksm_pages_unshared = 18461


Surprisingly, although chains are few in number, they contribute
significantly to the overall savings. In the 10-VM scenario, only 8 chains
produce 161,877 pages (44.5% of total), while thousands of non-chain groups
contribute the remaining 55.5%.

I would appreciate any feedback or suggestions.

Best regards,
Longlong Xia

在 2025/10/16 18:46, David Hildenbrand 写道:
> On 16.10.25 12:18, Longlong Xia wrote:
>> When a hardware memory error occurs on a KSM page, the current
>> behavior is to kill all processes mapping that page. This can
>> be overly aggressive when KSM has multiple duplicate pages in
>> a chain where other duplicates are still healthy.
>>
>> This patch introduces a recovery mechanism that attempts to
>> migrate mappings from the failing KSM page to a newly
>> allocated KSM page or another healthy duplicate already
>> present in the same chain, before falling back to the
>> process-killing procedure.
>>
>> The recovery process works as follows:
>> 1. Identify if the failing KSM page belongs to a stable node chain.
>> 2. Locate a healthy duplicate KSM page within the same chain.
>> 3. For each process mapping the failing page:
>>     a. Attempt to allocate a new KSM page copy from healthy duplicate
>>        KSM page. If successful, migrate the mapping to this new KSM 
>> page.
>>     b. If allocation fails, migrate the mapping to the existing healthy
>>        duplicate KSM page.
>> 4. If all migrations succeed, remove the failing KSM page from the 
>> chain.
>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>     error) does the kernel fall back to killing the affected processes.
>>
>> The original idea came from Naoya Horiguchi.
>> https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/ 
>>
>>
>> I test it with einj in physical machine x86_64 CPU Intel(R) Xeon(R) 
>> Gold 6430.
>>
>> test shell script
>> modprobe einj 2>/dev/null
>> echo 0x10 > /sys/kernel/debug/apei/einj/error_type
>> echo $ADDRESS > /sys/kernel/debug/apei/einj/param1
>> echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
>> echo 1 > /sys/kernel/debug/apei/einj/error_inject
>>
>> FIRST WAY: allocate a new KSM page copy from healthy duplicate
>> 1. alloc 1024 page with same content and enable KSM to merge
>> after merge (same phy_addr only print once)
>> virtual addr = 0x71582be00000  phy_addr =0x124802000
>> virtual addr = 0x71582bf2c000  phy_addr =0x124902000
>> virtual addr = 0x71582c026000  phy_addr =0x125402000
>> virtual addr = 0x71582c120000  phy_addr =0x125502000
>>
>>
>> 2. echo 0x124802000 > /sys/kernel/debug/apei/einj/param1
>> virtual addr = 0x71582be00000  phy_addr =0x1363b1000 (new allocated)
>> virtual addr = 0x71582bf2c000  phy_addr =0x124902000
>> virtual addr = 0x71582c026000  phy_addr =0x125402000
>> virtual addr = 0x71582c120000  phy_addr =0x125502000
>>
>>
>> 3. echo 0x124902000 > /sys/kernel/debug/apei/einj/param1
>> virtual addr = 0x71582be00000  phy_addr =0x1363b1000
>> virtual addr = 0x71582bf2c000  phy_addr =0x13099a000 (new allocated)
>> virtual addr = 0x71582c026000  phy_addr =0x125402000
>> virtual addr = 0x71582c120000  phy_addr =0x125502000
>>
>> kernel-log:
>> mce: [Hardware Error]: Machine check events logged
>> ksm: recovery successful, no need to kill processes
>> Memory failure: 0x124802: recovery action for dirty LRU page: Recovered
>> Memory failure: 0x124802: recovery action for already poisoned page: 
>> Failed
>> ksm: recovery successful, no need to kill processes
>> Memory failure: 0x124902: recovery action for dirty LRU page: Recovered
>> Memory failure: 0x124902: recovery action for already poisoned page: 
>> Failed
>>
>>
>> SECOND WAY: Migrate the mapping to the existing healthy duplicate KSM 
>> page
>> 1. alloc 1024 page with same content and enable KSM to merge
>> after merge (same phy_addr only print once)
>> virtual addr = 0x79a172000000  phy_addr =0x141802000
>> virtual addr = 0x79a17212c000  phy_addr =0x141902000
>> virtual addr = 0x79a172226000  phy_addr =0x13cc02000
>> virtual addr = 0x79a172320000  phy_addr =0x13cd02000
>>
>> 2 echo 0x141802000 > /sys/kernel/debug/apei/einj/param1
>> a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
>> b.virtual addr = 0x79a17212c000  phy_addr =0x141902000
>> c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
>> d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a)
>>
>> 3.echo 0x141902000 > /sys/kernel/debug/apei/einj/param1
>> a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
>> b.virtual addr = 0x79a172032000  phy_addr =0x13cd02000 (share with a)
>> c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
>> d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a)
>>
>> 4. echo 0x13cd02000 > /sys/kernel/debug/apei/einj/param1
>> a.virtual addr = 0x79a172000000  phy_addr =0x13cc02000
>> b.virtual addr = 0x79a172032000  phy_addr =0x13cc02000 (share with a)
>> c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000 (share with a)
>> d.virtual addr = 0x79a172320000  phy_addr =0x13cc02000 (share with a)
>>
>> 5. echo 0x13cc02000 > /sys/kernel/debug/apei/einj/param1
>> Bus error (core dumped)
>>
>> kernel-log:
>> mce: [Hardware Error]: Machine check events logged
>> ksm: recovery successful, no need to kill processes
>> Memory failure: 0x141802: recovery action for dirty LRU page: Recovered
>> Memory failure: 0x141802: recovery action for already poisoned page: 
>> Failed
>> ksm: recovery successful, no need to kill processes
>> Memory failure: 0x141902: recovery action for dirty LRU page: Recovered
>> Memory failure: 0x141902: recovery action for already poisoned page: 
>> Failed
>> ksm: recovery successful, no need to kill processes
>> Memory failure: 0x13cd02: recovery action for dirty LRU page: Recovered
>> Memory failure: 0x13cd02: recovery action for already poisoned page: 
>> Failed
>> Memory failure: 0x13cc02: recovery action for dirty LRU page: Recovered
>> Memory failure: 0x13cc02: recovery action for already poisoned page: 
>> Failed
>> MCE: Killing ksm_addr:5221 due to hardware memory corruption fault at 
>> 79a172000000
>>
>> ZERO PAGE TEST:
>> when I test in physical machine x86_64 CPU Intel(R) Xeon(R) Gold 6430
>> [shell]# ./einj.sh 0x193f908000
>> ./einj.sh: line 25: echo: write error: Address already in use
>>
>> when I test in qemu-x86_64.
>> Injecting memory failure at pfn 0x3a9d0c
>> Memory failure: 0x3a9d0c: unhandlable page.
>> Memory failure: 0x3a9d0c: recovery action for get hwpoison page: Ignored
>>
>> It seems return early before enter this patch's functions.
>>
>> Thanks for review and comments!
>>
>> Changes in v2:
>>
>> - Implemented a two-tier recovery strategy: preferring newly allocated
>>    pages over existing duplicates to avoid concentrating mappings on a
>>    single page suggested by David Hildenbrand
>
> I also asked how relevant this is in practice [1]
>
> "
> But how realistic do we consider that in practice? We need quite a bunch
> of processes to dedup the same page to end up getting duplicates in the
> chain IIRC.
>
> So isn't this rather an improvement only for less likely scenarios in
> practice?
> "
>
> In particular for your test "alloc 1024 page with same content".
>
> It certainly adds complexity, so we should clarify if this is really 
> worth it.
>
> [1] 
> https://lore.kernel.org/all/8c4d8ebe-885e-40f0-a10e-7290067c7b96@redhat.com/
>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
  2025-10-16 14:37   ` [PATCH v2] " Markus Elfring
  2025-10-17  3:09   ` [PATCH v2 1/1] " kernel test robot
@ 2025-10-23 11:54   ` Miaohe Lin
  2025-10-28  7:54     ` Long long Xia
  2025-10-28  9:44   ` David Hildenbrand
  3 siblings, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2025-10-23 11:54 UTC (permalink / raw)
  To: Longlong Xia
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
	lance.yang

On 2025/10/16 18:18, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
> 
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to
> migrate mappings from the failing KSM page to a newly
> allocated KSM page or another healthy duplicate already
> present in the same chain, before falling back to the
> process-killing procedure.
> 
> The recovery process works as follows:
> 1. Identify if the failing KSM page belongs to a stable node chain.
> 2. Locate a healthy duplicate KSM page within the same chain.
> 3. For each process mapping the failing page:
>    a. Attempt to allocate a new KSM page copy from healthy duplicate
>       KSM page. If successful, migrate the mapping to this new KSM page.
>    b. If allocation fails, migrate the mapping to the existing healthy
>       duplicate KSM page.
> 4. If all migrations succeed, remove the failing KSM page from the chain.
> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>    error) does the kernel fall back to killing the affected processes.
> 
> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>

Thanks for your patch. Some comments below.

> ---
>  mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 246 insertions(+)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 160787bb121c..9099bad1ab35 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>  }
>  
>  #ifdef CONFIG_MEMORY_FAILURE
> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
> +{
> +	struct ksm_stable_node *stable_node, *dup;
> +	struct rb_node *node;
> +	int nid;
> +
> +	if (!is_stable_node_dup(dup_node))
> +		return NULL;
> +
> +	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
> +		node = rb_first(root_stable_tree + nid);
> +		for (; node; node = rb_next(node)) {
> +			stable_node = rb_entry(node,
> +					struct ksm_stable_node,
> +					node);
> +
> +			if (!is_stable_node_chain(stable_node))
> +				continue;
> +
> +			hlist_for_each_entry(dup, &stable_node->hlist,
> +					hlist_dup) {
> +				if (dup == dup_node)
> +					return stable_node;
> +			}
> +		}
> +	}

Would above multiple loops take a long time in some corner cases?

> +
> +	return NULL;
> +}
> +
> +static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
> +		struct ksm_stable_node *failing_node,
> +		struct ksm_stable_node **healthy_dupdup)
> +{
> +	struct ksm_stable_node *dup;
> +	struct hlist_node *hlist_safe;
> +	struct folio *healthy_folio;
> +
> +	if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
> +		return NULL;
> +
> +	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
> +		if (dup == failing_node)
> +			continue;
> +
> +		healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
> +		if (healthy_folio) {
> +			*healthy_dupdup = dup;
> +			return healthy_folio;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
> +		struct folio *healthy_folio,
> +		struct ksm_stable_node **new_stable_node)
> +{
> +	int nid;
> +	unsigned long kpfn;
> +	struct page *new_page = NULL;
> +
> +	if (!is_stable_node_chain(chain_head))
> +		return NULL;
> +
> +	new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);

Why __GFP_ZERO is needed?

> +	if (!new_page)
> +		return NULL;
> +
> +	copy_highpage(new_page, folio_page(healthy_folio, 0));
> +
> +	*new_stable_node = alloc_stable_node();
> +	if (!*new_stable_node) {
> +		__free_page(new_page);
> +		return NULL;
> +	}
> +
> +	INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
> +	kpfn = page_to_pfn(new_page);
> +	(*new_stable_node)->kpfn = kpfn;
> +	nid = get_kpfn_nid(kpfn);
> +	DO_NUMA((*new_stable_node)->nid = nid);
> +	(*new_stable_node)->rmap_hlist_len = 0;
> +
> +	(*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
> +	hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
> +	ksm_stable_node_dups++;
> +	folio_set_stable_node(page_folio(new_page), *new_stable_node);
> +	folio_add_lru(page_folio(new_page));
> +
> +	return new_page;
> +}
> +

...

> +
> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
> +		struct folio *failing_folio,
> +		struct folio *target_folio,
> +		struct ksm_stable_node *target_dup)
> +{
> +	struct ksm_rmap_item *rmap_item;
> +	struct hlist_node *hlist_safe;
> +	int err;
> +
> +	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> +		struct mm_struct *mm = rmap_item->mm;
> +		unsigned long addr = rmap_item->address & PAGE_MASK;
> +		struct vm_area_struct *vma;
> +
> +		if (!mmap_read_trylock(mm))
> +			continue;
> +
> +		if (ksm_test_exit(mm)) {
> +			mmap_read_unlock(mm);
> +			continue;
> +		}
> +
> +		vma = vma_lookup(mm, addr);
> +		if (!vma) {
> +			mmap_read_unlock(mm);
> +			continue;
> +		}
> +
> +		if (!folio_trylock(target_folio)) {

Should we try to get the folio refcnt first?

> +			mmap_read_unlock(mm);
> +			continue;
> +		}
> +
> +		err = replace_failing_page(vma, &failing_folio->page,
> +				folio_page(target_folio, 0), addr);
> +		if (!err) {
> +			hlist_del(&rmap_item->hlist);
> +			rmap_item->head = target_dup;
> +			hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
> +			target_dup->rmap_hlist_len++;
> +			failing_node->rmap_hlist_len--;
> +		}
> +
> +		folio_unlock(target_folio);
> +		mmap_read_unlock(mm);
> +	}
> +
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
> +	struct folio *failing_folio = NULL;
> +	struct ksm_stable_node *healthy_dupdup = NULL;
> +	struct folio *healthy_folio = NULL;
> +	struct ksm_stable_node *chain_head = NULL;
> +	struct page *new_page = NULL;
> +	struct ksm_stable_node *new_stable_node = NULL;
> +
> +	if (!is_stable_node_dup(failing_node))
> +		return false;
> +
> +	guard(mutex)(&ksm_thread_mutex);
> +	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
> +	if (!failing_folio)
> +		return false;
> +
> +	chain_head = find_chain_head(failing_node);
> +	if (!chain_head)
> +		return NULL;

Should we folio_put(failing_folio) before return?

Thanks.
.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-21 14:00   ` Long long Xia
@ 2025-10-23 16:16     ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-10-23 16:16 UTC (permalink / raw)
  To: Long long Xia, linmiaohe, lance.yang
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm

On 21.10.25 16:00, Long long Xia wrote:
> Thanks for the reply.
> 
> I do some simple tests.
> I hope these findings are helpful for the community's review.
> 
> 1.Test VM
> Configuration
> Hardware: x86_64 QEMU VM, 1 vCPU, 256MB RAM per guest
> Kernel: 6.6.89
> 
> Testcase1: Single VM and enable KSM
> 
> - VM Memory Usage:
>     * RSS Total  = 275028 KB (268 MB)
>     * RSS Anon   = 253656 KB (247 MB)
>     * RSS File   = 21372 KB (20 MB)
>     * RSS Shmem  = 0 KB (0 MB)
> 
> a.Traverse the stable tree
> b. pages on the chain
> 2 chains detected
> Chain #1: 51 duplicates, 12,956 pages (~51 MB)
> Chain #2: 15 duplicates, 3,822 pages (~15 MB)
> Average: 8,389 pages per chain
> Sum: 16778 pages (64.6% of ksm_pages_sharing + ksm_pages_shared)
> c. pages on the chain
> Non-chain pages: 9,209 pages
> d.chain_count = 2, not_chain_count = 4200
> e.
> /sys/kernel/mm/ksm/ksm_pages_sharing = 21721
> /sys/kernel/mm/ksm/ksm_pages_shared = 4266
> /sys/kernel/mm/ksm/ksm_pages_unshared = 38098
> 
> 
> Testcase2: 10 VMs and enable KSM
> a.Traverse the stable tree
> b.Pages on the chain
> 8 chains detected
> Chain #1: 458 duplicates, 117,012 pages (~457 MB)
> Chain #2: 150 duplicates, 38,231 pages (~149 MB)
> Chain #3: 10 duplicates, 2,320 pages (~9 MB)
> Chain #4: 8 duplicates, 1,814 pages (~7 MB)
> Chain #5-8: 4, 3, 3, 2 duplicates (920, 720, 600, 260 pages)

Thanks, so I assume the top candidates is mostly zeropages and stuff 
like that.

Makes sense to me then, it would be great to add that as motivation to 
the cover letter!

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-23 11:54   ` Miaohe Lin
@ 2025-10-28  7:54     ` Long long Xia
  2025-10-29  6:40       ` Miaohe Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Long long Xia @ 2025-10-28  7:54 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
	lance.yang

Thanks for the reply.

在 2025/10/23 19:54, Miaohe Lin 写道:
> On 2025/10/16 18:18, Longlong Xia wrote:
>> From: Longlong Xia <xialonglong@kylinos.cn>
>>
>> When a hardware memory error occurs on a KSM page, the current
>> behavior is to kill all processes mapping that page. This can
>> be overly aggressive when KSM has multiple duplicate pages in
>> a chain where other duplicates are still healthy.
>>
>> This patch introduces a recovery mechanism that attempts to
>> migrate mappings from the failing KSM page to a newly
>> allocated KSM page or another healthy duplicate already
>> present in the same chain, before falling back to the
>> process-killing procedure.
>>
>> The recovery process works as follows:
>> 1. Identify if the failing KSM page belongs to a stable node chain.
>> 2. Locate a healthy duplicate KSM page within the same chain.
>> 3. For each process mapping the failing page:
>>     a. Attempt to allocate a new KSM page copy from healthy duplicate
>>        KSM page. If successful, migrate the mapping to this new KSM page.
>>     b. If allocation fails, migrate the mapping to the existing healthy
>>        duplicate KSM page.
>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>     error) does the kernel fall back to killing the affected processes.
>>
>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
> Thanks for your patch. Some comments below.
>
>> ---
>>   mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 246 insertions(+)
>>
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index 160787bb121c..9099bad1ab35 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>   }
>>   
>>   #ifdef CONFIG_MEMORY_FAILURE
>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>> +{
>> +	struct ksm_stable_node *stable_node, *dup;
>> +	struct rb_node *node;
>> +	int nid;
>> +
>> +	if (!is_stable_node_dup(dup_node))
>> +		return NULL;
>> +
>> +	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>> +		node = rb_first(root_stable_tree + nid);
>> +		for (; node; node = rb_next(node)) {
>> +			stable_node = rb_entry(node,
>> +					struct ksm_stable_node,
>> +					node);
>> +
>> +			if (!is_stable_node_chain(stable_node))
>> +				continue;
>> +
>> +			hlist_for_each_entry(dup, &stable_node->hlist,
>> +					hlist_dup) {
>> +				if (dup == dup_node)
>> +					return stable_node;
>> +			}
>> +		}
>> +	}
> Would above multiple loops take a long time in some corner cases?

Thanks for the concern.

I do some simple test。

Test 1: 10 Virtual Machines (Real-world Scenario)
Environment: 10 VMs (256MB each) with KSM enabled

KSM State:
pages_sharing: 262,802 (≈1GB)
pages_shared: 17,374 （≈68MB）
pages_unshared = 124,057 (≈485MB)
total ≈1.5GB
chain_count = 9, not_chain_count = 17152
Red-black tree nodes to traverse:
17,161 (9 chains + 17,152 non-chains)

Performance:
find_chain: 898 μs (0.9 ms)
collect_procs_ksm: 4,409 μs (4.4 ms)
Total memory failure handling: 6,135 μs (6.1 ms)


Test 2: 10GB Single Process (Extreme Case)
Environment: Single process with 10GB memory,
1,310,720 page pairs (each pair identical, different from others)

KSM State:
pages_sharing: 1,311,740 （≈5GB)
pages_shared: 1,310,724 （≈5GB)
pages_unshared = 0
total ≈10GB
Red-black tree nodes to traverse:
1,310,721 (1 chain + 1,310,720 non-chains)

Performance:
find_chain: 28,822 μs (28.8 ms)
collect_procs_ksm: 45,944 μs (45.9 ms)
Total memory failure handling: 46,594 μs (46.6 ms)

Summary:
The find_chain function shows approximately linear scaling with the 
number of red-black tree nodes.
With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 
32x (898 μs → 28,822 μs).
representing 62% of total memory failure handling time (46.6ms).
However, since memory failures are rare events, this latency may be 
acceptable
as it does not impact normal system performance and only affects error 
recovery paths.


>> +
>> +	return NULL;
>> +}
>> +
>> +static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
>> +		struct ksm_stable_node *failing_node,
>> +		struct ksm_stable_node **healthy_dupdup)
>> +{
>> +	struct ksm_stable_node *dup;
>> +	struct hlist_node *hlist_safe;
>> +	struct folio *healthy_folio;
>> +
>> +	if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
>> +		return NULL;
>> +
>> +	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
>> +		if (dup == failing_node)
>> +			continue;
>> +
>> +		healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
>> +		if (healthy_folio) {
>> +			*healthy_dupdup = dup;
>> +			return healthy_folio;
>> +		}
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
>> +		struct folio *healthy_folio,
>> +		struct ksm_stable_node **new_stable_node)
>> +{
>> +	int nid;
>> +	unsigned long kpfn;
>> +	struct page *new_page = NULL;
>> +
>> +	if (!is_stable_node_chain(chain_head))
>> +		return NULL;
>> +
>> +	new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
> Why __GFP_ZERO is needed?
Thanks for pointing this out. I'll remove it.

>> +	if (!new_page)
>> +		return NULL;
>> +
>> +	copy_highpage(new_page, folio_page(healthy_folio, 0));
>> +
>> +	*new_stable_node = alloc_stable_node();
>> +	if (!*new_stable_node) {
>> +		__free_page(new_page);
>> +		return NULL;
>> +	}
>> +
>> +	INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
>> +	kpfn = page_to_pfn(new_page);
>> +	(*new_stable_node)->kpfn = kpfn;
>> +	nid = get_kpfn_nid(kpfn);
>> +	DO_NUMA((*new_stable_node)->nid = nid);
>> +	(*new_stable_node)->rmap_hlist_len = 0;
>> +
>> +	(*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
>> +	hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
>> +	ksm_stable_node_dups++;
>> +	folio_set_stable_node(page_folio(new_page), *new_stable_node);
>> +	folio_add_lru(page_folio(new_page));
>> +
>> +	return new_page;
>> +}
>> +
> ...
>
>> +
>> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
>> +		struct folio *failing_folio,
>> +		struct folio *target_folio,
>> +		struct ksm_stable_node *target_dup)
>> +{
>> +	struct ksm_rmap_item *rmap_item;
>> +	struct hlist_node *hlist_safe;
>> +	int err;
>> +
>> +	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
>> +		struct mm_struct *mm = rmap_item->mm;
>> +		unsigned long addr = rmap_item->address & PAGE_MASK;
>> +		struct vm_area_struct *vma;
>> +
>> +		if (!mmap_read_trylock(mm))
>> +			continue;
>> +
>> +		if (ksm_test_exit(mm)) {
>> +			mmap_read_unlock(mm);
>> +			continue;
>> +		}
>> +
>> +		vma = vma_lookup(mm, addr);
>> +		if (!vma) {
>> +			mmap_read_unlock(mm);
>> +			continue;
>> +		}
>> +
>> +		if (!folio_trylock(target_folio)) {
> Should we try to get the folio refcnt first?

Thanks for pointing this out. I'll fix it.

>> +			mmap_read_unlock(mm);
>> +			continue;
>> +		}
>> +
>> +		err = replace_failing_page(vma, &failing_folio->page,
>> +				folio_page(target_folio, 0), addr);
>> +		if (!err) {
>> +			hlist_del(&rmap_item->hlist);
>> +			rmap_item->head = target_dup;
>> +			hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
>> +			target_dup->rmap_hlist_len++;
>> +			failing_node->rmap_hlist_len--;
>> +		}
>> +
>> +		folio_unlock(target_folio);
>> +		mmap_read_unlock(mm);
>> +	}
>> +
>> +}
>> +
>> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
>> +{
>> +	struct folio *failing_folio = NULL;
>> +	struct ksm_stable_node *healthy_dupdup = NULL;
>> +	struct folio *healthy_folio = NULL;
>> +	struct ksm_stable_node *chain_head = NULL;
>> +	struct page *new_page = NULL;
>> +	struct ksm_stable_node *new_stable_node = NULL;
>> +
>> +	if (!is_stable_node_dup(failing_node))
>> +		return false;
>> +
>> +	guard(mutex)(&ksm_thread_mutex);
>> +	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
>> +	if (!failing_folio)
>> +		return false;
>> +
>> +	chain_head = find_chain_head(failing_node);
>> +	if (!chain_head)
>> +		return NULL;
> Should we folio_put(failing_folio) before return?

Thanks for pointing this out. I'll fix it.

> Thanks.
> .
Best regards,
Longlong Xia





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
                     ` (2 preceding siblings ...)
  2025-10-23 11:54   ` Miaohe Lin
@ 2025-10-28  9:44   ` David Hildenbrand
  2025-11-03 15:15     ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
  3 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-10-28  9:44 UTC (permalink / raw)
  To: Longlong Xia, linmiaohe, lance.yang
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia

On 16.10.25 12:18, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
> 
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
> 
> This patch introduces a recovery mechanism that attempts to
> migrate mappings from the failing KSM page to a newly
> allocated KSM page or another healthy duplicate already
> present in the same chain, before falling back to the
> process-killing procedure.
> 
> The recovery process works as follows:
> 1. Identify if the failing KSM page belongs to a stable node chain.
> 2. Locate a healthy duplicate KSM page within the same chain.
> 3. For each process mapping the failing page:
>     a. Attempt to allocate a new KSM page copy from healthy duplicate
>        KSM page. If successful, migrate the mapping to this new KSM page.
>     b. If allocation fails, migrate the mapping to the existing healthy
>        duplicate KSM page.
> 4. If all migrations succeed, remove the failing KSM page from the chain.
> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>     error) does the kernel fall back to killing the affected processes.
> 
> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
> ---
>   mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 246 insertions(+)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 160787bb121c..9099bad1ab35 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>   }
>   
>   #ifdef CONFIG_MEMORY_FAILURE
> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
> +{
> +	struct ksm_stable_node *stable_node, *dup;
> +	struct rb_node *node;
> +	int nid;
> +
> +	if (!is_stable_node_dup(dup_node))
> +		return NULL;
> +
> +	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
> +		node = rb_first(root_stable_tree + nid);
> +		for (; node; node = rb_next(node)) {
> +			stable_node = rb_entry(node,
> +					struct ksm_stable_node,
> +					node);

Put that into a single line for readability, please.

You can also consider factoring out this inner loop in a helper function.

> +
> +			if (!is_stable_node_chain(stable_node))
> +				continue;
> +
> +			hlist_for_each_entry(dup, &stable_node->hlist,
> +					hlist_dup) {

Single line, or properly indent.

> +				if (dup == dup_node)
> +					return stable_node;
> +			}
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
> +		struct ksm_stable_node *failing_node,
> +		struct ksm_stable_node **healthy_dupdup)
> +{
> +	struct ksm_stable_node *dup;
> +	struct hlist_node *hlist_safe;
> +	struct folio *healthy_folio;
> +
> +	if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
> +		return NULL;
> +
> +	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
> +		if (dup == failing_node)
> +			continue;
> +
> +		healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
> +		if (healthy_folio) {
> +			*healthy_dupdup = dup;
> +			return healthy_folio;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
> +		struct folio *healthy_folio,
> +		struct ksm_stable_node **new_stable_node)
> +{
> +	int nid;
> +	unsigned long kpfn;
> +	struct page *new_page = NULL;
> +
> +	if (!is_stable_node_chain(chain_head))
> +		return NULL;
> +
> +	new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
> +	if (!new_page)
> +		return NULL;
> +
> +	copy_highpage(new_page, folio_page(healthy_folio, 0));
> +
> +	*new_stable_node = alloc_stable_node();
> +	if (!*new_stable_node) {
> +		__free_page(new_page);
> +		return NULL;
> +	}
> +
> +	INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
> +	kpfn = page_to_pfn(new_page);
> +	(*new_stable_node)->kpfn = kpfn;
> +	nid = get_kpfn_nid(kpfn);
> +	DO_NUMA((*new_stable_node)->nid = nid);
> +	(*new_stable_node)->rmap_hlist_len = 0;
> +
> +	(*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
> +	hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
> +	ksm_stable_node_dups++;
> +	folio_set_stable_node(page_folio(new_page), *new_stable_node);
> +	folio_add_lru(page_folio(new_page));

There seems to be a lot of copy-paste. For example, why no reuse 
stable_node_chain_add_dup()?

Or why not try to reuse stable_tree_insert() in the first place?

Try to reuse or factor out instead of copy-pasting, please.

> +
> +	return new_page;
> +}
> +
> +static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
> +		struct page *kpage, unsigned long addr)
> +{
> +	struct folio *kfolio = page_folio(kpage);
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct folio *folio = page_folio(page);
> +	pmd_t *pmd;
> +	pte_t *ptep;
> +	pte_t newpte;
> +	spinlock_t *ptl;
> +	int err = -EFAULT;
> +	struct mmu_notifier_range range;
> +
> +	pmd = mm_find_pmd(mm, addr);
> +	if (!pmd)
> +		goto out;
> +
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> +			addr + PAGE_SIZE);
> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	if (!ptep)
> +		goto out_mn;
> +
> +	folio_get(kfolio);
> +	folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
> +	newpte = mk_pte(kpage, vma->vm_page_prot);
> +
> +	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
> +	ptep_clear_flush(vma, addr, ptep);
> +	set_pte_at(mm, addr, ptep, newpte);
> +
> +	folio_remove_rmap_pte(folio, page, vma);
> +	if (!folio_mapped(folio))
> +		folio_free_swap(folio);
> +	folio_put(folio);
> +
> +	pte_unmap_unlock(ptep, ptl);
> +	err = 0;
> +out_mn:
> +	mmu_notifier_invalidate_range_end(&range);
> +out:
> +	return err;
> +}

This is a lot of copy-paste from replace_page(). Isn't there a way to 
avoid this duplication by unifying both functions in some way?

> +
> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
> +		struct folio *failing_folio,
> +		struct folio *target_folio,
> +		struct ksm_stable_node *target_dup)
> +{
> +	struct ksm_rmap_item *rmap_item;
> +	struct hlist_node *hlist_safe;
> +	int err;
> +
> +	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> +		struct mm_struct *mm = rmap_item->mm;
> +		unsigned long addr = rmap_item->address & PAGE_MASK;

Can be const.

> +		struct vm_area_struct *vma;
> +
> +		if (!mmap_read_trylock(mm))
> +			continue;
> +
> +		if (ksm_test_exit(mm)) {
> +			mmap_read_unlock(mm);
> +			continue;
> +		}
> +
> +		vma = vma_lookup(mm, addr);
> +		if (!vma) {
> +			mmap_read_unlock(mm);
> +			continue;
> +		}
> +
> +		if (!folio_trylock(target_folio)) {

Can't we leave the target folio locked the whole time? The caller 
already locked it, why not keep it locked until we're done?

> +			mmap_read_unlock(mm);
> +			continue;
> +		}
> +
> +		err = replace_failing_page(vma, &failing_folio->page,
> +				folio_page(target_folio, 0), addr);
> +		if (!err) {
> +			hlist_del(&rmap_item->hlist);
> +			rmap_item->head = target_dup;
> +			hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
> +			target_dup->rmap_hlist_len++;
> +			failing_node->rmap_hlist_len--;
> +		}
> +
> +		folio_unlock(target_folio);
> +		mmap_read_unlock(mm);
> +	}
> +
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
> +	struct folio *failing_folio = NULL;
> +	struct ksm_stable_node *healthy_dupdup = NULL;
> +	struct folio *healthy_folio = NULL;
> +	struct ksm_stable_node *chain_head = NULL;
> +	struct page *new_page = NULL;
> +	struct ksm_stable_node *new_stable_node = NULL;

Only initialize what needs initialization (nothing in here?) and combine 
where possible.

Like

	struct folio *failing_folio, *healthy_folio;


> +
> +	if (!is_stable_node_dup(failing_node))
> +		return false;
> +
> +	guard(mutex)(&ksm_thread_mutex);
> +	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
> +	if (!failing_folio)
> +		return false;
> +
> +	chain_head = find_chain_head(failing_node);
> +	if (!chain_head)
> +		return NULL;
> +
> +	healthy_folio = find_healthy_folio(chain_head, failing_node, &healthy_dupdup);
> +	if (!healthy_folio) {
> +		folio_put(failing_folio);
> +		return false;
> +	}
> +
> +	new_page = create_new_stable_node_dup(chain_head, healthy_folio, &new_stable_node);

Why are you returning a page here and not a folio?


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-28  7:54     ` Long long Xia
@ 2025-10-29  6:40       ` Miaohe Lin
  2025-10-29  7:12         ` Long long Xia
  0 siblings, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2025-10-29  6:40 UTC (permalink / raw)
  To: Long long Xia
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
	lance.yang

On 2025/10/28 15:54, Long long Xia wrote:
> Thanks for the reply.
> 
> 在 2025/10/23 19:54, Miaohe Lin 写道:
>> On 2025/10/16 18:18, Longlong Xia wrote:
>>> From: Longlong Xia <xialonglong@kylinos.cn>
>>>
>>> When a hardware memory error occurs on a KSM page, the current
>>> behavior is to kill all processes mapping that page. This can
>>> be overly aggressive when KSM has multiple duplicate pages in
>>> a chain where other duplicates are still healthy.
>>>
>>> This patch introduces a recovery mechanism that attempts to
>>> migrate mappings from the failing KSM page to a newly
>>> allocated KSM page or another healthy duplicate already
>>> present in the same chain, before falling back to the
>>> process-killing procedure.
>>>
>>> The recovery process works as follows:
>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>> 3. For each process mapping the failing page:
>>>     a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>        KSM page. If successful, migrate the mapping to this new KSM page.
>>>     b. If allocation fails, migrate the mapping to the existing healthy
>>>        duplicate KSM page.
>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>     error) does the kernel fall back to killing the affected processes.
>>>
>>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
>> Thanks for your patch. Some comments below.
>>
>>> ---
>>>   mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 246 insertions(+)
>>>
>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>> index 160787bb121c..9099bad1ab35 100644
>>> --- a/mm/ksm.c
>>> +++ b/mm/ksm.c
>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>   }
>>>     #ifdef CONFIG_MEMORY_FAILURE
>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>> +{
>>> +    struct ksm_stable_node *stable_node, *dup;
>>> +    struct rb_node *node;
>>> +    int nid;
>>> +
>>> +    if (!is_stable_node_dup(dup_node))
>>> +        return NULL;
>>> +
>>> +    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>> +        node = rb_first(root_stable_tree + nid);
>>> +        for (; node; node = rb_next(node)) {
>>> +            stable_node = rb_entry(node,
>>> +                    struct ksm_stable_node,
>>> +                    node);
>>> +
>>> +            if (!is_stable_node_chain(stable_node))
>>> +                continue;
>>> +
>>> +            hlist_for_each_entry(dup, &stable_node->hlist,
>>> +                    hlist_dup) {
>>> +                if (dup == dup_node)
>>> +                    return stable_node;
>>> +            }
>>> +        }
>>> +    }
>> Would above multiple loops take a long time in some corner cases?
> 
> Thanks for the concern.
> 
> I do some simple test。
> 
> Test 1: 10 Virtual Machines (Real-world Scenario)
> Environment: 10 VMs (256MB each) with KSM enabled
> 
> KSM State:
> pages_sharing: 262,802 (≈1GB)
> pages_shared: 17,374 （≈68MB）
> pages_unshared = 124,057 (≈485MB)
> total ≈1.5GB
> chain_count = 9, not_chain_count = 17152
> Red-black tree nodes to traverse:
> 17,161 (9 chains + 17,152 non-chains)
> 
> Performance:
> find_chain: 898 μs (0.9 ms)
> collect_procs_ksm: 4,409 μs (4.4 ms)
> Total memory failure handling: 6,135 μs (6.1 ms)
> 
> 
> Test 2: 10GB Single Process (Extreme Case)
> Environment: Single process with 10GB memory,
> 1,310,720 page pairs (each pair identical, different from others)
> 
> KSM State:
> pages_sharing: 1,311,740 （≈5GB)
> pages_shared: 1,310,724 （≈5GB)
> pages_unshared = 0
> total ≈10GB
> Red-black tree nodes to traverse:
> 1,310,721 (1 chain + 1,310,720 non-chains)
> 
> Performance:
> find_chain: 28,822 μs (28.8 ms)
> collect_procs_ksm: 45,944 μs (45.9 ms)
> Total memory failure handling: 46,594 μs (46.6 ms)

Thanks for your test.

> 
> Summary:
> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
> representing 62% of total memory failure handling time (46.6ms).
> However, since memory failures are rare events, this latency may be acceptable
> as it does not impact normal system performance and only affects error recovery paths.
> 

IMHO, the execution time of a kernel function must not be too long without any scheduling points.
Otherwise it may affect the normal scheduling of the system and leads to something like performance
fluctuation. Or am I miss something?

Thanks.
.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-29  6:40       ` Miaohe Lin
@ 2025-10-29  7:12         ` Long long Xia
  2025-10-30  2:56           ` Miaohe Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Long long Xia @ 2025-10-29  7:12 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
	lance.yang

Thanks for the reply.


在 2025/10/29 14:40, Miaohe Lin 写道:
> On 2025/10/28 15:54, Long long Xia wrote:
>> Thanks for the reply.
>>
>> 在 2025/10/23 19:54, Miaohe Lin 写道:
>>> On 2025/10/16 18:18, Longlong Xia wrote:
>>>> From: Longlong Xia <xialonglong@kylinos.cn>
>>>>
>>>> When a hardware memory error occurs on a KSM page, the current
>>>> behavior is to kill all processes mapping that page. This can
>>>> be overly aggressive when KSM has multiple duplicate pages in
>>>> a chain where other duplicates are still healthy.
>>>>
>>>> This patch introduces a recovery mechanism that attempts to
>>>> migrate mappings from the failing KSM page to a newly
>>>> allocated KSM page or another healthy duplicate already
>>>> present in the same chain, before falling back to the
>>>> process-killing procedure.
>>>>
>>>> The recovery process works as follows:
>>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>>> 3. For each process mapping the failing page:
>>>>      a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>>         KSM page. If successful, migrate the mapping to this new KSM page.
>>>>      b. If allocation fails, migrate the mapping to the existing healthy
>>>>         duplicate KSM page.
>>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>>      error) does the kernel fall back to killing the affected processes.
>>>>
>>>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
>>> Thanks for your patch. Some comments below.
>>>
>>>> ---
>>>>    mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 246 insertions(+)
>>>>
>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>> index 160787bb121c..9099bad1ab35 100644
>>>> --- a/mm/ksm.c
>>>> +++ b/mm/ksm.c
>>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>>    }
>>>>      #ifdef CONFIG_MEMORY_FAILURE
>>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>>> +{
>>>> +    struct ksm_stable_node *stable_node, *dup;
>>>> +    struct rb_node *node;
>>>> +    int nid;
>>>> +
>>>> +    if (!is_stable_node_dup(dup_node))
>>>> +        return NULL;
>>>> +
>>>> +    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>>> +        node = rb_first(root_stable_tree + nid);
>>>> +        for (; node; node = rb_next(node)) {
>>>> +            stable_node = rb_entry(node,
>>>> +                    struct ksm_stable_node,
>>>> +                    node);
>>>> +
>>>> +            if (!is_stable_node_chain(stable_node))
>>>> +                continue;
>>>> +
>>>> +            hlist_for_each_entry(dup, &stable_node->hlist,
>>>> +                    hlist_dup) {
>>>> +                if (dup == dup_node)
>>>> +                    return stable_node;
>>>> +            }
may I add cond_resched(); here ？
>>>> +        }
>>>> +    }
>>> Would above multiple loops take a long time in some corner cases?
>> Thanks for the concern.
>>
>> I do some simple test。
>>
>> Test 1: 10 Virtual Machines (Real-world Scenario)
>> Environment: 10 VMs (256MB each) with KSM enabled
>>
>> KSM State:
>> pages_sharing: 262,802 (≈1GB)
>> pages_shared: 17,374 （≈68MB）
>> pages_unshared = 124,057 (≈485MB)
>> total ≈1.5GB
>> chain_count = 9, not_chain_count = 17152
>> Red-black tree nodes to traverse:
>> 17,161 (9 chains + 17,152 non-chains)
>>
>> Performance:
>> find_chain: 898 μs (0.9 ms)
>> collect_procs_ksm: 4,409 μs (4.4 ms)
>> Total memory failure handling: 6,135 μs (6.1 ms)
>>
>>
>> Test 2: 10GB Single Process (Extreme Case)
>> Environment: Single process with 10GB memory,
>> 1,310,720 page pairs (each pair identical, different from others)
>>
>> KSM State:
>> pages_sharing: 1,311,740 （≈5GB)
>> pages_shared: 1,310,724 （≈5GB)
>> pages_unshared = 0
>> total ≈10GB
>> Red-black tree nodes to traverse:
>> 1,310,721 (1 chain + 1,310,720 non-chains)
>>
>> Performance:
>> find_chain: 28,822 μs (28.8 ms)
>> collect_procs_ksm: 45,944 μs (45.9 ms)
>> Total memory failure handling: 46,594 μs (46.6 ms)
> Thanks for your test.
>
>> Summary:
>> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
>> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
>> representing 62% of total memory failure handling time (46.6ms).
>> However, since memory failures are rare events, this latency may be acceptable
>> as it does not impact normal system performance and only affects error recovery paths.
>>
> IMHO, the execution time of a kernel function must not be too long without any scheduling points.
> Otherwise it may affect the normal scheduling of the system and leads to something like performance
> fluctuation. Or am I miss something?
>
> Thanks.
> .

I will add cond_resched()in the loop of red-black tree to allow 
scheduling in find_chain(), may be it is enough?

Best regards,
Longlong Xia



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-29  7:12         ` Long long Xia
@ 2025-10-30  2:56           ` Miaohe Lin
  0 siblings, 0 replies; 17+ messages in thread
From: Miaohe Lin @ 2025-10-30  2:56 UTC (permalink / raw)
  To: Long long Xia
  Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
	xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
	lance.yang

On 2025/10/29 15:12, Long long Xia wrote:
> Thanks for the reply.
> 
> 
> 在 2025/10/29 14:40, Miaohe Lin 写道:
>> On 2025/10/28 15:54, Long long Xia wrote:
>>> Thanks for the reply.
>>>
>>> 在 2025/10/23 19:54, Miaohe Lin 写道:
>>>> On 2025/10/16 18:18, Longlong Xia wrote:
>>>>> From: Longlong Xia <xialonglong@kylinos.cn>
>>>>>
>>>>> When a hardware memory error occurs on a KSM page, the current
>>>>> behavior is to kill all processes mapping that page. This can
>>>>> be overly aggressive when KSM has multiple duplicate pages in
>>>>> a chain where other duplicates are still healthy.
>>>>>
>>>>> This patch introduces a recovery mechanism that attempts to
>>>>> migrate mappings from the failing KSM page to a newly
>>>>> allocated KSM page or another healthy duplicate already
>>>>> present in the same chain, before falling back to the
>>>>> process-killing procedure.
>>>>>
>>>>> The recovery process works as follows:
>>>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>>>> 3. For each process mapping the failing page:
>>>>>      a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>>>         KSM page. If successful, migrate the mapping to this new KSM page.
>>>>>      b. If allocation fails, migrate the mapping to the existing healthy
>>>>>         duplicate KSM page.
>>>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>>>      error) does the kernel fall back to killing the affected processes.
>>>>>
>>>>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
>>>> Thanks for your patch. Some comments below.
>>>>
>>>>> ---
>>>>>    mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>    1 file changed, 246 insertions(+)
>>>>>
>>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>>> index 160787bb121c..9099bad1ab35 100644
>>>>> --- a/mm/ksm.c
>>>>> +++ b/mm/ksm.c
>>>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>>>    }
>>>>>      #ifdef CONFIG_MEMORY_FAILURE
>>>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>>>> +{
>>>>> +    struct ksm_stable_node *stable_node, *dup;
>>>>> +    struct rb_node *node;
>>>>> +    int nid;
>>>>> +
>>>>> +    if (!is_stable_node_dup(dup_node))
>>>>> +        return NULL;
>>>>> +
>>>>> +    for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>>>> +        node = rb_first(root_stable_tree + nid);
>>>>> +        for (; node; node = rb_next(node)) {
>>>>> +            stable_node = rb_entry(node,
>>>>> +                    struct ksm_stable_node,
>>>>> +                    node);
>>>>> +
>>>>> +            if (!is_stable_node_chain(stable_node))
>>>>> +                continue;
>>>>> +
>>>>> +            hlist_for_each_entry(dup, &stable_node->hlist,
>>>>> +                    hlist_dup) {
>>>>> +                if (dup == dup_node)
>>>>> +                    return stable_node;
>>>>> +            }
> may I add cond_resched(); here ？
>>>>> +        }
>>>>> +    }
>>>> Would above multiple loops take a long time in some corner cases?
>>> Thanks for the concern.
>>>
>>> I do some simple test。
>>>
>>> Test 1: 10 Virtual Machines (Real-world Scenario)
>>> Environment: 10 VMs (256MB each) with KSM enabled
>>>
>>> KSM State:
>>> pages_sharing: 262,802 (≈1GB)
>>> pages_shared: 17,374 （≈68MB）
>>> pages_unshared = 124,057 (≈485MB)
>>> total ≈1.5GB
>>> chain_count = 9, not_chain_count = 17152
>>> Red-black tree nodes to traverse:
>>> 17,161 (9 chains + 17,152 non-chains)
>>>
>>> Performance:
>>> find_chain: 898 μs (0.9 ms)
>>> collect_procs_ksm: 4,409 μs (4.4 ms)
>>> Total memory failure handling: 6,135 μs (6.1 ms)
>>>
>>>
>>> Test 2: 10GB Single Process (Extreme Case)
>>> Environment: Single process with 10GB memory,
>>> 1,310,720 page pairs (each pair identical, different from others)
>>>
>>> KSM State:
>>> pages_sharing: 1,311,740 （≈5GB)
>>> pages_shared: 1,310,724 （≈5GB)
>>> pages_unshared = 0
>>> total ≈10GB
>>> Red-black tree nodes to traverse:
>>> 1,310,721 (1 chain + 1,310,720 non-chains)
>>>
>>> Performance:
>>> find_chain: 28,822 μs (28.8 ms)
>>> collect_procs_ksm: 45,944 μs (45.9 ms)
>>> Total memory failure handling: 46,594 μs (46.6 ms)
>> Thanks for your test.
>>
>>> Summary:
>>> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
>>> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
>>> representing 62% of total memory failure handling time (46.6ms).
>>> However, since memory failures are rare events, this latency may be acceptable
>>> as it does not impact normal system performance and only affects error recovery paths.
>>>
>> IMHO, the execution time of a kernel function must not be too long without any scheduling points.
>> Otherwise it may affect the normal scheduling of the system and leads to something like performance
>> fluctuation. Or am I miss something?
>>
>> Thanks.
>> .
> 
> I will add cond_resched()in the loop of red-black tree to allow scheduling in find_chain(), may be it is enough?

It looks work to me.

Thanks.
.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 0/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate
  2025-10-28  9:44   ` David Hildenbrand
@ 2025-11-03 15:15     ` Longlong Xia
  2025-11-03 15:16       ` [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates Longlong Xia
  2025-11-03 15:16       ` [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
  0 siblings, 2 replies; 17+ messages in thread
From: Longlong Xia @ 2025-11-03 15:15 UTC (permalink / raw)
  To: david, linmiaohe
  Cc: lance.yang, markus.elfring, nao.horiguchi, akpm, wangkefeng.wang,
	qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm, Longlong Xia

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.

The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
   a. Attempt to allocate a new KSM page copy from healthy duplicate
      KSM page. If successful, migrate the mapping to this new KSM page.
   b. If allocation fails, migrate the mapping to the existing healthy
      duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
   error) does the kernel fall back to killing the affected processes.

The original idea came from Naoya Horiguchi.
https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/


Real-world Application Testing:
-------------------------------
Workload: 10 QEMU VMs (1 vCPU, 256MB RAM each) with KSM enabled
Platform: x86_64, Kernel 6.6.89

Testcase1: Single VM and enable KSM

- VM Memory Usage:
    * RSS Total  = 275028 KB (268 MB)
    * RSS Anon   = 253656 KB (247 MB)
    * RSS File   = 21372 KB (20 MB)
    * RSS Shmem  = 0 KB (0 MB)

a.Traverse the stable tree
b. pages on the chain
2 chains detected
Chain #1: 51 duplicates, 12,956 pages (~51 MB)
Chain #2: 15 duplicates, 3,822 pages (~15 MB)
Average: 8,389 pages per chain
Sum: 16778 pages (64.6% of ksm_pages_sharing + ksm_pages_shared)
c. pages on the chain
Non-chain pages: 9,209 pages
d.chain_count = 2, not_chain_count = 4200
e.
/sys/kernel/mm/ksm/ksm_pages_sharing = 21721
/sys/kernel/mm/ksm/ksm_pages_shared = 4266
/sys/kernel/mm/ksm/ksm_pages_unshared = 38098


Testcase2: 10 VMs and enable KSM
a.Traverse the stable tree
b.Pages on the chain
8 chains detected
Chain #1: 458 duplicates, 117,012 pages (~457 MB)
Chain #2: 150 duplicates, 38,231 pages (~149 MB)
Chain #3: 10 duplicates, 2,320 pages (~9 MB)
Chain #4: 8 duplicates, 1,814 pages (~7 MB)
Chain #5-8: 4, 3, 3, 2 duplicates (920, 720, 600, 260 pages) 

Surprisingly, although chains are few in number, they contribute
significantly to the overall savings. In the 10-VM scenario, only 8 chains
produce 161,877 pages (44.5% of total), while thousands of non-chain groups
contribute the remaining 55.5%. 

Functional Testing (Hardware Error Injection):
----------------------------------------------
test shell script
modprobe einj 2>/dev/null
echo 0x10 > /sys/kernel/debug/apei/einj/error_type
echo $ADDRESS > /sys/kernel/debug/apei/einj/param1
echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
echo 1 > /sys/kernel/debug/apei/einj/error_inject

FIRST WAY: allocate a new KSM page copy from healthy duplicate
1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
virtual addr = 0x71582be00000  phy_addr =0x124802000
virtual addr = 0x71582bf2c000  phy_addr =0x124902000
virtual addr = 0x71582c026000  phy_addr =0x125402000
virtual addr = 0x71582c120000  phy_addr =0x125502000


2. echo 0x124802000 > /sys/kernel/debug/apei/einj/param1
virtual addr = 0x71582be00000  phy_addr =0x1363b1000 (new allocated)
virtual addr = 0x71582bf2c000  phy_addr =0x124902000
virtual addr = 0x71582c026000  phy_addr =0x125402000
virtual addr = 0x71582c120000  phy_addr =0x125502000

kernel-log:
mce: [Hardware Error]: Machine check events logged
ksm: recovery successful, no need to kill processes
Memory failure: 0x124802: recovery action for dirty LRU page: Recovered
Memory failure: 0x124802: recovery action for already poisoned page: Failed

SECOND WAY: Migrate the mapping to the existing healthy duplicate KSM page
1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
virtual addr = 0x79a172000000  phy_addr =0x141802000
virtual addr = 0x79a17212c000  phy_addr =0x141902000
virtual addr = 0x79a172226000  phy_addr =0x13cc02000
virtual addr = 0x79a172320000  phy_addr =0x13cd02000

2 echo 0x141802000 > /sys/kernel/debug/apei/einj/param1
a.virtual addr = 0x79a172000000  phy_addr =0x13cd02000
b.virtual addr = 0x79a17212c000  phy_addr =0x141902000
c.virtual addr = 0x79a172226000  phy_addr =0x13cc02000
d.virtual addr = 0x79a172320000  phy_addr =0x13cd02000 (share with a) 

kernel-log:
mce: [Hardware Error]: Machine check events logged
ksm: recovery successful, no need to kill processes
Memory failure: 0x141802: recovery action for dirty LRU page: Recovered
Memory failure: 0x141802: recovery action for already poisoned page: Failed
ksm: recovery successful, no need to kill processes

Thanks for review and comments!

Changes in v3:

Patch 1/2 [New]: Preparatory refactoring
- Extract alloc_init_stable_node_dup() helper
- Refactor write_protect_page() and replace_page() to expose _addr variants
- No functional changes

Patch 2/2:
- Refactored to use alloc_init_stable_node_dup() helper from patch 1/2
  and stable_node_chain_add_dup()
- Fix locking: unlock failing_folio before mmap_read_lock to avoid deadlock
- Extracted find_stable_node_in_tree() as separate helper
- Removed redundant replace_failing_page(), using write_protect_page_addr()
  and replace_page_addr() instead
- Changed return type to 'struct folio *' for consistency
- Fixed code style issues

Changes in v2:

- Implemented a two-tier recovery strategy: preferring newly allocated
  pages over existing duplicates to avoid concentrating mappings on a 
  single page suggested by David Hildenbrand
- Remove handling of the zeropage in replace_failing_page(), as it is 
  non-recoverable suggested by Lance Yang 
- Correct the locking order by acquiring the mmap_lock before the page 
  lock during page replacement, suggested by Miaohe Lin
- Add protection using the ksm_thread_mutex around the entire recovery 
  operation to prevent race conditions with concurrent KSM scanning
- Separated the logic into smaller, more focused functions for better
  maintainability
- Update patch title

Longlong Xia (2):
  mm/ksm: add helper to allocate and initialize stable node duplicates
  mm/ksm: try recover from memory failure on KSM page by migrating to
    healthy duplicate

 mm/ksm.c | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 278 insertions(+), 26 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates
  2025-11-03 15:15     ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
@ 2025-11-03 15:16       ` Longlong Xia
  2025-11-03 15:16       ` [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
  1 sibling, 0 replies; 17+ messages in thread
From: Longlong Xia @ 2025-11-03 15:16 UTC (permalink / raw)
  To: david, linmiaohe
  Cc: lance.yang, markus.elfring, nao.horiguchi, akpm, wangkefeng.wang,
	qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm, Longlong Xia

Consolidate the duplicated stable_node allocation and initialization
code in stable_tree_insert() into a new helper function
alloc_init_stable_node_dup().

Also refactor write_protect_page() and replace_page() to expose
address-based variants (_addr suffix). The wrappers maintain existing
behavior by calculating the address first.

This refactoring prepares for the upcoming memory error recovery
feature, which will need to:
1) Allocate and initialize stable_node duplicates
2) Operate on specific addresses without re-calculation

No functional changes.

Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
---
 mm/ksm.c | 89 +++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 63 insertions(+), 26 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 160787bb121c..13ec057667af 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1247,11 +1247,11 @@ static u32 calc_checksum(struct page *page)
 	return checksum;
 }
 
-static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
-			      pte_t *orig_pte)
+static int write_protect_page_addr(struct vm_area_struct *vma, struct folio *folio,
+				   unsigned long address, pte_t *orig_pte)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int swapped;
 	int err = -EFAULT;
 	struct mmu_notifier_range range;
@@ -1261,10 +1261,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
 	if (WARN_ON_ONCE(folio_test_large(folio)))
 		return err;
 
-	pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma);
-	if (pvmw.address == -EFAULT)
-		goto out;
+	if (address < vma->vm_start || address >= vma->vm_end)
+		return err;
 
+	pvmw.address = address;
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address,
 				pvmw.address + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
@@ -1334,21 +1334,26 @@ static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
 	page_vma_mapped_walk_done(&pvmw);
 out_mn:
 	mmu_notifier_invalidate_range_end(&range);
-out:
 	return err;
 }
 
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma:      vma that holds the pte pointing to page
- * @page:     the page we are replacing by kpage
- * @kpage:    the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
-			struct page *kpage, pte_t orig_pte)
+static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
+			      pte_t *orig_pte)
+{
+	unsigned long address;
+
+	if (WARN_ON_ONCE(folio_test_large(folio)))
+		return -EFAULT;
+
+	address = page_address_in_vma(folio, folio_page(folio, 0), vma);
+	if (address == -EFAULT)
+		return -EFAULT;
+
+	return write_protect_page_addr(vma, folio, address, orig_pte);
+}
+
+static int replace_page_addr(struct vm_area_struct *vma, struct page *page,
+			     struct page *kpage, unsigned long addr, pte_t orig_pte)
 {
 	struct folio *kfolio = page_folio(kpage);
 	struct mm_struct *mm = vma->vm_mm;
@@ -1358,17 +1363,16 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	pte_t *ptep;
 	pte_t newpte;
 	spinlock_t *ptl;
-	unsigned long addr;
 	int err = -EFAULT;
 	struct mmu_notifier_range range;
 
-	addr = page_address_in_vma(folio, page, vma);
-	if (addr == -EFAULT)
+	if (addr < vma->vm_start || addr >= vma->vm_end)
 		goto out;
 
 	pmd = mm_find_pmd(mm, addr);
 	if (!pmd)
 		goto out;
+
 	/*
 	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
 	 * without holding anon_vma lock for write.  So when looking for a
@@ -1441,6 +1445,29 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	return err;
 }
 
+
+/**
+ * replace_page - replace page in vma by new ksm page
+ * @vma:      vma that holds the pte pointing to page
+ * @page:     the page we are replacing by kpage
+ * @kpage:    the ksm page we replace page by
+ * @orig_pte: the original value of the pte
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int replace_page(struct vm_area_struct *vma, struct page *page,
+			struct page *kpage, pte_t orig_pte)
+{
+	unsigned long addr;
+	struct folio *folio = page_folio(page);
+
+	addr = page_address_in_vma(folio, page, vma);
+	if (addr == -EFAULT)
+		return -EFAULT;
+
+	return replace_page_addr(vma, page, kpage, addr, orig_pte);
+}
+
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @vma: the vma that holds the pte pointing to page
@@ -2007,6 +2034,20 @@ static struct folio *stable_tree_search(struct page *page)
 	goto out;
 }
 
+static struct ksm_stable_node *alloc_init_stable_node_dup(unsigned long kpfn,
+							  int nid __maybe_unused)
+{
+	struct ksm_stable_node *stable_node = alloc_stable_node();
+
+	if (stable_node) {
+		INIT_HLIST_HEAD(&stable_node->hlist);
+		stable_node->kpfn = kpfn;
+		stable_node->rmap_hlist_len = 0;
+		DO_NUMA(stable_node->nid = nid);
+	}
+	return stable_node;
+}
+
 /*
  * stable_tree_insert - insert stable tree node pointing to new ksm page
  * into the stable tree.
@@ -2065,14 +2106,10 @@ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio)
 		}
 	}
 
-	stable_node_dup = alloc_stable_node();
+	stable_node_dup = alloc_init_stable_node_dup(kpfn, nid);
 	if (!stable_node_dup)
 		return NULL;
 
-	INIT_HLIST_HEAD(&stable_node_dup->hlist);
-	stable_node_dup->kpfn = kpfn;
-	stable_node_dup->rmap_hlist_len = 0;
-	DO_NUMA(stable_node_dup->nid = nid);
 	if (!need_chain) {
 		rb_link_node(&stable_node_dup->node, parent, new);
 		rb_insert_color(&stable_node_dup->node, root);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate
  2025-11-03 15:15     ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
  2025-11-03 15:16       ` [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates Longlong Xia
@ 2025-11-03 15:16       ` Longlong Xia
  1 sibling, 0 replies; 17+ messages in thread
From: Longlong Xia @ 2025-11-03 15:16 UTC (permalink / raw)
  To: david, linmiaohe
  Cc: lance.yang, markus.elfring, nao.horiguchi, akpm, wangkefeng.wang,
	qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm, Longlong Xia

When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.

This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.

The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
   a. Attempt to allocate a new KSM page copy from healthy duplicate
      KSM page. If successful, migrate the mapping to this new KSM page.
   b. If allocation fails, migrate the mapping to the existing healthy
      duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
   error) does the kernel fall back to killing the affected processes.

Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
---
 mm/ksm.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 215 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 13ec057667af..159b486b11f1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3121,6 +3121,215 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
+
+static struct rb_node *find_stable_node_in_tree(struct ksm_stable_node *dup_node,
+						const struct rb_root *root)
+{
+	struct rb_node *node;
+	struct ksm_stable_node *stable_node, *dup;
+
+	for (node = rb_first(root); node; node = rb_next(node)) {
+		stable_node = rb_entry(node, struct ksm_stable_node, node);
+		if (!is_stable_node_chain(stable_node))
+			continue;
+		hlist_for_each_entry(dup, &stable_node->hlist, hlist_dup) {
+			if (dup == dup_node)
+				return node;
+		}
+		cond_resched();
+	}
+	return NULL;
+}
+
+static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
+{
+	struct rb_node *node;
+	int nid;
+
+	if (!is_stable_node_dup(dup_node))
+		return NULL;
+
+	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+		node = find_stable_node_in_tree(dup_node, root_stable_tree + nid);
+		if (node)
+			return rb_entry(node, struct ksm_stable_node, node);
+	}
+
+	return NULL;
+}
+
+static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
+					struct ksm_stable_node *failing_node,
+					struct ksm_stable_node **healthy_stable_node)
+{
+	struct ksm_stable_node *dup;
+	struct hlist_node *hlist_safe;
+	struct folio *healthy_folio;
+
+	if (!is_stable_node_chain(chain_head) ||
+	    !is_stable_node_dup(failing_node))
+		return NULL;
+
+	hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist,
+				  hlist_dup) {
+		if (dup == failing_node)
+			continue;
+
+		healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
+		if (healthy_folio) {
+			*healthy_stable_node = dup;
+			return healthy_folio;
+		}
+	}
+
+	return NULL;
+}
+
+static struct folio *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
+						struct folio *healthy_folio,
+						struct ksm_stable_node **new_stable_node)
+{
+	struct folio *new_folio;
+	struct page *new_page;
+	unsigned long kpfn;
+	int nid;
+
+	if (!is_stable_node_chain(chain_head))
+		return NULL;
+
+	new_page = alloc_page(GFP_HIGHUSER_MOVABLE);
+	if (!new_page)
+		return NULL;
+
+	new_folio = page_folio(new_page);
+	copy_highpage(new_page, folio_page(healthy_folio, 0));
+
+	kpfn = folio_pfn(new_folio);
+	nid = get_kpfn_nid(kpfn);
+	*new_stable_node = alloc_init_stable_node_dup(kpfn, nid);
+	if (!*new_stable_node) {
+		folio_put(new_folio);
+		return NULL;
+	}
+
+	stable_node_chain_add_dup(*new_stable_node, chain_head);
+	folio_set_stable_node(new_folio, *new_stable_node);
+
+	/* Lock the folio before adding to LRU, consistent with ksm_get_folio */
+	folio_lock(new_folio);
+	folio_add_lru(new_folio);
+
+	return new_folio;
+}
+
+static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
+				  struct folio *failing_folio,
+				  struct folio *target_folio,
+				  struct ksm_stable_node *target_dup)
+{
+	struct ksm_rmap_item *rmap_item;
+	struct hlist_node *hlist_safe;
+	struct page *target_page = folio_page(target_folio, 0);
+	int err;
+
+	hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
+		struct mm_struct *mm = rmap_item->mm;
+		const unsigned long addr = rmap_item->address & PAGE_MASK;
+		struct vm_area_struct *vma;
+		pte_t orig_pte = __pte(0);
+
+		guard(mmap_read_lock)(mm);
+
+		vma = find_mergeable_vma(mm, addr);
+		if (!vma)
+			continue;
+
+		folio_lock(failing_folio);
+
+		err = write_protect_page_addr(vma, failing_folio, addr, &orig_pte);
+		if (err) {
+			folio_unlock(failing_folio);
+			continue;
+		}
+
+		err = replace_page_addr(vma, &failing_folio->page, target_page, addr, orig_pte);
+		if (!err) {
+			hlist_del(&rmap_item->hlist);
+			rmap_item->head = target_dup;
+			DO_NUMA(rmap_item->nid = target_dup->nid);
+			hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
+			target_dup->rmap_hlist_len++;
+			failing_node->rmap_hlist_len--;
+		}
+		folio_unlock(failing_folio);
+	}
+}
+
+static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
+{
+	struct folio *failing_folio, *healthy_folio, *target_folio;
+	struct ksm_stable_node *healthy_stable_node, *chain_head, *target_dup;
+	struct folio *new_folio = NULL;
+	struct ksm_stable_node *new_stable_node = NULL;
+
+	if (!is_stable_node_dup(failing_node))
+		return false;
+
+	guard(mutex)(&ksm_thread_mutex);
+
+	failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
+	if (!failing_folio)
+		return false;
+
+	chain_head = find_chain_head(failing_node);
+	if (!chain_head) {
+		folio_put(failing_folio);
+		return false;
+	}
+
+	healthy_folio = find_healthy_folio(chain_head, failing_node, &healthy_stable_node);
+	if (!healthy_folio) {
+		folio_put(failing_folio);
+		return false;
+	}
+
+	new_folio = create_new_stable_node_dup(chain_head, healthy_folio, &new_stable_node);
+
+	if (new_folio && new_stable_node) {
+		target_folio = new_folio;
+		target_dup = new_stable_node;
+
+		/* Release healthy_folio since we're using new_folio */
+		folio_unlock(healthy_folio);
+		folio_put(healthy_folio);
+	} else {
+		target_folio = healthy_folio;
+		target_dup = healthy_stable_node;
+	}
+
+	/*
+	 * failing_folio was locked in memory_failure(). Unlock it before
+	 * acquiring mmap_read_lock to avoid lock inversion deadlock.
+	 */
+	folio_unlock(failing_folio);
+	migrate_to_target_dup(failing_node, failing_folio, target_folio, target_dup);
+	folio_lock(failing_folio);
+
+	folio_unlock(target_folio);
+	folio_put(target_folio);
+
+	if (failing_node->rmap_hlist_len == 0) {
+		folio_set_stable_node(failing_folio, NULL);
+		__stable_node_dup_del(failing_node);
+		free_stable_node(failing_node);
+		folio_put(failing_folio);
+		return true;
+	}
+
+	folio_put(failing_folio);
+	return false;
+}
+
 /*
  * Collect processes when the error hit an ksm page.
  */
@@ -3135,6 +3344,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page,
 	stable_node = folio_stable_node(folio);
 	if (!stable_node)
 		return;
+
+	if (ksm_recover_within_chain(stable_node)) {
+		pr_info("ksm: recovery successful, no need to kill processes\n");
+		return;
+	}
+
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		struct anon_vma *av = rmap_item->anon_vma;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-11-03 15:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-16 10:18 [PATCH v2 0/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
2025-10-16 14:37   ` [PATCH v2] " Markus Elfring
2025-10-17  3:09   ` [PATCH v2 1/1] " kernel test robot
2025-10-23 11:54   ` Miaohe Lin
2025-10-28  7:54     ` Long long Xia
2025-10-29  6:40       ` Miaohe Lin
2025-10-29  7:12         ` Long long Xia
2025-10-30  2:56           ` Miaohe Lin
2025-10-28  9:44   ` David Hildenbrand
2025-11-03 15:15     ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
2025-11-03 15:16       ` [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates Longlong Xia
2025-11-03 15:16       ` [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
2025-10-16 10:46 ` [PATCH v2 0/1] mm/ksm: " David Hildenbrand
2025-10-21 14:00   ` Long long Xia
2025-10-23 16:16     ` David Hildenbrand
2025-10-16 11:01 ` Markus Elfring

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox