* Re: [PATCH v2] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
@ 2025-10-16 14:37 ` Markus Elfring
2025-10-17 3:09 ` [PATCH v2 1/1] " kernel test robot
` (2 subsequent siblings)
3 siblings, 0 replies; 17+ messages in thread
From: Markus Elfring @ 2025-10-16 14:37 UTC (permalink / raw)
To: Longlong Xia, linux-mm, David Hildenbrand, Lance Yang,
Miaohe Lin, Naoya Horiguchi
Cc: Longlong Xia, LKML, Andrew Morton, Kefeng Wang, Qiuxu Zhuo, xu xin
…> This patch introduces …
See also once more:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n94
…> +++ b/mm/ksm.c
…
> +static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage, unsigned long addr)
> +{
…> + int err = -EFAULT;
…> + pmd = mm_find_pmd(mm, addr);
> + if (!pmd)
> + goto out;
Please return directly here.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.17#n532
…> +out_mn:
> + mmu_notifier_invalidate_range_end(&range);
> +out:
> + return err;
> +}
> +
> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
> + struct folio *failing_folio,
> + struct folio *target_folio,
> + struct ksm_stable_node *target_dup)
> +{
…> + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> + struct mm_struct *mm = rmap_item->mm;
> + unsigned long addr = rmap_item->address & PAGE_MASK;
> + struct vm_area_struct *vma;
> +
> + if (!mmap_read_trylock(mm))
> + continue;
> +
> + if (ksm_test_exit(mm)) {
> + mmap_read_unlock(mm);
> + continue;
> + }
I suggest to avoid duplicate source code here by using another label.
Will such an implementation details become relevant for the application of scope-based resource management?
https://elixir.bootlin.com/linux/v6.17.1/source/include/linux/mmap_lock.h#L483-L484
…> + folio_unlock(target_folio);
+unlock:> + mmap_read_unlock(mm);
> + }
> +
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
…> + if (new_page && new_stable_node) {
> + migrate_to_target_dup(failing_node, failing_folio,
> + page_folio(new_page), new_stable_node);
> + } else {
> + migrate_to_target_dup(failing_node, failing_folio,
> + healthy_folio, healthy_dupdup);
> + }
…
How do you think about to omit curly brackets?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.17#n197
Regards,
Markus
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
2025-10-16 14:37 ` [PATCH v2] " Markus Elfring
@ 2025-10-17 3:09 ` kernel test robot
2025-10-23 11:54 ` Miaohe Lin
2025-10-28 9:44 ` David Hildenbrand
3 siblings, 0 replies; 17+ messages in thread
From: kernel test robot @ 2025-10-17 3:09 UTC (permalink / raw)
To: Longlong Xia, linmiaohe, david, lance.yang
Cc: llvm, oe-kbuild-all, markus.elfring, nao.horiguchi, akpm,
wangkefeng.wang, qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm,
Longlong Xia
Hi Longlong,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.18-rc1 next-20251016]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Longlong-Xia/mm-ksm-recover-from-memory-failure-on-KSM-page-by-migrating-to-healthy-duplicate/20251016-182115
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251016101813.484565-2-xialonglong2025%40163.com
patch subject: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
config: x86_64-buildonly-randconfig-003-20251017 (https://download.01.org/0day-ci/archive/20251017/202510171017.wBXHozQb-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251017/202510171017.wBXHozQb-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510171017.wBXHozQb-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/ksm.c:3160:6: warning: variable 'nid' set but not used [-Wunused-but-set-variable]
3160 | int nid;
| ^
1 warning generated.
vim +/nid +3160 mm/ksm.c
3155
3156 static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
3157 struct folio *healthy_folio,
3158 struct ksm_stable_node **new_stable_node)
3159 {
> 3160 int nid;
3161 unsigned long kpfn;
3162 struct page *new_page = NULL;
3163
3164 if (!is_stable_node_chain(chain_head))
3165 return NULL;
3166
3167 new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
3168 if (!new_page)
3169 return NULL;
3170
3171 copy_highpage(new_page, folio_page(healthy_folio, 0));
3172
3173 *new_stable_node = alloc_stable_node();
3174 if (!*new_stable_node) {
3175 __free_page(new_page);
3176 return NULL;
3177 }
3178
3179 INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
3180 kpfn = page_to_pfn(new_page);
3181 (*new_stable_node)->kpfn = kpfn;
3182 nid = get_kpfn_nid(kpfn);
3183 DO_NUMA((*new_stable_node)->nid = nid);
3184 (*new_stable_node)->rmap_hlist_len = 0;
3185
3186 (*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
3187 hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
3188 ksm_stable_node_dups++;
3189 folio_set_stable_node(page_folio(new_page), *new_stable_node);
3190 folio_add_lru(page_folio(new_page));
3191
3192 return new_page;
3193 }
3194
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
2025-10-16 14:37 ` [PATCH v2] " Markus Elfring
2025-10-17 3:09 ` [PATCH v2 1/1] " kernel test robot
@ 2025-10-23 11:54 ` Miaohe Lin
2025-10-28 7:54 ` Long long Xia
2025-10-28 9:44 ` David Hildenbrand
3 siblings, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2025-10-23 11:54 UTC (permalink / raw)
To: Longlong Xia
Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
lance.yang
On 2025/10/16 18:18, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
>
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
>
> This patch introduces a recovery mechanism that attempts to
> migrate mappings from the failing KSM page to a newly
> allocated KSM page or another healthy duplicate already
> present in the same chain, before falling back to the
> process-killing procedure.
>
> The recovery process works as follows:
> 1. Identify if the failing KSM page belongs to a stable node chain.
> 2. Locate a healthy duplicate KSM page within the same chain.
> 3. For each process mapping the failing page:
> a. Attempt to allocate a new KSM page copy from healthy duplicate
> KSM page. If successful, migrate the mapping to this new KSM page.
> b. If allocation fails, migrate the mapping to the existing healthy
> duplicate KSM page.
> 4. If all migrations succeed, remove the failing KSM page from the chain.
> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
> error) does the kernel fall back to killing the affected processes.
>
> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
Thanks for your patch. Some comments below.
> ---
> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 246 insertions(+)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 160787bb121c..9099bad1ab35 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> }
>
> #ifdef CONFIG_MEMORY_FAILURE
> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
> +{
> + struct ksm_stable_node *stable_node, *dup;
> + struct rb_node *node;
> + int nid;
> +
> + if (!is_stable_node_dup(dup_node))
> + return NULL;
> +
> + for (nid = 0; nid < ksm_nr_node_ids; nid++) {
> + node = rb_first(root_stable_tree + nid);
> + for (; node; node = rb_next(node)) {
> + stable_node = rb_entry(node,
> + struct ksm_stable_node,
> + node);
> +
> + if (!is_stable_node_chain(stable_node))
> + continue;
> +
> + hlist_for_each_entry(dup, &stable_node->hlist,
> + hlist_dup) {
> + if (dup == dup_node)
> + return stable_node;
> + }
> + }
> + }
Would above multiple loops take a long time in some corner cases?
> +
> + return NULL;
> +}
> +
> +static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
> + struct ksm_stable_node *failing_node,
> + struct ksm_stable_node **healthy_dupdup)
> +{
> + struct ksm_stable_node *dup;
> + struct hlist_node *hlist_safe;
> + struct folio *healthy_folio;
> +
> + if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
> + return NULL;
> +
> + hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
> + if (dup == failing_node)
> + continue;
> +
> + healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
> + if (healthy_folio) {
> + *healthy_dupdup = dup;
> + return healthy_folio;
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
> + struct folio *healthy_folio,
> + struct ksm_stable_node **new_stable_node)
> +{
> + int nid;
> + unsigned long kpfn;
> + struct page *new_page = NULL;
> +
> + if (!is_stable_node_chain(chain_head))
> + return NULL;
> +
> + new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
Why __GFP_ZERO is needed?
> + if (!new_page)
> + return NULL;
> +
> + copy_highpage(new_page, folio_page(healthy_folio, 0));
> +
> + *new_stable_node = alloc_stable_node();
> + if (!*new_stable_node) {
> + __free_page(new_page);
> + return NULL;
> + }
> +
> + INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
> + kpfn = page_to_pfn(new_page);
> + (*new_stable_node)->kpfn = kpfn;
> + nid = get_kpfn_nid(kpfn);
> + DO_NUMA((*new_stable_node)->nid = nid);
> + (*new_stable_node)->rmap_hlist_len = 0;
> +
> + (*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
> + hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
> + ksm_stable_node_dups++;
> + folio_set_stable_node(page_folio(new_page), *new_stable_node);
> + folio_add_lru(page_folio(new_page));
> +
> + return new_page;
> +}
> +
...
> +
> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
> + struct folio *failing_folio,
> + struct folio *target_folio,
> + struct ksm_stable_node *target_dup)
> +{
> + struct ksm_rmap_item *rmap_item;
> + struct hlist_node *hlist_safe;
> + int err;
> +
> + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> + struct mm_struct *mm = rmap_item->mm;
> + unsigned long addr = rmap_item->address & PAGE_MASK;
> + struct vm_area_struct *vma;
> +
> + if (!mmap_read_trylock(mm))
> + continue;
> +
> + if (ksm_test_exit(mm)) {
> + mmap_read_unlock(mm);
> + continue;
> + }
> +
> + vma = vma_lookup(mm, addr);
> + if (!vma) {
> + mmap_read_unlock(mm);
> + continue;
> + }
> +
> + if (!folio_trylock(target_folio)) {
Should we try to get the folio refcnt first?
> + mmap_read_unlock(mm);
> + continue;
> + }
> +
> + err = replace_failing_page(vma, &failing_folio->page,
> + folio_page(target_folio, 0), addr);
> + if (!err) {
> + hlist_del(&rmap_item->hlist);
> + rmap_item->head = target_dup;
> + hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
> + target_dup->rmap_hlist_len++;
> + failing_node->rmap_hlist_len--;
> + }
> +
> + folio_unlock(target_folio);
> + mmap_read_unlock(mm);
> + }
> +
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
> + struct folio *failing_folio = NULL;
> + struct ksm_stable_node *healthy_dupdup = NULL;
> + struct folio *healthy_folio = NULL;
> + struct ksm_stable_node *chain_head = NULL;
> + struct page *new_page = NULL;
> + struct ksm_stable_node *new_stable_node = NULL;
> +
> + if (!is_stable_node_dup(failing_node))
> + return false;
> +
> + guard(mutex)(&ksm_thread_mutex);
> + failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
> + if (!failing_folio)
> + return false;
> +
> + chain_head = find_chain_head(failing_node);
> + if (!chain_head)
> + return NULL;
Should we folio_put(failing_folio) before return?
Thanks.
.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-23 11:54 ` Miaohe Lin
@ 2025-10-28 7:54 ` Long long Xia
2025-10-29 6:40 ` Miaohe Lin
0 siblings, 1 reply; 17+ messages in thread
From: Long long Xia @ 2025-10-28 7:54 UTC (permalink / raw)
To: Miaohe Lin
Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
lance.yang
Thanks for the reply.
在 2025/10/23 19:54, Miaohe Lin 写道:
> On 2025/10/16 18:18, Longlong Xia wrote:
>> From: Longlong Xia <xialonglong@kylinos.cn>
>>
>> When a hardware memory error occurs on a KSM page, the current
>> behavior is to kill all processes mapping that page. This can
>> be overly aggressive when KSM has multiple duplicate pages in
>> a chain where other duplicates are still healthy.
>>
>> This patch introduces a recovery mechanism that attempts to
>> migrate mappings from the failing KSM page to a newly
>> allocated KSM page or another healthy duplicate already
>> present in the same chain, before falling back to the
>> process-killing procedure.
>>
>> The recovery process works as follows:
>> 1. Identify if the failing KSM page belongs to a stable node chain.
>> 2. Locate a healthy duplicate KSM page within the same chain.
>> 3. For each process mapping the failing page:
>> a. Attempt to allocate a new KSM page copy from healthy duplicate
>> KSM page. If successful, migrate the mapping to this new KSM page.
>> b. If allocation fails, migrate the mapping to the existing healthy
>> duplicate KSM page.
>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>> error) does the kernel fall back to killing the affected processes.
>>
>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
> Thanks for your patch. Some comments below.
>
>> ---
>> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 246 insertions(+)
>>
>> diff --git a/mm/ksm.c b/mm/ksm.c
>> index 160787bb121c..9099bad1ab35 100644
>> --- a/mm/ksm.c
>> +++ b/mm/ksm.c
>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>> }
>>
>> #ifdef CONFIG_MEMORY_FAILURE
>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>> +{
>> + struct ksm_stable_node *stable_node, *dup;
>> + struct rb_node *node;
>> + int nid;
>> +
>> + if (!is_stable_node_dup(dup_node))
>> + return NULL;
>> +
>> + for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>> + node = rb_first(root_stable_tree + nid);
>> + for (; node; node = rb_next(node)) {
>> + stable_node = rb_entry(node,
>> + struct ksm_stable_node,
>> + node);
>> +
>> + if (!is_stable_node_chain(stable_node))
>> + continue;
>> +
>> + hlist_for_each_entry(dup, &stable_node->hlist,
>> + hlist_dup) {
>> + if (dup == dup_node)
>> + return stable_node;
>> + }
>> + }
>> + }
> Would above multiple loops take a long time in some corner cases?
Thanks for the concern.
I do some simple test。
Test 1: 10 Virtual Machines (Real-world Scenario)
Environment: 10 VMs (256MB each) with KSM enabled
KSM State:
pages_sharing: 262,802 (≈1GB)
pages_shared: 17,374 (≈68MB)
pages_unshared = 124,057 (≈485MB)
total ≈1.5GB
chain_count = 9, not_chain_count = 17152
Red-black tree nodes to traverse:
17,161 (9 chains + 17,152 non-chains)
Performance:
find_chain: 898 μs (0.9 ms)
collect_procs_ksm: 4,409 μs (4.4 ms)
Total memory failure handling: 6,135 μs (6.1 ms)
Test 2: 10GB Single Process (Extreme Case)
Environment: Single process with 10GB memory,
1,310,720 page pairs (each pair identical, different from others)
KSM State:
pages_sharing: 1,311,740 (≈5GB)
pages_shared: 1,310,724 (≈5GB)
pages_unshared = 0
total ≈10GB
Red-black tree nodes to traverse:
1,310,721 (1 chain + 1,310,720 non-chains)
Performance:
find_chain: 28,822 μs (28.8 ms)
collect_procs_ksm: 45,944 μs (45.9 ms)
Total memory failure handling: 46,594 μs (46.6 ms)
Summary:
The find_chain function shows approximately linear scaling with the
number of red-black tree nodes.
With a 76x increase in nodes (17,161 → 1,310,721), latency increased by
32x (898 μs → 28,822 μs).
representing 62% of total memory failure handling time (46.6ms).
However, since memory failures are rare events, this latency may be
acceptable
as it does not impact normal system performance and only affects error
recovery paths.
>> +
>> + return NULL;
>> +}
>> +
>> +static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
>> + struct ksm_stable_node *failing_node,
>> + struct ksm_stable_node **healthy_dupdup)
>> +{
>> + struct ksm_stable_node *dup;
>> + struct hlist_node *hlist_safe;
>> + struct folio *healthy_folio;
>> +
>> + if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
>> + return NULL;
>> +
>> + hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
>> + if (dup == failing_node)
>> + continue;
>> +
>> + healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
>> + if (healthy_folio) {
>> + *healthy_dupdup = dup;
>> + return healthy_folio;
>> + }
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
>> + struct folio *healthy_folio,
>> + struct ksm_stable_node **new_stable_node)
>> +{
>> + int nid;
>> + unsigned long kpfn;
>> + struct page *new_page = NULL;
>> +
>> + if (!is_stable_node_chain(chain_head))
>> + return NULL;
>> +
>> + new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
> Why __GFP_ZERO is needed?
Thanks for pointing this out. I'll remove it.
>> + if (!new_page)
>> + return NULL;
>> +
>> + copy_highpage(new_page, folio_page(healthy_folio, 0));
>> +
>> + *new_stable_node = alloc_stable_node();
>> + if (!*new_stable_node) {
>> + __free_page(new_page);
>> + return NULL;
>> + }
>> +
>> + INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
>> + kpfn = page_to_pfn(new_page);
>> + (*new_stable_node)->kpfn = kpfn;
>> + nid = get_kpfn_nid(kpfn);
>> + DO_NUMA((*new_stable_node)->nid = nid);
>> + (*new_stable_node)->rmap_hlist_len = 0;
>> +
>> + (*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
>> + hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
>> + ksm_stable_node_dups++;
>> + folio_set_stable_node(page_folio(new_page), *new_stable_node);
>> + folio_add_lru(page_folio(new_page));
>> +
>> + return new_page;
>> +}
>> +
> ...
>
>> +
>> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
>> + struct folio *failing_folio,
>> + struct folio *target_folio,
>> + struct ksm_stable_node *target_dup)
>> +{
>> + struct ksm_rmap_item *rmap_item;
>> + struct hlist_node *hlist_safe;
>> + int err;
>> +
>> + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
>> + struct mm_struct *mm = rmap_item->mm;
>> + unsigned long addr = rmap_item->address & PAGE_MASK;
>> + struct vm_area_struct *vma;
>> +
>> + if (!mmap_read_trylock(mm))
>> + continue;
>> +
>> + if (ksm_test_exit(mm)) {
>> + mmap_read_unlock(mm);
>> + continue;
>> + }
>> +
>> + vma = vma_lookup(mm, addr);
>> + if (!vma) {
>> + mmap_read_unlock(mm);
>> + continue;
>> + }
>> +
>> + if (!folio_trylock(target_folio)) {
> Should we try to get the folio refcnt first?
Thanks for pointing this out. I'll fix it.
>> + mmap_read_unlock(mm);
>> + continue;
>> + }
>> +
>> + err = replace_failing_page(vma, &failing_folio->page,
>> + folio_page(target_folio, 0), addr);
>> + if (!err) {
>> + hlist_del(&rmap_item->hlist);
>> + rmap_item->head = target_dup;
>> + hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
>> + target_dup->rmap_hlist_len++;
>> + failing_node->rmap_hlist_len--;
>> + }
>> +
>> + folio_unlock(target_folio);
>> + mmap_read_unlock(mm);
>> + }
>> +
>> +}
>> +
>> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
>> +{
>> + struct folio *failing_folio = NULL;
>> + struct ksm_stable_node *healthy_dupdup = NULL;
>> + struct folio *healthy_folio = NULL;
>> + struct ksm_stable_node *chain_head = NULL;
>> + struct page *new_page = NULL;
>> + struct ksm_stable_node *new_stable_node = NULL;
>> +
>> + if (!is_stable_node_dup(failing_node))
>> + return false;
>> +
>> + guard(mutex)(&ksm_thread_mutex);
>> + failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
>> + if (!failing_folio)
>> + return false;
>> +
>> + chain_head = find_chain_head(failing_node);
>> + if (!chain_head)
>> + return NULL;
> Should we folio_put(failing_folio) before return?
Thanks for pointing this out. I'll fix it.
> Thanks.
> .
Best regards,
Longlong Xia
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-28 7:54 ` Long long Xia
@ 2025-10-29 6:40 ` Miaohe Lin
2025-10-29 7:12 ` Long long Xia
0 siblings, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2025-10-29 6:40 UTC (permalink / raw)
To: Long long Xia
Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
lance.yang
On 2025/10/28 15:54, Long long Xia wrote:
> Thanks for the reply.
>
> 在 2025/10/23 19:54, Miaohe Lin 写道:
>> On 2025/10/16 18:18, Longlong Xia wrote:
>>> From: Longlong Xia <xialonglong@kylinos.cn>
>>>
>>> When a hardware memory error occurs on a KSM page, the current
>>> behavior is to kill all processes mapping that page. This can
>>> be overly aggressive when KSM has multiple duplicate pages in
>>> a chain where other duplicates are still healthy.
>>>
>>> This patch introduces a recovery mechanism that attempts to
>>> migrate mappings from the failing KSM page to a newly
>>> allocated KSM page or another healthy duplicate already
>>> present in the same chain, before falling back to the
>>> process-killing procedure.
>>>
>>> The recovery process works as follows:
>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>> 3. For each process mapping the failing page:
>>> a. Attempt to allocate a new KSM page copy from healthy duplicate
>>> KSM page. If successful, migrate the mapping to this new KSM page.
>>> b. If allocation fails, migrate the mapping to the existing healthy
>>> duplicate KSM page.
>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>> error) does the kernel fall back to killing the affected processes.
>>>
>>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
>> Thanks for your patch. Some comments below.
>>
>>> ---
>>> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 246 insertions(+)
>>>
>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>> index 160787bb121c..9099bad1ab35 100644
>>> --- a/mm/ksm.c
>>> +++ b/mm/ksm.c
>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>> }
>>> #ifdef CONFIG_MEMORY_FAILURE
>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>> +{
>>> + struct ksm_stable_node *stable_node, *dup;
>>> + struct rb_node *node;
>>> + int nid;
>>> +
>>> + if (!is_stable_node_dup(dup_node))
>>> + return NULL;
>>> +
>>> + for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>> + node = rb_first(root_stable_tree + nid);
>>> + for (; node; node = rb_next(node)) {
>>> + stable_node = rb_entry(node,
>>> + struct ksm_stable_node,
>>> + node);
>>> +
>>> + if (!is_stable_node_chain(stable_node))
>>> + continue;
>>> +
>>> + hlist_for_each_entry(dup, &stable_node->hlist,
>>> + hlist_dup) {
>>> + if (dup == dup_node)
>>> + return stable_node;
>>> + }
>>> + }
>>> + }
>> Would above multiple loops take a long time in some corner cases?
>
> Thanks for the concern.
>
> I do some simple test。
>
> Test 1: 10 Virtual Machines (Real-world Scenario)
> Environment: 10 VMs (256MB each) with KSM enabled
>
> KSM State:
> pages_sharing: 262,802 (≈1GB)
> pages_shared: 17,374 (≈68MB)
> pages_unshared = 124,057 (≈485MB)
> total ≈1.5GB
> chain_count = 9, not_chain_count = 17152
> Red-black tree nodes to traverse:
> 17,161 (9 chains + 17,152 non-chains)
>
> Performance:
> find_chain: 898 μs (0.9 ms)
> collect_procs_ksm: 4,409 μs (4.4 ms)
> Total memory failure handling: 6,135 μs (6.1 ms)
>
>
> Test 2: 10GB Single Process (Extreme Case)
> Environment: Single process with 10GB memory,
> 1,310,720 page pairs (each pair identical, different from others)
>
> KSM State:
> pages_sharing: 1,311,740 (≈5GB)
> pages_shared: 1,310,724 (≈5GB)
> pages_unshared = 0
> total ≈10GB
> Red-black tree nodes to traverse:
> 1,310,721 (1 chain + 1,310,720 non-chains)
>
> Performance:
> find_chain: 28,822 μs (28.8 ms)
> collect_procs_ksm: 45,944 μs (45.9 ms)
> Total memory failure handling: 46,594 μs (46.6 ms)
Thanks for your test.
>
> Summary:
> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
> representing 62% of total memory failure handling time (46.6ms).
> However, since memory failures are rare events, this latency may be acceptable
> as it does not impact normal system performance and only affects error recovery paths.
>
IMHO, the execution time of a kernel function must not be too long without any scheduling points.
Otherwise it may affect the normal scheduling of the system and leads to something like performance
fluctuation. Or am I miss something?
Thanks.
.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-29 6:40 ` Miaohe Lin
@ 2025-10-29 7:12 ` Long long Xia
2025-10-30 2:56 ` Miaohe Lin
0 siblings, 1 reply; 17+ messages in thread
From: Long long Xia @ 2025-10-29 7:12 UTC (permalink / raw)
To: Miaohe Lin
Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
lance.yang
Thanks for the reply.
在 2025/10/29 14:40, Miaohe Lin 写道:
> On 2025/10/28 15:54, Long long Xia wrote:
>> Thanks for the reply.
>>
>> 在 2025/10/23 19:54, Miaohe Lin 写道:
>>> On 2025/10/16 18:18, Longlong Xia wrote:
>>>> From: Longlong Xia <xialonglong@kylinos.cn>
>>>>
>>>> When a hardware memory error occurs on a KSM page, the current
>>>> behavior is to kill all processes mapping that page. This can
>>>> be overly aggressive when KSM has multiple duplicate pages in
>>>> a chain where other duplicates are still healthy.
>>>>
>>>> This patch introduces a recovery mechanism that attempts to
>>>> migrate mappings from the failing KSM page to a newly
>>>> allocated KSM page or another healthy duplicate already
>>>> present in the same chain, before falling back to the
>>>> process-killing procedure.
>>>>
>>>> The recovery process works as follows:
>>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>>> 3. For each process mapping the failing page:
>>>> a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>> KSM page. If successful, migrate the mapping to this new KSM page.
>>>> b. If allocation fails, migrate the mapping to the existing healthy
>>>> duplicate KSM page.
>>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>> error) does the kernel fall back to killing the affected processes.
>>>>
>>>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
>>> Thanks for your patch. Some comments below.
>>>
>>>> ---
>>>> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 246 insertions(+)
>>>>
>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>> index 160787bb121c..9099bad1ab35 100644
>>>> --- a/mm/ksm.c
>>>> +++ b/mm/ksm.c
>>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>> }
>>>> #ifdef CONFIG_MEMORY_FAILURE
>>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>>> +{
>>>> + struct ksm_stable_node *stable_node, *dup;
>>>> + struct rb_node *node;
>>>> + int nid;
>>>> +
>>>> + if (!is_stable_node_dup(dup_node))
>>>> + return NULL;
>>>> +
>>>> + for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>>> + node = rb_first(root_stable_tree + nid);
>>>> + for (; node; node = rb_next(node)) {
>>>> + stable_node = rb_entry(node,
>>>> + struct ksm_stable_node,
>>>> + node);
>>>> +
>>>> + if (!is_stable_node_chain(stable_node))
>>>> + continue;
>>>> +
>>>> + hlist_for_each_entry(dup, &stable_node->hlist,
>>>> + hlist_dup) {
>>>> + if (dup == dup_node)
>>>> + return stable_node;
>>>> + }
may I add cond_resched(); here ?
>>>> + }
>>>> + }
>>> Would above multiple loops take a long time in some corner cases?
>> Thanks for the concern.
>>
>> I do some simple test。
>>
>> Test 1: 10 Virtual Machines (Real-world Scenario)
>> Environment: 10 VMs (256MB each) with KSM enabled
>>
>> KSM State:
>> pages_sharing: 262,802 (≈1GB)
>> pages_shared: 17,374 (≈68MB)
>> pages_unshared = 124,057 (≈485MB)
>> total ≈1.5GB
>> chain_count = 9, not_chain_count = 17152
>> Red-black tree nodes to traverse:
>> 17,161 (9 chains + 17,152 non-chains)
>>
>> Performance:
>> find_chain: 898 μs (0.9 ms)
>> collect_procs_ksm: 4,409 μs (4.4 ms)
>> Total memory failure handling: 6,135 μs (6.1 ms)
>>
>>
>> Test 2: 10GB Single Process (Extreme Case)
>> Environment: Single process with 10GB memory,
>> 1,310,720 page pairs (each pair identical, different from others)
>>
>> KSM State:
>> pages_sharing: 1,311,740 (≈5GB)
>> pages_shared: 1,310,724 (≈5GB)
>> pages_unshared = 0
>> total ≈10GB
>> Red-black tree nodes to traverse:
>> 1,310,721 (1 chain + 1,310,720 non-chains)
>>
>> Performance:
>> find_chain: 28,822 μs (28.8 ms)
>> collect_procs_ksm: 45,944 μs (45.9 ms)
>> Total memory failure handling: 46,594 μs (46.6 ms)
> Thanks for your test.
>
>> Summary:
>> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
>> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
>> representing 62% of total memory failure handling time (46.6ms).
>> However, since memory failures are rare events, this latency may be acceptable
>> as it does not impact normal system performance and only affects error recovery paths.
>>
> IMHO, the execution time of a kernel function must not be too long without any scheduling points.
> Otherwise it may affect the normal scheduling of the system and leads to something like performance
> fluctuation. Or am I miss something?
>
> Thanks.
> .
I will add cond_resched()in the loop of red-black tree to allow
scheduling in find_chain(), may be it is enough?
Best regards,
Longlong Xia
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-29 7:12 ` Long long Xia
@ 2025-10-30 2:56 ` Miaohe Lin
0 siblings, 0 replies; 17+ messages in thread
From: Miaohe Lin @ 2025-10-30 2:56 UTC (permalink / raw)
To: Long long Xia
Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
xu.xin16, linux-kernel, linux-mm, Longlong Xia, david,
lance.yang
On 2025/10/29 15:12, Long long Xia wrote:
> Thanks for the reply.
>
>
> 在 2025/10/29 14:40, Miaohe Lin 写道:
>> On 2025/10/28 15:54, Long long Xia wrote:
>>> Thanks for the reply.
>>>
>>> 在 2025/10/23 19:54, Miaohe Lin 写道:
>>>> On 2025/10/16 18:18, Longlong Xia wrote:
>>>>> From: Longlong Xia <xialonglong@kylinos.cn>
>>>>>
>>>>> When a hardware memory error occurs on a KSM page, the current
>>>>> behavior is to kill all processes mapping that page. This can
>>>>> be overly aggressive when KSM has multiple duplicate pages in
>>>>> a chain where other duplicates are still healthy.
>>>>>
>>>>> This patch introduces a recovery mechanism that attempts to
>>>>> migrate mappings from the failing KSM page to a newly
>>>>> allocated KSM page or another healthy duplicate already
>>>>> present in the same chain, before falling back to the
>>>>> process-killing procedure.
>>>>>
>>>>> The recovery process works as follows:
>>>>> 1. Identify if the failing KSM page belongs to a stable node chain.
>>>>> 2. Locate a healthy duplicate KSM page within the same chain.
>>>>> 3. For each process mapping the failing page:
>>>>> a. Attempt to allocate a new KSM page copy from healthy duplicate
>>>>> KSM page. If successful, migrate the mapping to this new KSM page.
>>>>> b. If allocation fails, migrate the mapping to the existing healthy
>>>>> duplicate KSM page.
>>>>> 4. If all migrations succeed, remove the failing KSM page from the chain.
>>>>> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
>>>>> error) does the kernel fall back to killing the affected processes.
>>>>>
>>>>> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
>>>> Thanks for your patch. Some comments below.
>>>>
>>>>> ---
>>>>> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 246 insertions(+)
>>>>>
>>>>> diff --git a/mm/ksm.c b/mm/ksm.c
>>>>> index 160787bb121c..9099bad1ab35 100644
>>>>> --- a/mm/ksm.c
>>>>> +++ b/mm/ksm.c
>>>>> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
>>>>> }
>>>>> #ifdef CONFIG_MEMORY_FAILURE
>>>>> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
>>>>> +{
>>>>> + struct ksm_stable_node *stable_node, *dup;
>>>>> + struct rb_node *node;
>>>>> + int nid;
>>>>> +
>>>>> + if (!is_stable_node_dup(dup_node))
>>>>> + return NULL;
>>>>> +
>>>>> + for (nid = 0; nid < ksm_nr_node_ids; nid++) {
>>>>> + node = rb_first(root_stable_tree + nid);
>>>>> + for (; node; node = rb_next(node)) {
>>>>> + stable_node = rb_entry(node,
>>>>> + struct ksm_stable_node,
>>>>> + node);
>>>>> +
>>>>> + if (!is_stable_node_chain(stable_node))
>>>>> + continue;
>>>>> +
>>>>> + hlist_for_each_entry(dup, &stable_node->hlist,
>>>>> + hlist_dup) {
>>>>> + if (dup == dup_node)
>>>>> + return stable_node;
>>>>> + }
> may I add cond_resched(); here ?
>>>>> + }
>>>>> + }
>>>> Would above multiple loops take a long time in some corner cases?
>>> Thanks for the concern.
>>>
>>> I do some simple test。
>>>
>>> Test 1: 10 Virtual Machines (Real-world Scenario)
>>> Environment: 10 VMs (256MB each) with KSM enabled
>>>
>>> KSM State:
>>> pages_sharing: 262,802 (≈1GB)
>>> pages_shared: 17,374 (≈68MB)
>>> pages_unshared = 124,057 (≈485MB)
>>> total ≈1.5GB
>>> chain_count = 9, not_chain_count = 17152
>>> Red-black tree nodes to traverse:
>>> 17,161 (9 chains + 17,152 non-chains)
>>>
>>> Performance:
>>> find_chain: 898 μs (0.9 ms)
>>> collect_procs_ksm: 4,409 μs (4.4 ms)
>>> Total memory failure handling: 6,135 μs (6.1 ms)
>>>
>>>
>>> Test 2: 10GB Single Process (Extreme Case)
>>> Environment: Single process with 10GB memory,
>>> 1,310,720 page pairs (each pair identical, different from others)
>>>
>>> KSM State:
>>> pages_sharing: 1,311,740 (≈5GB)
>>> pages_shared: 1,310,724 (≈5GB)
>>> pages_unshared = 0
>>> total ≈10GB
>>> Red-black tree nodes to traverse:
>>> 1,310,721 (1 chain + 1,310,720 non-chains)
>>>
>>> Performance:
>>> find_chain: 28,822 μs (28.8 ms)
>>> collect_procs_ksm: 45,944 μs (45.9 ms)
>>> Total memory failure handling: 46,594 μs (46.6 ms)
>> Thanks for your test.
>>
>>> Summary:
>>> The find_chain function shows approximately linear scaling with the number of red-black tree nodes.
>>> With a 76x increase in nodes (17,161 → 1,310,721), latency increased by 32x (898 μs → 28,822 μs).
>>> representing 62% of total memory failure handling time (46.6ms).
>>> However, since memory failures are rare events, this latency may be acceptable
>>> as it does not impact normal system performance and only affects error recovery paths.
>>>
>> IMHO, the execution time of a kernel function must not be too long without any scheduling points.
>> Otherwise it may affect the normal scheduling of the system and leads to something like performance
>> fluctuation. Or am I miss something?
>>
>> Thanks.
>> .
>
> I will add cond_resched()in the loop of red-black tree to allow scheduling in find_chain(), may be it is enough?
It looks work to me.
Thanks.
.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/1] mm/ksm: recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-16 10:18 ` [PATCH v2 1/1] " Longlong Xia
` (2 preceding siblings ...)
2025-10-23 11:54 ` Miaohe Lin
@ 2025-10-28 9:44 ` David Hildenbrand
2025-11-03 15:15 ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
3 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-10-28 9:44 UTC (permalink / raw)
To: Longlong Xia, linmiaohe, lance.yang
Cc: markus.elfring, nao.horiguchi, akpm, wangkefeng.wang, qiuxu.zhuo,
xu.xin16, linux-kernel, linux-mm, Longlong Xia
On 16.10.25 12:18, Longlong Xia wrote:
> From: Longlong Xia <xialonglong@kylinos.cn>
>
> When a hardware memory error occurs on a KSM page, the current
> behavior is to kill all processes mapping that page. This can
> be overly aggressive when KSM has multiple duplicate pages in
> a chain where other duplicates are still healthy.
>
> This patch introduces a recovery mechanism that attempts to
> migrate mappings from the failing KSM page to a newly
> allocated KSM page or another healthy duplicate already
> present in the same chain, before falling back to the
> process-killing procedure.
>
> The recovery process works as follows:
> 1. Identify if the failing KSM page belongs to a stable node chain.
> 2. Locate a healthy duplicate KSM page within the same chain.
> 3. For each process mapping the failing page:
> a. Attempt to allocate a new KSM page copy from healthy duplicate
> KSM page. If successful, migrate the mapping to this new KSM page.
> b. If allocation fails, migrate the mapping to the existing healthy
> duplicate KSM page.
> 4. If all migrations succeed, remove the failing KSM page from the chain.
> 5. Only if recovery fails (e.g., no healthy duplicate found or migration
> error) does the kernel fall back to killing the affected processes.
>
> Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
> ---
> mm/ksm.c | 246 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 246 insertions(+)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 160787bb121c..9099bad1ab35 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -3084,6 +3084,246 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
> }
>
> #ifdef CONFIG_MEMORY_FAILURE
> +static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
> +{
> + struct ksm_stable_node *stable_node, *dup;
> + struct rb_node *node;
> + int nid;
> +
> + if (!is_stable_node_dup(dup_node))
> + return NULL;
> +
> + for (nid = 0; nid < ksm_nr_node_ids; nid++) {
> + node = rb_first(root_stable_tree + nid);
> + for (; node; node = rb_next(node)) {
> + stable_node = rb_entry(node,
> + struct ksm_stable_node,
> + node);
Put that into a single line for readability, please.
You can also consider factoring out this inner loop in a helper function.
> +
> + if (!is_stable_node_chain(stable_node))
> + continue;
> +
> + hlist_for_each_entry(dup, &stable_node->hlist,
> + hlist_dup) {
Single line, or properly indent.
> + if (dup == dup_node)
> + return stable_node;
> + }
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
> + struct ksm_stable_node *failing_node,
> + struct ksm_stable_node **healthy_dupdup)
> +{
> + struct ksm_stable_node *dup;
> + struct hlist_node *hlist_safe;
> + struct folio *healthy_folio;
> +
> + if (!is_stable_node_chain(chain_head) || !is_stable_node_dup(failing_node))
> + return NULL;
> +
> + hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist, hlist_dup) {
> + if (dup == failing_node)
> + continue;
> +
> + healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
> + if (healthy_folio) {
> + *healthy_dupdup = dup;
> + return healthy_folio;
> + }
> + }
> +
> + return NULL;
> +}
> +
> +static struct page *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
> + struct folio *healthy_folio,
> + struct ksm_stable_node **new_stable_node)
> +{
> + int nid;
> + unsigned long kpfn;
> + struct page *new_page = NULL;
> +
> + if (!is_stable_node_chain(chain_head))
> + return NULL;
> +
> + new_page = alloc_page(GFP_HIGHUSER_MOVABLE | __GFP_ZERO);
> + if (!new_page)
> + return NULL;
> +
> + copy_highpage(new_page, folio_page(healthy_folio, 0));
> +
> + *new_stable_node = alloc_stable_node();
> + if (!*new_stable_node) {
> + __free_page(new_page);
> + return NULL;
> + }
> +
> + INIT_HLIST_HEAD(&(*new_stable_node)->hlist);
> + kpfn = page_to_pfn(new_page);
> + (*new_stable_node)->kpfn = kpfn;
> + nid = get_kpfn_nid(kpfn);
> + DO_NUMA((*new_stable_node)->nid = nid);
> + (*new_stable_node)->rmap_hlist_len = 0;
> +
> + (*new_stable_node)->head = STABLE_NODE_DUP_HEAD;
> + hlist_add_head(&(*new_stable_node)->hlist_dup, &chain_head->hlist);
> + ksm_stable_node_dups++;
> + folio_set_stable_node(page_folio(new_page), *new_stable_node);
> + folio_add_lru(page_folio(new_page));
There seems to be a lot of copy-paste. For example, why no reuse
stable_node_chain_add_dup()?
Or why not try to reuse stable_tree_insert() in the first place?
Try to reuse or factor out instead of copy-pasting, please.
> +
> + return new_page;
> +}
> +
> +static int replace_failing_page(struct vm_area_struct *vma, struct page *page,
> + struct page *kpage, unsigned long addr)
> +{
> + struct folio *kfolio = page_folio(kpage);
> + struct mm_struct *mm = vma->vm_mm;
> + struct folio *folio = page_folio(page);
> + pmd_t *pmd;
> + pte_t *ptep;
> + pte_t newpte;
> + spinlock_t *ptl;
> + int err = -EFAULT;
> + struct mmu_notifier_range range;
> +
> + pmd = mm_find_pmd(mm, addr);
> + if (!pmd)
> + goto out;
> +
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> + addr + PAGE_SIZE);
> + mmu_notifier_invalidate_range_start(&range);
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!ptep)
> + goto out_mn;
> +
> + folio_get(kfolio);
> + folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
> + newpte = mk_pte(kpage, vma->vm_page_prot);
> +
> + flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
> + ptep_clear_flush(vma, addr, ptep);
> + set_pte_at(mm, addr, ptep, newpte);
> +
> + folio_remove_rmap_pte(folio, page, vma);
> + if (!folio_mapped(folio))
> + folio_free_swap(folio);
> + folio_put(folio);
> +
> + pte_unmap_unlock(ptep, ptl);
> + err = 0;
> +out_mn:
> + mmu_notifier_invalidate_range_end(&range);
> +out:
> + return err;
> +}
This is a lot of copy-paste from replace_page(). Isn't there a way to
avoid this duplication by unifying both functions in some way?
> +
> +static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
> + struct folio *failing_folio,
> + struct folio *target_folio,
> + struct ksm_stable_node *target_dup)
> +{
> + struct ksm_rmap_item *rmap_item;
> + struct hlist_node *hlist_safe;
> + int err;
> +
> + hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
> + struct mm_struct *mm = rmap_item->mm;
> + unsigned long addr = rmap_item->address & PAGE_MASK;
Can be const.
> + struct vm_area_struct *vma;
> +
> + if (!mmap_read_trylock(mm))
> + continue;
> +
> + if (ksm_test_exit(mm)) {
> + mmap_read_unlock(mm);
> + continue;
> + }
> +
> + vma = vma_lookup(mm, addr);
> + if (!vma) {
> + mmap_read_unlock(mm);
> + continue;
> + }
> +
> + if (!folio_trylock(target_folio)) {
Can't we leave the target folio locked the whole time? The caller
already locked it, why not keep it locked until we're done?
> + mmap_read_unlock(mm);
> + continue;
> + }
> +
> + err = replace_failing_page(vma, &failing_folio->page,
> + folio_page(target_folio, 0), addr);
> + if (!err) {
> + hlist_del(&rmap_item->hlist);
> + rmap_item->head = target_dup;
> + hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
> + target_dup->rmap_hlist_len++;
> + failing_node->rmap_hlist_len--;
> + }
> +
> + folio_unlock(target_folio);
> + mmap_read_unlock(mm);
> + }
> +
> +}
> +
> +static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
> +{
> + struct folio *failing_folio = NULL;
> + struct ksm_stable_node *healthy_dupdup = NULL;
> + struct folio *healthy_folio = NULL;
> + struct ksm_stable_node *chain_head = NULL;
> + struct page *new_page = NULL;
> + struct ksm_stable_node *new_stable_node = NULL;
Only initialize what needs initialization (nothing in here?) and combine
where possible.
Like
struct folio *failing_folio, *healthy_folio;
> +
> + if (!is_stable_node_dup(failing_node))
> + return false;
> +
> + guard(mutex)(&ksm_thread_mutex);
> + failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
> + if (!failing_folio)
> + return false;
> +
> + chain_head = find_chain_head(failing_node);
> + if (!chain_head)
> + return NULL;
> +
> + healthy_folio = find_healthy_folio(chain_head, failing_node, &healthy_dupdup);
> + if (!healthy_folio) {
> + folio_put(failing_folio);
> + return false;
> + }
> +
> + new_page = create_new_stable_node_dup(chain_head, healthy_folio, &new_stable_node);
Why are you returning a page here and not a folio?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH v3 0/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate
2025-10-28 9:44 ` David Hildenbrand
@ 2025-11-03 15:15 ` Longlong Xia
2025-11-03 15:16 ` [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates Longlong Xia
2025-11-03 15:16 ` [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
0 siblings, 2 replies; 17+ messages in thread
From: Longlong Xia @ 2025-11-03 15:15 UTC (permalink / raw)
To: david, linmiaohe
Cc: lance.yang, markus.elfring, nao.horiguchi, akpm, wangkefeng.wang,
qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm, Longlong Xia
When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.
This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.
The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
a. Attempt to allocate a new KSM page copy from healthy duplicate
KSM page. If successful, migrate the mapping to this new KSM page.
b. If allocation fails, migrate the mapping to the existing healthy
duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
error) does the kernel fall back to killing the affected processes.
The original idea came from Naoya Horiguchi.
https://lore.kernel.org/all/20230331054243.GB1435482@hori.linux.bs1.fc.nec.co.jp/
Real-world Application Testing:
-------------------------------
Workload: 10 QEMU VMs (1 vCPU, 256MB RAM each) with KSM enabled
Platform: x86_64, Kernel 6.6.89
Testcase1: Single VM and enable KSM
- VM Memory Usage:
* RSS Total = 275028 KB (268 MB)
* RSS Anon = 253656 KB (247 MB)
* RSS File = 21372 KB (20 MB)
* RSS Shmem = 0 KB (0 MB)
a.Traverse the stable tree
b. pages on the chain
2 chains detected
Chain #1: 51 duplicates, 12,956 pages (~51 MB)
Chain #2: 15 duplicates, 3,822 pages (~15 MB)
Average: 8,389 pages per chain
Sum: 16778 pages (64.6% of ksm_pages_sharing + ksm_pages_shared)
c. pages on the chain
Non-chain pages: 9,209 pages
d.chain_count = 2, not_chain_count = 4200
e.
/sys/kernel/mm/ksm/ksm_pages_sharing = 21721
/sys/kernel/mm/ksm/ksm_pages_shared = 4266
/sys/kernel/mm/ksm/ksm_pages_unshared = 38098
Testcase2: 10 VMs and enable KSM
a.Traverse the stable tree
b.Pages on the chain
8 chains detected
Chain #1: 458 duplicates, 117,012 pages (~457 MB)
Chain #2: 150 duplicates, 38,231 pages (~149 MB)
Chain #3: 10 duplicates, 2,320 pages (~9 MB)
Chain #4: 8 duplicates, 1,814 pages (~7 MB)
Chain #5-8: 4, 3, 3, 2 duplicates (920, 720, 600, 260 pages)
Surprisingly, although chains are few in number, they contribute
significantly to the overall savings. In the 10-VM scenario, only 8 chains
produce 161,877 pages (44.5% of total), while thousands of non-chain groups
contribute the remaining 55.5%.
Functional Testing (Hardware Error Injection):
----------------------------------------------
test shell script
modprobe einj 2>/dev/null
echo 0x10 > /sys/kernel/debug/apei/einj/error_type
echo $ADDRESS > /sys/kernel/debug/apei/einj/param1
echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
echo 1 > /sys/kernel/debug/apei/einj/error_inject
FIRST WAY: allocate a new KSM page copy from healthy duplicate
1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
virtual addr = 0x71582be00000 phy_addr =0x124802000
virtual addr = 0x71582bf2c000 phy_addr =0x124902000
virtual addr = 0x71582c026000 phy_addr =0x125402000
virtual addr = 0x71582c120000 phy_addr =0x125502000
2. echo 0x124802000 > /sys/kernel/debug/apei/einj/param1
virtual addr = 0x71582be00000 phy_addr =0x1363b1000 (new allocated)
virtual addr = 0x71582bf2c000 phy_addr =0x124902000
virtual addr = 0x71582c026000 phy_addr =0x125402000
virtual addr = 0x71582c120000 phy_addr =0x125502000
kernel-log:
mce: [Hardware Error]: Machine check events logged
ksm: recovery successful, no need to kill processes
Memory failure: 0x124802: recovery action for dirty LRU page: Recovered
Memory failure: 0x124802: recovery action for already poisoned page: Failed
SECOND WAY: Migrate the mapping to the existing healthy duplicate KSM page
1. alloc 1024 page with same content and enable KSM to merge
after merge (same phy_addr only print once)
virtual addr = 0x79a172000000 phy_addr =0x141802000
virtual addr = 0x79a17212c000 phy_addr =0x141902000
virtual addr = 0x79a172226000 phy_addr =0x13cc02000
virtual addr = 0x79a172320000 phy_addr =0x13cd02000
2 echo 0x141802000 > /sys/kernel/debug/apei/einj/param1
a.virtual addr = 0x79a172000000 phy_addr =0x13cd02000
b.virtual addr = 0x79a17212c000 phy_addr =0x141902000
c.virtual addr = 0x79a172226000 phy_addr =0x13cc02000
d.virtual addr = 0x79a172320000 phy_addr =0x13cd02000 (share with a)
kernel-log:
mce: [Hardware Error]: Machine check events logged
ksm: recovery successful, no need to kill processes
Memory failure: 0x141802: recovery action for dirty LRU page: Recovered
Memory failure: 0x141802: recovery action for already poisoned page: Failed
ksm: recovery successful, no need to kill processes
Thanks for review and comments!
Changes in v3:
Patch 1/2 [New]: Preparatory refactoring
- Extract alloc_init_stable_node_dup() helper
- Refactor write_protect_page() and replace_page() to expose _addr variants
- No functional changes
Patch 2/2:
- Refactored to use alloc_init_stable_node_dup() helper from patch 1/2
and stable_node_chain_add_dup()
- Fix locking: unlock failing_folio before mmap_read_lock to avoid deadlock
- Extracted find_stable_node_in_tree() as separate helper
- Removed redundant replace_failing_page(), using write_protect_page_addr()
and replace_page_addr() instead
- Changed return type to 'struct folio *' for consistency
- Fixed code style issues
Changes in v2:
- Implemented a two-tier recovery strategy: preferring newly allocated
pages over existing duplicates to avoid concentrating mappings on a
single page suggested by David Hildenbrand
- Remove handling of the zeropage in replace_failing_page(), as it is
non-recoverable suggested by Lance Yang
- Correct the locking order by acquiring the mmap_lock before the page
lock during page replacement, suggested by Miaohe Lin
- Add protection using the ksm_thread_mutex around the entire recovery
operation to prevent race conditions with concurrent KSM scanning
- Separated the logic into smaller, more focused functions for better
maintainability
- Update patch title
Longlong Xia (2):
mm/ksm: add helper to allocate and initialize stable node duplicates
mm/ksm: try recover from memory failure on KSM page by migrating to
healthy duplicate
mm/ksm.c | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 278 insertions(+), 26 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates
2025-11-03 15:15 ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
@ 2025-11-03 15:16 ` Longlong Xia
2025-11-03 15:16 ` [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate Longlong Xia
1 sibling, 0 replies; 17+ messages in thread
From: Longlong Xia @ 2025-11-03 15:16 UTC (permalink / raw)
To: david, linmiaohe
Cc: lance.yang, markus.elfring, nao.horiguchi, akpm, wangkefeng.wang,
qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm, Longlong Xia
Consolidate the duplicated stable_node allocation and initialization
code in stable_tree_insert() into a new helper function
alloc_init_stable_node_dup().
Also refactor write_protect_page() and replace_page() to expose
address-based variants (_addr suffix). The wrappers maintain existing
behavior by calculating the address first.
This refactoring prepares for the upcoming memory error recovery
feature, which will need to:
1) Allocate and initialize stable_node duplicates
2) Operate on specific addresses without re-calculation
No functional changes.
Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
---
mm/ksm.c | 89 +++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 63 insertions(+), 26 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index 160787bb121c..13ec057667af 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1247,11 +1247,11 @@ static u32 calc_checksum(struct page *page)
return checksum;
}
-static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
- pte_t *orig_pte)
+static int write_protect_page_addr(struct vm_area_struct *vma, struct folio *folio,
+ unsigned long address, pte_t *orig_pte)
{
struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0);
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
int swapped;
int err = -EFAULT;
struct mmu_notifier_range range;
@@ -1261,10 +1261,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
if (WARN_ON_ONCE(folio_test_large(folio)))
return err;
- pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma);
- if (pvmw.address == -EFAULT)
- goto out;
+ if (address < vma->vm_start || address >= vma->vm_end)
+ return err;
+ pvmw.address = address;
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address,
pvmw.address + PAGE_SIZE);
mmu_notifier_invalidate_range_start(&range);
@@ -1334,21 +1334,26 @@ static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
page_vma_mapped_walk_done(&pvmw);
out_mn:
mmu_notifier_invalidate_range_end(&range);
-out:
return err;
}
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma: vma that holds the pte pointing to page
- * @page: the page we are replacing by kpage
- * @kpage: the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
- struct page *kpage, pte_t orig_pte)
+static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
+ pte_t *orig_pte)
+{
+ unsigned long address;
+
+ if (WARN_ON_ONCE(folio_test_large(folio)))
+ return -EFAULT;
+
+ address = page_address_in_vma(folio, folio_page(folio, 0), vma);
+ if (address == -EFAULT)
+ return -EFAULT;
+
+ return write_protect_page_addr(vma, folio, address, orig_pte);
+}
+
+static int replace_page_addr(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, unsigned long addr, pte_t orig_pte)
{
struct folio *kfolio = page_folio(kpage);
struct mm_struct *mm = vma->vm_mm;
@@ -1358,17 +1363,16 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
pte_t *ptep;
pte_t newpte;
spinlock_t *ptl;
- unsigned long addr;
int err = -EFAULT;
struct mmu_notifier_range range;
- addr = page_address_in_vma(folio, page, vma);
- if (addr == -EFAULT)
+ if (addr < vma->vm_start || addr >= vma->vm_end)
goto out;
pmd = mm_find_pmd(mm, addr);
if (!pmd)
goto out;
+
/*
* Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
* without holding anon_vma lock for write. So when looking for a
@@ -1441,6 +1445,29 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
return err;
}
+
+/**
+ * replace_page - replace page in vma by new ksm page
+ * @vma: vma that holds the pte pointing to page
+ * @page: the page we are replacing by kpage
+ * @kpage: the ksm page we replace page by
+ * @orig_pte: the original value of the pte
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte)
+{
+ unsigned long addr;
+ struct folio *folio = page_folio(page);
+
+ addr = page_address_in_vma(folio, page, vma);
+ if (addr == -EFAULT)
+ return -EFAULT;
+
+ return replace_page_addr(vma, page, kpage, addr, orig_pte);
+}
+
/*
* try_to_merge_one_page - take two pages and merge them into one
* @vma: the vma that holds the pte pointing to page
@@ -2007,6 +2034,20 @@ static struct folio *stable_tree_search(struct page *page)
goto out;
}
+static struct ksm_stable_node *alloc_init_stable_node_dup(unsigned long kpfn,
+ int nid __maybe_unused)
+{
+ struct ksm_stable_node *stable_node = alloc_stable_node();
+
+ if (stable_node) {
+ INIT_HLIST_HEAD(&stable_node->hlist);
+ stable_node->kpfn = kpfn;
+ stable_node->rmap_hlist_len = 0;
+ DO_NUMA(stable_node->nid = nid);
+ }
+ return stable_node;
+}
+
/*
* stable_tree_insert - insert stable tree node pointing to new ksm page
* into the stable tree.
@@ -2065,14 +2106,10 @@ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio)
}
}
- stable_node_dup = alloc_stable_node();
+ stable_node_dup = alloc_init_stable_node_dup(kpfn, nid);
if (!stable_node_dup)
return NULL;
- INIT_HLIST_HEAD(&stable_node_dup->hlist);
- stable_node_dup->kpfn = kpfn;
- stable_node_dup->rmap_hlist_len = 0;
- DO_NUMA(stable_node_dup->nid = nid);
if (!need_chain) {
rb_link_node(&stable_node_dup->node, parent, new);
rb_insert_color(&stable_node_dup->node, root);
--
2.43.0
^ permalink raw reply [flat|nested] 17+ messages in thread* [PATCH v3 2/2] mm/ksm: try recover from memory failure on KSM page by migrating to healthy duplicate
2025-11-03 15:15 ` [PATCH v3 0/2] mm/ksm: try " Longlong Xia
2025-11-03 15:16 ` [PATCH v3 1/2] mm/ksm: add helper to allocate and initialize stable node duplicates Longlong Xia
@ 2025-11-03 15:16 ` Longlong Xia
1 sibling, 0 replies; 17+ messages in thread
From: Longlong Xia @ 2025-11-03 15:16 UTC (permalink / raw)
To: david, linmiaohe
Cc: lance.yang, markus.elfring, nao.horiguchi, akpm, wangkefeng.wang,
qiuxu.zhuo, xu.xin16, linux-kernel, linux-mm, Longlong Xia
When a hardware memory error occurs on a KSM page, the current
behavior is to kill all processes mapping that page. This can
be overly aggressive when KSM has multiple duplicate pages in
a chain where other duplicates are still healthy.
This patch introduces a recovery mechanism that attempts to
migrate mappings from the failing KSM page to a newly
allocated KSM page or another healthy duplicate already
present in the same chain, before falling back to the
process-killing procedure.
The recovery process works as follows:
1. Identify if the failing KSM page belongs to a stable node chain.
2. Locate a healthy duplicate KSM page within the same chain.
3. For each process mapping the failing page:
a. Attempt to allocate a new KSM page copy from healthy duplicate
KSM page. If successful, migrate the mapping to this new KSM page.
b. If allocation fails, migrate the mapping to the existing healthy
duplicate KSM page.
4. If all migrations succeed, remove the failing KSM page from the chain.
5. Only if recovery fails (e.g., no healthy duplicate found or migration
error) does the kernel fall back to killing the affected processes.
Signed-off-by: Longlong Xia <xialonglong@kylinos.cn>
---
mm/ksm.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 215 insertions(+)
diff --git a/mm/ksm.c b/mm/ksm.c
index 13ec057667af..159b486b11f1 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3121,6 +3121,215 @@ void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc)
}
#ifdef CONFIG_MEMORY_FAILURE
+
+static struct rb_node *find_stable_node_in_tree(struct ksm_stable_node *dup_node,
+ const struct rb_root *root)
+{
+ struct rb_node *node;
+ struct ksm_stable_node *stable_node, *dup;
+
+ for (node = rb_first(root); node; node = rb_next(node)) {
+ stable_node = rb_entry(node, struct ksm_stable_node, node);
+ if (!is_stable_node_chain(stable_node))
+ continue;
+ hlist_for_each_entry(dup, &stable_node->hlist, hlist_dup) {
+ if (dup == dup_node)
+ return node;
+ }
+ cond_resched();
+ }
+ return NULL;
+}
+
+static struct ksm_stable_node *find_chain_head(struct ksm_stable_node *dup_node)
+{
+ struct rb_node *node;
+ int nid;
+
+ if (!is_stable_node_dup(dup_node))
+ return NULL;
+
+ for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+ node = find_stable_node_in_tree(dup_node, root_stable_tree + nid);
+ if (node)
+ return rb_entry(node, struct ksm_stable_node, node);
+ }
+
+ return NULL;
+}
+
+static struct folio *find_healthy_folio(struct ksm_stable_node *chain_head,
+ struct ksm_stable_node *failing_node,
+ struct ksm_stable_node **healthy_stable_node)
+{
+ struct ksm_stable_node *dup;
+ struct hlist_node *hlist_safe;
+ struct folio *healthy_folio;
+
+ if (!is_stable_node_chain(chain_head) ||
+ !is_stable_node_dup(failing_node))
+ return NULL;
+
+ hlist_for_each_entry_safe(dup, hlist_safe, &chain_head->hlist,
+ hlist_dup) {
+ if (dup == failing_node)
+ continue;
+
+ healthy_folio = ksm_get_folio(dup, KSM_GET_FOLIO_TRYLOCK);
+ if (healthy_folio) {
+ *healthy_stable_node = dup;
+ return healthy_folio;
+ }
+ }
+
+ return NULL;
+}
+
+static struct folio *create_new_stable_node_dup(struct ksm_stable_node *chain_head,
+ struct folio *healthy_folio,
+ struct ksm_stable_node **new_stable_node)
+{
+ struct folio *new_folio;
+ struct page *new_page;
+ unsigned long kpfn;
+ int nid;
+
+ if (!is_stable_node_chain(chain_head))
+ return NULL;
+
+ new_page = alloc_page(GFP_HIGHUSER_MOVABLE);
+ if (!new_page)
+ return NULL;
+
+ new_folio = page_folio(new_page);
+ copy_highpage(new_page, folio_page(healthy_folio, 0));
+
+ kpfn = folio_pfn(new_folio);
+ nid = get_kpfn_nid(kpfn);
+ *new_stable_node = alloc_init_stable_node_dup(kpfn, nid);
+ if (!*new_stable_node) {
+ folio_put(new_folio);
+ return NULL;
+ }
+
+ stable_node_chain_add_dup(*new_stable_node, chain_head);
+ folio_set_stable_node(new_folio, *new_stable_node);
+
+ /* Lock the folio before adding to LRU, consistent with ksm_get_folio */
+ folio_lock(new_folio);
+ folio_add_lru(new_folio);
+
+ return new_folio;
+}
+
+static void migrate_to_target_dup(struct ksm_stable_node *failing_node,
+ struct folio *failing_folio,
+ struct folio *target_folio,
+ struct ksm_stable_node *target_dup)
+{
+ struct ksm_rmap_item *rmap_item;
+ struct hlist_node *hlist_safe;
+ struct page *target_page = folio_page(target_folio, 0);
+ int err;
+
+ hlist_for_each_entry_safe(rmap_item, hlist_safe, &failing_node->hlist, hlist) {
+ struct mm_struct *mm = rmap_item->mm;
+ const unsigned long addr = rmap_item->address & PAGE_MASK;
+ struct vm_area_struct *vma;
+ pte_t orig_pte = __pte(0);
+
+ guard(mmap_read_lock)(mm);
+
+ vma = find_mergeable_vma(mm, addr);
+ if (!vma)
+ continue;
+
+ folio_lock(failing_folio);
+
+ err = write_protect_page_addr(vma, failing_folio, addr, &orig_pte);
+ if (err) {
+ folio_unlock(failing_folio);
+ continue;
+ }
+
+ err = replace_page_addr(vma, &failing_folio->page, target_page, addr, orig_pte);
+ if (!err) {
+ hlist_del(&rmap_item->hlist);
+ rmap_item->head = target_dup;
+ DO_NUMA(rmap_item->nid = target_dup->nid);
+ hlist_add_head(&rmap_item->hlist, &target_dup->hlist);
+ target_dup->rmap_hlist_len++;
+ failing_node->rmap_hlist_len--;
+ }
+ folio_unlock(failing_folio);
+ }
+}
+
+static bool ksm_recover_within_chain(struct ksm_stable_node *failing_node)
+{
+ struct folio *failing_folio, *healthy_folio, *target_folio;
+ struct ksm_stable_node *healthy_stable_node, *chain_head, *target_dup;
+ struct folio *new_folio = NULL;
+ struct ksm_stable_node *new_stable_node = NULL;
+
+ if (!is_stable_node_dup(failing_node))
+ return false;
+
+ guard(mutex)(&ksm_thread_mutex);
+
+ failing_folio = ksm_get_folio(failing_node, KSM_GET_FOLIO_NOLOCK);
+ if (!failing_folio)
+ return false;
+
+ chain_head = find_chain_head(failing_node);
+ if (!chain_head) {
+ folio_put(failing_folio);
+ return false;
+ }
+
+ healthy_folio = find_healthy_folio(chain_head, failing_node, &healthy_stable_node);
+ if (!healthy_folio) {
+ folio_put(failing_folio);
+ return false;
+ }
+
+ new_folio = create_new_stable_node_dup(chain_head, healthy_folio, &new_stable_node);
+
+ if (new_folio && new_stable_node) {
+ target_folio = new_folio;
+ target_dup = new_stable_node;
+
+ /* Release healthy_folio since we're using new_folio */
+ folio_unlock(healthy_folio);
+ folio_put(healthy_folio);
+ } else {
+ target_folio = healthy_folio;
+ target_dup = healthy_stable_node;
+ }
+
+ /*
+ * failing_folio was locked in memory_failure(). Unlock it before
+ * acquiring mmap_read_lock to avoid lock inversion deadlock.
+ */
+ folio_unlock(failing_folio);
+ migrate_to_target_dup(failing_node, failing_folio, target_folio, target_dup);
+ folio_lock(failing_folio);
+
+ folio_unlock(target_folio);
+ folio_put(target_folio);
+
+ if (failing_node->rmap_hlist_len == 0) {
+ folio_set_stable_node(failing_folio, NULL);
+ __stable_node_dup_del(failing_node);
+ free_stable_node(failing_node);
+ folio_put(failing_folio);
+ return true;
+ }
+
+ folio_put(failing_folio);
+ return false;
+}
+
/*
* Collect processes when the error hit an ksm page.
*/
@@ -3135,6 +3344,12 @@ void collect_procs_ksm(const struct folio *folio, const struct page *page,
stable_node = folio_stable_node(folio);
if (!stable_node)
return;
+
+ if (ksm_recover_within_chain(stable_node)) {
+ pr_info("ksm: recovery successful, no need to kill processes\n");
+ return;
+ }
+
hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
struct anon_vma *av = rmap_item->anon_vma;
--
2.43.0
^ permalink raw reply [flat|nested] 17+ messages in thread