* [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
@ 2025-08-05 3:54 Barry Song
2025-08-05 5:20 ` Lorenzo Stoakes
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Barry Song @ 2025-08-05 3:54 UTC (permalink / raw)
To: akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Qi Zheng, Vlastimil Babka, Jann Horn,
Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Lance Yang,
Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain
From: Barry Song <v-songbaohua@oppo.com>
The check_pmd_still_valid() call during collapse is currently only
protected by the mmap_lock in write mode, which was sufficient when
pt_reclaim always ran under mmap_lock in read mode. However, since
madvise_dontneed can now execute under a per-VMA lock, this assumption
is no longer valid. As a result, a race condition can occur between
collapse and PT_RECLAIM, potentially leading to a kernel panic.
[ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
[ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
[ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
[ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
[ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
[ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
[ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
[ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
[ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
[ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
[ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
[ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
[ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
[ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
[ 38.168812] PKRU: 55555554
[ 38.169275] Call Trace:
[ 38.169647] <TASK>
[ 38.169975] ? __kasan_check_byte+0x19/0x50
[ 38.170581] lock_acquire+0xea/0x310
[ 38.171083] ? rcu_is_watching+0x19/0xc0
[ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 38.173130] _raw_spin_lock+0x38/0x50
[ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
[ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
[ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
[ 38.175724] ? __pfx_pud_val+0x10/0x10
[ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[ 38.177183] unmap_page_range+0xb60/0x43e0
[ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
[ 38.178485] ? mas_next_slot+0x133a/0x1a50
[ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
[ 38.179830] unmap_vmas+0x1fa/0x460
[ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
[ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 38.181877] exit_mmap+0x1a2/0xb40
[ 38.182396] ? lock_release+0x14f/0x2c0
[ 38.182929] ? __pfx_exit_mmap+0x10/0x10
[ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
[ 38.184188] ? mutex_unlock+0x16/0x20
[ 38.184704] mmput+0x132/0x370
[ 38.185208] do_exit+0x7e7/0x28c0
[ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
[ 38.186328] ? do_group_exit+0x1d8/0x2c0
[ 38.186873] ? __pfx_do_exit+0x10/0x10
[ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
[ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
[ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
[ 38.189313] do_group_exit+0xe4/0x2c0
[ 38.189831] __x64_sys_exit_group+0x4d/0x60
[ 38.190413] x64_sys_call+0x2174/0x2180
[ 38.190935] do_syscall_64+0x6d/0x2e0
[ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
This patch moves the vma_start_write() call to precede
check_pmd_still_valid(), ensuring that the check is also properly
protected by the per-VMA lock.
Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 374a6a5193a7..6b40bdfd224c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
/* check if the pmd is still valid */
+ vma_start_write(vma);
result = check_pmd_still_valid(mm, address, pmd);
if (result != SCAN_SUCCEED)
goto out_up_write;
- vma_start_write(vma);
anon_vma_lock_write(vma->anon_vma);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
--
2.39.3 (Apple Git-146)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 3:54 [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock Barry Song
@ 2025-08-05 5:20 ` Lorenzo Stoakes
2025-08-05 6:41 ` Baolin Wang
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Lorenzo Stoakes @ 2025-08-05 5:20 UTC (permalink / raw)
To: Barry Song
Cc: akpm, linux-mm, linux-kernel, Barry Song, Lai, Yi,
David Hildenbrand, Qi Zheng, Vlastimil Babka, Jann Horn,
Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Lance Yang,
Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain
Andrew - to be clear, we need this as a hotfix for 6.17, as this is a known bug
in rc1 right now.
On Tue, Aug 05, 2025 at 11:54:47AM +0800, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Looks good to me so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> + vma_start_write(vma);
> result = check_pmd_still_valid(mm, address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> - vma_start_write(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> --
> 2.39.3 (Apple Git-146)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 3:54 [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock Barry Song
2025-08-05 5:20 ` Lorenzo Stoakes
@ 2025-08-05 6:41 ` Baolin Wang
2025-08-05 6:42 ` Qi Zheng
2025-08-05 8:02 ` David Hildenbrand
3 siblings, 0 replies; 12+ messages in thread
From: Baolin Wang @ 2025-08-05 6:41 UTC (permalink / raw)
To: Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Qi Zheng, Vlastimil Babka, Jann Horn,
Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Lance Yang,
Zi Yan, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
On 2025/8/5 11:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 3:54 [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock Barry Song
2025-08-05 5:20 ` Lorenzo Stoakes
2025-08-05 6:41 ` Baolin Wang
@ 2025-08-05 6:42 ` Qi Zheng
2025-08-05 7:53 ` Baolin Wang
2025-08-05 8:02 ` David Hildenbrand
3 siblings, 1 reply; 12+ messages in thread
From: Qi Zheng @ 2025-08-05 6:42 UTC (permalink / raw)
To: Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
Lokesh Gidra, Tangquan Zheng, Lance Yang, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
Hi Barry,
On 8/5/25 11:54 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
There is indeed a race condition here. And after applying this patch, I
can no longer reproduce the problem locally (I was able to reproduce it
stably locally last night).
But I still can't figure out how this race condtion causes the
following panic:
exit_mmap
--> mmap_read_lock()
unmap_vmas()
--> pte_offset_map_lock
--> rcu_read_lock()
check if the pmd entry is a PTE page
ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
spin_lock(ptl) <-- PANIC!!
If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not
be NULL.
The collapse holds mmap write lock, so it is impossible to be concurrent
with exit_mmap().
Confusing. :(
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> + vma_start_write(vma);
> result = check_pmd_still_valid(mm, address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> - vma_start_write(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 6:42 ` Qi Zheng
@ 2025-08-05 7:53 ` Baolin Wang
2025-08-05 8:17 ` Qi Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2025-08-05 7:53 UTC (permalink / raw)
To: Qi Zheng, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
Lokesh Gidra, Tangquan Zheng, Lance Yang, Zi Yan,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
On 2025/8/5 14:42, Qi Zheng wrote:
> Hi Barry,
>
> On 8/5/25 11:54 AM, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> The check_pmd_still_valid() call during collapse is currently only
>> protected by the mmap_lock in write mode, which was sufficient when
>> pt_reclaim always ran under mmap_lock in read mode. However, since
>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>> is no longer valid. As a result, a race condition can occur between
>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> There is indeed a race condition here. And after applying this patch, I
> can no longer reproduce the problem locally (I was able to reproduce it
> stably locally last night).
>
> But I still can't figure out how this race condtion causes the
> following panic:
>
> exit_mmap
> --> mmap_read_lock()
> unmap_vmas()
> --> pte_offset_map_lock
> --> rcu_read_lock()
> check if the pmd entry is a PTE page
> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
> spin_lock(ptl) <-- PANIC!!
>
> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can not
> be NULL.
>
> The collapse holds mmap write lock, so it is impossible to be concurrent
> with exit_mmap().
>
> Confusing. :(
IIUC, the issue is not caused by the concurrency between exit_mmap and
collapse, but rather by the concurrency between pt_reclaim and collapse.
Before this patch, khugepaged might incorrectly restore a PTE pagetable
that had already been freed.
pt_reclaim has cleared the pmd entry and freed the PTE page table.
However, due to the race condition, check_pmd_still_valid() still passes
and continues to attempt the collapse:
_pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd
entry (the original pmd entry has been cleared)
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
pte == NULL
Then khugepaged will restore the old PTE pagetable with an invalid pmd
entry:
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
So when the process exits and trys to free the mapping of the process,
traversing the invalid pmd table will lead to a crash.
Barry, please correct me if I have misunderstood something.
>> [ 38.151897] Oops: general protection fault, probably for non-
>> canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
>> [ 38.153519] KASAN: null-ptr-deref in range
>> [0x0000000000000018-0x000000000000001f]
>> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted
>> 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
>> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX,
>> 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
>> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
>> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
>> 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
>> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
>> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX:
>> 1ffffffff0dde60c
>> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI:
>> dffffc0000000003
>> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09:
>> 0000000000000000
>> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12:
>> 0000000000000018
>> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15:
>> 0000000000000000
>> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000)
>> knlGS:0000000000000000
>> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4:
>> 0000000000770ef0
>> [ 38.168812] PKRU: 55555554
>> [ 38.169275] Call Trace:
>> [ 38.169647] <TASK>
>> [ 38.169975] ? __kasan_check_byte+0x19/0x50
>> [ 38.170581] lock_acquire+0xea/0x310
>> [ 38.171083] ? rcu_is_watching+0x19/0xc0
>> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
>> [ 38.173130] _raw_spin_lock+0x38/0x50
>> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
>> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
>> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
>> [ 38.175724] ? __pfx_pud_val+0x10/0x10
>> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
>> [ 38.177183] unmap_page_range+0xb60/0x43e0
>> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
>> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
>> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
>> [ 38.179830] unmap_vmas+0x1fa/0x460
>> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
>> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
>> [ 38.181877] exit_mmap+0x1a2/0xb40
>> [ 38.182396] ? lock_release+0x14f/0x2c0
>> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
>> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
>> [ 38.184188] ? mutex_unlock+0x16/0x20
>> [ 38.184704] mmput+0x132/0x370
>> [ 38.185208] do_exit+0x7e7/0x28c0
>> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
>> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
>> [ 38.186873] ? __pfx_do_exit+0x10/0x10
>> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
>> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
>> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
>> [ 38.189313] do_group_exit+0xe4/0x2c0
>> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
>> [ 38.190413] x64_sys_call+0x2174/0x2180
>> [ 38.190935] do_syscall_64+0x6d/0x2e0
>> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> This patch moves the vma_start_write() call to precede
>> check_pmd_still_valid(), ensuring that the check is also properly
>> protected by the per-VMA lock.
>>
>> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
>> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
>> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Jann Horn <jannh@google.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Lokesh Gidra <lokeshgidra@google.com>
>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>> Cc: Lance Yang <ioworker0@gmail.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> ---
>> mm/khugepaged.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 374a6a5193a7..6b40bdfd224c 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct
>> *mm, unsigned long address,
>> if (result != SCAN_SUCCEED)
>> goto out_up_write;
>> /* check if the pmd is still valid */
>> + vma_start_write(vma);
>> result = check_pmd_still_valid(mm, address, pmd);
>> if (result != SCAN_SUCCEED)
>> goto out_up_write;
>> - vma_start_write(vma);
>> anon_vma_lock_write(vma->anon_vma);
>> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 3:54 [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock Barry Song
` (2 preceding siblings ...)
2025-08-05 6:42 ` Qi Zheng
@ 2025-08-05 8:02 ` David Hildenbrand
3 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-08-05 8:02 UTC (permalink / raw)
To: Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, Lorenzo Stoakes, Qi Zheng,
Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
Tangquan Zheng, Lance Yang, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
On 05.08.25 05:54, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> The check_pmd_still_valid() call during collapse is currently only
> protected by the mmap_lock in write mode, which was sufficient when
> pt_reclaim always ran under mmap_lock in read mode. However, since
> madvise_dontneed can now execute under a per-VMA lock, this assumption
> is no longer valid. As a result, a race condition can occur between
> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>
> [ 38.151897] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] SMP KASI
> [ 38.153519] KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> [ 38.154605] CPU: 0 UID: 0 PID: 721 Comm: repro Not tainted 6.16.0-next-20250801-next-2025080 #1 PREEMPT(voluntary)
> [ 38.155929] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org4
> [ 38.157418] RIP: 0010:kasan_byte_accessible+0x15/0x30
> [ 38.158125] Code: 03 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 48 b8 00 00 00 00 00 fc0
> [ 38.160461] RSP: 0018:ffff88800feef678 EFLAGS: 00010286
> [ 38.161220] RAX: dffffc0000000000 RBX: 0000000000000001 RCX: 1ffffffff0dde60c
> [ 38.162232] RDX: 0000000000000000 RSI: ffffffff85da1e18 RDI: dffffc0000000003
> [ 38.163176] RBP: ffff88800feef698 R08: 0000000000000001 R09: 0000000000000000
> [ 38.164195] R10: 0000000000000000 R11: ffff888016a8ba58 R12: 0000000000000018
> [ 38.165189] R13: 0000000000000018 R14: ffffffff85da1e18 R15: 0000000000000000
> [ 38.166100] FS: 0000000000000000(0000) GS:ffff8880e3b40000(0000) knlGS:0000000000000000
> [ 38.167137] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 38.167891] CR2: 00007f97fadfe504 CR3: 0000000007088005 CR4: 0000000000770ef0
> [ 38.168812] PKRU: 55555554
> [ 38.169275] Call Trace:
> [ 38.169647] <TASK>
> [ 38.169975] ? __kasan_check_byte+0x19/0x50
> [ 38.170581] lock_acquire+0xea/0x310
> [ 38.171083] ? rcu_is_watching+0x19/0xc0
> [ 38.171615] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.172343] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
> [ 38.173130] _raw_spin_lock+0x38/0x50
> [ 38.173707] ? __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174390] __pte_offset_map_lock+0x1a2/0x3c0
> [ 38.174987] ? __pfx___pte_offset_map_lock+0x10/0x10
> [ 38.175724] ? __pfx_pud_val+0x10/0x10
> [ 38.176308] ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
> [ 38.177183] unmap_page_range+0xb60/0x43e0
> [ 38.177824] ? __pfx_unmap_page_range+0x10/0x10
> [ 38.178485] ? mas_next_slot+0x133a/0x1a50
> [ 38.179079] unmap_single_vma.constprop.0+0x15b/0x250
> [ 38.179830] unmap_vmas+0x1fa/0x460
> [ 38.180373] ? __pfx_unmap_vmas+0x10/0x10
> [ 38.180994] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 38.181877] exit_mmap+0x1a2/0xb40
> [ 38.182396] ? lock_release+0x14f/0x2c0
> [ 38.182929] ? __pfx_exit_mmap+0x10/0x10
> [ 38.183474] ? __pfx___mutex_unlock_slowpath+0x10/0x10
> [ 38.184188] ? mutex_unlock+0x16/0x20
> [ 38.184704] mmput+0x132/0x370
> [ 38.185208] do_exit+0x7e7/0x28c0
> [ 38.185682] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.186328] ? do_group_exit+0x1d8/0x2c0
> [ 38.186873] ? __pfx_do_exit+0x10/0x10
> [ 38.187401] ? __this_cpu_preempt_check+0x21/0x30
> [ 38.188036] ? _raw_spin_unlock_irq+0x2c/0x60
> [ 38.188634] ? lockdep_hardirqs_on+0x89/0x110
> [ 38.189313] do_group_exit+0xe4/0x2c0
> [ 38.189831] __x64_sys_exit_group+0x4d/0x60
> [ 38.190413] x64_sys_call+0x2174/0x2180
> [ 38.190935] do_syscall_64+0x6d/0x2e0
> [ 38.191449] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This patch moves the vma_start_write() call to precede
> check_pmd_still_valid(), ensuring that the check is also properly
> protected by the per-VMA lock.
>
> Fixes: a6fde7add78d ("mm: use per_vma lock for MADV_DONTNEED")
> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
> Closes: https://lore.kernel.org/all/aJAFrYfyzGpbm+0m@ly-workstation/
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 374a6a5193a7..6b40bdfd224c 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1172,11 +1172,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> + vma_start_write(vma);
> result = check_pmd_still_valid(mm, address, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> - vma_start_write(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
LGTM, I was wondering whether we should just place it next to the
mmap_write_lock() with the assumption that hugepage_vma_revalidate()
will commonly not fail.
So personally, I would move it further up.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 7:53 ` Baolin Wang
@ 2025-08-05 8:17 ` Qi Zheng
2025-08-05 8:56 ` Baolin Wang
0 siblings, 1 reply; 12+ messages in thread
From: Qi Zheng @ 2025-08-05 8:17 UTC (permalink / raw)
To: Baolin Wang, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
Lokesh Gidra, Tangquan Zheng, Lance Yang, Zi Yan,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
Hi Baolin,
On 8/5/25 3:53 PM, Baolin Wang wrote:
>
>
> On 2025/8/5 14:42, Qi Zheng wrote:
>> Hi Barry,
>>
>> On 8/5/25 11:54 AM, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> The check_pmd_still_valid() call during collapse is currently only
>>> protected by the mmap_lock in write mode, which was sufficient when
>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>> is no longer valid. As a result, a race condition can occur between
>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>
>> There is indeed a race condition here. And after applying this patch, I
>> can no longer reproduce the problem locally (I was able to reproduce it
>> stably locally last night).
>>
>> But I still can't figure out how this race condtion causes the
>> following panic:
>>
>> exit_mmap
>> --> mmap_read_lock()
>> unmap_vmas()
>> --> pte_offset_map_lock
>> --> rcu_read_lock()
>> check if the pmd entry is a PTE page
>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>> spin_lock(ptl) <-- PANIC!!
>>
>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>> not be NULL.
>>
>> The collapse holds mmap write lock, so it is impossible to be concurrent
>> with exit_mmap().
>>
>> Confusing. :(
>
> IIUC, the issue is not caused by the concurrency between exit_mmap and
> collapse, but rather by the concurrency between pt_reclaim and collapse.
>
> Before this patch, khugepaged might incorrectly restore a PTE pagetable
> that had already been freed.
>
> pt_reclaim has cleared the pmd entry and freed the PTE page table.
> However, due to the race condition, check_pmd_still_valid() still passes
> and continues to attempt the collapse:
>
> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd
> entry (the original pmd entry has been cleared)
>
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
> pte == NULL
>
> Then khugepaged will restore the old PTE pagetable with an invalid pmd
> entry:
>
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> So when the process exits and trys to free the mapping of the process,
> traversing the invalid pmd table will lead to a crash.
CPU0 CPU1
==== ====
collapse
--> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
mmap_write_unlock
exit_mmap
--> hold mmap lock
__pte_offset_map_lock
--> pte = __pte_offset_map(pmd, addr,
&pmdval);
if (unlikely(!pte))
return pte; <-- will return
IIUC, in this case, if we get an invalid pmd entry, we will retrun
directly instead of causing a crash?
>
> Barry, please correct me if I have misunderstood something.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 8:17 ` Qi Zheng
@ 2025-08-05 8:56 ` Baolin Wang
2025-08-05 9:30 ` Qi Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2025-08-05 8:56 UTC (permalink / raw)
To: Qi Zheng, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
Lokesh Gidra, Tangquan Zheng, Lance Yang, Zi Yan,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
On 2025/8/5 16:17, Qi Zheng wrote:
> Hi Baolin,
>
> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>
>>
>> On 2025/8/5 14:42, Qi Zheng wrote:
>>> Hi Barry,
>>>
>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> The check_pmd_still_valid() call during collapse is currently only
>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>> is no longer valid. As a result, a race condition can occur between
>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>
>>> There is indeed a race condition here. And after applying this patch, I
>>> can no longer reproduce the problem locally (I was able to reproduce it
>>> stably locally last night).
>>>
>>> But I still can't figure out how this race condtion causes the
>>> following panic:
>>>
>>> exit_mmap
>>> --> mmap_read_lock()
>>> unmap_vmas()
>>> --> pte_offset_map_lock
>>> --> rcu_read_lock()
>>> check if the pmd entry is a PTE page
>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>>> spin_lock(ptl) <-- PANIC!!
>>>
>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>> not be NULL.
>>>
>>> The collapse holds mmap write lock, so it is impossible to be concurrent
>>> with exit_mmap().
>>>
>>> Confusing. :(
>>
>> IIUC, the issue is not caused by the concurrency between exit_mmap and
>> collapse, but rather by the concurrency between pt_reclaim and collapse.
>>
>> Before this patch, khugepaged might incorrectly restore a PTE
>> pagetable that had already been freed.
>>
>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>> However, due to the race condition, check_pmd_still_valid() still
>> passes and continues to attempt the collapse:
>>
>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none pmd
>> entry (the original pmd entry has been cleared)
>>
>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>> pte == NULL
>>
>> Then khugepaged will restore the old PTE pagetable with an invalid pmd
>> entry:
>>
>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>
>> So when the process exits and trys to free the mapping of the process,
>> traversing the invalid pmd table will lead to a crash.
>
> CPU0 CPU1
> ==== ====
>
> collapse
> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> mmap_write_unlock
> exit_mmap
> --> hold mmap lock
> __pte_offset_map_lock
> --> pte = __pte_offset_map(pmd, addr,
> &pmdval);
> if (unlikely(!pte))
> return pte; <-- will return
__pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
'_pmd' entry is NONE), but it is not the original pagetable page.
> IIUC, in this case, if we get an invalid pmd entry, we will retrun
> directly instead of causing a crash?
>
>>
>> Barry, please correct me if I have misunderstood something.
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 8:56 ` Baolin Wang
@ 2025-08-05 9:30 ` Qi Zheng
2025-08-05 9:50 ` David Hildenbrand
0 siblings, 1 reply; 12+ messages in thread
From: Qi Zheng @ 2025-08-05 9:30 UTC (permalink / raw)
To: Baolin Wang, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
Lokesh Gidra, Tangquan Zheng, Lance Yang, Zi Yan,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain
On 8/5/25 4:56 PM, Baolin Wang wrote:
>
>
> On 2025/8/5 16:17, Qi Zheng wrote:
>> Hi Baolin,
>>
>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>> Hi Barry,
>>>>
>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>>> is no longer valid. As a result, a race condition can occur between
>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>
>>>> There is indeed a race condition here. And after applying this patch, I
>>>> can no longer reproduce the problem locally (I was able to reproduce it
>>>> stably locally last night).
>>>>
>>>> But I still can't figure out how this race condtion causes the
>>>> following panic:
>>>>
>>>> exit_mmap
>>>> --> mmap_read_lock()
>>>> unmap_vmas()
>>>> --> pte_offset_map_lock
>>>> --> rcu_read_lock()
>>>> check if the pmd entry is a PTE page
>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>>>> spin_lock(ptl) <-- PANIC!!
>>>>
>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>>> not be NULL.
>>>>
>>>> The collapse holds mmap write lock, so it is impossible to be
>>>> concurrent
>>>> with exit_mmap().
>>>>
>>>> Confusing. :(
>>>
>>> IIUC, the issue is not caused by the concurrency between exit_mmap
>>> and collapse, but rather by the concurrency between pt_reclaim and
>>> collapse.
>>>
>>> Before this patch, khugepaged might incorrectly restore a PTE
>>> pagetable that had already been freed.
>>>
>>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>>> However, due to the race condition, check_pmd_still_valid() still
>>> passes and continues to attempt the collapse:
>>>
>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none
>>> pmd entry (the original pmd entry has been cleared)
>>>
>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>>> pte == NULL
>>>
>>> Then khugepaged will restore the old PTE pagetable with an invalid
>>> pmd entry:
>>>
>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>
>>> So when the process exits and trys to free the mapping of the
>>> process, traversing the invalid pmd table will lead to a crash.
>>
>> CPU0 CPU1
>> ==== ====
>>
>> collapse
>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>> mmap_write_unlock
>> exit_mmap
>> --> hold mmap lock
>> __pte_offset_map_lock
>> --> pte = __pte_offset_map(pmd,
>> addr, &pmdval);
>> if (unlikely(!pte))
>> return pte; <-- will return
>
> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
> '_pmd' entry is NONE), but it is not the original pagetable page.
CPU0 CPU1
==== ====
collapse
--> check_pmd_still_valid
vma read lock
pt_reclaim clear the pmd entry and will
free the PTE page (via RCU)
vma read unlock
vma write lock
_pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is
NULL
pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page?
vma write unlock
The above is the concurrent scenario you mentioned, right?
What types of this 'valid page' could be? If __pte_offset_map() returns
non-NULL, then it is a PTE page. Even if it is not the original one, it
should not cause panic. Did I miss some key information? :(
>
>> IIUC, in this case, if we get an invalid pmd entry, we will retrun
>> directly instead of causing a crash?
>>
>>>
>>> Barry, please correct me if I have misunderstood something.
>>>
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 9:30 ` Qi Zheng
@ 2025-08-05 9:50 ` David Hildenbrand
2025-08-05 10:07 ` Baolin Wang
0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-08-05 9:50 UTC (permalink / raw)
To: Qi Zheng, Baolin Wang, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
Tangquan Zheng, Lance Yang, Zi Yan, Liam R . Howlett, Nico Pache,
Ryan Roberts, Dev Jain
On 05.08.25 11:30, Qi Zheng wrote:
>
>
> On 8/5/25 4:56 PM, Baolin Wang wrote:
>>
>>
>> On 2025/8/5 16:17, Qi Zheng wrote:
>>> Hi Baolin,
>>>
>>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>>> Hi Barry,
>>>>>
>>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>>> madvise_dontneed can now execute under a per-VMA lock, this assumption
>>>>>> is no longer valid. As a result, a race condition can occur between
>>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>>
>>>>> There is indeed a race condition here. And after applying this patch, I
>>>>> can no longer reproduce the problem locally (I was able to reproduce it
>>>>> stably locally last night).
>>>>>
>>>>> But I still can't figure out how this race condtion causes the
>>>>> following panic:
>>>>>
>>>>> exit_mmap
>>>>> --> mmap_read_lock()
>>>>> unmap_vmas()
>>>>> --> pte_offset_map_lock
>>>>> --> rcu_read_lock()
>>>>> check if the pmd entry is a PTE page
>>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>>>>> spin_lock(ptl) <-- PANIC!!
>>>>>
>>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>>>> not be NULL.
>>>>>
>>>>> The collapse holds mmap write lock, so it is impossible to be
>>>>> concurrent
>>>>> with exit_mmap().
>>>>>
>>>>> Confusing. :(
>>>>
>>>> IIUC, the issue is not caused by the concurrency between exit_mmap
>>>> and collapse, but rather by the concurrency between pt_reclaim and
>>>> collapse.
>>>>
>>>> Before this patch, khugepaged might incorrectly restore a PTE
>>>> pagetable that had already been freed.
>>>>
>>>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>>>> However, due to the race condition, check_pmd_still_valid() still
>>>> passes and continues to attempt the collapse:
>>>>
>>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none
>>>> pmd entry (the original pmd entry has been cleared)
>>>>
>>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>>>> pte == NULL
>>>>
>>>> Then khugepaged will restore the old PTE pagetable with an invalid
>>>> pmd entry:
>>>>
>>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>>
>>>> So when the process exits and trys to free the mapping of the
>>>> process, traversing the invalid pmd table will lead to a crash.
>>>
>>> CPU0 CPU1
>>> ==== ====
>>>
>>> collapse
>>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>> mmap_write_unlock
>>> exit_mmap
>>> --> hold mmap lock
>>> __pte_offset_map_lock
>>> --> pte = __pte_offset_map(pmd,
>>> addr, &pmdval);
>>> if (unlikely(!pte))
>>> return pte; <-- will return
>>
>> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
>> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
>> '_pmd' entry is NONE), but it is not the original pagetable page.
>
> CPU0 CPU1
> ==== ====
>
> collapse
> --> check_pmd_still_valid
> vma read lock
> pt_reclaim clear the pmd entry and will
> free the PTE page (via RCU)
> vma read unlock
>
> vma write lock
> _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is
> NULL
> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid page?
> vma write unlock
>
> The above is the concurrent scenario you mentioned, right?
>
> What types of this 'valid page' could be? If __pte_offset_map() returns
> non-NULL, then it is a PTE page. Even if it is not the original one, it
> should not cause panic. Did I miss some key information? :(
Wasn't the original issue all about a NULL-pointer de-reference while
*locking*?
Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so
likely we will have ALLOC_SPLIT_PTLOCKS set.
[1]
https://github.com/laifryiee/syzkaller_logs/blob/main/250803_193026___pte_offset_map_lock/.config
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 9:50 ` David Hildenbrand
@ 2025-08-05 10:07 ` Baolin Wang
2025-08-05 10:26 ` Qi Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Baolin Wang @ 2025-08-05 10:07 UTC (permalink / raw)
To: David Hildenbrand, Qi Zheng, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
Tangquan Zheng, Lance Yang, Zi Yan, Liam R . Howlett, Nico Pache,
Ryan Roberts, Dev Jain
On 2025/8/5 17:50, David Hildenbrand wrote:
> On 05.08.25 11:30, Qi Zheng wrote:
>>
>>
>> On 8/5/25 4:56 PM, Baolin Wang wrote:
>>>
>>>
>>> On 2025/8/5 16:17, Qi Zheng wrote:
>>>> Hi Baolin,
>>>>
>>>> On 8/5/25 3:53 PM, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/8/5 14:42, Qi Zheng wrote:
>>>>>> Hi Barry,
>>>>>>
>>>>>> On 8/5/25 11:54 AM, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> The check_pmd_still_valid() call during collapse is currently only
>>>>>>> protected by the mmap_lock in write mode, which was sufficient when
>>>>>>> pt_reclaim always ran under mmap_lock in read mode. However, since
>>>>>>> madvise_dontneed can now execute under a per-VMA lock, this
>>>>>>> assumption
>>>>>>> is no longer valid. As a result, a race condition can occur between
>>>>>>> collapse and PT_RECLAIM, potentially leading to a kernel panic.
>>>>>>
>>>>>> There is indeed a race condition here. And after applying this
>>>>>> patch, I
>>>>>> can no longer reproduce the problem locally (I was able to
>>>>>> reproduce it
>>>>>> stably locally last night).
>>>>>>
>>>>>> But I still can't figure out how this race condtion causes the
>>>>>> following panic:
>>>>>>
>>>>>> exit_mmap
>>>>>> --> mmap_read_lock()
>>>>>> unmap_vmas()
>>>>>> --> pte_offset_map_lock
>>>>>> --> rcu_read_lock()
>>>>>> check if the pmd entry is a PTE page
>>>>>> ptl = pte_lockptr(mm, &pmdval) <-- ptl is NULL
>>>>>> spin_lock(ptl) <-- PANIC!!
>>>>>>
>>>>>> If this PTE page is freed by pt_reclaim (via RCU), then the ptl can
>>>>>> not be NULL.
>>>>>>
>>>>>> The collapse holds mmap write lock, so it is impossible to be
>>>>>> concurrent
>>>>>> with exit_mmap().
>>>>>>
>>>>>> Confusing. :(
>>>>>
>>>>> IIUC, the issue is not caused by the concurrency between exit_mmap
>>>>> and collapse, but rather by the concurrency between pt_reclaim and
>>>>> collapse.
>>>>>
>>>>> Before this patch, khugepaged might incorrectly restore a PTE
>>>>> pagetable that had already been freed.
>>>>>
>>>>> pt_reclaim has cleared the pmd entry and freed the PTE page table.
>>>>> However, due to the race condition, check_pmd_still_valid() still
>>>>> passes and continues to attempt the collapse:
>>>>>
>>>>> _pmd = pmdp_collapse_flush(vma, address, pmd); ---> returns a none
>>>>> pmd entry (the original pmd entry has been cleared)
>>>>>
>>>>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); ---> returns
>>>>> pte == NULL
>>>>>
>>>>> Then khugepaged will restore the old PTE pagetable with an invalid
>>>>> pmd entry:
>>>>>
>>>>> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>>>
>>>>> So when the process exits and trys to free the mapping of the
>>>>> process, traversing the invalid pmd table will lead to a crash.
>>>>
>>>> CPU0 CPU1
>>>> ==== ====
>>>>
>>>> collapse
>>>> --> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>>>> mmap_write_unlock
>>>> exit_mmap
>>>> --> hold mmap lock
>>>> __pte_offset_map_lock
>>>> --> pte = __pte_offset_map(pmd,
>>>> addr, &pmdval);
>>>> if (unlikely(!pte))
>>>> return pte; <-- will
>>>> return
>>>
>>> __pte_offset_map() might not return NULL? Because the 'pmd_populate(mm,
>>> pmd, pmd_pgtable(_pmd))' could populate a valid page (although the
>>> '_pmd' entry is NONE), but it is not the original pagetable page.
>>
>> CPU0 CPU1
>> ==== ====
>>
>> collapse
>> --> check_pmd_still_valid
>> vma read lock
>> pt_reclaim clear the pmd entry and will
>> free the PTE page (via RCU)
>> vma read unlock
>>
>> vma write lock
>> _pmd = pmdp_collapse_flush(vma, address, pmd) <-- pmd_none(_pmd)
>> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); <-- pte is
>> NULL
>> pmd_populate(mm, pmd, pmd_pgtable(_pmd)); <-- populate a valid
>> page?
>> vma write unlock
>>
>> The above is the concurrent scenario you mentioned, right?
Yes.
>>
>> What types of this 'valid page' could be? If __pte_offset_map() returns
>> non-NULL, then it is a PTE page. Even if it is not the original one, it
>> should not cause panic. Did I miss some key information? :(
Sorry for not being clear. Let me try again.
In the race condition described above, the '_pmd' value is NONE, meaning
that when restoring the pmd entry with ‘pmd_populate(mm, pmd,
pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page
corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd
entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the
corresponding ptl lock of this page is not initialized.
Additionally, from the boot dmesg, I can see that the BIOS reports an
address range with pfn == 0, indicating that there is a struct page
initialized for pfn == 0 (possibly a reserved page):
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
reserved
Of course, this is my theoretical analysis from the code perspective. If
there are other race conditions, I would be very surprised:)
> Wasn't the original issue all about a NULL-pointer de-reference while
> *locking*?
Yes.
> Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y, so
> likely we will have ALLOC_SPLIT_PTLOCKS set.
>
> [1] https://github.com/laifryiee/syzkaller_logs/blob/
> main/250803_193026___pte_offset_map_lock/.config
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock
2025-08-05 10:07 ` Baolin Wang
@ 2025-08-05 10:26 ` Qi Zheng
0 siblings, 0 replies; 12+ messages in thread
From: Qi Zheng @ 2025-08-05 10:26 UTC (permalink / raw)
To: Baolin Wang, David Hildenbrand, Barry Song, akpm, linux-mm
Cc: linux-kernel, Barry Song, Lai, Yi, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra,
Tangquan Zheng, Lance Yang, Zi Yan, Liam R . Howlett, Nico Pache,
Ryan Roberts, Dev Jain
On 8/5/25 6:07 PM, Baolin Wang wrote:
>
>
[...]
>
>>>
>>> What types of this 'valid page' could be? If __pte_offset_map() returns
>>> non-NULL, then it is a PTE page. Even if it is not the original one, it
>>> should not cause panic. Did I miss some key information? :(
>
> Sorry for not being clear. Let me try again.
>
> In the race condition described above, the '_pmd' value is NONE, meaning
> that when restoring the pmd entry with ‘pmd_populate(mm, pmd,
> pmd_pgtable(_pmd))’, the 'pmd_pgtable(_pmd)' can return a struct page
> corresponding to pfn == 0 (cause the '_pmd' is NONE) to populate the pmd
> entry. Clearly, this pfn == 0 page is not a pagetable page, meaning the
> corresponding ptl lock of this page is not initialized.
>
> Additionally, from the boot dmesg, I can see that the BIOS reports an
> address range with pfn == 0, indicating that there is a struct page
> initialized for pfn == 0 (possibly a reserved page):
>
> [ 0.000000] BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff]
> usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffdffff]
> usable
> [ 0.000000] BIOS-e820: [mem 0x000000007ffe0000-0x000000007fffffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
> reserved
>
Now I understand, thank you very much for your patient explanation!
And for this patch:
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Thanks!
> Of course, this is my theoretical analysis from the code perspective. If
> there are other race conditions, I would be very surprised:)
>
>> Wasn't the original issue all about a NULL-pointer de-reference while
>> *locking*?
>
> Yes.
>
>> Note that in that kernel config [1] we have CONFIG_DEBUG_SPINLOCK=y,
>> so likely we will have ALLOC_SPLIT_PTLOCKS set.
>>
>> [1] https://github.com/laifryiee/syzkaller_logs/blob/
>> main/250803_193026___pte_offset_map_lock/.config
>>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-08-05 10:26 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-05 3:54 [PATCH] mm: Fix the race between collapse and PT_RECLAIM under per-vma lock Barry Song
2025-08-05 5:20 ` Lorenzo Stoakes
2025-08-05 6:41 ` Baolin Wang
2025-08-05 6:42 ` Qi Zheng
2025-08-05 7:53 ` Baolin Wang
2025-08-05 8:17 ` Qi Zheng
2025-08-05 8:56 ` Baolin Wang
2025-08-05 9:30 ` Qi Zheng
2025-08-05 9:50 ` David Hildenbrand
2025-08-05 10:07 ` Baolin Wang
2025-08-05 10:26 ` Qi Zheng
2025-08-05 8:02 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox