* [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
@ 2026-02-24 11:09 Dev Jain
2026-02-24 11:31 ` Lorenzo Stoakes
0 siblings, 1 reply; 3+ messages in thread
From: Dev Jain @ 2026-02-24 11:09 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes
Cc: riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua, linux-mm,
linux-kernel, Dev Jain, stable
We batch unmapping of anonymous lazyfree folios by folio_unmap_pte_batch.
If the batch has a mix of writable and non-writable bits, we may end up
setting the entire batch writable. Fix this by write-protecting the ptes
during pte restoration in the failure path.
I was able to write the below reproducer and crash the kernel.
Explanation of reproducer (set 64K mTHP to always):
Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
fork() - parent points to the folio with 8 writable ptes and 8 non-writable
ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
the folio as lazyfree. Write to the memory to dirty the pte, eventually
rmap will dirty the folio. Then trigger reclaim, we will hit the pte
restoration path, and the kernel will crash with the following trace:
[ 21.134473] kernel BUG at mm/page_table_check.c:118!
[ 21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 21.135917] Modules linked in:
[ 21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
[ 21.136858] Hardware name: linux,dummy-virt (DT)
[ 21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 21.137308] pc : page_table_check_set+0x28c/0x2a8
[ 21.137607] lr : page_table_check_set+0x134/0x2a8
[ 21.137885] sp : ffff80008a3b3340
[ 21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
[ 21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
[ 21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
[ 21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
[ 21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
[ 21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
[ 21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
[ 21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
[ 21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
[ 21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
[ 21.141991] Call trace:
[ 21.142093] page_table_check_set+0x28c/0x2a8 (P)
[ 21.142265] __page_table_check_ptes_set+0x144/0x1e8
[ 21.142441] __set_ptes_anysz.constprop.0+0x160/0x1a8
[ 21.142766] contpte_set_ptes+0xe8/0x140
[ 21.142907] try_to_unmap_one+0x10c4/0x10d0
[ 21.143177] rmap_walk_anon+0x100/0x250
[ 21.143315] try_to_unmap+0xa0/0xc8
[ 21.143441] shrink_folio_list+0x59c/0x18a8
[ 21.143759] shrink_lruvec+0x664/0xbf0
[ 21.144043] shrink_node+0x218/0x878
[ 21.144285] __node_reclaim.constprop.0+0x98/0x338
[ 21.144763] user_proactive_reclaim+0x2a4/0x340
[ 21.145056] reclaim_store+0x3c/0x60
[ 21.145216] dev_attr_store+0x20/0x40
[ 21.145585] sysfs_kf_write+0x84/0xa8
[ 21.145835] kernfs_fop_write_iter+0x130/0x1c8
[ 21.145994] vfs_write+0x2b8/0x368
[ 21.146119] ksys_write+0x70/0x110
[ 21.146240] __arm64_sys_write+0x24/0x38
[ 21.146380] invoke_syscall+0x50/0x120
[ 21.146513] el0_svc_common.constprop.0+0x48/0xf8
[ 21.146679] do_el0_svc+0x28/0x40
[ 21.146798] el0_svc+0x34/0x110
[ 21.146926] el0t_64_sync_handler+0xa0/0xe8
[ 21.147074] el0t_64_sync+0x198/0x1a0
[ 21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
[ 21.147440] ---[ end trace 0000000000000000 ]---
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/wait.h>
#include <sched.h>
#include <fcntl.h>
void write_to_reclaim() {
const char *path = "/sys/devices/system/node/node0/reclaim";
const char *value = "409600000000";
int fd = open(path, O_WRONLY);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
if (write(fd, value, sizeof("409600000000") - 1) == -1) {
perror("write");
close(fd);
exit(EXIT_FAILURE);
}
printf("Successfully wrote %s to %s\n", value, path);
close(fd);
}
int main()
{
char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if ((unsigned long)ptr != (1UL << 30)) {
perror("mmap");
return 1;
}
/* a 64K folio gets faulted in */
memset(ptr, 0, 1UL << 16);
/* 32K half will not be shared into child */
if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
perror("madvise madv dontfork");
return 1;
}
pid_t pid = fork();
if (pid < 0) {
perror("fork");
return 1;
} else if (pid == 0) {
sleep(15);
} else {
/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
perror("madvise madv fork");
return 1;
}
if (madvise(ptr, (1UL << 16), MADV_FREE)) {
perror("madvise madv free");
return 1;
}
/* dirty the large folio */
(*ptr) += 10;
write_to_reclaim();
// sleep(10);
waitpid(pid, NULL, 0);
}
}
Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
Cc: stable <stable@kernel.org>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Applies on mm-new (commit 018018a17770).
mm/rmap.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/mm/rmap.c b/mm/rmap.c
index bff8f222004e4..501519844f290 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2235,6 +2235,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
smp_rmb();
if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
+ /*
+ * The pte batch may have a mix of writable and non-writable
+ * ptes. If the first pte of the batch was writable, we may
+ * end up restoring the ptes incorrectly by setting the
+ * entire batch writable. Avoid this by setting the batch
+ * non-writable; this is not optimal, but improbable to
+ * reach by virtue of being a failure path.
+ */
+ pteval = pte_wrprotect(pteval);
+
/*
* redirtied either using the page table or a previously
* obtained GUP reference.
@@ -2243,6 +2253,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
folio_set_swapbacked(folio);
goto walk_abort;
} else if (ref_count != 1 + map_count) {
+ /* See comment above */
+ pteval = pte_wrprotect(pteval);
+
/*
* Additional reference. Could be a GUP reference or any
* speculative reference. GUP users must mark the folio
--
2.34.1
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-24 11:09 [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
@ 2026-02-24 11:31 ` Lorenzo Stoakes
2026-02-24 11:43 ` Lorenzo Stoakes
0 siblings, 1 reply; 3+ messages in thread
From: Lorenzo Stoakes @ 2026-02-24 11:31 UTC (permalink / raw)
To: Dev Jain
Cc: akpm, david, riel, Liam.Howlett, vbabka, harry.yoo, jannh,
baohua, linux-mm, linux-kernel, stable
Thanks Dev.
Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
large folios during reclamation") merged?
It had enormous amounts of review commentary at
https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
no tags, this should be a signal to wait for a respin _at least_, and really if
late in cycle suggests it should wait a cycle.
I've said going forward I'm going to check THP series for tags and if not
present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
It'd be easier for all concerned if we could yank stuff earlier
though. Waiting for the next cycle isn't a bad thing and avoids this kind
of bug.
Dev - I wonder if we shouldn't just revert 354dffd29575. I don't like how
the original patch piles more mess into an already HUGE function and it's
clearly adding risk here.
On Tue, Feb 24, 2026 at 04:39:34PM +0530, Dev Jain wrote:
> We batch unmapping of anonymous lazyfree folios by folio_unmap_pte_batch.
> If the batch has a mix of writable and non-writable bits, we may end up
> setting the entire batch writable. Fix this by write-protecting the ptes
> during pte restoration in the failure path.
>
> I was able to write the below reproducer and crash the kernel.
> Explanation of reproducer (set 64K mTHP to always):
>
> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
> the folio as lazyfree. Write to the memory to dirty the pte, eventually
> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
> restoration path, and the kernel will crash with the following trace:
>
> [ 21.134473] kernel BUG at mm/page_table_check.c:118!
> [ 21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [ 21.135917] Modules linked in:
> [ 21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
> [ 21.136858] Hardware name: linux,dummy-virt (DT)
> [ 21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [ 21.137308] pc : page_table_check_set+0x28c/0x2a8
> [ 21.137607] lr : page_table_check_set+0x134/0x2a8
> [ 21.137885] sp : ffff80008a3b3340
> [ 21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
> [ 21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
> [ 21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
> [ 21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
> [ 21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
> [ 21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
> [ 21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
> [ 21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
> [ 21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
> [ 21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
> [ 21.141991] Call trace:
> [ 21.142093] page_table_check_set+0x28c/0x2a8 (P)
> [ 21.142265] __page_table_check_ptes_set+0x144/0x1e8
> [ 21.142441] __set_ptes_anysz.constprop.0+0x160/0x1a8
> [ 21.142766] contpte_set_ptes+0xe8/0x140
> [ 21.142907] try_to_unmap_one+0x10c4/0x10d0
> [ 21.143177] rmap_walk_anon+0x100/0x250
> [ 21.143315] try_to_unmap+0xa0/0xc8
> [ 21.143441] shrink_folio_list+0x59c/0x18a8
> [ 21.143759] shrink_lruvec+0x664/0xbf0
> [ 21.144043] shrink_node+0x218/0x878
> [ 21.144285] __node_reclaim.constprop.0+0x98/0x338
> [ 21.144763] user_proactive_reclaim+0x2a4/0x340
> [ 21.145056] reclaim_store+0x3c/0x60
> [ 21.145216] dev_attr_store+0x20/0x40
> [ 21.145585] sysfs_kf_write+0x84/0xa8
> [ 21.145835] kernfs_fop_write_iter+0x130/0x1c8
> [ 21.145994] vfs_write+0x2b8/0x368
> [ 21.146119] ksys_write+0x70/0x110
> [ 21.146240] __arm64_sys_write+0x24/0x38
> [ 21.146380] invoke_syscall+0x50/0x120
> [ 21.146513] el0_svc_common.constprop.0+0x48/0xf8
> [ 21.146679] do_el0_svc+0x28/0x40
> [ 21.146798] el0_svc+0x34/0x110
> [ 21.146926] el0t_64_sync_handler+0xa0/0xe8
> [ 21.147074] el0t_64_sync+0x198/0x1a0
> [ 21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
> [ 21.147440] ---[ end trace 0000000000000000 ]---
>
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/wait.h>
> #include <sched.h>
> #include <fcntl.h>
>
> void write_to_reclaim() {
> const char *path = "/sys/devices/system/node/node0/reclaim";
> const char *value = "409600000000";
> int fd = open(path, O_WRONLY);
> if (fd == -1) {
> perror("open");
> exit(EXIT_FAILURE);
> }
>
> if (write(fd, value, sizeof("409600000000") - 1) == -1) {
> perror("write");
> close(fd);
> exit(EXIT_FAILURE);
> }
>
> printf("Successfully wrote %s to %s\n", value, path);
> close(fd);
> }
>
> int main()
> {
> char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> if ((unsigned long)ptr != (1UL << 30)) {
> perror("mmap");
> return 1;
> }
>
> /* a 64K folio gets faulted in */
> memset(ptr, 0, 1UL << 16);
>
> /* 32K half will not be shared into child */
> if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
> perror("madvise madv dontfork");
> return 1;
> }
>
> pid_t pid = fork();
>
> if (pid < 0) {
> perror("fork");
> return 1;
> } else if (pid == 0) {
> sleep(15);
> } else {
> /* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
> if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
> perror("madvise madv fork");
> return 1;
> }
> if (madvise(ptr, (1UL << 16), MADV_FREE)) {
> perror("madvise madv free");
> return 1;
> }
>
> /* dirty the large folio */
> (*ptr) += 10;
>
> write_to_reclaim();
> // sleep(10);
> waitpid(pid, NULL, 0);
>
> }
> }
>
> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Applies on mm-new (commit 018018a17770).
Thanks, but please base on mm-unstable, as mm-new is for now considered a
testing base only (yes we will endure merge conflict pain but I think
worthwhile).
>
> mm/rmap.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bff8f222004e4..501519844f290 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2235,6 +2235,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> smp_rmb();
>
> if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> + /*
> + * The pte batch may have a mix of writable and non-writable
> + * ptes. If the first pte of the batch was writable, we may
> + * end up restoring the ptes incorrectly by setting the
> + * entire batch writable. Avoid this by setting the batch
> + * non-writable; this is not optimal, but improbable to
> + * reach by virtue of being a failure path.
> + */
> + pteval = pte_wrprotect(pteval);
Is this really a good long-term solution?
This feels like a hack.
> +
> /*
> * redirtied either using the page table or a previously
> * obtained GUP reference.
> @@ -2243,6 +2253,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> folio_set_swapbacked(folio);
> goto walk_abort;
> } else if (ref_count != 1 + map_count) {
> + /* See comment above */
> + pteval = pte_wrprotect(pteval);
> +
Again, feels like a hack.
> /*
> * Additional reference. Could be a GUP reference or any
> * speculative reference. GUP users must mark the folio
> --
> 2.34.1
>
So maybe a revert + a rethink?
David - what do you think?
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-24 11:31 ` Lorenzo Stoakes
@ 2026-02-24 11:43 ` Lorenzo Stoakes
0 siblings, 0 replies; 3+ messages in thread
From: Lorenzo Stoakes @ 2026-02-24 11:43 UTC (permalink / raw)
To: Dev Jain
Cc: akpm, david, riel, Liam.Howlett, vbabka, harry.yoo, jannh,
baohua, linux-mm, linux-kernel, stable
On Tue, Feb 24, 2026 at 11:31:24AM +0000, Lorenzo Stoakes wrote:
> Thanks Dev.
>
> Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
> large folios during reclamation") merged?
>
> It had enormous amounts of review commentary at
> https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
> no tags, this should be a signal to wait for a respin _at least_, and really if
> late in cycle suggests it should wait a cycle.
>
> I've said going forward I'm going to check THP series for tags and if not
> present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
Sorry I misread the original mail rushing through this is old... so this is less
pressing than I thought (for some reason I thought it was merged last cycle...!)
but it's a good example of how stuff can go unnoticed for a while.
In that case maybe a revert is a bit much and we just want the simplest possible
fix for backporting.
But is the proposed 'just assume wrprotect' sensible? David?
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-02-24 11:43 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24 11:09 [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
2026-02-24 11:31 ` Lorenzo Stoakes
2026-02-24 11:43 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox