[PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
@ 2026-03-03  6:15 Dev Jain
  2026-03-03  8:50 ` David Hildenbrand (Arm)
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Dev Jain @ 2026-03-03  6:15 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, Dev Jain, stable

We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
If the batch has a mix of writable and non-writable bits, we may end up
setting the entire batch writable. Fix this by respecting writable bit
during batching.
Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
lost, preserve it on pte restoration by respecting the bit during batching,
to make the fix consistent w.r.t both writable bit and soft-dirty bit.

I was able to write the below reproducer and crash the kernel.
Explanation of reproducer (set 64K mTHP to always):

Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
fork() - parent points to the folio with 8 writable ptes and 8 non-writable
ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
the folio as lazyfree. Write to the memory to dirty the pte, eventually
rmap will dirty the folio. Then trigger reclaim, we will hit the pte
restoration path, and the kernel will crash with the following trace:

[   21.134473] kernel BUG at mm/page_table_check.c:118!
[   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
[   21.135917] Modules linked in:
[   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
[   21.136858] Hardware name: linux,dummy-virt (DT)
[   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   21.137308] pc : page_table_check_set+0x28c/0x2a8
[   21.137607] lr : page_table_check_set+0x134/0x2a8
[   21.137885] sp : ffff80008a3b3340
[   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
[   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
[   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
[   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
[   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
[   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
[   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
[   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
[   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
[   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
[   21.141991] Call trace:
[   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
[   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
[   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
[   21.142766]  contpte_set_ptes+0xe8/0x140
[   21.142907]  try_to_unmap_one+0x10c4/0x10d0
[   21.143177]  rmap_walk_anon+0x100/0x250
[   21.143315]  try_to_unmap+0xa0/0xc8
[   21.143441]  shrink_folio_list+0x59c/0x18a8
[   21.143759]  shrink_lruvec+0x664/0xbf0
[   21.144043]  shrink_node+0x218/0x878
[   21.144285]  __node_reclaim.constprop.0+0x98/0x338
[   21.144763]  user_proactive_reclaim+0x2a4/0x340
[   21.145056]  reclaim_store+0x3c/0x60
[   21.145216]  dev_attr_store+0x20/0x40
[   21.145585]  sysfs_kf_write+0x84/0xa8
[   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
[   21.145994]  vfs_write+0x2b8/0x368
[   21.146119]  ksys_write+0x70/0x110
[   21.146240]  __arm64_sys_write+0x24/0x38
[   21.146380]  invoke_syscall+0x50/0x120
[   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
[   21.146679]  do_el0_svc+0x28/0x40
[   21.146798]  el0_svc+0x34/0x110
[   21.146926]  el0t_64_sync_handler+0xa0/0xe8
[   21.147074]  el0t_64_sync+0x198/0x1a0
[   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
[   21.147440] ---[ end trace 0000000000000000 ]---


#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/wait.h>
#include <sched.h>
#include <fcntl.h>

void write_to_reclaim() {
    const char *path = "/sys/devices/system/node/node0/reclaim";
    const char *value = "409600000000";
    int fd = open(path, O_WRONLY);
    if (fd == -1) {
        perror("open");
        exit(EXIT_FAILURE);
    }

    if (write(fd, value, sizeof("409600000000") - 1) == -1) {
        perror("write");
        close(fd);
        exit(EXIT_FAILURE);
    }

    printf("Successfully wrote %s to %s\n", value, path);
    close(fd);
}

int main()
{
	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	if ((unsigned long)ptr != (1UL << 30)) {
		perror("mmap");
		return 1;
	}
	
	/* a 64K folio gets faulted in */
	memset(ptr, 0, 1UL << 16);

	/* 32K half will not be shared into child */
	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
		perror("madvise madv dontfork");
		return 1;
	}

	pid_t pid = fork();

	if (pid < 0) {
		perror("fork");
		return 1;
	} else if (pid == 0) {
		sleep(15);
	} else {
		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
			perror("madvise madv fork");
			return 1;
		}
		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
			perror("madvise madv free");
			return 1;
		}

		/* dirty the large folio */
		(*ptr) += 10;

		write_to_reclaim();
		// sleep(10);
		waitpid(pid, NULL, 0);

	}
}

Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
Cc: stable <stable@kernel.org>
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Patch applies on mm-unstable (9af4957ef127).

v2->v3:
 - Don't special case for anon folios

v1->v2:
 - Just respect the writable bit instead of hacking in a pte_wrprotect() in
   failure path
 - Also handle soft-dirty bit

 mm/rmap.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index bff8f222004e4..5a3e408e3f179 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (userfaultfd_wp(vma))
 		return 1;
 
-	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
+	/*
+	 * If unmap fails, we need to restore the ptes. To avoid accidentally
+	 * upgrading write permissions for ptes that were not originally
+	 * writable, and to avoid losing the soft-dirty bit, use the
+	 * appropriate FPB flags.
+	 */
+	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
+				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
 /*
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03  6:15 [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
@ 2026-03-03  8:50 ` David Hildenbrand (Arm)
  2026-03-03  9:54 ` Lorenzo Stoakes
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-03  8:50 UTC (permalink / raw)
  To: Dev Jain, akpm, lorenzo.stoakes
  Cc: riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, stable

On 3/3/26 07:15, Dev Jain wrote:
> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
> If the batch has a mix of writable and non-writable bits, we may end up
> setting the entire batch writable. Fix this by respecting writable bit
> during batching.
> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
> lost, preserve it on pte restoration by respecting the bit during batching,
> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
> 
> I was able to write the below reproducer and crash the kernel.
> Explanation of reproducer (set 64K mTHP to always):
> 
> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
> the folio as lazyfree. Write to the memory to dirty the pte, eventually
> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
> restoration path, and the kernel will crash with the following trace:
> 
> [   21.134473] kernel BUG at mm/page_table_check.c:118!
> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
> [   21.135917] Modules linked in:
> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
> [   21.136858] Hardware name: linux,dummy-virt (DT)
> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
> [   21.137607] lr : page_table_check_set+0x134/0x2a8
> [   21.137885] sp : ffff80008a3b3340
> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
> [   21.141991] Call trace:
> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
> [   21.142766]  contpte_set_ptes+0xe8/0x140
> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
> [   21.143177]  rmap_walk_anon+0x100/0x250
> [   21.143315]  try_to_unmap+0xa0/0xc8
> [   21.143441]  shrink_folio_list+0x59c/0x18a8
> [   21.143759]  shrink_lruvec+0x664/0xbf0
> [   21.144043]  shrink_node+0x218/0x878
> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
> [   21.145056]  reclaim_store+0x3c/0x60
> [   21.145216]  dev_attr_store+0x20/0x40
> [   21.145585]  sysfs_kf_write+0x84/0xa8
> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
> [   21.145994]  vfs_write+0x2b8/0x368
> [   21.146119]  ksys_write+0x70/0x110
> [   21.146240]  __arm64_sys_write+0x24/0x38
> [   21.146380]  invoke_syscall+0x50/0x120
> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
> [   21.146679]  do_el0_svc+0x28/0x40
> [   21.146798]  el0_svc+0x34/0x110
> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
> [   21.147074]  el0t_64_sync+0x198/0x1a0
> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
> [   21.147440] ---[ end trace 0000000000000000 ]---
> 
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/wait.h>
> #include <sched.h>
> #include <fcntl.h>
> 
> void write_to_reclaim() {
>     const char *path = "/sys/devices/system/node/node0/reclaim";
>     const char *value = "409600000000";
>     int fd = open(path, O_WRONLY);
>     if (fd == -1) {
>         perror("open");
>         exit(EXIT_FAILURE);
>     }
> 
>     if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>         perror("write");
>         close(fd);
>         exit(EXIT_FAILURE);
>     }
> 
>     printf("Successfully wrote %s to %s\n", value, path);
>     close(fd);
> }
> 
> int main()
> {
> 	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
> 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if ((unsigned long)ptr != (1UL << 30)) {
> 		perror("mmap");
> 		return 1;
> 	}
> 	
> 	/* a 64K folio gets faulted in */
> 	memset(ptr, 0, 1UL << 16);
> 
> 	/* 32K half will not be shared into child */
> 	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
> 		perror("madvise madv dontfork");
> 		return 1;
> 	}
> 
> 	pid_t pid = fork();
> 
> 	if (pid < 0) {
> 		perror("fork");
> 		return 1;
> 	} else if (pid == 0) {
> 		sleep(15);
> 	} else {
> 		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
> 		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
> 			perror("madvise madv fork");
> 			return 1;
> 		}
> 		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
> 			perror("madvise madv free");
> 			return 1;
> 		}
> 
> 		/* dirty the large folio */
> 		(*ptr) += 10;
> 
> 		write_to_reclaim();
> 		// sleep(10);
> 		waitpid(pid, NULL, 0);
> 
> 	}
> }
> 
> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>


Thanks Dev!

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03  6:15 [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
  2026-03-03  8:50 ` David Hildenbrand (Arm)
@ 2026-03-03  9:54 ` Lorenzo Stoakes
  2026-03-03 10:22   ` Dev Jain
  2026-03-03  9:57 ` Barry Song
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Lorenzo Stoakes @ 2026-03-03  9:54 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, riel, Liam.Howlett, vbabka, harry.yoo, jannh,
	baohua, linux-mm, linux-kernel, ryan.roberts, anshuman.khandual,
	stable

On Tue, Mar 03, 2026 at 11:45:28AM +0530, Dev Jain wrote:
> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
> If the batch has a mix of writable and non-writable bits, we may end up
> setting the entire batch writable. Fix this by respecting writable bit
> during batching.
> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
> lost, preserve it on pte restoration by respecting the bit during batching,
> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>
> I was able to write the below reproducer and crash the kernel.
> Explanation of reproducer (set 64K mTHP to always):
>
> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
> the folio as lazyfree. Write to the memory to dirty the pte, eventually
> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
> restoration path, and the kernel will crash with the following trace:
>
> [   21.134473] kernel BUG at mm/page_table_check.c:118!

Presumably:

			BUG_ON(atomic_inc_return(&ptc->anon_map_count) > 1 && rw);

It'd be useful to be explicit about this in the commit msg.

It's also probably worth saying explicitly that this is about a
non-AnonExclusive mapping being mapping read/write as a result of this.

I'm not sure the stack trace is really that useful beyond that, as by the time
somebody comes to read this later it's probably going to be fairly inaccurate :)

Maybe worth just spelling out e.g. try_to_unmap_one() -> ... rest of code path ...

I mean the stack already misses stuff due to inlining so that'd be more useful,
otherwise I'd drop it.


> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
> [   21.135917] Modules linked in:
> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
> [   21.136858] Hardware name: linux,dummy-virt (DT)
> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
> [   21.137607] lr : page_table_check_set+0x134/0x2a8
> [   21.137885] sp : ffff80008a3b3340
> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
> [   21.141991] Call trace:
> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
> [   21.142766]  contpte_set_ptes+0xe8/0x140
> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
> [   21.143177]  rmap_walk_anon+0x100/0x250
> [   21.143315]  try_to_unmap+0xa0/0xc8
> [   21.143441]  shrink_folio_list+0x59c/0x18a8
> [   21.143759]  shrink_lruvec+0x664/0xbf0
> [   21.144043]  shrink_node+0x218/0x878
> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
> [   21.145056]  reclaim_store+0x3c/0x60
> [   21.145216]  dev_attr_store+0x20/0x40
> [   21.145585]  sysfs_kf_write+0x84/0xa8
> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
> [   21.145994]  vfs_write+0x2b8/0x368
> [   21.146119]  ksys_write+0x70/0x110
> [   21.146240]  __arm64_sys_write+0x24/0x38
> [   21.146380]  invoke_syscall+0x50/0x120
> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
> [   21.146679]  do_el0_svc+0x28/0x40
> [   21.146798]  el0_svc+0x34/0x110
> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
> [   21.147074]  el0t_64_sync+0x198/0x1a0
> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
> [   21.147440] ---[ end trace 0000000000000000 ]---
>
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/wait.h>
> #include <sched.h>
> #include <fcntl.h>
>
> void write_to_reclaim() {
>     const char *path = "/sys/devices/system/node/node0/reclaim";
>     const char *value = "409600000000";
>     int fd = open(path, O_WRONLY);
>     if (fd == -1) {
>         perror("open");
>         exit(EXIT_FAILURE);
>     }
>
>     if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>         perror("write");
>         close(fd);
>         exit(EXIT_FAILURE);
>     }
>
>     printf("Successfully wrote %s to %s\n", value, path);
>     close(fd);
> }
>
> int main()
> {
> 	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
> 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if ((unsigned long)ptr != (1UL << 30)) {
> 		perror("mmap");
> 		return 1;
> 	}
>
> 	/* a 64K folio gets faulted in */
> 	memset(ptr, 0, 1UL << 16);
>
> 	/* 32K half will not be shared into child */
> 	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
> 		perror("madvise madv dontfork");
> 		return 1;
> 	}
>
> 	pid_t pid = fork();
>
> 	if (pid < 0) {
> 		perror("fork");
> 		return 1;
> 	} else if (pid == 0) {
> 		sleep(15);
> 	} else {
> 		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
> 		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
> 			perror("madvise madv fork");
> 			return 1;
> 		}
> 		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
> 			perror("madvise madv free");
> 			return 1;
> 		}
>
> 		/* dirty the large folio */
> 		(*ptr) += 10;
>
> 		write_to_reclaim();
> 		// sleep(10);
> 		waitpid(pid, NULL, 0);
>
> 	}
> }

Thanks, having the repro code here is really useful!

>
> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>

This looks correct to me, so other than nits around commit msg above:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
> Patch applies on mm-unstable (9af4957ef127).
>
> v2->v3:
>  - Don't special case for anon folios
>
> v1->v2:
>  - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>    failure path
>  - Also handle soft-dirty bit
>
>  mm/rmap.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bff8f222004e4..5a3e408e3f179 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	if (userfaultfd_wp(vma))
>  		return 1;
>
> -	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> +	/*
> +	 * If unmap fails, we need to restore the ptes. To avoid accidentally
> +	 * upgrading write permissions for ptes that were not originally
> +	 * writable, and to avoid losing the soft-dirty bit, use the
> +	 * appropriate FPB flags.
> +	 */
> +	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
> +				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>  }
>
>  /*
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03  6:15 [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
  2026-03-03  8:50 ` David Hildenbrand (Arm)
  2026-03-03  9:54 ` Lorenzo Stoakes
@ 2026-03-03  9:57 ` Barry Song
  2026-03-03 10:32 ` Dev Jain
  2026-03-03 12:17 ` Wei Yang
  4 siblings, 0 replies; 8+ messages in thread
From: Barry Song @ 2026-03-03  9:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka,
	harry.yoo, jannh, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, stable

On Tue, Mar 3, 2026 at 2:15 PM Dev Jain <dev.jain@arm.com> wrote:
>
> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
> If the batch has a mix of writable and non-writable bits, we may end up
> setting the entire batch writable. Fix this by respecting writable bit
> during batching.
> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
> lost, preserve it on pte restoration by respecting the bit during batching,
> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>
> I was able to write the below reproducer and crash the kernel.
> Explanation of reproducer (set 64K mTHP to always):
>
> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
> the folio as lazyfree. Write to the memory to dirty the pte, eventually
> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
> restoration path, and the kernel will crash with the following trace:
>
> [   21.134473] kernel BUG at mm/page_table_check.c:118!
> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
> [   21.135917] Modules linked in:
> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
> [   21.136858] Hardware name: linux,dummy-virt (DT)
> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
> [   21.137607] lr : page_table_check_set+0x134/0x2a8
> [   21.137885] sp : ffff80008a3b3340
> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
> [   21.141991] Call trace:
> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
> [   21.142766]  contpte_set_ptes+0xe8/0x140
> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
> [   21.143177]  rmap_walk_anon+0x100/0x250
> [   21.143315]  try_to_unmap+0xa0/0xc8
> [   21.143441]  shrink_folio_list+0x59c/0x18a8
> [   21.143759]  shrink_lruvec+0x664/0xbf0
> [   21.144043]  shrink_node+0x218/0x878
> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
> [   21.145056]  reclaim_store+0x3c/0x60
> [   21.145216]  dev_attr_store+0x20/0x40
> [   21.145585]  sysfs_kf_write+0x84/0xa8
> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
> [   21.145994]  vfs_write+0x2b8/0x368
> [   21.146119]  ksys_write+0x70/0x110
> [   21.146240]  __arm64_sys_write+0x24/0x38
> [   21.146380]  invoke_syscall+0x50/0x120
> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
> [   21.146679]  do_el0_svc+0x28/0x40
> [   21.146798]  el0_svc+0x34/0x110
> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
> [   21.147074]  el0t_64_sync+0x198/0x1a0
> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
> [   21.147440] ---[ end trace 0000000000000000 ]---
>
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/wait.h>
> #include <sched.h>
> #include <fcntl.h>
>
> void write_to_reclaim() {
>     const char *path = "/sys/devices/system/node/node0/reclaim";
>     const char *value = "409600000000";
>     int fd = open(path, O_WRONLY);
>     if (fd == -1) {
>         perror("open");
>         exit(EXIT_FAILURE);
>     }
>
>     if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>         perror("write");
>         close(fd);
>         exit(EXIT_FAILURE);
>     }
>
>     printf("Successfully wrote %s to %s\n", value, path);
>     close(fd);
> }
>
> int main()
> {
>         char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
>                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         if ((unsigned long)ptr != (1UL << 30)) {
>                 perror("mmap");
>                 return 1;
>         }
>
>         /* a 64K folio gets faulted in */
>         memset(ptr, 0, 1UL << 16);
>
>         /* 32K half will not be shared into child */
>         if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
>                 perror("madvise madv dontfork");
>                 return 1;
>         }
>
>         pid_t pid = fork();
>
>         if (pid < 0) {
>                 perror("fork");
>                 return 1;
>         } else if (pid == 0) {
>                 sleep(15);
>         } else {
>                 /* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
>                 if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
>                         perror("madvise madv fork");
>                         return 1;
>                 }
>                 if (madvise(ptr, (1UL << 16), MADV_FREE)) {
>                         perror("madvise madv free");
>                         return 1;
>                 }
>
>                 /* dirty the large folio */
>                 (*ptr) += 10;
>
>                 write_to_reclaim();
>                 // sleep(10);
>                 waitpid(pid, NULL, 0);
>
>         }
> }
>
> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---

LGTM, thanks!

Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03  9:54 ` Lorenzo Stoakes
@ 2026-03-03 10:22   ` Dev Jain
  0 siblings, 0 replies; 8+ messages in thread
From: Dev Jain @ 2026-03-03 10:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, riel, Liam.Howlett, vbabka, harry.yoo, jannh,
	baohua, linux-mm, linux-kernel, ryan.roberts, anshuman.khandual,
	stable



On 03/03/26 3:24 pm, Lorenzo Stoakes wrote:
> On Tue, Mar 03, 2026 at 11:45:28AM +0530, Dev Jain wrote:
>> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
>> If the batch has a mix of writable and non-writable bits, we may end up
>> setting the entire batch writable. Fix this by respecting writable bit
>> during batching.
>> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
>> lost, preserve it on pte restoration by respecting the bit during batching,
>> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>>
>> I was able to write the below reproducer and crash the kernel.
>> Explanation of reproducer (set 64K mTHP to always):
>>
>> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
>> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
>> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
>> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
>> the folio as lazyfree. Write to the memory to dirty the pte, eventually
>> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
>> restoration path, and the kernel will crash with the following trace:
>>
>> [   21.134473] kernel BUG at mm/page_table_check.c:118!
> 
> Presumably:
> 
> 			BUG_ON(atomic_inc_return(&ptc->anon_map_count) > 1 && rw);
> 
> It'd be useful to be explicit about this in the commit msg.
> 
> It's also probably worth saying explicitly that this is about a
> non-AnonExclusive mapping being mapping read/write as a result of this.
> 
> I'm not sure the stack trace is really that useful beyond that, as by the time
> somebody comes to read this later it's probably going to be fairly inaccurate :)
> 
> Maybe worth just spelling out e.g. try_to_unmap_one() -> ... rest of code path ...
> 
> I mean the stack already misses stuff due to inlining so that'd be more useful,
> otherwise I'd drop it.

Good point, thanks! Lemme reply to the patch.

> 
> 
>> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
>> [   21.135917] Modules linked in:
>> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
>> [   21.136858] Hardware name: linux,dummy-virt (DT)
>> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
>> [   21.137607] lr : page_table_check_set+0x134/0x2a8
>> [   21.137885] sp : ffff80008a3b3340
>> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
>> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
>> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
>> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
>> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
>> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
>> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
>> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
>> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
>> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
>> [   21.141991] Call trace:
>> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
>> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
>> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
>> [   21.142766]  contpte_set_ptes+0xe8/0x140
>> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
>> [   21.143177]  rmap_walk_anon+0x100/0x250
>> [   21.143315]  try_to_unmap+0xa0/0xc8
>> [   21.143441]  shrink_folio_list+0x59c/0x18a8
>> [   21.143759]  shrink_lruvec+0x664/0xbf0
>> [   21.144043]  shrink_node+0x218/0x878
>> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
>> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
>> [   21.145056]  reclaim_store+0x3c/0x60
>> [   21.145216]  dev_attr_store+0x20/0x40
>> [   21.145585]  sysfs_kf_write+0x84/0xa8
>> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
>> [   21.145994]  vfs_write+0x2b8/0x368
>> [   21.146119]  ksys_write+0x70/0x110
>> [   21.146240]  __arm64_sys_write+0x24/0x38
>> [   21.146380]  invoke_syscall+0x50/0x120
>> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
>> [   21.146679]  do_el0_svc+0x28/0x40
>> [   21.146798]  el0_svc+0x34/0x110
>> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
>> [   21.147074]  el0t_64_sync+0x198/0x1a0
>> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
>> [   21.147440] ---[ end trace 0000000000000000 ]---
>>
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <sys/wait.h>
>> #include <sched.h>
>> #include <fcntl.h>
>>
>> void write_to_reclaim() {
>>     const char *path = "/sys/devices/system/node/node0/reclaim";
>>     const char *value = "409600000000";
>>     int fd = open(path, O_WRONLY);
>>     if (fd == -1) {
>>         perror("open");
>>         exit(EXIT_FAILURE);
>>     }
>>
>>     if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>>         perror("write");
>>         close(fd);
>>         exit(EXIT_FAILURE);
>>     }
>>
>>     printf("Successfully wrote %s to %s\n", value, path);
>>     close(fd);
>> }
>>
>> int main()
>> {
>> 	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
>> 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> 	if ((unsigned long)ptr != (1UL << 30)) {
>> 		perror("mmap");
>> 		return 1;
>> 	}
>>
>> 	/* a 64K folio gets faulted in */
>> 	memset(ptr, 0, 1UL << 16);
>>
>> 	/* 32K half will not be shared into child */
>> 	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
>> 		perror("madvise madv dontfork");
>> 		return 1;
>> 	}
>>
>> 	pid_t pid = fork();
>>
>> 	if (pid < 0) {
>> 		perror("fork");
>> 		return 1;
>> 	} else if (pid == 0) {
>> 		sleep(15);
>> 	} else {
>> 		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
>> 		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
>> 			perror("madvise madv fork");
>> 			return 1;
>> 		}
>> 		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
>> 			perror("madvise madv free");
>> 			return 1;
>> 		}
>>
>> 		/* dirty the large folio */
>> 		(*ptr) += 10;
>>
>> 		write_to_reclaim();
>> 		// sleep(10);
>> 		waitpid(pid, NULL, 0);
>>
>> 	}
>> }
> 
> Thanks, having the repro code here is really useful!
> 
>>
>> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
>> Cc: stable <stable@kernel.org>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> 
> This looks correct to me, so other than nits around commit msg above:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
>> ---
>> Patch applies on mm-unstable (9af4957ef127).
>>
>> v2->v3:
>>  - Don't special case for anon folios
>>
>> v1->v2:
>>  - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>>    failure path
>>  - Also handle soft-dirty bit
>>
>>  mm/rmap.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index bff8f222004e4..5a3e408e3f179 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>  	if (userfaultfd_wp(vma))
>>  		return 1;
>>
>> -	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>> +	/*
>> +	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>> +	 * upgrading write permissions for ptes that were not originally
>> +	 * writable, and to avoid losing the soft-dirty bit, use the
>> +	 * appropriate FPB flags.
>> +	 */
>> +	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
>> +				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>>  }
>>
>>  /*
>> --
>> 2.34.1
>>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03  6:15 [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
                   ` (2 preceding siblings ...)
  2026-03-03  9:57 ` Barry Song
@ 2026-03-03 10:32 ` Dev Jain
  2026-03-03 12:17 ` Wei Yang
  4 siblings, 0 replies; 8+ messages in thread
From: Dev Jain @ 2026-03-03 10:32 UTC (permalink / raw)
  To: akpm, david, lorenzo.stoakes
  Cc: riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua, linux-mm,
	linux-kernel, ryan.roberts, anshuman.khandual, stable



On 03/03/26 11:45 am, Dev Jain wrote:
> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
> If the batch has a mix of writable and non-writable bits, we may end up
> setting the entire batch writable. Fix this by respecting writable bit
> during batching.
> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
> lost, preserve it on pte restoration by respecting the bit during batching,
> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
> 
> I was able to write the below reproducer and crash the kernel.
> Explanation of reproducer (set 64K mTHP to always):
> 
> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
> the folio as lazyfree. Write to the memory to dirty the pte, eventually
> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
> restoration path, and the kernel will crash with the following trace:

Andrew, could you convert the last line to the following:


..restoration path, and the kernel will crash with the trace given
below. The BUG happens at:

BUG_ON(atomic_inc_return(&ptc->anon_map_count) > 1 && rw);

The code path is asking for anonymous page to be mapped writable into
the pagetable. The BUG_ON() firing implies that such a writable page has
been mapped into the pagetables of more than one process, which breaks
anonymous memory/CoW semantics.


> 
> [   21.134473] kernel BUG at mm/page_table_check.c:118!
> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
> [   21.135917] Modules linked in:
> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
> [   21.136858] Hardware name: linux,dummy-virt (DT)
> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
> [   21.137607] lr : page_table_check_set+0x134/0x2a8
> [   21.137885] sp : ffff80008a3b3340
> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
> [   21.141991] Call trace:
> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
> [   21.142766]  contpte_set_ptes+0xe8/0x140
> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
> [   21.143177]  rmap_walk_anon+0x100/0x250
> [   21.143315]  try_to_unmap+0xa0/0xc8
> [   21.143441]  shrink_folio_list+0x59c/0x18a8
> [   21.143759]  shrink_lruvec+0x664/0xbf0
> [   21.144043]  shrink_node+0x218/0x878
> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
> [   21.145056]  reclaim_store+0x3c/0x60
> [   21.145216]  dev_attr_store+0x20/0x40
> [   21.145585]  sysfs_kf_write+0x84/0xa8
> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
> [   21.145994]  vfs_write+0x2b8/0x368
> [   21.146119]  ksys_write+0x70/0x110
> [   21.146240]  __arm64_sys_write+0x24/0x38
> [   21.146380]  invoke_syscall+0x50/0x120
> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
> [   21.146679]  do_el0_svc+0x28/0x40
> [   21.146798]  el0_svc+0x34/0x110
> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
> [   21.147074]  el0t_64_sync+0x198/0x1a0
> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
> [   21.147440] ---[ end trace 0000000000000000 ]---
> 
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/wait.h>
> #include <sched.h>
> #include <fcntl.h>
> 
> void write_to_reclaim() {
>     const char *path = "/sys/devices/system/node/node0/reclaim";
>     const char *value = "409600000000";
>     int fd = open(path, O_WRONLY);
>     if (fd == -1) {
>         perror("open");
>         exit(EXIT_FAILURE);
>     }
> 
>     if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>         perror("write");
>         close(fd);
>         exit(EXIT_FAILURE);
>     }
> 
>     printf("Successfully wrote %s to %s\n", value, path);
>     close(fd);
> }
> 
> int main()
> {
> 	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
> 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	if ((unsigned long)ptr != (1UL << 30)) {
> 		perror("mmap");
> 		return 1;
> 	}
> 	
> 	/* a 64K folio gets faulted in */
> 	memset(ptr, 0, 1UL << 16);
> 
> 	/* 32K half will not be shared into child */
> 	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
> 		perror("madvise madv dontfork");
> 		return 1;
> 	}
> 
> 	pid_t pid = fork();
> 
> 	if (pid < 0) {
> 		perror("fork");
> 		return 1;
> 	} else if (pid == 0) {
> 		sleep(15);
> 	} else {
> 		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
> 		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
> 			perror("madvise madv fork");
> 			return 1;
> 		}
> 		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
> 			perror("madvise madv free");
> 			return 1;
> 		}
> 
> 		/* dirty the large folio */
> 		(*ptr) += 10;
> 
> 		write_to_reclaim();
> 		// sleep(10);
> 		waitpid(pid, NULL, 0);
> 
> 	}
> }
> 
> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Patch applies on mm-unstable (9af4957ef127).
> 
> v2->v3:
>  - Don't special case for anon folios
> 
> v1->v2:
>  - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>    failure path
>  - Also handle soft-dirty bit
> 
>  mm/rmap.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bff8f222004e4..5a3e408e3f179 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	if (userfaultfd_wp(vma))
>  		return 1;
>  
> -	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> +	/*
> +	 * If unmap fails, we need to restore the ptes. To avoid accidentally
> +	 * upgrading write permissions for ptes that were not originally
> +	 * writable, and to avoid losing the soft-dirty bit, use the
> +	 * appropriate FPB flags.
> +	 */
> +	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
> +				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>  }
>  
>  /*



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03  6:15 [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
                   ` (3 preceding siblings ...)
  2026-03-03 10:32 ` Dev Jain
@ 2026-03-03 12:17 ` Wei Yang
  2026-03-03 12:25   ` Dev Jain
  4 siblings, 1 reply; 8+ messages in thread
From: Wei Yang @ 2026-03-03 12:17 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka,
	harry.yoo, jannh, baohua, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, stable

On Tue, Mar 03, 2026 at 11:45:28AM +0530, Dev Jain wrote:
>We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
>If the batch has a mix of writable and non-writable bits, we may end up
>setting the entire batch writable. Fix this by respecting writable bit
>during batching.
>Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
>lost, preserve it on pte restoration by respecting the bit during batching,
>to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>
>I was able to write the below reproducer and crash the kernel.
>Explanation of reproducer (set 64K mTHP to always):
>
>Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
>fork() - parent points to the folio with 8 writable ptes and 8 non-writable
>ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
>determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
>the folio as lazyfree. Write to the memory to dirty the pte, eventually
>rmap will dirty the folio. Then trigger reclaim, we will hit the pte
>restoration path, and the kernel will crash with the following trace:
>
>[   21.134473] kernel BUG at mm/page_table_check.c:118!
>[   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
>[   21.135917] Modules linked in:
>[   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
>[   21.136858] Hardware name: linux,dummy-virt (DT)
>[   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>[   21.137308] pc : page_table_check_set+0x28c/0x2a8
>[   21.137607] lr : page_table_check_set+0x134/0x2a8
>[   21.137885] sp : ffff80008a3b3340
>[   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
>[   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
>[   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
>[   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
>[   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
>[   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
>[   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
>[   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
>[   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
>[   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
>[   21.141991] Call trace:
>[   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
>[   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
>[   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
>[   21.142766]  contpte_set_ptes+0xe8/0x140
>[   21.142907]  try_to_unmap_one+0x10c4/0x10d0
>[   21.143177]  rmap_walk_anon+0x100/0x250
>[   21.143315]  try_to_unmap+0xa0/0xc8
>[   21.143441]  shrink_folio_list+0x59c/0x18a8
>[   21.143759]  shrink_lruvec+0x664/0xbf0
>[   21.144043]  shrink_node+0x218/0x878
>[   21.144285]  __node_reclaim.constprop.0+0x98/0x338
>[   21.144763]  user_proactive_reclaim+0x2a4/0x340
>[   21.145056]  reclaim_store+0x3c/0x60
>[   21.145216]  dev_attr_store+0x20/0x40
>[   21.145585]  sysfs_kf_write+0x84/0xa8
>[   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
>[   21.145994]  vfs_write+0x2b8/0x368
>[   21.146119]  ksys_write+0x70/0x110
>[   21.146240]  __arm64_sys_write+0x24/0x38
>[   21.146380]  invoke_syscall+0x50/0x120
>[   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
>[   21.146679]  do_el0_svc+0x28/0x40
>[   21.146798]  el0_svc+0x34/0x110
>[   21.146926]  el0t_64_sync_handler+0xa0/0xe8
>[   21.147074]  el0t_64_sync+0x198/0x1a0
>[   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
>[   21.147440] ---[ end trace 0000000000000000 ]---
>
>
>#define _GNU_SOURCE
>#include <stdio.h>
>#include <unistd.h>
>#include <stdlib.h>
>#include <sys/mman.h>
>#include <string.h>
>#include <sys/wait.h>
>#include <sched.h>
>#include <fcntl.h>
>
>void write_to_reclaim() {
>    const char *path = "/sys/devices/system/node/node0/reclaim";
>    const char *value = "409600000000";
>    int fd = open(path, O_WRONLY);
>    if (fd == -1) {
>        perror("open");
>        exit(EXIT_FAILURE);
>    }
>
>    if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>        perror("write");
>        close(fd);
>        exit(EXIT_FAILURE);
>    }
>
>    printf("Successfully wrote %s to %s\n", value, path);
>    close(fd);
>}
>
>int main()
>{
>	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
>			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>	if ((unsigned long)ptr != (1UL << 30)) {
>		perror("mmap");
>		return 1;
>	}
>	
>	/* a 64K folio gets faulted in */
>	memset(ptr, 0, 1UL << 16);
>
>	/* 32K half will not be shared into child */
>	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
>		perror("madvise madv dontfork");
>		return 1;
>	}
>
>	pid_t pid = fork();
>
>	if (pid < 0) {
>		perror("fork");
>		return 1;
>	} else if (pid == 0) {
>		sleep(15);
>	} else {
>		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
>		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
>			perror("madvise madv fork");
>			return 1;
>		}
>		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
>			perror("madvise madv free");
>			return 1;
>		}
>
>		/* dirty the large folio */
>		(*ptr) += 10;
>
>		write_to_reclaim();
>		// sleep(10);
>		waitpid(pid, NULL, 0);
>
>	}
>}
>
>Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
>Cc: stable <stable@kernel.org>
>Signed-off-by: Dev Jain <dev.jain@arm.com>
>---
>Patch applies on mm-unstable (9af4957ef127).
>
>v2->v3:
> - Don't special case for anon folios
>
>v1->v2:
> - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>   failure path
> - Also handle soft-dirty bit
>
> mm/rmap.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
>diff --git a/mm/rmap.c b/mm/rmap.c
>index bff8f222004e4..5a3e408e3f179 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> 	if (userfaultfd_wp(vma))
> 		return 1;
> 
>-	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>+	/*
>+	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>+	 * upgrading write permissions for ptes that were not originally
>+	 * writable, and to avoid losing the soft-dirty bit, use the
>+	 * appropriate FPB flags.
>+	 */
>+	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
>+				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
> }
> 

Hi, Dev

When reading the code, I got one confusion. Current call flow is like below:

    try_to_unmap_one();
        nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
	..
	pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
	..
	set_ptes(mm, address, pvmw.pte, pteval, nr_pages);

We get pteval by folio_unmap_pte_batch() but it is set again by
get_and_clear_ptes(), which maybe a different value. Then we use this pteval
to restore ptes.

So even we fix folio_unmap_pte_batch(), how this impact on the final restored
value?

Hope I don't miss something.

> /*
>-- 
>2.34.1
>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
  2026-03-03 12:17 ` Wei Yang
@ 2026-03-03 12:25   ` Dev Jain
  0 siblings, 0 replies; 8+ messages in thread
From: Dev Jain @ 2026-03-03 12:25 UTC (permalink / raw)
  To: Wei Yang
  Cc: akpm, david, lorenzo.stoakes, riel, Liam.Howlett, vbabka,
	harry.yoo, jannh, baohua, linux-mm, linux-kernel, ryan.roberts,
	anshuman.khandual, stable



On 03/03/26 5:47 pm, Wei Yang wrote:
> On Tue, Mar 03, 2026 at 11:45:28AM +0530, Dev Jain wrote:
>> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
>> If the batch has a mix of writable and non-writable bits, we may end up
>> setting the entire batch writable. Fix this by respecting writable bit
>> during batching.
>> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
>> lost, preserve it on pte restoration by respecting the bit during batching,
>> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>>
>> I was able to write the below reproducer and crash the kernel.
>> Explanation of reproducer (set 64K mTHP to always):
>>
>> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
>> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
>> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
>> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
>> the folio as lazyfree. Write to the memory to dirty the pte, eventually
>> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
>> restoration path, and the kernel will crash with the following trace:
>>
>> [   21.134473] kernel BUG at mm/page_table_check.c:118!
>> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
>> [   21.135917] Modules linked in:
>> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
>> [   21.136858] Hardware name: linux,dummy-virt (DT)
>> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
>> [   21.137607] lr : page_table_check_set+0x134/0x2a8
>> [   21.137885] sp : ffff80008a3b3340
>> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
>> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
>> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
>> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
>> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
>> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
>> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
>> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
>> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
>> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
>> [   21.141991] Call trace:
>> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
>> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
>> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
>> [   21.142766]  contpte_set_ptes+0xe8/0x140
>> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
>> [   21.143177]  rmap_walk_anon+0x100/0x250
>> [   21.143315]  try_to_unmap+0xa0/0xc8
>> [   21.143441]  shrink_folio_list+0x59c/0x18a8
>> [   21.143759]  shrink_lruvec+0x664/0xbf0
>> [   21.144043]  shrink_node+0x218/0x878
>> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
>> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
>> [   21.145056]  reclaim_store+0x3c/0x60
>> [   21.145216]  dev_attr_store+0x20/0x40
>> [   21.145585]  sysfs_kf_write+0x84/0xa8
>> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
>> [   21.145994]  vfs_write+0x2b8/0x368
>> [   21.146119]  ksys_write+0x70/0x110
>> [   21.146240]  __arm64_sys_write+0x24/0x38
>> [   21.146380]  invoke_syscall+0x50/0x120
>> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
>> [   21.146679]  do_el0_svc+0x28/0x40
>> [   21.146798]  el0_svc+0x34/0x110
>> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
>> [   21.147074]  el0t_64_sync+0x198/0x1a0
>> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
>> [   21.147440] ---[ end trace 0000000000000000 ]---
>>
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <sys/wait.h>
>> #include <sched.h>
>> #include <fcntl.h>
>>
>> void write_to_reclaim() {
>>    const char *path = "/sys/devices/system/node/node0/reclaim";
>>    const char *value = "409600000000";
>>    int fd = open(path, O_WRONLY);
>>    if (fd == -1) {
>>        perror("open");
>>        exit(EXIT_FAILURE);
>>    }
>>
>>    if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>>        perror("write");
>>        close(fd);
>>        exit(EXIT_FAILURE);
>>    }
>>
>>    printf("Successfully wrote %s to %s\n", value, path);
>>    close(fd);
>> }
>>
>> int main()
>> {
>> 	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
>> 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> 	if ((unsigned long)ptr != (1UL << 30)) {
>> 		perror("mmap");
>> 		return 1;
>> 	}
>> 	
>> 	/* a 64K folio gets faulted in */
>> 	memset(ptr, 0, 1UL << 16);
>>
>> 	/* 32K half will not be shared into child */
>> 	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
>> 		perror("madvise madv dontfork");
>> 		return 1;
>> 	}
>>
>> 	pid_t pid = fork();
>>
>> 	if (pid < 0) {
>> 		perror("fork");
>> 		return 1;
>> 	} else if (pid == 0) {
>> 		sleep(15);
>> 	} else {
>> 		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
>> 		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
>> 			perror("madvise madv fork");
>> 			return 1;
>> 		}
>> 		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
>> 			perror("madvise madv free");
>> 			return 1;
>> 		}
>>
>> 		/* dirty the large folio */
>> 		(*ptr) += 10;
>>
>> 		write_to_reclaim();
>> 		// sleep(10);
>> 		waitpid(pid, NULL, 0);
>>
>> 	}
>> }
>>
>> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
>> Cc: stable <stable@kernel.org>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Patch applies on mm-unstable (9af4957ef127).
>>
>> v2->v3:
>> - Don't special case for anon folios
>>
>> v1->v2:
>> - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>>   failure path
>> - Also handle soft-dirty bit
>>
>> mm/rmap.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index bff8f222004e4..5a3e408e3f179 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> 	if (userfaultfd_wp(vma))
>> 		return 1;
>>
>> -	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>> +	/*
>> +	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>> +	 * upgrading write permissions for ptes that were not originally
>> +	 * writable, and to avoid losing the soft-dirty bit, use the
>> +	 * appropriate FPB flags.
>> +	 */
>> +	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
>> +				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>> }
>>
> 
> Hi, Dev
> 
> When reading the code, I got one confusion. Current call flow is like below:
> 
>     try_to_unmap_one();
>         nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
> 	..
> 	pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
> 	..
> 	set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
> 
> We get pteval by folio_unmap_pte_batch() but it is set again by

folio_unmap_pte_batch() gives the batch size, not pteval. pteval is
given by get_and_clear_ptes() after accumulating a/d bits.

> get_and_clear_ptes(), which maybe a different value. Then we use this pteval
> to restore ptes.
> 
> So even we fix folio_unmap_pte_batch(), how this impact on the final restored
> value?

By respecting writable bit, we ensure that the ptes in the batch do not
have a mix of writable and non writable ptes.

So, if pteval returned by get_and_clear_ptes() is writable, then it is
guaranteed via folio_unmap_pte_batch() that the all pte values of
these nr_pages consecutive ptes, are writable. And vice versa.

> 
> Hope I don't miss something.
> 
>> /*
>> -- 
>> 2.34.1
>>
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-03 12:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-03  6:15 [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios Dev Jain
2026-03-03  8:50 ` David Hildenbrand (Arm)
2026-03-03  9:54 ` Lorenzo Stoakes
2026-03-03 10:22   ` Dev Jain
2026-03-03  9:57 ` Barry Song
2026-03-03 10:32 ` Dev Jain
2026-03-03 12:17 ` Wei Yang
2026-03-03 12:25   ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox