From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Dev Jain <dev.jain@arm.com>
Cc: akpm@linux-foundation.org, david@kernel.org, riel@surriel.com,
Liam.Howlett@oracle.com, vbabka@kernel.org, harry.yoo@oracle.com,
jannh@google.com, baohua@kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, stable <stable@kernel.org>
Subject: Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
Date: Tue, 24 Feb 2026 11:31:24 +0000 [thread overview]
Message-ID: <763ffcc5-8640-4b48-8ace-051ff0ccbdaf@lucifer.local> (raw)
In-Reply-To: <20260224110934.881360-1-dev.jain@arm.com>
Thanks Dev.
Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
large folios during reclamation") merged?
It had enormous amounts of review commentary at
https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
no tags, this should be a signal to wait for a respin _at least_, and really if
late in cycle suggests it should wait a cycle.
I've said going forward I'm going to check THP series for tags and if not
present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
It'd be easier for all concerned if we could yank stuff earlier
though. Waiting for the next cycle isn't a bad thing and avoids this kind
of bug.
Dev - I wonder if we shouldn't just revert 354dffd29575. I don't like how
the original patch piles more mess into an already HUGE function and it's
clearly adding risk here.
On Tue, Feb 24, 2026 at 04:39:34PM +0530, Dev Jain wrote:
> We batch unmapping of anonymous lazyfree folios by folio_unmap_pte_batch.
> If the batch has a mix of writable and non-writable bits, we may end up
> setting the entire batch writable. Fix this by write-protecting the ptes
> during pte restoration in the failure path.
>
> I was able to write the below reproducer and crash the kernel.
> Explanation of reproducer (set 64K mTHP to always):
>
> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
> the folio as lazyfree. Write to the memory to dirty the pte, eventually
> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
> restoration path, and the kernel will crash with the following trace:
>
> [ 21.134473] kernel BUG at mm/page_table_check.c:118!
> [ 21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
> [ 21.135917] Modules linked in:
> [ 21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
> [ 21.136858] Hardware name: linux,dummy-virt (DT)
> [ 21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [ 21.137308] pc : page_table_check_set+0x28c/0x2a8
> [ 21.137607] lr : page_table_check_set+0x134/0x2a8
> [ 21.137885] sp : ffff80008a3b3340
> [ 21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
> [ 21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
> [ 21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
> [ 21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
> [ 21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
> [ 21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
> [ 21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
> [ 21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
> [ 21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
> [ 21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
> [ 21.141991] Call trace:
> [ 21.142093] page_table_check_set+0x28c/0x2a8 (P)
> [ 21.142265] __page_table_check_ptes_set+0x144/0x1e8
> [ 21.142441] __set_ptes_anysz.constprop.0+0x160/0x1a8
> [ 21.142766] contpte_set_ptes+0xe8/0x140
> [ 21.142907] try_to_unmap_one+0x10c4/0x10d0
> [ 21.143177] rmap_walk_anon+0x100/0x250
> [ 21.143315] try_to_unmap+0xa0/0xc8
> [ 21.143441] shrink_folio_list+0x59c/0x18a8
> [ 21.143759] shrink_lruvec+0x664/0xbf0
> [ 21.144043] shrink_node+0x218/0x878
> [ 21.144285] __node_reclaim.constprop.0+0x98/0x338
> [ 21.144763] user_proactive_reclaim+0x2a4/0x340
> [ 21.145056] reclaim_store+0x3c/0x60
> [ 21.145216] dev_attr_store+0x20/0x40
> [ 21.145585] sysfs_kf_write+0x84/0xa8
> [ 21.145835] kernfs_fop_write_iter+0x130/0x1c8
> [ 21.145994] vfs_write+0x2b8/0x368
> [ 21.146119] ksys_write+0x70/0x110
> [ 21.146240] __arm64_sys_write+0x24/0x38
> [ 21.146380] invoke_syscall+0x50/0x120
> [ 21.146513] el0_svc_common.constprop.0+0x48/0xf8
> [ 21.146679] do_el0_svc+0x28/0x40
> [ 21.146798] el0_svc+0x34/0x110
> [ 21.146926] el0t_64_sync_handler+0xa0/0xe8
> [ 21.147074] el0t_64_sync+0x198/0x1a0
> [ 21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
> [ 21.147440] ---[ end trace 0000000000000000 ]---
>
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <sys/wait.h>
> #include <sched.h>
> #include <fcntl.h>
>
> void write_to_reclaim() {
> const char *path = "/sys/devices/system/node/node0/reclaim";
> const char *value = "409600000000";
> int fd = open(path, O_WRONLY);
> if (fd == -1) {
> perror("open");
> exit(EXIT_FAILURE);
> }
>
> if (write(fd, value, sizeof("409600000000") - 1) == -1) {
> perror("write");
> close(fd);
> exit(EXIT_FAILURE);
> }
>
> printf("Successfully wrote %s to %s\n", value, path);
> close(fd);
> }
>
> int main()
> {
> char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> if ((unsigned long)ptr != (1UL << 30)) {
> perror("mmap");
> return 1;
> }
>
> /* a 64K folio gets faulted in */
> memset(ptr, 0, 1UL << 16);
>
> /* 32K half will not be shared into child */
> if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
> perror("madvise madv dontfork");
> return 1;
> }
>
> pid_t pid = fork();
>
> if (pid < 0) {
> perror("fork");
> return 1;
> } else if (pid == 0) {
> sleep(15);
> } else {
> /* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
> if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
> perror("madvise madv fork");
> return 1;
> }
> if (madvise(ptr, (1UL << 16), MADV_FREE)) {
> perror("madvise madv free");
> return 1;
> }
>
> /* dirty the large folio */
> (*ptr) += 10;
>
> write_to_reclaim();
> // sleep(10);
> waitpid(pid, NULL, 0);
>
> }
> }
>
> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Applies on mm-new (commit 018018a17770).
Thanks, but please base on mm-unstable, as mm-new is for now considered a
testing base only (yes we will endure merge conflict pain but I think
worthwhile).
>
> mm/rmap.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bff8f222004e4..501519844f290 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2235,6 +2235,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> smp_rmb();
>
> if (folio_test_dirty(folio) && !(vma->vm_flags & VM_DROPPABLE)) {
> + /*
> + * The pte batch may have a mix of writable and non-writable
> + * ptes. If the first pte of the batch was writable, we may
> + * end up restoring the ptes incorrectly by setting the
> + * entire batch writable. Avoid this by setting the batch
> + * non-writable; this is not optimal, but improbable to
> + * reach by virtue of being a failure path.
> + */
> + pteval = pte_wrprotect(pteval);
Is this really a good long-term solution?
This feels like a hack.
> +
> /*
> * redirtied either using the page table or a previously
> * obtained GUP reference.
> @@ -2243,6 +2253,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> folio_set_swapbacked(folio);
> goto walk_abort;
> } else if (ref_count != 1 + map_count) {
> + /* See comment above */
> + pteval = pte_wrprotect(pteval);
> +
Again, feels like a hack.
> /*
> * Additional reference. Could be a GUP reference or any
> * speculative reference. GUP users must mark the folio
> --
> 2.34.1
>
So maybe a revert + a rethink?
David - what do you think?
Thanks, Lorenzo
next prev parent reply other threads:[~2026-02-24 11:31 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 11:09 Dev Jain
2026-02-24 11:31 ` Lorenzo Stoakes [this message]
2026-02-24 11:43 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=763ffcc5-8640-4b48-8ace-051ff0ccbdaf@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=riel@surriel.com \
--cc=stable@kernel.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox