linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Wei Yang <richard.weiyang@gmail.com>
Cc: akpm@linux-foundation.org, david@kernel.org,
	lorenzo.stoakes@oracle.com, riel@surriel.com,
	Liam.Howlett@oracle.com, vbabka@kernel.org, harry.yoo@oracle.com,
	jannh@google.com, baohua@kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, ryan.roberts@arm.com,
	anshuman.khandual@arm.com, stable <stable@kernel.org>
Subject: Re: [PATCH v3] mm/rmap: fix incorrect pte restoration for lazyfree folios
Date: Tue, 3 Mar 2026 17:55:58 +0530	[thread overview]
Message-ID: <257aa379-64cf-47af-b48d-8817f7bca257@arm.com> (raw)
In-Reply-To: <20260303121727.ss3d3gbzituygb6p@master>



On 03/03/26 5:47 pm, Wei Yang wrote:
> On Tue, Mar 03, 2026 at 11:45:28AM +0530, Dev Jain wrote:
>> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
>> If the batch has a mix of writable and non-writable bits, we may end up
>> setting the entire batch writable. Fix this by respecting writable bit
>> during batching.
>> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
>> lost, preserve it on pte restoration by respecting the bit during batching,
>> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>>
>> I was able to write the below reproducer and crash the kernel.
>> Explanation of reproducer (set 64K mTHP to always):
>>
>> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
>> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
>> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
>> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
>> the folio as lazyfree. Write to the memory to dirty the pte, eventually
>> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
>> restoration path, and the kernel will crash with the following trace:
>>
>> [   21.134473] kernel BUG at mm/page_table_check.c:118!
>> [   21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
>> [   21.135917] Modules linked in:
>> [   21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
>> [   21.136858] Hardware name: linux,dummy-virt (DT)
>> [   21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>> [   21.137308] pc : page_table_check_set+0x28c/0x2a8
>> [   21.137607] lr : page_table_check_set+0x134/0x2a8
>> [   21.137885] sp : ffff80008a3b3340
>> [   21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
>> [   21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
>> [   21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
>> [   21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
>> [   21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
>> [   21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
>> [   21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
>> [   21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
>> [   21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
>> [   21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
>> [   21.141991] Call trace:
>> [   21.142093]  page_table_check_set+0x28c/0x2a8 (P)
>> [   21.142265]  __page_table_check_ptes_set+0x144/0x1e8
>> [   21.142441]  __set_ptes_anysz.constprop.0+0x160/0x1a8
>> [   21.142766]  contpte_set_ptes+0xe8/0x140
>> [   21.142907]  try_to_unmap_one+0x10c4/0x10d0
>> [   21.143177]  rmap_walk_anon+0x100/0x250
>> [   21.143315]  try_to_unmap+0xa0/0xc8
>> [   21.143441]  shrink_folio_list+0x59c/0x18a8
>> [   21.143759]  shrink_lruvec+0x664/0xbf0
>> [   21.144043]  shrink_node+0x218/0x878
>> [   21.144285]  __node_reclaim.constprop.0+0x98/0x338
>> [   21.144763]  user_proactive_reclaim+0x2a4/0x340
>> [   21.145056]  reclaim_store+0x3c/0x60
>> [   21.145216]  dev_attr_store+0x20/0x40
>> [   21.145585]  sysfs_kf_write+0x84/0xa8
>> [   21.145835]  kernfs_fop_write_iter+0x130/0x1c8
>> [   21.145994]  vfs_write+0x2b8/0x368
>> [   21.146119]  ksys_write+0x70/0x110
>> [   21.146240]  __arm64_sys_write+0x24/0x38
>> [   21.146380]  invoke_syscall+0x50/0x120
>> [   21.146513]  el0_svc_common.constprop.0+0x48/0xf8
>> [   21.146679]  do_el0_svc+0x28/0x40
>> [   21.146798]  el0_svc+0x34/0x110
>> [   21.146926]  el0t_64_sync_handler+0xa0/0xe8
>> [   21.147074]  el0t_64_sync+0x198/0x1a0
>> [   21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
>> [   21.147440] ---[ end trace 0000000000000000 ]---
>>
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <string.h>
>> #include <sys/wait.h>
>> #include <sched.h>
>> #include <fcntl.h>
>>
>> void write_to_reclaim() {
>>    const char *path = "/sys/devices/system/node/node0/reclaim";
>>    const char *value = "409600000000";
>>    int fd = open(path, O_WRONLY);
>>    if (fd == -1) {
>>        perror("open");
>>        exit(EXIT_FAILURE);
>>    }
>>
>>    if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>>        perror("write");
>>        close(fd);
>>        exit(EXIT_FAILURE);
>>    }
>>
>>    printf("Successfully wrote %s to %s\n", value, path);
>>    close(fd);
>> }
>>
>> int main()
>> {
>> 	char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
>> 			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> 	if ((unsigned long)ptr != (1UL << 30)) {
>> 		perror("mmap");
>> 		return 1;
>> 	}
>> 	
>> 	/* a 64K folio gets faulted in */
>> 	memset(ptr, 0, 1UL << 16);
>>
>> 	/* 32K half will not be shared into child */
>> 	if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
>> 		perror("madvise madv dontfork");
>> 		return 1;
>> 	}
>>
>> 	pid_t pid = fork();
>>
>> 	if (pid < 0) {
>> 		perror("fork");
>> 		return 1;
>> 	} else if (pid == 0) {
>> 		sleep(15);
>> 	} else {
>> 		/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
>> 		if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
>> 			perror("madvise madv fork");
>> 			return 1;
>> 		}
>> 		if (madvise(ptr, (1UL << 16), MADV_FREE)) {
>> 			perror("madvise madv free");
>> 			return 1;
>> 		}
>>
>> 		/* dirty the large folio */
>> 		(*ptr) += 10;
>>
>> 		write_to_reclaim();
>> 		// sleep(10);
>> 		waitpid(pid, NULL, 0);
>>
>> 	}
>> }
>>
>> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
>> Cc: stable <stable@kernel.org>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Patch applies on mm-unstable (9af4957ef127).
>>
>> v2->v3:
>> - Don't special case for anon folios
>>
>> v1->v2:
>> - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>>   failure path
>> - Also handle soft-dirty bit
>>
>> mm/rmap.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index bff8f222004e4..5a3e408e3f179 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1955,7 +1955,14 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> 	if (userfaultfd_wp(vma))
>> 		return 1;
>>
>> -	return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>> +	/*
>> +	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>> +	 * upgrading write permissions for ptes that were not originally
>> +	 * writable, and to avoid losing the soft-dirty bit, use the
>> +	 * appropriate FPB flags.
>> +	 */
>> +	return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
>> +				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>> }
>>
> 
> Hi, Dev
> 
> When reading the code, I got one confusion. Current call flow is like below:
> 
>     try_to_unmap_one();
>         nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
> 	..
> 	pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
> 	..
> 	set_ptes(mm, address, pvmw.pte, pteval, nr_pages);
> 
> We get pteval by folio_unmap_pte_batch() but it is set again by

folio_unmap_pte_batch() gives the batch size, not pteval. pteval is
given by get_and_clear_ptes() after accumulating a/d bits.

> get_and_clear_ptes(), which maybe a different value. Then we use this pteval
> to restore ptes.
> 
> So even we fix folio_unmap_pte_batch(), how this impact on the final restored
> value?

By respecting writable bit, we ensure that the ptes in the batch do not
have a mix of writable and non writable ptes.

So, if pteval returned by get_and_clear_ptes() is writable, then it is
guaranteed via folio_unmap_pte_batch() that the all pte values of
these nr_pages consecutive ptes, are writable. And vice versa.

> 
> Hope I don't miss something.
> 
>> /*
>> -- 
>> 2.34.1
>>
> 



  reply	other threads:[~2026-03-03 12:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03  6:15 Dev Jain
2026-03-03  8:50 ` David Hildenbrand (Arm)
2026-03-03  9:54 ` Lorenzo Stoakes
2026-03-03 10:22   ` Dev Jain
2026-03-03  9:57 ` Barry Song
2026-03-03 10:32 ` Dev Jain
2026-03-03 12:17 ` Wei Yang
2026-03-03 12:25   ` Dev Jain [this message]
2026-03-03 12:50     ` Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=257aa379-64cf-47af-b48d-8817f7bca257@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=richard.weiyang@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=stable@kernel.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox