Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Pedro Falcato <pfalcato@suse.de>, "wuyifeng (C)" <wuyifeng10@huawei.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
Date: Mon, 22 Sep 2025 11:07:43 +0200	[thread overview]
Message-ID: <adab9c31-c281-4bf7-93ae-89ed9f303d7b@redhat.com> (raw)
In-Reply-To: <oypwdcx6j726iffszfayd66xizaw5tfv2lnkk7bx7ibzn37x3m@ulkgb6x6geef>

On 22.09.25 10:45, Pedro Falcato wrote:
> On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
>> Hi all, While reviewing the memory management code, I noticed a
>> potential inefficiency related to MAP_POPULATE used on writable
>> anonymous mappings.I verified the behavior on the mainline kernel
>> and wanted to share it for discussion.
>>
>> Test Environment:
>> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
>> Architecture: aarch64
>>
>> Background:
>> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
>> is intended to pre-fault pages, so that subsequent accesses do not
>> trigger page faults. However,I observed that when MAP_POPULATE is used
>> on writable anonymous mappings, all pre-faulted pages are immediately
>> marked as dirty, even though the user program has not written to them.
>>
>> Minimal Reproduction:
>>
>> #define _GNU_SOURCE
>> #include <sys/mman.h>
>> #include <unistd.h>
>> #include <stdio.h>
>>
>> int main() {
>>      size_t len = 100*1024*1024; // 100MB
>>      void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>>                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>>      if (p == MAP_FAILED) {
>>          perror("mmap");
>>          return 1;
>>      }
>>      pause();
>>      return 0;
>> }
>>
>> Observed Output (/proc/<pid>/smaps):
>> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
>> Size:             102400 kB
>> KernelPageSize:        4 kB
>> MMUPageSize:           4 kB
>> Rss:              102400 kB
>> Pss:              102400 kB
>> Pss_Dirty:        102400 kB
>> Shared_Clean:          0 kB
>> Shared_Dirty:          0 kB
>> Private_Clean:         0 kB
>> Private_Dirty:    102400 kB
>> Referenced:       102400 kB
>> Anonymous:        102400 kB
>> KSM:                   0 kB
>> LazyFree:              0 kB
>> AnonHugePages:    102400 kB
>> ShmemPmdMapped:        0 kB
>> FilePmdMapped:         0 kB
>> Shared_Hugetlb:        0 kB
>> Private_Hugetlb:       0 kB
>> Swap:                  0 kB
>> SwapPss:               0 kB
>> Locked:                0 kB
>> THPeligible:           1
>> VmFlags: rd wr mr mw me ac
>>
>> Code Path Analysis:
>> The behavior can be traced through the following kernel code path:
>> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
>> Inside it:
>>
>> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
>>          gup_flags |= FOLL_WRITE;
>>
>> This sets FOLL_WRITE for writable anonymous VMAs.
>>
>> Later, in faultin_page():
>>
>> if (*flags & FOLL_WRITE)
>>          fault_flags |= FAULT_FLAG_WRITE;
>>
>> This effectively marks the page fault as a write.
>> Finally, in do_anonymous_page():
>>
>> if (vma->vm_flags & VM_WRITE)
>>          entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>
>> Here, the PTE is updated to writable and immediately marked dirty.
>> As a result, all pre-faulted pages are marked dirty, even though the
>> user program has not performed any writes.
>> For large anonymous mappings, this can trigger unnecessary swap-out
>> writebacks, generating avoidable I/O.
>>
>> Discussion:
>> Would it be possible to optimize this behavior: for example, by
>> populate pte as writable, but deferring the dirty bit until the user
>> actually writes to the page?
> 
> How would we know if the user wrote to the page, since we marked it writeable?

On access, either HW sets the dirty bit if it supports it, or we get 
another fault and set the dirty bit manually.

What happens on architectures where the HW doesn't support setting the 
dirty bit is that performing a pte_mkwrite() checks whether the pte is 
dirty. If it's not dirty the HW write bit will not be set and instead 
the next pte_mkdirty() will set the actual HW write bit.

See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or 
arch/s390/include/asm/pgtable.h

Of course, setting the dirty bit either way on later access comes with a 
price.

-- 
Cheers

David / dhildenb

next prev parent reply	other threads:[~2025-09-22  9:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22  6:19 wuyifeng (C)
2025-09-22  8:45 ` Pedro Falcato
2025-09-22  9:07   ` David Hildenbrand [this message]
2025-09-22  9:37     ` Pedro Falcato
2025-09-22  9:49       ` wuyifeng (C)
2025-09-22 12:46       ` David Hildenbrand
2025-09-22 14:13         ` Pedro Falcato
2025-09-22 14:44           ` David Hildenbrand
2025-09-22  9:00 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adab9c31-c281-4bf7-93ae-89ed9f303d7b@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=pfalcato@suse.de \
    --cc=wuyifeng10@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox