From: Pedro Falcato <pfalcato@suse.de>
To: David Hildenbrand <david@redhat.com>
Cc: "wuyifeng (C)" <wuyifeng10@huawei.com>,
akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
Date: Mon, 22 Sep 2025 10:37:22 +0100 [thread overview]
Message-ID: <hgbxlsaxe3p6npj3tyrd6u64qel6monttjaadzcnzpcpgi7arp@ztzuvbq55vuz> (raw)
In-Reply-To: <adab9c31-c281-4bf7-93ae-89ed9f303d7b@redhat.com>
On Mon, Sep 22, 2025 at 11:07:43AM +0200, David Hildenbrand wrote:
> On 22.09.25 10:45, Pedro Falcato wrote:
> > On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
> > > Hi all, While reviewing the memory management code, I noticed a
> > > potential inefficiency related to MAP_POPULATE used on writable
> > > anonymous mappings.I verified the behavior on the mainline kernel
> > > and wanted to share it for discussion.
> > >
> > > Test Environment:
> > > Kernel version: 6.17.0-rc4-00083-gb9a10f876409
> > > Architecture: aarch64
> > >
> > > Background:
> > > For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
> > > is intended to pre-fault pages, so that subsequent accesses do not
> > > trigger page faults. However,I observed that when MAP_POPULATE is used
> > > on writable anonymous mappings, all pre-faulted pages are immediately
> > > marked as dirty, even though the user program has not written to them.
> > >
> > > Minimal Reproduction:
> > >
> > > #define _GNU_SOURCE
> > > #include <sys/mman.h>
> > > #include <unistd.h>
> > > #include <stdio.h>
> > >
> > > int main() {
> > > size_t len = 100*1024*1024; // 100MB
> > > void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
> > > MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> > > if (p == MAP_FAILED) {
> > > perror("mmap");
> > > return 1;
> > > }
> > > pause();
> > > return 0;
> > > }
> > >
> > > Observed Output (/proc/<pid>/smaps):
> > > ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
> > > Size: 102400 kB
> > > KernelPageSize: 4 kB
> > > MMUPageSize: 4 kB
> > > Rss: 102400 kB
> > > Pss: 102400 kB
> > > Pss_Dirty: 102400 kB
> > > Shared_Clean: 0 kB
> > > Shared_Dirty: 0 kB
> > > Private_Clean: 0 kB
> > > Private_Dirty: 102400 kB
> > > Referenced: 102400 kB
> > > Anonymous: 102400 kB
> > > KSM: 0 kB
> > > LazyFree: 0 kB
> > > AnonHugePages: 102400 kB
> > > ShmemPmdMapped: 0 kB
> > > FilePmdMapped: 0 kB
> > > Shared_Hugetlb: 0 kB
> > > Private_Hugetlb: 0 kB
> > > Swap: 0 kB
> > > SwapPss: 0 kB
> > > Locked: 0 kB
> > > THPeligible: 1
> > > VmFlags: rd wr mr mw me ac
> > >
> > > Code Path Analysis:
> > > The behavior can be traced through the following kernel code path:
> > > populate_vma_page_range() is invoked to pre-fault pages for the VMA.
> > > Inside it:
> > >
> > > if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> > > gup_flags |= FOLL_WRITE;
> > >
> > > This sets FOLL_WRITE for writable anonymous VMAs.
> > >
> > > Later, in faultin_page():
> > >
> > > if (*flags & FOLL_WRITE)
> > > fault_flags |= FAULT_FLAG_WRITE;
> > >
> > > This effectively marks the page fault as a write.
> > > Finally, in do_anonymous_page():
> > >
> > > if (vma->vm_flags & VM_WRITE)
> > > entry = pte_mkwrite(pte_mkdirty(entry), vma);
> > >
> > > Here, the PTE is updated to writable and immediately marked dirty.
> > > As a result, all pre-faulted pages are marked dirty, even though the
> > > user program has not performed any writes.
> > > For large anonymous mappings, this can trigger unnecessary swap-out
> > > writebacks, generating avoidable I/O.
> > >
> > > Discussion:
> > > Would it be possible to optimize this behavior: for example, by
> > > populate pte as writable, but deferring the dirty bit until the user
> > > actually writes to the page?
> >
> > How would we know if the user wrote to the page, since we marked it writeable?
>
> On access, either HW sets the dirty bit if it supports it, or we get another
> fault and set the dirty bit manually.
>
> What happens on architectures where the HW doesn't support setting the dirty
> bit is that performing a pte_mkwrite() checks whether the pte is dirty. If
> it's not dirty the HW write bit will not be set and instead the next
> pte_mkdirty() will set the actual HW write bit.
>
> See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
> arch/s390/include/asm/pgtable.h
>
> Of course, setting the dirty bit either way on later access comes with a
> price.
Ah, yes, the details were a little fuzzy in my head, thanks.
I'm trying to swap in (ha!) the details again. We still proactively mark anon
folios dirty anyway for $reasons, so optimizing it might be difficult? Not sure
if it is _worth_ optimizing for anyway.
--
Pedro
next prev parent reply other threads:[~2025-09-22 9:37 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-22 6:19 wuyifeng (C)
2025-09-22 8:45 ` Pedro Falcato
2025-09-22 9:07 ` David Hildenbrand
2025-09-22 9:37 ` Pedro Falcato [this message]
2025-09-22 9:49 ` wuyifeng (C)
2025-09-22 12:46 ` David Hildenbrand
2025-09-22 14:13 ` Pedro Falcato
2025-09-22 14:44 ` David Hildenbrand
2025-09-22 9:00 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=hgbxlsaxe3p6npj3tyrd6u64qel6monttjaadzcnzpcpgi7arp@ztzuvbq55vuz \
--to=pfalcato@suse.de \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=wuyifeng10@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox