Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: "wuyifeng (C)" <wuyifeng10@huawei.com>, akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Subject: Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
Date: Mon, 22 Sep 2025 11:00:22 +0200	[thread overview]
Message-ID: <696c74b1-269a-4dec-989d-5ea74b509f30@redhat.com> (raw)
In-Reply-To: <17ad24e5-9ee0-4d94-be5f-3c28bd57460a@huawei.com>

On 22.09.25 08:19, wuyifeng (C) wrote:
> Hi all, While reviewing the memory management code, I noticed a
> potential inefficiency related to MAP_POPULATE used on writable
> anonymous mappings.I verified the behavior on the mainline kernel
> and wanted to share it for discussion.
> 
> Test Environment:
> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
> Architecture: aarch64
> 
> Background:
> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
> is intended to pre-fault pages, so that subsequent accesses do not
> trigger page faults. However,I observed that when MAP_POPULATE is used
> on writable anonymous mappings, all pre-faulted pages are immediately
> marked as dirty, even though the user program has not written to them.
> 
> Minimal Reproduction:
> 
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <unistd.h>
> #include <stdio.h>
> 
> int main() {
>      size_t len = 100*1024*1024; // 100MB
>      void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>      if (p == MAP_FAILED) {
>          perror("mmap");
>          return 1;
>      }
>      pause();
>      return 0;
> }
> 
> Observed Output (/proc/<pid>/smaps):
> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
> Size:             102400 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:              102400 kB
> Pss:              102400 kB
> Pss_Dirty:        102400 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:    102400 kB
> Referenced:       102400 kB
> Anonymous:        102400 kB
> KSM:                   0 kB
> LazyFree:              0 kB
> AnonHugePages:    102400 kB
> ShmemPmdMapped:        0 kB
> FilePmdMapped:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> Locked:                0 kB
> THPeligible:           1
> VmFlags: rd wr mr mw me ac
> 
> Code Path Analysis:
> The behavior can be traced through the following kernel code path:
> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
> Inside it:
> 
> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
>          gup_flags |= FOLL_WRITE;
> 
> This sets FOLL_WRITE for writable anonymous VMAs.
> 
> Later, in faultin_page():
> 
> if (*flags & FOLL_WRITE)
>          fault_flags |= FAULT_FLAG_WRITE;
> 
> This effectively marks the page fault as a write.
> Finally, in do_anonymous_page():
> 
> if (vma->vm_flags & VM_WRITE)
>          entry = pte_mkwrite(pte_mkdirty(entry), vma);
> 

Yes, as MAP_POPULATE ends up triggering ordinary write faults through 
GUP, this is expected.

For write faults it makes perfect sense to set the pte dirty as well: 
avoids the cost of setting the pte dirty immediately afterwards (either 
through another fault or through the hw).

MADV_POPULATE_WRITE has the same behavior, but it's even documented to 
behave like that: "Populate (prefault) page tables writable, faulting in 
all pages in the range just  as  if manually  writing  to  each each page;"

> Here, the PTE is updated to writable and immediately marked dirty.
> As a result, all pre-faulted pages are marked dirty, even though the
> user program has not performed any writes.
> For large anonymous mappings, this can trigger unnecessary swap-out
> writebacks, generating avoidable I/O.

Is this a theoretical issue? Applications are supposed to make use of 
that memory after all, and at that point, the folios will be dirty.

> 
> Discussion:
> Would it be possible to optimize this behavior: for example, by
> populate pte as writable, but deferring the dirty bit until the user
> actually writes to the page?

The only way I would see us changing that is by passing from GUP that 
this is not an ordinary write fault but a populate_write fault. We 
certainly don't want to affect other fault+GUP behavior where we can 
avoid the cost of setting the dirty bit immediately afterwards.

But then, it could be counter-productive for workloads that will just 
write to that memory (IOW, use it).

-- 
Cheers

David / dhildenb

     prev parent reply	other threads:[~2025-09-22  9:00 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22  6:19 wuyifeng (C)
2025-09-22  8:45 ` Pedro Falcato
2025-09-22  9:07   ` David Hildenbrand
2025-09-22  9:37     ` Pedro Falcato
2025-09-22  9:49       ` wuyifeng (C)
2025-09-22 12:46       ` David Hildenbrand
2025-09-22 14:13         ` Pedro Falcato
2025-09-22 14:44           ` David Hildenbrand
2025-09-22  9:00 ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=696c74b1-269a-4dec-989d-5ea74b509f30@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=wuyifeng10@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox