* [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
@ 2025-09-22 6:19 wuyifeng (C)
2025-09-22 8:45 ` Pedro Falcato
2025-09-22 9:00 ` David Hildenbrand
0 siblings, 2 replies; 9+ messages in thread
From: wuyifeng (C) @ 2025-09-22 6:19 UTC (permalink / raw)
To: david, akpm; +Cc: linux-mm
Hi all, While reviewing the memory management code, I noticed a
potential inefficiency related to MAP_POPULATE used on writable
anonymous mappings.I verified the behavior on the mainline kernel
and wanted to share it for discussion.
Test Environment:
Kernel version: 6.17.0-rc4-00083-gb9a10f876409
Architecture: aarch64
Background:
For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
is intended to pre-fault pages, so that subsequent accesses do not
trigger page faults. However,I observed that when MAP_POPULATE is used
on writable anonymous mappings, all pre-faulted pages are immediately
marked as dirty, even though the user program has not written to them.
Minimal Reproduction:
#define _GNU_SOURCE
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
int main() {
size_t len = 100*1024*1024; // 100MB
void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return 1;
}
pause();
return 0;
}
Observed Output (/proc/<pid>/smaps):
ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
Size: 102400 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 102400 kB
Pss: 102400 kB
Pss_Dirty: 102400 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 102400 kB
Referenced: 102400 kB
Anonymous: 102400 kB
KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 102400 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 1
VmFlags: rd wr mr mw me ac
Code Path Analysis:
The behavior can be traced through the following kernel code path:
populate_vma_page_range() is invoked to pre-fault pages for the VMA.
Inside it:
if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
gup_flags |= FOLL_WRITE;
This sets FOLL_WRITE for writable anonymous VMAs.
Later, in faultin_page():
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
This effectively marks the page fault as a write.
Finally, in do_anonymous_page():
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry), vma);
Here, the PTE is updated to writable and immediately marked dirty.
As a result, all pre-faulted pages are marked dirty, even though the
user program has not performed any writes.
For large anonymous mappings, this can trigger unnecessary swap-out
writebacks, generating avoidable I/O.
Discussion:
Would it be possible to optimize this behavior: for example, by
populate pte as writable, but deferring the dirty bit until the user
actually writes to the page?
Thanks, [wuyifeng]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 6:19 [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily? wuyifeng (C)
@ 2025-09-22 8:45 ` Pedro Falcato
2025-09-22 9:07 ` David Hildenbrand
2025-09-22 9:00 ` David Hildenbrand
1 sibling, 1 reply; 9+ messages in thread
From: Pedro Falcato @ 2025-09-22 8:45 UTC (permalink / raw)
To: wuyifeng (C); +Cc: david, akpm, linux-mm
On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
> Hi all, While reviewing the memory management code, I noticed a
> potential inefficiency related to MAP_POPULATE used on writable
> anonymous mappings.I verified the behavior on the mainline kernel
> and wanted to share it for discussion.
>
> Test Environment:
> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
> Architecture: aarch64
>
> Background:
> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
> is intended to pre-fault pages, so that subsequent accesses do not
> trigger page faults. However,I observed that when MAP_POPULATE is used
> on writable anonymous mappings, all pre-faulted pages are immediately
> marked as dirty, even though the user program has not written to them.
>
> Minimal Reproduction:
>
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <unistd.h>
> #include <stdio.h>
>
> int main() {
> size_t len = 100*1024*1024; // 100MB
> void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> if (p == MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> pause();
> return 0;
> }
>
> Observed Output (/proc/<pid>/smaps):
> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
> Size: 102400 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB
> Rss: 102400 kB
> Pss: 102400 kB
> Pss_Dirty: 102400 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 102400 kB
> Referenced: 102400 kB
> Anonymous: 102400 kB
> KSM: 0 kB
> LazyFree: 0 kB
> AnonHugePages: 102400 kB
> ShmemPmdMapped: 0 kB
> FilePmdMapped: 0 kB
> Shared_Hugetlb: 0 kB
> Private_Hugetlb: 0 kB
> Swap: 0 kB
> SwapPss: 0 kB
> Locked: 0 kB
> THPeligible: 1
> VmFlags: rd wr mr mw me ac
>
> Code Path Analysis:
> The behavior can be traced through the following kernel code path:
> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
> Inside it:
>
> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> gup_flags |= FOLL_WRITE;
>
> This sets FOLL_WRITE for writable anonymous VMAs.
>
> Later, in faultin_page():
>
> if (*flags & FOLL_WRITE)
> fault_flags |= FAULT_FLAG_WRITE;
>
> This effectively marks the page fault as a write.
> Finally, in do_anonymous_page():
>
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
> Here, the PTE is updated to writable and immediately marked dirty.
> As a result, all pre-faulted pages are marked dirty, even though the
> user program has not performed any writes.
> For large anonymous mappings, this can trigger unnecessary swap-out
> writebacks, generating avoidable I/O.
>
> Discussion:
> Would it be possible to optimize this behavior: for example, by
> populate pte as writable, but deferring the dirty bit until the user
> actually writes to the page?
How would we know if the user wrote to the page, since we marked it writeable?
--
Pedro
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 6:19 [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily? wuyifeng (C)
2025-09-22 8:45 ` Pedro Falcato
@ 2025-09-22 9:00 ` David Hildenbrand
1 sibling, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-09-22 9:00 UTC (permalink / raw)
To: wuyifeng (C), akpm; +Cc: linux-mm
On 22.09.25 08:19, wuyifeng (C) wrote:
> Hi all, While reviewing the memory management code, I noticed a
> potential inefficiency related to MAP_POPULATE used on writable
> anonymous mappings.I verified the behavior on the mainline kernel
> and wanted to share it for discussion.
>
> Test Environment:
> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
> Architecture: aarch64
>
> Background:
> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
> is intended to pre-fault pages, so that subsequent accesses do not
> trigger page faults. However,I observed that when MAP_POPULATE is used
> on writable anonymous mappings, all pre-faulted pages are immediately
> marked as dirty, even though the user program has not written to them.
>
> Minimal Reproduction:
>
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <unistd.h>
> #include <stdio.h>
>
> int main() {
> size_t len = 100*1024*1024; // 100MB
> void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> if (p == MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> pause();
> return 0;
> }
>
> Observed Output (/proc/<pid>/smaps):
> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
> Size: 102400 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB
> Rss: 102400 kB
> Pss: 102400 kB
> Pss_Dirty: 102400 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 102400 kB
> Referenced: 102400 kB
> Anonymous: 102400 kB
> KSM: 0 kB
> LazyFree: 0 kB
> AnonHugePages: 102400 kB
> ShmemPmdMapped: 0 kB
> FilePmdMapped: 0 kB
> Shared_Hugetlb: 0 kB
> Private_Hugetlb: 0 kB
> Swap: 0 kB
> SwapPss: 0 kB
> Locked: 0 kB
> THPeligible: 1
> VmFlags: rd wr mr mw me ac
>
> Code Path Analysis:
> The behavior can be traced through the following kernel code path:
> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
> Inside it:
>
> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> gup_flags |= FOLL_WRITE;
>
> This sets FOLL_WRITE for writable anonymous VMAs.
>
> Later, in faultin_page():
>
> if (*flags & FOLL_WRITE)
> fault_flags |= FAULT_FLAG_WRITE;
>
> This effectively marks the page fault as a write.
> Finally, in do_anonymous_page():
>
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
Yes, as MAP_POPULATE ends up triggering ordinary write faults through
GUP, this is expected.
For write faults it makes perfect sense to set the pte dirty as well:
avoids the cost of setting the pte dirty immediately afterwards (either
through another fault or through the hw).
MADV_POPULATE_WRITE has the same behavior, but it's even documented to
behave like that: "Populate (prefault) page tables writable, faulting in
all pages in the range just as if manually writing to each each page;"
> Here, the PTE is updated to writable and immediately marked dirty.
> As a result, all pre-faulted pages are marked dirty, even though the
> user program has not performed any writes.
> For large anonymous mappings, this can trigger unnecessary swap-out
> writebacks, generating avoidable I/O.
Is this a theoretical issue? Applications are supposed to make use of
that memory after all, and at that point, the folios will be dirty.
>
> Discussion:
> Would it be possible to optimize this behavior: for example, by
> populate pte as writable, but deferring the dirty bit until the user
> actually writes to the page?
The only way I would see us changing that is by passing from GUP that
this is not an ordinary write fault but a populate_write fault. We
certainly don't want to affect other fault+GUP behavior where we can
avoid the cost of setting the dirty bit immediately afterwards.
But then, it could be counter-productive for workloads that will just
write to that memory (IOW, use it).
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 8:45 ` Pedro Falcato
@ 2025-09-22 9:07 ` David Hildenbrand
2025-09-22 9:37 ` Pedro Falcato
0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-09-22 9:07 UTC (permalink / raw)
To: Pedro Falcato, wuyifeng (C); +Cc: akpm, linux-mm
On 22.09.25 10:45, Pedro Falcato wrote:
> On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
>> Hi all, While reviewing the memory management code, I noticed a
>> potential inefficiency related to MAP_POPULATE used on writable
>> anonymous mappings.I verified the behavior on the mainline kernel
>> and wanted to share it for discussion.
>>
>> Test Environment:
>> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
>> Architecture: aarch64
>>
>> Background:
>> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
>> is intended to pre-fault pages, so that subsequent accesses do not
>> trigger page faults. However,I observed that when MAP_POPULATE is used
>> on writable anonymous mappings, all pre-faulted pages are immediately
>> marked as dirty, even though the user program has not written to them.
>>
>> Minimal Reproduction:
>>
>> #define _GNU_SOURCE
>> #include <sys/mman.h>
>> #include <unistd.h>
>> #include <stdio.h>
>>
>> int main() {
>> size_t len = 100*1024*1024; // 100MB
>> void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>> MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>> if (p == MAP_FAILED) {
>> perror("mmap");
>> return 1;
>> }
>> pause();
>> return 0;
>> }
>>
>> Observed Output (/proc/<pid>/smaps):
>> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
>> Size: 102400 kB
>> KernelPageSize: 4 kB
>> MMUPageSize: 4 kB
>> Rss: 102400 kB
>> Pss: 102400 kB
>> Pss_Dirty: 102400 kB
>> Shared_Clean: 0 kB
>> Shared_Dirty: 0 kB
>> Private_Clean: 0 kB
>> Private_Dirty: 102400 kB
>> Referenced: 102400 kB
>> Anonymous: 102400 kB
>> KSM: 0 kB
>> LazyFree: 0 kB
>> AnonHugePages: 102400 kB
>> ShmemPmdMapped: 0 kB
>> FilePmdMapped: 0 kB
>> Shared_Hugetlb: 0 kB
>> Private_Hugetlb: 0 kB
>> Swap: 0 kB
>> SwapPss: 0 kB
>> Locked: 0 kB
>> THPeligible: 1
>> VmFlags: rd wr mr mw me ac
>>
>> Code Path Analysis:
>> The behavior can be traced through the following kernel code path:
>> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
>> Inside it:
>>
>> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
>> gup_flags |= FOLL_WRITE;
>>
>> This sets FOLL_WRITE for writable anonymous VMAs.
>>
>> Later, in faultin_page():
>>
>> if (*flags & FOLL_WRITE)
>> fault_flags |= FAULT_FLAG_WRITE;
>>
>> This effectively marks the page fault as a write.
>> Finally, in do_anonymous_page():
>>
>> if (vma->vm_flags & VM_WRITE)
>> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>
>> Here, the PTE is updated to writable and immediately marked dirty.
>> As a result, all pre-faulted pages are marked dirty, even though the
>> user program has not performed any writes.
>> For large anonymous mappings, this can trigger unnecessary swap-out
>> writebacks, generating avoidable I/O.
>>
>> Discussion:
>> Would it be possible to optimize this behavior: for example, by
>> populate pte as writable, but deferring the dirty bit until the user
>> actually writes to the page?
>
> How would we know if the user wrote to the page, since we marked it writeable?
On access, either HW sets the dirty bit if it supports it, or we get
another fault and set the dirty bit manually.
What happens on architectures where the HW doesn't support setting the
dirty bit is that performing a pte_mkwrite() checks whether the pte is
dirty. If it's not dirty the HW write bit will not be set and instead
the next pte_mkdirty() will set the actual HW write bit.
See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
arch/s390/include/asm/pgtable.h
Of course, setting the dirty bit either way on later access comes with a
price.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 9:07 ` David Hildenbrand
@ 2025-09-22 9:37 ` Pedro Falcato
2025-09-22 9:49 ` wuyifeng (C)
2025-09-22 12:46 ` David Hildenbrand
0 siblings, 2 replies; 9+ messages in thread
From: Pedro Falcato @ 2025-09-22 9:37 UTC (permalink / raw)
To: David Hildenbrand; +Cc: wuyifeng (C), akpm, linux-mm
On Mon, Sep 22, 2025 at 11:07:43AM +0200, David Hildenbrand wrote:
> On 22.09.25 10:45, Pedro Falcato wrote:
> > On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
> > > Hi all, While reviewing the memory management code, I noticed a
> > > potential inefficiency related to MAP_POPULATE used on writable
> > > anonymous mappings.I verified the behavior on the mainline kernel
> > > and wanted to share it for discussion.
> > >
> > > Test Environment:
> > > Kernel version: 6.17.0-rc4-00083-gb9a10f876409
> > > Architecture: aarch64
> > >
> > > Background:
> > > For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
> > > is intended to pre-fault pages, so that subsequent accesses do not
> > > trigger page faults. However,I observed that when MAP_POPULATE is used
> > > on writable anonymous mappings, all pre-faulted pages are immediately
> > > marked as dirty, even though the user program has not written to them.
> > >
> > > Minimal Reproduction:
> > >
> > > #define _GNU_SOURCE
> > > #include <sys/mman.h>
> > > #include <unistd.h>
> > > #include <stdio.h>
> > >
> > > int main() {
> > > size_t len = 100*1024*1024; // 100MB
> > > void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
> > > MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> > > if (p == MAP_FAILED) {
> > > perror("mmap");
> > > return 1;
> > > }
> > > pause();
> > > return 0;
> > > }
> > >
> > > Observed Output (/proc/<pid>/smaps):
> > > ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
> > > Size: 102400 kB
> > > KernelPageSize: 4 kB
> > > MMUPageSize: 4 kB
> > > Rss: 102400 kB
> > > Pss: 102400 kB
> > > Pss_Dirty: 102400 kB
> > > Shared_Clean: 0 kB
> > > Shared_Dirty: 0 kB
> > > Private_Clean: 0 kB
> > > Private_Dirty: 102400 kB
> > > Referenced: 102400 kB
> > > Anonymous: 102400 kB
> > > KSM: 0 kB
> > > LazyFree: 0 kB
> > > AnonHugePages: 102400 kB
> > > ShmemPmdMapped: 0 kB
> > > FilePmdMapped: 0 kB
> > > Shared_Hugetlb: 0 kB
> > > Private_Hugetlb: 0 kB
> > > Swap: 0 kB
> > > SwapPss: 0 kB
> > > Locked: 0 kB
> > > THPeligible: 1
> > > VmFlags: rd wr mr mw me ac
> > >
> > > Code Path Analysis:
> > > The behavior can be traced through the following kernel code path:
> > > populate_vma_page_range() is invoked to pre-fault pages for the VMA.
> > > Inside it:
> > >
> > > if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> > > gup_flags |= FOLL_WRITE;
> > >
> > > This sets FOLL_WRITE for writable anonymous VMAs.
> > >
> > > Later, in faultin_page():
> > >
> > > if (*flags & FOLL_WRITE)
> > > fault_flags |= FAULT_FLAG_WRITE;
> > >
> > > This effectively marks the page fault as a write.
> > > Finally, in do_anonymous_page():
> > >
> > > if (vma->vm_flags & VM_WRITE)
> > > entry = pte_mkwrite(pte_mkdirty(entry), vma);
> > >
> > > Here, the PTE is updated to writable and immediately marked dirty.
> > > As a result, all pre-faulted pages are marked dirty, even though the
> > > user program has not performed any writes.
> > > For large anonymous mappings, this can trigger unnecessary swap-out
> > > writebacks, generating avoidable I/O.
> > >
> > > Discussion:
> > > Would it be possible to optimize this behavior: for example, by
> > > populate pte as writable, but deferring the dirty bit until the user
> > > actually writes to the page?
> >
> > How would we know if the user wrote to the page, since we marked it writeable?
>
> On access, either HW sets the dirty bit if it supports it, or we get another
> fault and set the dirty bit manually.
>
> What happens on architectures where the HW doesn't support setting the dirty
> bit is that performing a pte_mkwrite() checks whether the pte is dirty. If
> it's not dirty the HW write bit will not be set and instead the next
> pte_mkdirty() will set the actual HW write bit.
>
> See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
> arch/s390/include/asm/pgtable.h
>
> Of course, setting the dirty bit either way on later access comes with a
> price.
Ah, yes, the details were a little fuzzy in my head, thanks.
I'm trying to swap in (ha!) the details again. We still proactively mark anon
folios dirty anyway for $reasons, so optimizing it might be difficult? Not sure
if it is _worth_ optimizing for anyway.
--
Pedro
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 9:37 ` Pedro Falcato
@ 2025-09-22 9:49 ` wuyifeng (C)
2025-09-22 12:46 ` David Hildenbrand
1 sibling, 0 replies; 9+ messages in thread
From: wuyifeng (C) @ 2025-09-22 9:49 UTC (permalink / raw)
To: Pedro Falcato, David Hildenbrand; +Cc: akpm, linux-mm
I only noticed this behavior while reading the code and haven’t actually
encountered any performance issues caused by swap. I hadn’t initially considered
that marking pages dirty again would incur extra overhead (even with hardware support).
From this perspective, it’s clear that the current design provides a net benefit in the
vast majority of scenarios.
Thank you very much for your explanation!
在 2025/9/22 17:37, Pedro Falcato 写道:
> On Mon, Sep 22, 2025 at 11:07:43AM +0200, David Hildenbrand wrote:
>> On 22.09.25 10:45, Pedro Falcato wrote:
>>> On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
>>>> Hi all, While reviewing the memory management code, I noticed a
>>>> potential inefficiency related to MAP_POPULATE used on writable
>>>> anonymous mappings.I verified the behavior on the mainline kernel
>>>> and wanted to share it for discussion.
>>>>
>>>> Test Environment:
>>>> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
>>>> Architecture: aarch64
>>>>
>>>> Background:
>>>> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
>>>> is intended to pre-fault pages, so that subsequent accesses do not
>>>> trigger page faults. However,I observed that when MAP_POPULATE is used
>>>> on writable anonymous mappings, all pre-faulted pages are immediately
>>>> marked as dirty, even though the user program has not written to them.
>>>>
>>>> Minimal Reproduction:
>>>>
>>>> #define _GNU_SOURCE
>>>> #include <sys/mman.h>
>>>> #include <unistd.h>
>>>> #include <stdio.h>
>>>>
>>>> int main() {
>>>> size_t len = 100*1024*1024; // 100MB
>>>> void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>>>> MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>>>> if (p == MAP_FAILED) {
>>>> perror("mmap");
>>>> return 1;
>>>> }
>>>> pause();
>>>> return 0;
>>>> }
>>>>
>>>> Observed Output (/proc/<pid>/smaps):
>>>> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
>>>> Size: 102400 kB
>>>> KernelPageSize: 4 kB
>>>> MMUPageSize: 4 kB
>>>> Rss: 102400 kB
>>>> Pss: 102400 kB
>>>> Pss_Dirty: 102400 kB
>>>> Shared_Clean: 0 kB
>>>> Shared_Dirty: 0 kB
>>>> Private_Clean: 0 kB
>>>> Private_Dirty: 102400 kB
>>>> Referenced: 102400 kB
>>>> Anonymous: 102400 kB
>>>> KSM: 0 kB
>>>> LazyFree: 0 kB
>>>> AnonHugePages: 102400 kB
>>>> ShmemPmdMapped: 0 kB
>>>> FilePmdMapped: 0 kB
>>>> Shared_Hugetlb: 0 kB
>>>> Private_Hugetlb: 0 kB
>>>> Swap: 0 kB
>>>> SwapPss: 0 kB
>>>> Locked: 0 kB
>>>> THPeligible: 1
>>>> VmFlags: rd wr mr mw me ac
>>>>
>>>> Code Path Analysis:
>>>> The behavior can be traced through the following kernel code path:
>>>> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
>>>> Inside it:
>>>>
>>>> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
>>>> gup_flags |= FOLL_WRITE;
>>>>
>>>> This sets FOLL_WRITE for writable anonymous VMAs.
>>>>
>>>> Later, in faultin_page():
>>>>
>>>> if (*flags & FOLL_WRITE)
>>>> fault_flags |= FAULT_FLAG_WRITE;
>>>>
>>>> This effectively marks the page fault as a write.
>>>> Finally, in do_anonymous_page():
>>>>
>>>> if (vma->vm_flags & VM_WRITE)
>>>> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>>>
>>>> Here, the PTE is updated to writable and immediately marked dirty.
>>>> As a result, all pre-faulted pages are marked dirty, even though the
>>>> user program has not performed any writes.
>>>> For large anonymous mappings, this can trigger unnecessary swap-out
>>>> writebacks, generating avoidable I/O.
>>>>
>>>> Discussion:
>>>> Would it be possible to optimize this behavior: for example, by
>>>> populate pte as writable, but deferring the dirty bit until the user
>>>> actually writes to the page?
>>>
>>> How would we know if the user wrote to the page, since we marked it writeable?
>>
>> On access, either HW sets the dirty bit if it supports it, or we get another
>> fault and set the dirty bit manually.
>>
>> What happens on architectures where the HW doesn't support setting the dirty
>> bit is that performing a pte_mkwrite() checks whether the pte is dirty. If
>> it's not dirty the HW write bit will not be set and instead the next
>> pte_mkdirty() will set the actual HW write bit.
>>
>> See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
>> arch/s390/include/asm/pgtable.h
>>
>> Of course, setting the dirty bit either way on later access comes with a
>> price.
>
> Ah, yes, the details were a little fuzzy in my head, thanks.
> I'm trying to swap in (ha!) the details again. We still proactively mark anon
> folios dirty anyway for $reasons, so optimizing it might be difficult? Not sure
> if it is _worth_ optimizing for anyway.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 9:37 ` Pedro Falcato
2025-09-22 9:49 ` wuyifeng (C)
@ 2025-09-22 12:46 ` David Hildenbrand
2025-09-22 14:13 ` Pedro Falcato
1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-09-22 12:46 UTC (permalink / raw)
To: Pedro Falcato; +Cc: wuyifeng (C), akpm, linux-mm
On 22.09.25 11:37, Pedro Falcato wrote:
> On Mon, Sep 22, 2025 at 11:07:43AM +0200, David Hildenbrand wrote:
>> On 22.09.25 10:45, Pedro Falcato wrote:
>>> On Mon, Sep 22, 2025 at 02:19:51PM +0800, wuyifeng (C) wrote:
>>>> Hi all, While reviewing the memory management code, I noticed a
>>>> potential inefficiency related to MAP_POPULATE used on writable
>>>> anonymous mappings.I verified the behavior on the mainline kernel
>>>> and wanted to share it for discussion.
>>>>
>>>> Test Environment:
>>>> Kernel version: 6.17.0-rc4-00083-gb9a10f876409
>>>> Architecture: aarch64
>>>>
>>>> Background:
>>>> For anonymous mappings with PROT_WRITE | PROT_READ, using MAP_POPULATE
>>>> is intended to pre-fault pages, so that subsequent accesses do not
>>>> trigger page faults. However,I observed that when MAP_POPULATE is used
>>>> on writable anonymous mappings, all pre-faulted pages are immediately
>>>> marked as dirty, even though the user program has not written to them.
>>>>
>>>> Minimal Reproduction:
>>>>
>>>> #define _GNU_SOURCE
>>>> #include <sys/mman.h>
>>>> #include <unistd.h>
>>>> #include <stdio.h>
>>>>
>>>> int main() {
>>>> size_t len = 100*1024*1024; // 100MB
>>>> void *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
>>>> MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>>>> if (p == MAP_FAILED) {
>>>> perror("mmap");
>>>> return 1;
>>>> }
>>>> pause();
>>>> return 0;
>>>> }
>>>>
>>>> Observed Output (/proc/<pid>/smaps):
>>>> ffff7a600000-ffff80a00000 rw-p 00000000 00:00 0
>>>> Size: 102400 kB
>>>> KernelPageSize: 4 kB
>>>> MMUPageSize: 4 kB
>>>> Rss: 102400 kB
>>>> Pss: 102400 kB
>>>> Pss_Dirty: 102400 kB
>>>> Shared_Clean: 0 kB
>>>> Shared_Dirty: 0 kB
>>>> Private_Clean: 0 kB
>>>> Private_Dirty: 102400 kB
>>>> Referenced: 102400 kB
>>>> Anonymous: 102400 kB
>>>> KSM: 0 kB
>>>> LazyFree: 0 kB
>>>> AnonHugePages: 102400 kB
>>>> ShmemPmdMapped: 0 kB
>>>> FilePmdMapped: 0 kB
>>>> Shared_Hugetlb: 0 kB
>>>> Private_Hugetlb: 0 kB
>>>> Swap: 0 kB
>>>> SwapPss: 0 kB
>>>> Locked: 0 kB
>>>> THPeligible: 1
>>>> VmFlags: rd wr mr mw me ac
>>>>
>>>> Code Path Analysis:
>>>> The behavior can be traced through the following kernel code path:
>>>> populate_vma_page_range() is invoked to pre-fault pages for the VMA.
>>>> Inside it:
>>>>
>>>> if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
>>>> gup_flags |= FOLL_WRITE;
>>>>
>>>> This sets FOLL_WRITE for writable anonymous VMAs.
>>>>
>>>> Later, in faultin_page():
>>>>
>>>> if (*flags & FOLL_WRITE)
>>>> fault_flags |= FAULT_FLAG_WRITE;
>>>>
>>>> This effectively marks the page fault as a write.
>>>> Finally, in do_anonymous_page():
>>>>
>>>> if (vma->vm_flags & VM_WRITE)
>>>> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>>>
>>>> Here, the PTE is updated to writable and immediately marked dirty.
>>>> As a result, all pre-faulted pages are marked dirty, even though the
>>>> user program has not performed any writes.
>>>> For large anonymous mappings, this can trigger unnecessary swap-out
>>>> writebacks, generating avoidable I/O.
>>>>
>>>> Discussion:
>>>> Would it be possible to optimize this behavior: for example, by
>>>> populate pte as writable, but deferring the dirty bit until the user
>>>> actually writes to the page?
>>>
>>> How would we know if the user wrote to the page, since we marked it writeable?
>>
>> On access, either HW sets the dirty bit if it supports it, or we get another
>> fault and set the dirty bit manually.
>>
>> What happens on architectures where the HW doesn't support setting the dirty
>> bit is that performing a pte_mkwrite() checks whether the pte is dirty. If
>> it's not dirty the HW write bit will not be set and instead the next
>> pte_mkdirty() will set the actual HW write bit.
>>
>> See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
>> arch/s390/include/asm/pgtable.h
>>
>> Of course, setting the dirty bit either way on later access comes with a
>> price.
>
> Ah, yes, the details were a little fuzzy in my head, thanks.
> I'm trying to swap in (ha!) the details again. We still proactively mark anon
> folios dirty anyway for $reasons, so optimizing it might be difficult? Not sure
> if it is _worth_ optimizing for anyway.
I remembered the same thing (proactively mark anon folios dirty) but I
didn't easily spot it in the code. Did you spot it?
I only found the folio_mark_dirty() calls when unmapping anon pages and
we stumble over a dirty pte.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 12:46 ` David Hildenbrand
@ 2025-09-22 14:13 ` Pedro Falcato
2025-09-22 14:44 ` David Hildenbrand
0 siblings, 1 reply; 9+ messages in thread
From: Pedro Falcato @ 2025-09-22 14:13 UTC (permalink / raw)
To: David Hildenbrand; +Cc: wuyifeng (C), akpm, linux-mm
On Mon, Sep 22, 2025 at 02:46:44PM +0200, David Hildenbrand wrote:
> On 22.09.25 11:37, Pedro Falcato wrote:
snip
> >
> > > What happens on architectures where the HW doesn't support setting the dirty
> > > bit is that performing a pte_mkwrite() checks whether the pte is dirty. If
> > > it's not dirty the HW write bit will not be set and instead the next
> > > pte_mkdirty() will set the actual HW write bit.
> > >
> > > See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
> > > arch/s390/include/asm/pgtable.h
> > >
> > > Of course, setting the dirty bit either way on later access comes with a
> > > price.
> >
> > Ah, yes, the details were a little fuzzy in my head, thanks.
> > I'm trying to swap in (ha!) the details again. We still proactively mark anon
> > folios dirty anyway for $reasons, so optimizing it might be difficult? Not sure
> > if it is _worth_ optimizing for anyway.
>
> I remembered the same thing (proactively mark anon folios dirty) but I
> didn't easily spot it in the code. Did you spot it?
>
> I only found the folio_mark_dirty() calls when unmapping anon pages and we
> stumble over a dirty pte.
>
In shrink_folio_list():
if (folio_test_anon(folio) && folio_test_swapbacked(folio)) {
if (!folio_test_swapcache(folio)) {
/* ... */
/*
* Normally the folio will be dirtied in unmap because its
* pte should be dirty. A special case is MADV_FREE page. The
* page's pte could have dirty bit cleared but the folio's
* SwapBacked flag is still set because clearing the dirty bit
* and SwapBacked flag has no lock protected. For such folio,
* unmap will not set dirty bit for it, so folio reclaim will
* not write the folio out. This can cause data corruption when
* the folio is swapped in later. Always setting the dirty flag
* for the folio solves the problem.
*/
folio_mark_dirty(folio);
}
}
So we assume the folio is dirty due to races with MADV_FREE. Seems like a
somewhat heavy handed solution, but I guess it works nicely for 99.9% of cases.
--
Pedro
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily?
2025-09-22 14:13 ` Pedro Falcato
@ 2025-09-22 14:44 ` David Hildenbrand
0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-09-22 14:44 UTC (permalink / raw)
To: Pedro Falcato; +Cc: wuyifeng (C), akpm, linux-mm
On 22.09.25 16:13, Pedro Falcato wrote:
> On Mon, Sep 22, 2025 at 02:46:44PM +0200, David Hildenbrand wrote:
>> On 22.09.25 11:37, Pedro Falcato wrote:
> snip
>>>
>>>> What happens on architectures where the HW doesn't support setting the dirty
>>>> bit is that performing a pte_mkwrite() checks whether the pte is dirty. If
>>>> it's not dirty the HW write bit will not be set and instead the next
>>>> pte_mkdirty() will set the actual HW write bit.
>>>>
>>>> See pte_mkwrite() handling in arch/sparc/include/asm/pgtable_64.h or
>>>> arch/s390/include/asm/pgtable.h
>>>>
>>>> Of course, setting the dirty bit either way on later access comes with a
>>>> price.
>>>
>>> Ah, yes, the details were a little fuzzy in my head, thanks.
>>> I'm trying to swap in (ha!) the details again. We still proactively mark anon
>>> folios dirty anyway for $reasons, so optimizing it might be difficult? Not sure
>>> if it is _worth_ optimizing for anyway.
>>
>> I remembered the same thing (proactively mark anon folios dirty) but I
>> didn't easily spot it in the code. Did you spot it?
>>
>> I only found the folio_mark_dirty() calls when unmapping anon pages and we
>> stumble over a dirty pte.
>>
>
> In shrink_folio_list():
> if (folio_test_anon(folio) && folio_test_swapbacked(folio)) {
> if (!folio_test_swapcache(folio)) {
> /* ... */
> /*
> * Normally the folio will be dirtied in unmap because its
> * pte should be dirty. A special case is MADV_FREE page. The
> * page's pte could have dirty bit cleared but the folio's
> * SwapBacked flag is still set because clearing the dirty bit
> * and SwapBacked flag has no lock protected. For such folio,
> * unmap will not set dirty bit for it, so folio reclaim will
> * not write the folio out. This can cause data corruption when
> * the folio is swapped in later. Always setting the dirty flag
> * for the folio solves the problem.
> */
> folio_mark_dirty(folio);
> }
> }
>
> So we assume the folio is dirty due to races with MADV_FREE. Seems like a
> somewhat heavy handed solution, but I guess it works nicely for 99.9% of cases.
Thanks, yeah that makes sense. Whenever we add a folio to the swapcache
we mark it dirty, so it's guaranteed that whatever content it had (even
if just zeroes) will be written out.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-09-22 14:44 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-22 6:19 [RFC] mm: MAP_POPULATE on writable anonymous mappings marks pte dirty is necessarily? wuyifeng (C)
2025-09-22 8:45 ` Pedro Falcato
2025-09-22 9:07 ` David Hildenbrand
2025-09-22 9:37 ` Pedro Falcato
2025-09-22 9:49 ` wuyifeng (C)
2025-09-22 12:46 ` David Hildenbrand
2025-09-22 14:13 ` Pedro Falcato
2025-09-22 14:44 ` David Hildenbrand
2025-09-22 9:00 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox