* [RFC 0/2] relax anon mTHP PTE Mapping restrictions
@ 2025-01-20 1:22 Lance Yang
2025-01-20 1:22 ` [RFC 1/2] mm/mthp: add pte_range_none_or_zeropfn() helper Lance Yang
2025-01-20 1:22 ` [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions Lance Yang
0 siblings, 2 replies; 7+ messages in thread
From: Lance Yang @ 2025-01-20 1:22 UTC (permalink / raw)
To: akpm
Cc: 21cnbao, ryan.roberts, dev.jain, david, shy828301, ziy,
libang.li, baolin.wang, linux-kernel, linux-mm, Lance Yang
Hey all,
Previously, mTHP could only be mapped to PTEs where all entries were none.
With this change, PTEs within the range mapping the demand-zero page can
now be treated as `pte_none` and remapped to a new mTHP, providing more
opportunities to take advantage of mTHP.
Lance Yang (2):
mm/mthp: add pte_range_none_or_zeropfn() helper
mm/mthp: relax anon mTHP PTE Mapping restrictions
mm/memory.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 49 insertions(+), 5 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 1/2] mm/mthp: add pte_range_none_or_zeropfn() helper
2025-01-20 1:22 [RFC 0/2] relax anon mTHP PTE Mapping restrictions Lance Yang
@ 2025-01-20 1:22 ` Lance Yang
2025-01-20 1:22 ` [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions Lance Yang
1 sibling, 0 replies; 7+ messages in thread
From: Lance Yang @ 2025-01-20 1:22 UTC (permalink / raw)
To: akpm
Cc: 21cnbao, ryan.roberts, dev.jain, david, shy828301, ziy,
libang.li, baolin.wang, linux-kernel, linux-mm, Lance Yang,
Mingzhe Yang
In preparation for relaxing the mTHP PTE mapping restriction on the demand
zero page. We intend to provide more opportunities to take advantage of
mTHP.
Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
mm/memory.c | 43 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 40 insertions(+), 3 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2a20e3810534..4e148309b3e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4724,18 +4724,55 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return ret;
}
-static bool pte_range_none(pte_t *pte, int nr_pages)
+/* Flags for __pte_range_check(). */
+typedef int __bitwise prc_t;
+
+/* Check if PTEs are not present. */
+#define PRC_CHECK_NONE ((__force prc_t)BIT(0))
+
+/* Check if PTEs are mapped to the zero page. */
+#define PRC_CHECK_ZEROPFN ((__force prc_t)BIT(1))
+
+static bool __pte_range_check(pte_t *pte, int nr_pages, prc_t flags,
+ bool *any_zeropfn)
{
int i;
+ pte_t ptent;
+
+ if (any_zeropfn)
+ *any_zeropfn = false;
for (i = 0; i < nr_pages; i++) {
- if (!pte_none(ptep_get_lockless(pte + i)))
- return false;
+ ptent = ptep_get_lockless(pte + i);
+
+ if ((flags & PRC_CHECK_NONE) && pte_none(ptent))
+ continue;
+
+ if ((flags & PRC_CHECK_ZEROPFN) &&
+ is_zero_pfn(pte_pfn(ptent))) {
+ if (any_zeropfn)
+ *any_zeropfn = true;
+ continue;
+ }
+
+ return false;
}
return true;
}
+static inline bool pte_range_none(pte_t *pte, int nr_pages)
+{
+ return __pte_range_check(pte, nr_pages, PRC_CHECK_NONE, NULL);
+}
+
+static inline bool pte_range_none_or_zeropfn(pte_t *pte, int nr_pages,
+ bool *any_zeropfn)
+{
+ return __pte_range_check(
+ pte, nr_pages, PRC_CHECK_NONE | PRC_CHECK_ZEROPFN, any_zeropfn);
+}
+
static struct folio *alloc_anon_folio(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
--
2.45.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions
2025-01-20 1:22 [RFC 0/2] relax anon mTHP PTE Mapping restrictions Lance Yang
2025-01-20 1:22 ` [RFC 1/2] mm/mthp: add pte_range_none_or_zeropfn() helper Lance Yang
@ 2025-01-20 1:22 ` Lance Yang
2025-01-20 2:15 ` Barry Song
1 sibling, 1 reply; 7+ messages in thread
From: Lance Yang @ 2025-01-20 1:22 UTC (permalink / raw)
To: akpm
Cc: 21cnbao, ryan.roberts, dev.jain, david, shy828301, ziy,
libang.li, baolin.wang, linux-kernel, linux-mm, Lance Yang,
Mingzhe Yang
Previously, mTHP could only be mapped to PTEs where all entries were none.
With this change, PTEs within the range mapping the demand-zero page can
now be treated as `pte_none` and remapped to a new mTHP, providing more
opportunities to take advantage of mTHP.
Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
mm/memory.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 4e148309b3e0..99ec75c6f0fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4815,7 +4815,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
order = highest_order(orders);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- if (pte_range_none(pte + pte_index(addr), 1 << order))
+ if (pte_range_none_or_zeropfn(pte + pte_index(addr), 1 << order,
+ NULL))
break;
order = next_order(&orders, order);
}
@@ -4867,6 +4868,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
+ bool any_zeropfn = false;
struct folio *folio;
vm_fault_t ret = 0;
int nr_pages = 1;
@@ -4939,7 +4941,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (nr_pages == 1 && vmf_pte_changed(vmf)) {
update_mmu_tlb(vma, addr, vmf->pte);
goto release;
- } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+ } else if (nr_pages > 1 && !pte_range_none_or_zeropfn(
+ vmf->pte, nr_pages, &any_zeropfn)) {
update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
goto release;
}
@@ -4965,6 +4968,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
entry = pte_mkuffd_wp(entry);
set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
+ /* At least one PTE was mapped to the zero page */
+ if (nr_pages > 1 && any_zeropfn)
+ flush_tlb_range(vma, addr, addr + (nr_pages * PAGE_SIZE));
+
/* No need to invalidate - it was non-present before */
update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
unlock:
--
2.45.2
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions
2025-01-20 1:22 ` [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions Lance Yang
@ 2025-01-20 2:15 ` Barry Song
2025-01-20 13:20 ` Lance Yang
0 siblings, 1 reply; 7+ messages in thread
From: Barry Song @ 2025-01-20 2:15 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, ryan.roberts, dev.jain, david, shy828301, ziy, libang.li,
baolin.wang, linux-kernel, linux-mm, Mingzhe Yang
On Mon, Jan 20, 2025 at 2:23 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> Previously, mTHP could only be mapped to PTEs where all entries were none.
> With this change, PTEs within the range mapping the demand-zero page can
> now be treated as `pte_none` and remapped to a new mTHP, providing more
> opportunities to take advantage of mTHP.
>
> Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
> mm/memory.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4e148309b3e0..99ec75c6f0fe 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4815,7 +4815,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> order = highest_order(orders);
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> + if (pte_range_none_or_zeropfn(pte + pte_index(addr), 1 << order,
> + NULL))
> break;
> order = next_order(&orders, order);
> }
> @@ -4867,6 +4868,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> unsigned long addr = vmf->address;
> + bool any_zeropfn = false;
> struct folio *folio;
> vm_fault_t ret = 0;
> int nr_pages = 1;
> @@ -4939,7 +4941,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (nr_pages == 1 && vmf_pte_changed(vmf)) {
> update_mmu_tlb(vma, addr, vmf->pte);
> goto release;
> - } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> + } else if (nr_pages > 1 && !pte_range_none_or_zeropfn(
> + vmf->pte, nr_pages, &any_zeropfn)) {
> update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
> goto release;
> }
> @@ -4965,6 +4968,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> entry = pte_mkuffd_wp(entry);
> set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
> + /* At least one PTE was mapped to the zero page */
> + if (nr_pages > 1 && any_zeropfn)
> + flush_tlb_range(vma, addr, addr + (nr_pages * PAGE_SIZE));
Do we also need mmu_notifier?
mmu_notifier_range_init(...)
mmu_notifier_invalidate_range_start(&range);
By the way, this is getting much more complex, but are we seeing any real
benefits? I’ve tested this before, and it seems that zeropfn-mapped
anonymous folios are quite rare.
> +
> /* No need to invalidate - it was non-present before */
> update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> unlock:
> --
> 2.45.2
>
Thanks
Barry
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions
2025-01-20 2:15 ` Barry Song
@ 2025-01-20 13:20 ` Lance Yang
2025-01-20 13:38 ` David Hildenbrand
0 siblings, 1 reply; 7+ messages in thread
From: Lance Yang @ 2025-01-20 13:20 UTC (permalink / raw)
To: Barry Song
Cc: akpm, ryan.roberts, dev.jain, david, shy828301, ziy, libang.li,
baolin.wang, linux-kernel, linux-mm, Mingzhe Yang
On Mon, Jan 20, 2025 at 10:15 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Jan 20, 2025 at 2:23 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > Previously, mTHP could only be mapped to PTEs where all entries were none.
> > With this change, PTEs within the range mapping the demand-zero page can
> > now be treated as `pte_none` and remapped to a new mTHP, providing more
> > opportunities to take advantage of mTHP.
> >
> > Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
> > Signed-off-by: Lance Yang <ioworker0@gmail.com>
> > ---
> > mm/memory.c | 11 +++++++++--
> > 1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4e148309b3e0..99ec75c6f0fe 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4815,7 +4815,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > order = highest_order(orders);
> > while (orders) {
> > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > - if (pte_range_none(pte + pte_index(addr), 1 << order))
> > + if (pte_range_none_or_zeropfn(pte + pte_index(addr), 1 << order,
> > + NULL))
> > break;
> > order = next_order(&orders, order);
> > }
> > @@ -4867,6 +4868,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > unsigned long addr = vmf->address;
> > + bool any_zeropfn = false;
> > struct folio *folio;
> > vm_fault_t ret = 0;
> > int nr_pages = 1;
> > @@ -4939,7 +4941,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > if (nr_pages == 1 && vmf_pte_changed(vmf)) {
> > update_mmu_tlb(vma, addr, vmf->pte);
> > goto release;
> > - } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> > + } else if (nr_pages > 1 && !pte_range_none_or_zeropfn(
> > + vmf->pte, nr_pages, &any_zeropfn)) {
> > update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
> > goto release;
> > }
> > @@ -4965,6 +4968,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > entry = pte_mkuffd_wp(entry);
> > set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> >
> > + /* At least one PTE was mapped to the zero page */
> > + if (nr_pages > 1 && any_zeropfn)
> > + flush_tlb_range(vma, addr, addr + (nr_pages * PAGE_SIZE));
>
Thanks for taking time to review!
> Do we also need mmu_notifier?
>
> mmu_notifier_range_init(...)
> mmu_notifier_invalidate_range_start(&range);
>
> By the way, this is getting much more complex, but are we seeing any real
> benefits? I’ve tested this before, and it seems that zeropfn-mapped
> anonymous folios are quite rare.
Hmm... Agreed that it's getting more complex. I don't have any data
showing real benefits yet, so let's put this patch aside for now until I do.
Thanks,
Lance
>
> > +
> > /* No need to invalidate - it was non-present before */
> > update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> > unlock:
> > --
> > 2.45.2
> >
>
> Thanks
> Barry
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions
2025-01-20 13:20 ` Lance Yang
@ 2025-01-20 13:38 ` David Hildenbrand
2025-01-21 12:57 ` Lance Yang
0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2025-01-20 13:38 UTC (permalink / raw)
To: Lance Yang, Barry Song
Cc: akpm, ryan.roberts, dev.jain, shy828301, ziy, libang.li,
baolin.wang, linux-kernel, linux-mm, Mingzhe Yang
On 20.01.25 14:20, Lance Yang wrote:
> On Mon, Jan 20, 2025 at 10:15 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Mon, Jan 20, 2025 at 2:23 PM Lance Yang <ioworker0@gmail.com> wrote:
>>>
>>> Previously, mTHP could only be mapped to PTEs where all entries were none.
>>> With this change, PTEs within the range mapping the demand-zero page can
>>> now be treated as `pte_none` and remapped to a new mTHP, providing more
>>> opportunities to take advantage of mTHP.
>>>
>>> Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
>>> Signed-off-by: Lance Yang <ioworker0@gmail.com>
>>> ---
>>> mm/memory.c | 11 +++++++++--
>>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 4e148309b3e0..99ec75c6f0fe 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4815,7 +4815,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>> order = highest_order(orders);
>>> while (orders) {
>>> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> - if (pte_range_none(pte + pte_index(addr), 1 << order))
>>> + if (pte_range_none_or_zeropfn(pte + pte_index(addr), 1 << order,
>>> + NULL))
>>> break;
>>> order = next_order(&orders, order);
>>> }
>>> @@ -4867,6 +4868,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> unsigned long addr = vmf->address;
>>> + bool any_zeropfn = false;
>>> struct folio *folio;
>>> vm_fault_t ret = 0;
>>> int nr_pages = 1;
>>> @@ -4939,7 +4941,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> if (nr_pages == 1 && vmf_pte_changed(vmf)) {
>>> update_mmu_tlb(vma, addr, vmf->pte);
>>> goto release;
>>> - } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
>>> + } else if (nr_pages > 1 && !pte_range_none_or_zeropfn(
>>> + vmf->pte, nr_pages, &any_zeropfn)) {
>>> update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
>>> goto release;
>>> }
>>> @@ -4965,6 +4968,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> entry = pte_mkuffd_wp(entry);
>>> set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>>
>>> + /* At least one PTE was mapped to the zero page */
>>> + if (nr_pages > 1 && any_zeropfn)
>>> + flush_tlb_range(vma, addr, addr + (nr_pages * PAGE_SIZE));
>>
>
> Thanks for taking time to review!
>
>> Do we also need mmu_notifier?
>>
>> mmu_notifier_range_init(...)
>> mmu_notifier_invalidate_range_start(&range);
>>
>> By the way, this is getting much more complex, but are we seeing any real
>> benefits? I’ve tested this before, and it seems that zeropfn-mapped
>> anonymous folios are quite rare.
>
> Hmm... Agreed that it's getting more complex. I don't have any data
> showing real benefits yet, so let's put this patch aside for now until I do.
The motivation for the huge zeropage case was that we would install the
huge zeropage on read fault, to then PTE-map it on write fault and only
COW a single PTE. It would require khugepaged to collapse in order to
get a THP again.
So read-then-immediately-write would not give us a PMD-mapped THP with
the huge shared zeropage enabled, but would give us a PMD-mapped THP
with the huge shared zeropage disabled, which is rather inconsistent.
In case of mTHP, the behavior is at least always the same, and the
performance gain is likely not as huge as in the PMD case. Maybe we can
just wait until khugepaged will fix it up, where a lot of that
complexity will reside either way.
So I agree that it might be useful, but also that there must be a good
reason to have that complexity here.
It might also be a valid question if we should even always use the
shared zeropage on read faults with mTHP enabled, because maybe it could
be more efficient to avoid the TLB flush when ripping out the shared
zeropage etc and just give it an mTHP, if our shrinker works as expected
Using the huge zeropage is rather "obviously good" because it also
reduces TLB pressure + page table walk and can be zapped fairly easily
and efficiently to install a fresh PMD-mapped THP.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions
2025-01-20 13:38 ` David Hildenbrand
@ 2025-01-21 12:57 ` Lance Yang
0 siblings, 0 replies; 7+ messages in thread
From: Lance Yang @ 2025-01-21 12:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Barry Song, akpm, ryan.roberts, dev.jain, shy828301, ziy,
libang.li, baolin.wang, linux-kernel, linux-mm, Mingzhe Yang
On Mon, Jan 20, 2025 at 9:38 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.01.25 14:20, Lance Yang wrote:
> > On Mon, Jan 20, 2025 at 10:15 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Mon, Jan 20, 2025 at 2:23 PM Lance Yang <ioworker0@gmail.com> wrote:
> >>>
> >>> Previously, mTHP could only be mapped to PTEs where all entries were none.
> >>> With this change, PTEs within the range mapping the demand-zero page can
> >>> now be treated as `pte_none` and remapped to a new mTHP, providing more
> >>> opportunities to take advantage of mTHP.
> >>>
> >>> Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com>
> >>> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> >>> ---
> >>> mm/memory.c | 11 +++++++++--
> >>> 1 file changed, 9 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 4e148309b3e0..99ec75c6f0fe 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4815,7 +4815,8 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>> order = highest_order(orders);
> >>> while (orders) {
> >>> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> >>> + if (pte_range_none_or_zeropfn(pte + pte_index(addr), 1 << order,
> >>> + NULL))
> >>> break;
> >>> order = next_order(&orders, order);
> >>> }
> >>> @@ -4867,6 +4868,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>> {
> >>> struct vm_area_struct *vma = vmf->vma;
> >>> unsigned long addr = vmf->address;
> >>> + bool any_zeropfn = false;
> >>> struct folio *folio;
> >>> vm_fault_t ret = 0;
> >>> int nr_pages = 1;
> >>> @@ -4939,7 +4941,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>> if (nr_pages == 1 && vmf_pte_changed(vmf)) {
> >>> update_mmu_tlb(vma, addr, vmf->pte);
> >>> goto release;
> >>> - } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
> >>> + } else if (nr_pages > 1 && !pte_range_none_or_zeropfn(
> >>> + vmf->pte, nr_pages, &any_zeropfn)) {
> >>> update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
> >>> goto release;
> >>> }
> >>> @@ -4965,6 +4968,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >>> entry = pte_mkuffd_wp(entry);
> >>> set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> >>>
> >>> + /* At least one PTE was mapped to the zero page */
> >>> + if (nr_pages > 1 && any_zeropfn)
> >>> + flush_tlb_range(vma, addr, addr + (nr_pages * PAGE_SIZE));
> >>
> >
> > Thanks for taking time to review!
> >
> >> Do we also need mmu_notifier?
> >>
> >> mmu_notifier_range_init(...)
> >> mmu_notifier_invalidate_range_start(&range);
> >>
> >> By the way, this is getting much more complex, but are we seeing any real
> >> benefits? I’ve tested this before, and it seems that zeropfn-mapped
> >> anonymous folios are quite rare.
> >
> > Hmm... Agreed that it's getting more complex. I don't have any data
> > showing real benefits yet, so let's put this patch aside for now until I do.
>
Thanks so much for the lesson! It's really helpful to me ;)
> The motivation for the huge zeropage case was that we would install the
> huge zeropage on read fault, to then PTE-map it on write fault and only
> COW a single PTE. It would require khugepaged to collapse in order to
> get a THP again.
One thing I’d like to do is adjust the patch[1] and resend that out.
Case1: After a read-then-immediately-write operation, we end up with a
PTE-mapped huge zeropage and one PTE-mapped page.
Case 2: Most PTEs within the PMD are set to none.
In either case, we can pre-zero the new THP before khugepaged
collapses the PMD, which will reduce the process-visible downtime ;)
Do you see this as a useful improvement?
[1] https://lore.kernel.org/linux-mm/20240308074921.45752-1-ioworker0@gmail.com/
>
> So read-then-immediately-write would not give us a PMD-mapped THP with
> the huge shared zeropage enabled, but would give us a PMD-mapped THP
> with the huge shared zeropage disabled, which is rather inconsistent.
>
> In case of mTHP, the behavior is at least always the same, and the
> performance gain is likely not as huge as in the PMD case. Maybe we can
> just wait until khugepaged will fix it up, where a lot of that
> complexity will reside either way.
Yeah, It’s better to let khugepaged handle that and keep the complexity on
its side.
>
> So I agree that it might be useful, but also that there must be a good
> reason to have that complexity here.
>
>
> It might also be a valid question if we should even always use the
> shared zeropage on read faults with mTHP enabled, because maybe it could
> be more efficient to avoid the TLB flush when ripping out the shared
> zeropage etc and just give it an mTHP, if our shrinker works as expected
It sounds good to me ;p
Perhaps we could add a switch to just give it an mTHP on read faults with
mTHP enabled, instead of always using the shared zeropage. Would this
switch be per-process or per-cgroup? If it makes sense, I’d like to
get involved.
>
> Using the huge zeropage is rather "obviously good" because it also
> reduces TLB pressure + page table walk and can be zapped fairly easily
> and efficiently to install a fresh PMD-mapped THP.
Yep, as for the huge zeropage, it’s indeed better than the shared zeropage for
reducing TLB pressure and page table walks. TBH, the shared zeropage is less
effective for mTHP than the huge zeropage is for THP.
Thanks,
Lance
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-01-21 12:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-20 1:22 [RFC 0/2] relax anon mTHP PTE Mapping restrictions Lance Yang
2025-01-20 1:22 ` [RFC 1/2] mm/mthp: add pte_range_none_or_zeropfn() helper Lance Yang
2025-01-20 1:22 ` [RFC 2/2] mm/mthp: relax anon mTHP PTE Mapping restrictions Lance Yang
2025-01-20 2:15 ` Barry Song
2025-01-20 13:20 ` Lance Yang
2025-01-20 13:38 ` David Hildenbrand
2025-01-21 12:57 ` Lance Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox