* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-15 13:52 [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL Kiryl Shutsemau
@ 2025-09-15 13:58 ` David Hildenbrand
2025-09-15 15:08 ` Dev Jain
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-09-15 13:58 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, linux-mm, linux-kernel
On 15.09.25 15:52, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
> err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-15 13:52 [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL Kiryl Shutsemau
2025-09-15 13:58 ` David Hildenbrand
@ 2025-09-15 15:08 ` Dev Jain
2025-09-15 15:35 ` Zi Yan
` (2 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Dev Jain @ 2025-09-15 15:08 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
Barry Song, linux-mm, linux-kernel, nd
On 15/09/25 7:22 pm, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
> err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>
> v2:
> - Modify set_huge_pmd() instead of introducing install_huge_pmd();
>
> ---
> mm/khugepaged.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b486c1d19b2d..986718599355 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmdp, struct folio *folio, struct page *page)
> {
> + struct mm_struct *mm = vma->vm_mm;
> struct vm_fault vmf = {
> .vma = vma,
> .address = addr,
> .flags = 0,
> - .pmd = pmdp,
> };
> + pgd_t *pgdp;
> + p4d_t *p4dp;
> + pud_t *pudp;
>
> mmap_assert_locked(vma->vm_mm);
I was going to reply to v1 - you could replace vma->vm_mm
with mm here.
Reviewed-by: Dev Jain <dev.jain@arm.com>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-15 13:52 [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL Kiryl Shutsemau
2025-09-15 13:58 ` David Hildenbrand
2025-09-15 15:08 ` Dev Jain
@ 2025-09-15 15:35 ` Zi Yan
2025-09-15 16:51 ` Zach O'Keefe
2025-09-16 9:54 ` Lorenzo Stoakes
2025-09-19 3:14 ` Baolin Wang
4 siblings, 1 reply; 13+ messages in thread
From: Zi Yan @ 2025-09-15 15:35 UTC (permalink / raw)
To: Kiryl Shutsemau, Zach O'Keefe
Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
linux-mm, linux-kernel
On 15 Sep 2025, at 9:52, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
> err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
Why does collapse code want to know the difference between SCAN_PMD_NULL and
SCAN_PMD_NONE? Both seems to be treated as “nothing here, install a PMD
leaf”. One difference is that madvise_collapse() will continue
on SCAN_PMD_NULL but bail out on SCAN_PMD_NONE.
I wonder if we could have SCAN_PMD_NULL_OR_NONE instead.
Zach, since you added both, can you share some insight? Thanks.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
>
> v2:
> - Modify set_huge_pmd() instead of introducing install_huge_pmd();
>
> ---
> mm/khugepaged.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
The changes look good to me. Reviewed-by: Zi Yan <ziy@nvidia.com>
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-15 15:35 ` Zi Yan
@ 2025-09-15 16:51 ` Zach O'Keefe
0 siblings, 0 replies; 13+ messages in thread
From: Zach O'Keefe @ 2025-09-15 16:51 UTC (permalink / raw)
To: Zi Yan
Cc: Kiryl Shutsemau, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Baolin Wang, Liam R. Howlett, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, linux-mm, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2146 bytes --]
On Mon, Sep 15, 2025 at 8:35 AM Zi Yan <ziy@nvidia.com> wrote:
> On 15 Sep 2025, at 9:52, Kiryl Shutsemau wrote:
>
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> >
> > Consider following example:
> >
> > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> > MAP_SHARED, fd, 0);
> > err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> >
> > fd is a populated tmpfs file.
> >
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> >
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> >
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
>
> Why does collapse code want to know the difference between SCAN_PMD_NULL
> and
> SCAN_PMD_NONE? Both seems to be treated as “nothing here, install a PMD
> leaf”. One difference is that madvise_collapse() will continue
> on SCAN_PMD_NULL but bail out on SCAN_PMD_NONE.
>
> I wonder if we could have SCAN_PMD_NULL_OR_NONE instead.
>
> Zach, since you added both, can you share some insight? Thanks.
>
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > ---
> >
> > v2:
> > - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> >
> > ---
> > mm/khugepaged.c | 20 +++++++++++++++++++-
> > 1 file changed, 19 insertions(+), 1 deletion(-)
> >
>
> The changes look good to me. Reviewed-by: Zi Yan <ziy@nvidia.com>
>
> Best Regards,
> Yan, Zi
Thanks Zi. Hugh had also looped me into this. Travelling today but will
respond tomorrow. Generally though, this is a behavioural cleanup I’d had
been meaning to do for a while, but didn’t realize it’d be so
straightforward. Thank you,
Kiryl
>
>
[-- Attachment #2: Type: text/html, Size: 3463 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-15 13:52 [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL Kiryl Shutsemau
` (2 preceding siblings ...)
2025-09-15 15:35 ` Zi Yan
@ 2025-09-16 9:54 ` Lorenzo Stoakes
2025-09-16 18:06 ` Zach O'Keefe
2025-09-17 10:43 ` Kiryl Shutsemau
2025-09-19 3:14 ` Baolin Wang
4 siblings, 2 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 9:54 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
linux-mm, linux-kernel
On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
> err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
There was a v1 with tags, you've not propagated any of them? Did you feel
the change was enough to remove them?
Anyway, LGTM so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>
> v2:
> - Modify set_huge_pmd() instead of introducing install_huge_pmd();
>
> ---
> mm/khugepaged.c | 20 +++++++++++++++++++-
> 1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b486c1d19b2d..986718599355 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmdp, struct folio *folio, struct page *page)
> {
> + struct mm_struct *mm = vma->vm_mm;
> struct vm_fault vmf = {
> .vma = vma,
> .address = addr,
> .flags = 0,
> - .pmd = pmdp,
> };
> + pgd_t *pgdp;
> + p4d_t *p4dp;
> + pud_t *pudp;
>
> mmap_assert_locked(vma->vm_mm);
NIT: you have mm as a local var should use here too. Not a big deal though
obviously...
>
> + if (!pmdp) {
> + pgdp = pgd_offset(mm, addr);
> + p4dp = p4d_alloc(mm, pgdp, addr);
> + if (!p4dp)
> + return SCAN_FAIL;
> + pudp = pud_alloc(mm, p4dp, addr);
> + if (!pudp)
> + return SCAN_FAIL;
> + pmdp = pmd_alloc(mm, pudp, addr);
> + if (!pmdp)
> + return SCAN_FAIL;
> + }
> +
> + vmf.pmd = pmdp;
> if (do_set_pmd(&vmf, folio, page))
> return SCAN_FAIL;
>
> @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> switch (result) {
> case SCAN_SUCCEED:
> break;
> + case SCAN_PMD_NULL:
> case SCAN_PMD_NONE:
> /*
> * All pte entries have been removed and pmd cleared.
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-16 9:54 ` Lorenzo Stoakes
@ 2025-09-16 18:06 ` Zach O'Keefe
2025-09-17 10:52 ` Kiryl Shutsemau
2025-09-17 10:43 ` Kiryl Shutsemau
1 sibling, 1 reply; 13+ messages in thread
From: Zach O'Keefe @ 2025-09-16 18:06 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, linux-kernel
On Tue, Sep 16, 2025 at 2:54 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> >
> > Consider following example:
> >
> > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> > MAP_SHARED, fd, 0);
> > err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> >
> > fd is a populated tmpfs file.
> >
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> >
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> >
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
So, since we are trying to aim for consistency here, I think we ought
to also support the anonymous case.
I don't have a patch, but can spot at least two things we'd need to adjust:
First, we are defeated by the check in __thp_vma_allowable_orders();
/*
* THPeligible bit of smaps should show 1 for proper VMAs even
* though anon_vma is not initialized yet.
*
* Allow page fault since anon_vma may be not initialized until
* the first page fault.
*/
if (!vma->anon_vma)
return (smaps || in_pf) ? orders : 0;
I think we can probably just delete that check, but would need to confirm.
And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
collapse_pte_mapped_thp(). I think we just need to audit places where
we return this code, to make sure it's faithfully describing a
situation where we can go ahead and install a new pmd. As a hasty
check, the return codes in check_pmd_state() don't look to follow
that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
there are many underlying failure reasons for
pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
entry".
WDYT?
> There was a v1 with tags, you've not propagated any of them? Did you feel
> the change was enough to remove them?
>
> Anyway, LGTM so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >
> > v2:
> > - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> >
> > ---
> > mm/khugepaged.c | 20 +++++++++++++++++++-
> > 1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b486c1d19b2d..986718599355 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > pmd_t *pmdp, struct folio *folio, struct page *page)
> > {
> > + struct mm_struct *mm = vma->vm_mm;
> > struct vm_fault vmf = {
> > .vma = vma,
> > .address = addr,
> > .flags = 0,
> > - .pmd = pmdp,
> > };
> > + pgd_t *pgdp;
> > + p4d_t *p4dp;
> > + pud_t *pudp;
> >
> > mmap_assert_locked(vma->vm_mm);
>
> NIT: you have mm as a local var should use here too. Not a big deal though
> obviously...
>
> >
> > + if (!pmdp) {
> > + pgdp = pgd_offset(mm, addr);
> > + p4dp = p4d_alloc(mm, pgdp, addr);
> > + if (!p4dp)
> > + return SCAN_FAIL;
> > + pudp = pud_alloc(mm, p4dp, addr);
> > + if (!pudp)
> > + return SCAN_FAIL;
> > + pmdp = pmd_alloc(mm, pudp, addr);
> > + if (!pmdp)
> > + return SCAN_FAIL;
> > + }
> > +
> > + vmf.pmd = pmdp;
> > if (do_set_pmd(&vmf, folio, page))
> > return SCAN_FAIL;
> >
> > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > switch (result) {
> > case SCAN_SUCCEED:
> > break;
> > + case SCAN_PMD_NULL:
> > case SCAN_PMD_NONE:
> > /*
> > * All pte entries have been removed and pmd cleared.
> > --
> > 2.50.1
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-16 18:06 ` Zach O'Keefe
@ 2025-09-17 10:52 ` Kiryl Shutsemau
2025-09-17 13:56 ` Zach O'Keefe
2025-09-18 12:27 ` Lorenzo Stoakes
0 siblings, 2 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2025-09-17 10:52 UTC (permalink / raw)
To: Zach O'Keefe
Cc: Lorenzo Stoakes, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, linux-kernel
On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote:
> So, since we are trying to aim for consistency here, I think we ought
> to also support the anonymous case.
>
> I don't have a patch, but can spot at least two things we'd need to adjust:
>
> First, we are defeated by the check in __thp_vma_allowable_orders();
>
> /*
> * THPeligible bit of smaps should show 1 for proper VMAs even
> * though anon_vma is not initialized yet.
> *
> * Allow page fault since anon_vma may be not initialized until
> * the first page fault.
> */
> if (!vma->anon_vma)
> return (smaps || in_pf) ? orders : 0;
>
> I think we can probably just delete that check, but would need to confirm.
Do you want MADV_COLLAPSE to work on VMAs that never got a page fault?
I think it should be fine as long as we agree that MADV_COLLAPSE implies
memory population. I think it should, but I want to be sure we are on
the same page.
I also brings a question on holes in the files on MADV_COLLAPSE. We
might want to populate them too. But it means the logic between
MADV_COLLAPSE and khugepaged will diverge. It requires larger
refactoring.
> And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
> collapse_pte_mapped_thp(). I think we just need to audit places where
> we return this code, to make sure it's faithfully describing a
> situation where we can go ahead and install a new pmd. As a hasty
> check, the return codes in check_pmd_state() don't look to follow
> that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
> there are many underlying failure reasons for
> pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
> entry".
Sounds like a plan :)
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-17 10:52 ` Kiryl Shutsemau
@ 2025-09-17 13:56 ` Zach O'Keefe
2025-09-18 12:27 ` Lorenzo Stoakes
1 sibling, 0 replies; 13+ messages in thread
From: Zach O'Keefe @ 2025-09-17 13:56 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Lorenzo Stoakes, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, linux-kernel
On Wed, Sep 17, 2025 at 3:52 AM Kiryl Shutsemau <kirill@shutemov.name> wrote:
>
> On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote:
> > So, since we are trying to aim for consistency here, I think we ought
> > to also support the anonymous case.
> >
> > I don't have a patch, but can spot at least two things we'd need to adjust:
> >
> > First, we are defeated by the check in __thp_vma_allowable_orders();
> >
> > /*
> > * THPeligible bit of smaps should show 1 for proper VMAs even
> > * though anon_vma is not initialized yet.
> > *
> > * Allow page fault since anon_vma may be not initialized until
> > * the first page fault.
> > */
> > if (!vma->anon_vma)
> > return (smaps || in_pf) ? orders : 0;
> >
> > I think we can probably just delete that check, but would need to confirm.
>
> Do you want MADV_COLLAPSE to work on VMAs that never got a page fault?
>
> I think it should be fine as long as we agree that MADV_COLLAPSE implies
> memory population. I think it should, but I want to be sure we are on
> the same page.
Exactly. I'm always a little embarrassed when telling people about how
to successfully use MADV_COLLAPSE, "oh, but makes sure you fault at
least one page beforehand because of ~reasons~"
> I also brings a question on holes in the files on MADV_COLLAPSE. We
> might want to populate them too. But it means the logic between
> MADV_COLLAPSE and khugepaged will diverge. It requires larger
> refactoring.
Yeah, and taking a look more thorough am perhaps reminded why I didn't
pursue this yet.
> > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
> > collapse_pte_mapped_thp(). I think we just need to audit places where
> > we return this code, to make sure it's faithfully describing a
> > situation where we can go ahead and install a new pmd. As a hasty
> > check, the return codes in check_pmd_state() don't look to follow
> > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
> > there are many underlying failure reasons for
> > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
> > entry".
>
> Sounds like a plan :)
:) Frankly, I don't have cycles to tackle this at the moment, and
unfair to push the work on you, given it's non-trivial, so can have my
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
For this patch ; though Andrew has already taken it
Hopefully I can look and sneak improvements into 6.18 -- but wouldn't
hold my breath.
> --
> Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-17 10:52 ` Kiryl Shutsemau
2025-09-17 13:56 ` Zach O'Keefe
@ 2025-09-18 12:27 ` Lorenzo Stoakes
1 sibling, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 12:27 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Zach O'Keefe, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, linux-kernel
On Wed, Sep 17, 2025 at 11:52:34AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 11:06:30AM -0700, Zach O'Keefe wrote:
> > So, since we are trying to aim for consistency here, I think we ought
> > to also support the anonymous case.
> >
> > I don't have a patch, but can spot at least two things we'd need to adjust:
> >
> > First, we are defeated by the check in __thp_vma_allowable_orders();
> >
> > /*
> > * THPeligible bit of smaps should show 1 for proper VMAs even
> > * though anon_vma is not initialized yet.
> > *
> > * Allow page fault since anon_vma may be not initialized until
> > * the first page fault.
> > */
> > if (!vma->anon_vma)
> > return (smaps || in_pf) ? orders : 0;
> >
> > I think we can probably just delete that check, but would need to confirm.
>
> Do you want MADV_COLLAPSE to work on VMAs that never got a page fault?
>
> I think it should be fine as long as we agree that MADV_COLLAPSE implies
> memory population. I think it should, but I want to be sure we are on
> the same page.
I'm definitely in favour of MADV_COLLAPSE not requiring pre-faulting like
that.
As long as nothing is assuming that an anon_vma already exists, which
surely is not possible as we ignore in smaps/pf case.
(Oh how I hate how anon_vma works)
Though would this not also impact khugepaged?
thp_vma_allowable_order() called with TVA_KHUGEPAGED are in:
- khugepaged_enter_vma() - does't matter really, one VMA having anon_vma or
not should be immaterial to adding the mm to khugepaged, not hugely
likely _none_ would have anon_vma...
- khugepaged_scan_mm_slot() - well, we will require at least 1 PTE for this
to succeed anyway right? We might waste some time though. But probably
not much. hpage_collapse_scan_pmd() -> collapse_huge_page() tries an
anon VMA lock, but would have to have found PTEs to succeed...
So actually it is fine to just remove the check.
>
> I also brings a question on holes in the files on MADV_COLLAPSE. We
> might want to populate them too. But it means the logic between
> MADV_COLLAPSE and khugepaged will diverge. It requires larger
> refactoring.
I think all of THP requires a larger refactoring :) but interesting point.
>
> > And second, madvise_collapse() doesn't route SCAN_PMD_NULL to
> > collapse_pte_mapped_thp(). I think we just need to audit places where
> > we return this code, to make sure it's faithfully describing a
> > situation where we can go ahead and install a new pmd. As a hasty
> > check, the return codes in check_pmd_state() don't look to follow
> > that, with !present and pmd_bad() returning SCAN_PMD_NULL. Likewise,
> > there are many underlying failure reasons for
> > pte_offset_map_ro_nolock()=>___pte_offset_map() that aren't "no PMD
> > entry".
>
> Sounds like a plan :)
There being differences is kind of horrible. So yes indeed, one to look at
:)
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-16 9:54 ` Lorenzo Stoakes
2025-09-16 18:06 ` Zach O'Keefe
@ 2025-09-17 10:43 ` Kiryl Shutsemau
2025-09-18 12:16 ` Lorenzo Stoakes
1 sibling, 1 reply; 13+ messages in thread
From: Kiryl Shutsemau @ 2025-09-17 10:43 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
linux-mm, linux-kernel
On Tue, Sep 16, 2025 at 10:54:12AM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 02:52:53PM +0100, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> >
> > MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> > PMD page table is installed or not.
> >
> > Consider following example:
> >
> > p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> > MAP_SHARED, fd, 0);
> > err = madvise(p, 2UL << 20, MADV_COLLAPSE);
> >
> > fd is a populated tmpfs file.
> >
> > The result depends on the address that the kernel returns on mmap().
> > If it is located in an existing PMD table, the madvise() will succeed.
> > However, if the table does not exist, it will fail with -EINVAL.
> >
> > This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> > a page table is missing, which causes collapse_pte_mapped_thp() to fail.
> >
> > SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> > collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> > tables as needed.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>
> There was a v1 with tags, you've not propagated any of them? Did you feel
> the change was enough to remove them?
I moved code around and was not comfortable to carry tags over.
> Anyway, LGTM so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> >
> > v2:
> > - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> >
> > ---
> > mm/khugepaged.c | 20 +++++++++++++++++++-
> > 1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b486c1d19b2d..986718599355 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > pmd_t *pmdp, struct folio *folio, struct page *page)
> > {
> > + struct mm_struct *mm = vma->vm_mm;
> > struct vm_fault vmf = {
> > .vma = vma,
> > .address = addr,
> > .flags = 0,
> > - .pmd = pmdp,
> > };
> > + pgd_t *pgdp;
> > + p4d_t *p4dp;
> > + pud_t *pudp;
> >
> > mmap_assert_locked(vma->vm_mm);
>
> NIT: you have mm as a local var should use here too. Not a big deal though
> obviously...
Do you want v3 for this?
> >
> > + if (!pmdp) {
> > + pgdp = pgd_offset(mm, addr);
> > + p4dp = p4d_alloc(mm, pgdp, addr);
> > + if (!p4dp)
> > + return SCAN_FAIL;
> > + pudp = pud_alloc(mm, p4dp, addr);
> > + if (!pudp)
> > + return SCAN_FAIL;
> > + pmdp = pmd_alloc(mm, pudp, addr);
> > + if (!pmdp)
> > + return SCAN_FAIL;
> > + }
> > +
> > + vmf.pmd = pmdp;
> > if (do_set_pmd(&vmf, folio, page))
> > return SCAN_FAIL;
> >
> > @@ -1556,6 +1573,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > switch (result) {
> > case SCAN_SUCCEED:
> > break;
> > + case SCAN_PMD_NULL:
> > case SCAN_PMD_NONE:
> > /*
> > * All pte entries have been removed and pmd cleared.
> > --
> > 2.50.1
> >
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-17 10:43 ` Kiryl Shutsemau
@ 2025-09-18 12:16 ` Lorenzo Stoakes
0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 12:16 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
linux-mm, linux-kernel
Sorry mutt hid this reply from me...
On Wed, Sep 17, 2025 at 11:43:08AM +0100, Kiryl Shutsemau wrote:
> On Tue, Sep 16, 2025 at 10:54:12AM +0100, Lorenzo Stoakes wrote:
> > There was a v1 with tags, you've not propagated any of them? Did you feel
> > the change was enough to remove them?
>
> I moved code around and was not comfortable to carry tags over.
Ack.
>
> > Anyway, LGTM so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > ---
> > >
> > > v2:
> > > - Modify set_huge_pmd() instead of introducing install_huge_pmd();
> > >
> > > ---
> > > mm/khugepaged.c | 20 +++++++++++++++++++-
> > > 1 file changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index b486c1d19b2d..986718599355 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1472,15 +1472,32 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> > > static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > pmd_t *pmdp, struct folio *folio, struct page *page)
> > > {
> > > + struct mm_struct *mm = vma->vm_mm;
> > > struct vm_fault vmf = {
> > > .vma = vma,
> > > .address = addr,
> > > .flags = 0,
> > > - .pmd = pmdp,
> > > };
> > > + pgd_t *pgdp;
> > > + p4d_t *p4dp;
> > > + pud_t *pudp;
> > >
> > > mmap_assert_locked(vma->vm_mm);
> >
> > NIT: you have mm as a local var should use here too. Not a big deal though
> > obviously...
>
> Do you want v3 for this?
No, this is not a big deal.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL
2025-09-15 13:52 [PATCHv2] mm/khugepaged: Do not fail collapse_pte_mapped_thp() on SCAN_PMD_NULL Kiryl Shutsemau
` (3 preceding siblings ...)
2025-09-16 9:54 ` Lorenzo Stoakes
@ 2025-09-19 3:14 ` Baolin Wang
4 siblings, 0 replies; 13+ messages in thread
From: Baolin Wang @ 2025-09-19 3:14 UTC (permalink / raw)
To: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, linux-mm, linux-kernel
On 2025/9/15 21:52, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
>
> MADV_COLLAPSE on a file mapping behaves inconsistently depending on if
> PMD page table is installed or not.
>
> Consider following example:
>
> p = mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
> MAP_SHARED, fd, 0);
> err = madvise(p, 2UL << 20, MADV_COLLAPSE);
>
> fd is a populated tmpfs file.
>
> The result depends on the address that the kernel returns on mmap().
> If it is located in an existing PMD table, the madvise() will succeed.
> However, if the table does not exist, it will fail with -EINVAL.
>
> This occurs because find_pmd_or_thp_or_none() returns SCAN_PMD_NULL when
> a page table is missing, which causes collapse_pte_mapped_thp() to fail.
>
> SCAN_PMD_NULL and SCAN_PMD_NONE should be treated the same in
> collapse_pte_mapped_thp(): install the PMD leaf entry and allocate page
> tables as needed.
>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---
LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
^ permalink raw reply [flat|nested] 13+ messages in thread