linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware
@ 2025-01-21  4:18 Jane Chu
  2025-01-21  5:00 ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Jane Chu @ 2025-01-21  4:18 UTC (permalink / raw)
  To: akpm, willy, linmiaohe, kirill.shutemov, hughd, linux-mm, linux-kernel

When a process consumes a UE in a page, the memory failure handler
attempts to collect information for a potential SIGBUS.
If the page is an anonymous page, page_mapped_in_vma(page, vma) is
invoked in order to
  1. retrieve the vaddr from the process' address space,
  2. verify that the vaddr is indeed mapped to the poisoned page,
where 'page' is the precise small page with UE.

It's been observed that when injecting poison to a non-head subpage
of an anonymous hugetlb page, no SIGBUS show up; while injecting to
the head page produces a SIGBUS. The casue is that, though hugetlb_walk()
returns a valid pmd entry (on x86), but check_pte() detects mismatch
between the head page per the pmd and the input subpage. Thus the vaddr
is considered not mapped to the subpage and the process is not collected
for SIGBUS purpose.  This is the calling stack
      collect_procs_anon
        page_mapped_in_vma
          page_vma_mapped_walk
            hugetlb_walk
              huge_pte_lock
                check_pte

It seems that the most obvious place to fix the issue is by making
page_mapped_in_vma() hugetlb walk aware. The precise subpage in the
input is useful in providing PAGE_SIZE granularity vaddr.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 mm/page_vma_mapped.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 81839a9e74f1..bc036060cc68 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -342,15 +342,26 @@ unsigned long page_mapped_in_vma(const struct page *page,
 {
 	const struct folio *folio = page_folio(page);
 	struct page_vma_mapped_walk pvmw = {
-		.pfn = page_to_pfn(page),
 		.nr_pages = 1,
 		.vma = vma,
 		.flags = PVMW_SYNC,
 	};
 
+	/* fine granularity address is always preferred */
 	pvmw.address = vma_address(vma, page_pgoff(folio, page), 1);
 	if (pvmw.address == -EFAULT)
 		goto out;
+
+	/*
+	 * Hugetlb doesn't support partial page-mapping, hugetlb_walk()
+	 * simply assumes hugetlb pte, hence feed the headpage pfn for
+	 * the walk and pte check.
+	 */
+	if (folio_test_hugetlb(folio))
+		pvmw.pfn = folio_pfn(folio);
+	else
+		pvmw.pfn = page_to_pfn(page);
+
 	if (!page_vma_mapped_walk(&pvmw))
 		return -EFAULT;
 	page_vma_mapped_walk_done(&pvmw);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware
  2025-01-21  4:18 [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware Jane Chu
@ 2025-01-21  5:00 ` Matthew Wilcox
  2025-01-21  5:20   ` jane.chu
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2025-01-21  5:00 UTC (permalink / raw)
  To: Jane Chu; +Cc: akpm, linmiaohe, kirill.shutemov, hughd, linux-mm, linux-kernel

On Mon, Jan 20, 2025 at 09:18:49PM -0700, Jane Chu wrote:
> When a process consumes a UE in a page, the memory failure handler
> attempts to collect information for a potential SIGBUS.
> If the page is an anonymous page, page_mapped_in_vma(page, vma) is
> invoked in order to
>   1. retrieve the vaddr from the process' address space,
>   2. verify that the vaddr is indeed mapped to the poisoned page,
> where 'page' is the precise small page with UE.
> 
> It's been observed that when injecting poison to a non-head subpage
> of an anonymous hugetlb page, no SIGBUS show up; while injecting to
> the head page produces a SIGBUS. The casue is that, though hugetlb_walk()
> returns a valid pmd entry (on x86), but check_pte() detects mismatch
> between the head page per the pmd and the input subpage. Thus the vaddr
> is considered not mapped to the subpage and the process is not collected
> for SIGBUS purpose.  This is the calling stack
>       collect_procs_anon
>         page_mapped_in_vma
>           page_vma_mapped_walk
>             hugetlb_walk
>               huge_pte_lock
>                 check_pte
> 
> It seems that the most obvious place to fix the issue is by making
> page_mapped_in_vma() hugetlb walk aware. The precise subpage in the
> input is useful in providing PAGE_SIZE granularity vaddr.

I don't like this solution because it adds yet another special case for
hugetlb.  If we don't split a PMD-mapped THP, we'd have the same
problem, right?

check_pte() would succeed if we set pvmw->pfn to folio_pfn() and
pvmw->nr_pages to folio_nr_pages(), right?  I just don't know what else
might be affected by that.

I like one of these two options:

@@ -206,6 +206,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
                pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
                if (!pvmw->pte)
                        return false;
+               pvmw->pte += pvmw->address & (size - PAGE_SIZE);

                pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
                if (!check_pte(pvmw))

(that needs a bit of tidying up; you can't just do that, but I think
you get the basic idea -- correct the pte to point to the precise page
instead of the hugetlb pfn)


The option I really prefer is much more work but matches our preferred
direction of getting rid of hugetlb specific code.  Something like this:

@@ -192,27 +192,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
        if (pvmw->pmd && !pvmw->pte)
                return not_found(pvmw);

-       if (unlikely(is_vm_hugetlb_page(vma))) {
-               struct hstate *hstate = hstate_vma(vma);
-               unsigned long size = huge_page_size(hstate);
-               /* The only possible mapping was handled on last iteration */
[...]
-               pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
-               if (!check_pte(pvmw))
-                       return not_found(pvmw);
-               return true;
-       }
-
        end = vma_address_end(pvmw);
        if (pvmw->pte)
                goto next_pte;
@@ -229,7 +208,19 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw                        continue;
                }
                pud = pud_offset(p4d, pvmw->address);
-               if (!pud_present(*pud)) {
+               pude = *pud;
+               if (pud_trans_huge(pude) ||
+                   (pud_present(pude) && pud_devmap(pude))) {
+                       pvmw->ptl = pud_lock(mm, pvmw->pud);
+                       ...
+                       if (likely(pud_trans_huge(pude) || pud_devmap(pude))) {
+                               if (pvmw->flags & PVMW_MIGRATION)
+                                       return not_found(pvmw);
+                               if (!check_pud(pud_pfn(pude), pvmw))
+                                       return not_found(pvmw);
+                               return true;
+                       }
+               } else if (!pud_present(pude)) {
                        step_forward(pvmw, PUD_SIZE);
                        continue;
                }

ie get rid of all the hugetlb-specific code, and add support for the
PUD level to the common code.  You'd also need to write check_pud().

I'll understand if you don't want to do all the extra work.  And
thanks for tracking down this bug.



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware
  2025-01-21  5:00 ` Matthew Wilcox
@ 2025-01-21  5:20   ` jane.chu
  2025-02-24 20:45     ` jane.chu
  0 siblings, 1 reply; 4+ messages in thread
From: jane.chu @ 2025-01-21  5:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linmiaohe, kirill.shutemov, hughd, linux-mm, linux-kernel

Thanks for the quick comment!

On 1/20/2025 9:00 PM, Matthew Wilcox wrote:
> On Mon, Jan 20, 2025 at 09:18:49PM -0700, Jane Chu wrote:
>> When a process consumes a UE in a page, the memory failure handler
>> attempts to collect information for a potential SIGBUS.
>> If the page is an anonymous page, page_mapped_in_vma(page, vma) is
>> invoked in order to
>>    1. retrieve the vaddr from the process' address space,
>>    2. verify that the vaddr is indeed mapped to the poisoned page,
>> where 'page' is the precise small page with UE.
>>
>> It's been observed that when injecting poison to a non-head subpage
>> of an anonymous hugetlb page, no SIGBUS show up; while injecting to
>> the head page produces a SIGBUS. The casue is that, though hugetlb_walk()
>> returns a valid pmd entry (on x86), but check_pte() detects mismatch
>> between the head page per the pmd and the input subpage. Thus the vaddr
>> is considered not mapped to the subpage and the process is not collected
>> for SIGBUS purpose.  This is the calling stack
>>        collect_procs_anon
>>          page_mapped_in_vma
>>            page_vma_mapped_walk
>>              hugetlb_walk
>>                huge_pte_lock
>>                  check_pte
>>
>> It seems that the most obvious place to fix the issue is by making
>> page_mapped_in_vma() hugetlb walk aware. The precise subpage in the
>> input is useful in providing PAGE_SIZE granularity vaddr.
> I don't like this solution because it adds yet another special case for
> hugetlb.  If we don't split a PMD-mapped THP, we'd have the same
> problem, right?
>
> check_pte() would succeed if we set pvmw->pfn to folio_pfn() and
> pvmw->nr_pages to folio_nr_pages(), right?  I just don't know what else
> might be affected by that.
>
> I like one of these two options:
>
> @@ -206,6 +206,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>                  pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>                  if (!pvmw->pte)
>                          return false;
> +               pvmw->pte += pvmw->address & (size - PAGE_SIZE);
>
>                  pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
>                  if (!check_pte(pvmw))
>
> (that needs a bit of tidying up; you can't just do that, but I think
> you get the basic idea -- correct the pte to point to the precise page
> instead of the hugetlb pfn)
That'll work, let me think about how to tidy it up.  More below.
>
>
> The option I really prefer is much more work but matches our preferred
> direction of getting rid of hugetlb specific code.  Something like this:
>
> @@ -192,27 +192,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>          if (pvmw->pmd && !pvmw->pte)
>                  return not_found(pvmw);
>
> -       if (unlikely(is_vm_hugetlb_page(vma))) {
> -               struct hstate *hstate = hstate_vma(vma);
> -               unsigned long size = huge_page_size(hstate);
> -               /* The only possible mapping was handled on last iteration */
> [...]
> -               pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
> -               if (!check_pte(pvmw))
> -                       return not_found(pvmw);
> -               return true;
> -       }
> -
>          end = vma_address_end(pvmw);
>          if (pvmw->pte)
>                  goto next_pte;
> @@ -229,7 +208,19 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw                        continue;
>                  }
>                  pud = pud_offset(p4d, pvmw->address);
> -               if (!pud_present(*pud)) {
> +               pude = *pud;
> +               if (pud_trans_huge(pude) ||
> +                   (pud_present(pude) && pud_devmap(pude))) {
> +                       pvmw->ptl = pud_lock(mm, pvmw->pud);
> +                       ...
> +                       if (likely(pud_trans_huge(pude) || pud_devmap(pude))) {
> +                               if (pvmw->flags & PVMW_MIGRATION)
> +                                       return not_found(pvmw);
> +                               if (!check_pud(pud_pfn(pude), pvmw))
> +                                       return not_found(pvmw);
> +                               return true;
> +                       }
> +               } else if (!pud_present(pude)) {
>                          step_forward(pvmw, PUD_SIZE);
>                          continue;
>                  }
>
> ie get rid of all the hugetlb-specific code, and add support for the
> PUD level to the common code.  You'd also need to write check_pud().
Good idea!  I'd like to give this more generic approach a try as well.
>
> I'll understand if you don't want to do all the extra work.  And
> thanks for tracking down this bug.

Thanks a lot!

-jane



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware
  2025-01-21  5:20   ` jane.chu
@ 2025-02-24 20:45     ` jane.chu
  0 siblings, 0 replies; 4+ messages in thread
From: jane.chu @ 2025-02-24 20:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linmiaohe, kirill.shutemov, hughd, linux-mm, linux-kernel, peterx

An update below. Also, add Peter.

On 1/20/2025 9:20 PM, jane.chu@oracle.com wrote:
> Thanks for the quick comment!
>
> On 1/20/2025 9:00 PM, Matthew Wilcox wrote:
>> On Mon, Jan 20, 2025 at 09:18:49PM -0700, Jane Chu wrote:
>>> When a process consumes a UE in a page, the memory failure handler
>>> attempts to collect information for a potential SIGBUS.
>>> If the page is an anonymous page, page_mapped_in_vma(page, vma) is
>>> invoked in order to
>>>    1. retrieve the vaddr from the process' address space,
>>>    2. verify that the vaddr is indeed mapped to the poisoned page,
>>> where 'page' is the precise small page with UE.
>>>
>>> It's been observed that when injecting poison to a non-head subpage
>>> of an anonymous hugetlb page, no SIGBUS show up; while injecting to
>>> the head page produces a SIGBUS. The casue is that, though 
>>> hugetlb_walk()
>>> returns a valid pmd entry (on x86), but check_pte() detects mismatch
>>> between the head page per the pmd and the input subpage. Thus the vaddr
>>> is considered not mapped to the subpage and the process is not 
>>> collected
>>> for SIGBUS purpose.  This is the calling stack
>>>        collect_procs_anon
>>>          page_mapped_in_vma
>>>            page_vma_mapped_walk
>>>              hugetlb_walk
>>>                huge_pte_lock
>>>                  check_pte
>>>
>>> It seems that the most obvious place to fix the issue is by making
>>> page_mapped_in_vma() hugetlb walk aware. The precise subpage in the
>>> input is useful in providing PAGE_SIZE granularity vaddr.
>> I don't like this solution because it adds yet another special case for
>> hugetlb.  If we don't split a PMD-mapped THP, we'd have the same
>> problem, right?
>>
>> check_pte() would succeed if we set pvmw->pfn to folio_pfn() and
>> pvmw->nr_pages to folio_nr_pages(), right?  I just don't know what else
>> might be affected by that.
>>
>> I like one of these two options:
>>
>> @@ -206,6 +206,7 @@ bool page_vma_mapped_walk(struct 
>> page_vma_mapped_walk *pvmw)
>>                  pvmw->pte = hugetlb_walk(vma, pvmw->address, size);
>>                  if (!pvmw->pte)
>>                          return false;
>> +               pvmw->pte += pvmw->address & (size - PAGE_SIZE);
>>
>>                  pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
>>                  if (!check_pte(pvmw))
>>
>> (that needs a bit of tidying up; you can't just do that, but I think
>> you get the basic idea -- correct the pte to point to the precise page
>> instead of the hugetlb pfn)
> That'll work, let me think about how to tidy it up.  More below.

It appears that check_pte() is supposed to be able to check range 
overlap for leaf pte among other things.

But the currently implementation doesn't do that.  Fixing this takes 
care of the subject issue.

More below.

>>
>>
>> The option I really prefer is much more work but matches our preferred
>> direction of getting rid of hugetlb specific code.  Something like this:
>>
>> @@ -192,27 +192,6 @@ bool page_vma_mapped_walk(struct 
>> page_vma_mapped_walk *pvmw)
>>          if (pvmw->pmd && !pvmw->pte)
>>                  return not_found(pvmw);
>>
>> -       if (unlikely(is_vm_hugetlb_page(vma))) {
>> -               struct hstate *hstate = hstate_vma(vma);
>> -               unsigned long size = huge_page_size(hstate);
>> -               /* The only possible mapping was handled on last 
>> iteration */
>> [...]
>> -               pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
>> -               if (!check_pte(pvmw))
>> -                       return not_found(pvmw);
>> -               return true;
>> -       }
>> -
>>          end = vma_address_end(pvmw);
>>          if (pvmw->pte)
>>                  goto next_pte;
>> @@ -229,7 +208,19 @@ bool page_vma_mapped_walk(struct 
>> page_vma_mapped_walk *pvmw                        continue;
>>                  }
>>                  pud = pud_offset(p4d, pvmw->address);
>> -               if (!pud_present(*pud)) {
>> +               pude = *pud;
>> +               if (pud_trans_huge(pude) ||
>> +                   (pud_present(pude) && pud_devmap(pude))) {
>> +                       pvmw->ptl = pud_lock(mm, pvmw->pud);
>> +                       ...
>> +                       if (likely(pud_trans_huge(pude) || 
>> pud_devmap(pude))) {
>> +                               if (pvmw->flags & PVMW_MIGRATION)
>> +                                       return not_found(pvmw);
>> +                               if (!check_pud(pud_pfn(pude), pvmw))
>> +                                       return not_found(pvmw);
>> +                               return true;
>> +                       }
>> +               } else if (!pud_present(pude)) {
>>                          step_forward(pvmw, PUD_SIZE);
>>                          continue;
>>                  }
>>
>> ie get rid of all the hugetlb-specific code, and add support for the
>> PUD level to the common code.  You'd also need to write check_pud().
> Good idea!  I'd like to give this more generic approach a try as well.

I ran a simple experiment on page_vma_mapped_walk() checking on a PMD 
level hugetlb page but with the hugetlb{} logic commented out.

The experiment includes both test/move_pages and the memory poison 
tests, and the latter hit a NULL pointer deref. in try_to_unmap_one()  
that I am looking into.

Besides, there are broader things to consider, such as pxd_trans_huge() 
are defined under CONFIG_TRANSPARENT_HUGEPAGE that doesn't apply to 
hugetlb.  Mabybe try pxd_leaf()? but they're not the same thing, does 
the  difference matter?

Hugetlb doesn't use pxd_lock(), it has its own locking though after 
peeling off the wrappers appear to be  the similar thing, but not quite 
the same thing.

There are also the hugetlb unique PMD sharing and implied locking 
requirement.

Maybe there are more as I dig further, it's becoming more like a project 
on its own which I will update on a separate thread in future.

Thanks,

-jane

>>
>> I'll understand if you don't want to do all the extra work.  And
>> thanks for tracking down this bug.
>
> Thanks a lot!
>
> -jane
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-02-24 20:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-21  4:18 [PATCH] mm: make page_mapped_in_vma() hugetlb walk aware Jane Chu
2025-01-21  5:00 ` Matthew Wilcox
2025-01-21  5:20   ` jane.chu
2025-02-24 20:45     ` jane.chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox