Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")
       [not found] ` <20251020084736.591739-1-karol.wachowski@linux.intel.com>
@ 2025-10-22  2:46   ` Zhang Yi
  2025-10-22  8:30     ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang Yi @ 2025-10-22  2:46 UTC (permalink / raw)
  To: Karol Wachowski; +Cc: tytso, adilger.kernel, linux-mm, linux-ext4

[add mm list to CC]

On 10/20/2025 4:47 PM, Karol Wachowski wrote:
> Hi,
> 
> I can reproduce this on Intel's x86 (Meteor Lake and Lunar Lake Intel CPUs
> but I believe it's not platform dependent). It reproduces on stable.
> I have bisected this to the mentioned commit: 7ac67301e82f02b77a5c8e7377a1f414ef108b84
> and it reproduces every time if that commit is present. I have attached a patch at the
> end of this message that provides a very simple driver that creates character device
> which calls pin_user_pages_fast() on user provided user pointer and simple test application
> that creates 2 MB file on a filesystem (you have to ensure it's location is on ext4) and
> does IOCTL with pointer obtained through mmap of that file with specific flags to reproduce
> the issue.
> 
> When it reproduces user application hangs indefinitely and has to be interrupted.
> 
> I have also noticed that if we don't write to the file prior to mmap or the write size is less than
> 2 MB issue does not reproduce.
> 
> Patch with reproductor is attached at the end of this message, please let me know if that helps or
> if there's anything else I can provide to help to determine if it's a real issue.
> 
> -
> Karol
> 
Thank you for the reproducer. I can reproduce this issue on my x86 virtual
machine. After debugging and analyzing, I found that this is not a
filesystem issue, we can reproduce it on any filesystem that supports
large folios, such as XFS. However, anyway, IIUC, I think it's a real
issue.

The root cause of this issue is that calling pin_user_pages_fast() triggers
an infinite loop in __get_user_pages() when a PMD-sized(2MB on x86) and COW
mmaped large folio is passed to pin. To trigger this issue on x86, the
following conditions must be met. The specific triggering process is as
follows:

1. Call mmap with a 2MB size in MAP_PRIVATE mode for a file that has a 2MB
   folio installed in the page cache.

   addr = mmap(NULL, 2 * 1024 * 1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
2. The kernel driver pass this mapped address to pin_user_pages_fast() in
   FOLL_LONGTERM mode.

   pin_user_pages_fast(addr, nr_pages, FOLL_LONGTERM, pages);

  ->  pin_user_pages_fast()
  |   gup_fast_fallback()
  |    __gup_longterm_locked()
  |     __get_user_pages_locked()
  |      __get_user_pages()
  |       follow_page_mask()
  |        follow_p4d_mask()
  |         follow_pud_mask()
  |          follow_pmd_mask() //pmd_leaf(pmdval) is true since it's pmd
  |                            //installed, This is normal in the first
  |                            //round, but it shouldn't happen in the
  |                            //second round.
  |           follow_huge_pmd() //gup_must_unshare() is always true
  |            return -EMLINK
  |   faultin_page()
  |    handle_mm_fault()
  |     wp_huge_pmd() //split pmd and fault back to PTE
  |     handle_pte_fault()  //
  |      do_pte_missing()
  |       do_fault()
  |        do_read_fault() //FAULT_FLAG_WRITE is not set
  |         finish_fault()
  |          do_set_pmd() //install leaf pmd again, I think this is wrong!!!
  |      do_wp_page() //copy private anno pages
  <-    goto retry

Due to an incorrectly large PMD set in do_read_fault(), follow_pmd_mask()
always returns -EMLINK, causing an infinite loop. Under normal
circumstances, I suppose it should fall back to do_wp_page(), which installs
the anonymous page into the PTE. This is also why mappings smaller than 2MB
do not trigger this issue. In addition, if you add FOLL_WRITE when calling
pin_user_pages_fast(), it also will not trigger this issue becasue do_fault()
will call do_cow_fault() to create anonymous pages.

The above is my analysis, and I tried the following fix, which can solve
the issue (I haven't done a full test yet). But I am not expert in the MM
field, I might have missed something, and this needs to be reviewed by MM
experts.

Best regards,
Yi.

diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..64846a030a5b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5342,6 +5342,10 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
 	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
 		return ret;

+	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) &&
+	    !pmd_write(*vmf->pmd))
+		return ret;
+
 	if (folio_order(folio) != HPAGE_PMD_ORDER)
 		return ret;
 	page = &folio->page;





^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")
  2025-10-22  2:46   ` Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file") Zhang Yi
@ 2025-10-22  8:30     ` David Hildenbrand
  2025-10-23  3:04       ` Zhang Yi
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2025-10-22  8:30 UTC (permalink / raw)
  To: Zhang Yi, Karol Wachowski; +Cc: tytso, adilger.kernel, linux-mm, linux-ext4

On 22.10.25 04:46, Zhang Yi wrote:
> [add mm list to CC]
> 
> On 10/20/2025 4:47 PM, Karol Wachowski wrote:
>> Hi,
>>
>> I can reproduce this on Intel's x86 (Meteor Lake and Lunar Lake Intel CPUs
>> but I believe it's not platform dependent). It reproduces on stable.
>> I have bisected this to the mentioned commit: 7ac67301e82f02b77a5c8e7377a1f414ef108b84
>> and it reproduces every time if that commit is present. I have attached a patch at the
>> end of this message that provides a very simple driver that creates character device
>> which calls pin_user_pages_fast() on user provided user pointer and simple test application
>> that creates 2 MB file on a filesystem (you have to ensure it's location is on ext4) and
>> does IOCTL with pointer obtained through mmap of that file with specific flags to reproduce
>> the issue.
>>
>> When it reproduces user application hangs indefinitely and has to be interrupted.
>>
>> I have also noticed that if we don't write to the file prior to mmap or the write size is less than
>> 2 MB issue does not reproduce.
>>
>> Patch with reproductor is attached at the end of this message, please let me know if that helps or
>> if there's anything else I can provide to help to determine if it's a real issue.
>>
>> -
>> Karol
>>
> Thank you for the reproducer. I can reproduce this issue on my x86 virtual
> machine. After debugging and analyzing, I found that this is not a
> filesystem issue, we can reproduce it on any filesystem that supports
> large folios, such as XFS. However, anyway, IIUC, I think it's a real
> issue.
> 
> The root cause of this issue is that calling pin_user_pages_fast() triggers
> an infinite loop in __get_user_pages() when a PMD-sized(2MB on x86) and COW
> mmaped large folio is passed to pin. To trigger this issue on x86, the
> following conditions must be met. The specific triggering process is as
> follows:
> 
> 1. Call mmap with a 2MB size in MAP_PRIVATE mode for a file that has a 2MB
>     folio installed in the page cache.
> 
>     addr = mmap(NULL, 2 * 1024 * 1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
> 2. The kernel driver pass this mapped address to pin_user_pages_fast() in
>     FOLL_LONGTERM mode.
> 
>     pin_user_pages_fast(addr, nr_pages, FOLL_LONGTERM, pages);
> 
>    ->  pin_user_pages_fast()
>    |   gup_fast_fallback()
>    |    __gup_longterm_locked()
>    |     __get_user_pages_locked()
>    |      __get_user_pages()
>    |       follow_page_mask()
>    |        follow_p4d_mask()
>    |         follow_pud_mask()
>    |          follow_pmd_mask() //pmd_leaf(pmdval) is true since it's pmd
>    |                            //installed, This is normal in the first
>    |                            //round, but it shouldn't happen in the
>    |                            //second round.
>    |           follow_huge_pmd() //gup_must_unshare() is always true


FOLL_LONGTERM in a private mapping needs an anon folio, so this checks out.

>    |            return -EMLINK
>    |   faultin_page()
>    |    handle_mm_fault()
>    |     wp_huge_pmd() //split pmd and fault back to PTE

Oh, that's nasty. Indeed, we don't remap the PMD to be mapped by PTEs 
but instead zap it.

__split_huge_pmd_locked() contains that handling.

We have to do that because we did not preallocate a page table we can 
just throw in.

We could do that on this path instead: remap the PMD to be mapped by a 
PTE table. We'd have to preallocate a page table.

That would avoid the do_pte_missing() below for such faults.

that could be done later on top of this fix.

>    |     handle_pte_fault()  //
>    |      do_pte_missing()
>    |       do_fault()
>    |        do_read_fault() //FAULT_FLAG_WRITE is not set
>    |         finish_fault()
>    |          do_set_pmd() //install leaf pmd again, I think this is wrong!!!
>    |      do_wp_page() //copy private anno pages
>    <-    goto retry
> 
> Due to an incorrectly large PMD set in do_read_fault(), follow_pmd_mask()
> always returns -EMLINK, causing an infinite loop. Under normal
> circumstances, I suppose it should fall back to do_wp_page(), which installs
> the anonymous page into the PTE. This is also why mappings smaller than 2MB
> do not trigger this issue. In addition, if you add FOLL_WRITE when calling
> pin_user_pages_fast(), it also will not trigger this issue becasue do_fault()
> will call do_cow_fault() to create anonymous pages.
> 
> The above is my analysis, and I tried the following fix, which can solve
> the issue (I haven't done a full test yet). But I am not expert in the MM
> field, I might have missed something, and this needs to be reviewed by MM
> experts.
> 
> Best regards,
> Yi.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 74b45e258323..64846a030a5b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5342,6 +5342,10 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>   	if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>   		return ret;
> 
> +	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) &&
> +	    !pmd_write(*vmf->pmd))
> +		return ret;

Likely we would want to make this depend on is_cow_mapping().

/*
  * We're about to trigger CoW, so never map it through a PMD.
  */
if (is_cow_mapping(vma->vm_flags &&
     vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)))
	return ret;

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")
  2025-10-22  8:30     ` David Hildenbrand
@ 2025-10-23  3:04       ` Zhang Yi
  2025-10-23  7:24         ` David Hildenbrand
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang Yi @ 2025-10-23  3:04 UTC (permalink / raw)
  To: David Hildenbrand, Karol Wachowski
  Cc: tytso, adilger.kernel, linux-mm, linux-ext4

On 10/22/2025 4:30 PM, David Hildenbrand wrote:
> On 22.10.25 04:46, Zhang Yi wrote:
>> [add mm list to CC]
>>
>> On 10/20/2025 4:47 PM, Karol Wachowski wrote:
>>> Hi,
>>>
>>> I can reproduce this on Intel's x86 (Meteor Lake and Lunar Lake Intel CPUs
>>> but I believe it's not platform dependent). It reproduces on stable.
>>> I have bisected this to the mentioned commit: 7ac67301e82f02b77a5c8e7377a1f414ef108b84
>>> and it reproduces every time if that commit is present. I have attached a patch at the
>>> end of this message that provides a very simple driver that creates character device
>>> which calls pin_user_pages_fast() on user provided user pointer and simple test application
>>> that creates 2 MB file on a filesystem (you have to ensure it's location is on ext4) and
>>> does IOCTL with pointer obtained through mmap of that file with specific flags to reproduce
>>> the issue.
>>>
>>> When it reproduces user application hangs indefinitely and has to be interrupted.
>>>
>>> I have also noticed that if we don't write to the file prior to mmap or the write size is less than
>>> 2 MB issue does not reproduce.
>>>
>>> Patch with reproductor is attached at the end of this message, please let me know if that helps or
>>> if there's anything else I can provide to help to determine if it's a real issue.
>>>
>>> -
>>> Karol
>>>
>> Thank you for the reproducer. I can reproduce this issue on my x86 virtual
>> machine. After debugging and analyzing, I found that this is not a
>> filesystem issue, we can reproduce it on any filesystem that supports
>> large folios, such as XFS. However, anyway, IIUC, I think it's a real
>> issue.
>>
>> The root cause of this issue is that calling pin_user_pages_fast() triggers
>> an infinite loop in __get_user_pages() when a PMD-sized(2MB on x86) and COW
>> mmaped large folio is passed to pin. To trigger this issue on x86, the
>> following conditions must be met. The specific triggering process is as
>> follows:
>>
>> 1. Call mmap with a 2MB size in MAP_PRIVATE mode for a file that has a 2MB
>>     folio installed in the page cache.
>>
>>     addr = mmap(NULL, 2 * 1024 * 1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
>> 2. The kernel driver pass this mapped address to pin_user_pages_fast() in
>>     FOLL_LONGTERM mode.
>>
>>     pin_user_pages_fast(addr, nr_pages, FOLL_LONGTERM, pages);
>>
>>    ->  pin_user_pages_fast()
>>    |   gup_fast_fallback()
>>    |    __gup_longterm_locked()
>>    |     __get_user_pages_locked()
>>    |      __get_user_pages()
>>    |       follow_page_mask()
>>    |        follow_p4d_mask()
>>    |         follow_pud_mask()
>>    |          follow_pmd_mask() //pmd_leaf(pmdval) is true since it's pmd
>>    |                            //installed, This is normal in the first
>>    |                            //round, but it shouldn't happen in the
>>    |                            //second round.
>>    |           follow_huge_pmd() //gup_must_unshare() is always true
> 
> 
> FOLL_LONGTERM in a private mapping needs an anon folio, so this checks out.
> 
>>    |            return -EMLINK
>>    |   faultin_page()
>>    |    handle_mm_fault()
>>    |     wp_huge_pmd() //split pmd and fault back to PTE
> 
> Oh, that's nasty. Indeed, we don't remap the PMD to be mapped by PTEs but instead zap it.
> 
> __split_huge_pmd_locked() contains that handling.
> 
> We have to do that because we did not preallocate a page table we can just throw in.
> 
> We could do that on this path instead: remap the PMD to be mapped by a PTE table. We'd have to preallocate a page table.
> 
> That would avoid the do_pte_missing() below for such faults.
> 
> that could be done later on top of this fix.

Yeah, thank you for the explanation! I have another question, just curious.
Why do we have to fall back to installing the PTE table instead of creating
a new anonymous large folio (2M) and setting a new leaf huge PMD?

> 
>>    |     handle_pte_fault()  //
>>    |      do_pte_missing()
>>    |       do_fault()
>>    |        do_read_fault() //FAULT_FLAG_WRITE is not set
>>    |         finish_fault()
>>    |          do_set_pmd() //install leaf pmd again, I think this is wrong!!!
>>    |      do_wp_page() //copy private anno pages
>>    <-    goto retry
>>
>> Due to an incorrectly large PMD set in do_read_fault(), follow_pmd_mask()
>> always returns -EMLINK, causing an infinite loop. Under normal
>> circumstances, I suppose it should fall back to do_wp_page(), which installs
>> the anonymous page into the PTE. This is also why mappings smaller than 2MB
>> do not trigger this issue. In addition, if you add FOLL_WRITE when calling
>> pin_user_pages_fast(), it also will not trigger this issue becasue do_fault()
>> will call do_cow_fault() to create anonymous pages.
>>
>> The above is my analysis, and I tried the following fix, which can solve
>> the issue (I haven't done a full test yet). But I am not expert in the MM
>> field, I might have missed something, and this needs to be reviewed by MM
>> experts.
>>
>> Best regards,
>> Yi.
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 74b45e258323..64846a030a5b 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5342,6 +5342,10 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>>       if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>>           return ret;
>>
>> +    if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) &&
>> +        !pmd_write(*vmf->pmd))
>> +        return ret;
> 
> Likely we would want to make this depend on is_cow_mapping().
> 
> /*
>  * We're about to trigger CoW, so never map it through a PMD.
>  */
> if (is_cow_mapping(vma->vm_flags &&
>     vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)))
>     return ret;
> 

Sure, adding a cow check would be better. I will send out an official patch.

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")
  2025-10-23  3:04       ` Zhang Yi
@ 2025-10-23  7:24         ` David Hildenbrand
  2025-10-23  7:34           ` Zhang Yi
  0 siblings, 1 reply; 5+ messages in thread
From: David Hildenbrand @ 2025-10-23  7:24 UTC (permalink / raw)
  To: Zhang Yi, Karol Wachowski; +Cc: tytso, adilger.kernel, linux-mm, linux-ext4

>> __split_huge_pmd_locked() contains that handling.
>>
>> We have to do that because we did not preallocate a page table we can just throw in.
>>
>> We could do that on this path instead: remap the PMD to be mapped by a PTE table. We'd have to preallocate a page table.
>>
>> That would avoid the do_pte_missing() below for such faults.
>>
>> that could be done later on top of this fix.
> 
> Yeah, thank you for the explanation! I have another question, just curious.
> Why do we have to fall back to installing the PTE table instead of creating
> a new anonymous large folio (2M) and setting a new leaf huge PMD?

Primarily because it would waste more memory for various use cases, on a 
factor of 512.

> 
>>
>>>     |     handle_pte_fault()  //
>>>     |      do_pte_missing()
>>>     |       do_fault()
>>>     |        do_read_fault() //FAULT_FLAG_WRITE is not set
>>>     |         finish_fault()
>>>     |          do_set_pmd() //install leaf pmd again, I think this is wrong!!!
>>>     |      do_wp_page() //copy private anno pages
>>>     <-    goto retry
>>>
>>> Due to an incorrectly large PMD set in do_read_fault(), follow_pmd_mask()
>>> always returns -EMLINK, causing an infinite loop. Under normal
>>> circumstances, I suppose it should fall back to do_wp_page(), which installs
>>> the anonymous page into the PTE. This is also why mappings smaller than 2MB
>>> do not trigger this issue. In addition, if you add FOLL_WRITE when calling
>>> pin_user_pages_fast(), it also will not trigger this issue becasue do_fault()
>>> will call do_cow_fault() to create anonymous pages.
>>>
>>> The above is my analysis, and I tried the following fix, which can solve
>>> the issue (I haven't done a full test yet). But I am not expert in the MM
>>> field, I might have missed something, and this needs to be reviewed by MM
>>> experts.
>>>
>>> Best regards,
>>> Yi.
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 74b45e258323..64846a030a5b 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5342,6 +5342,10 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>>>        if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>>>            return ret;
>>>
>>> +    if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) &&
>>> +        !pmd_write(*vmf->pmd))
>>> +        return ret;
>>
>> Likely we would want to make this depend on is_cow_mapping().
>>
>> /*
>>   * We're about to trigger CoW, so never map it through a PMD.
>>   */
>> if (is_cow_mapping(vma->vm_flags &&
>>      vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)))
>>      return ret;
>>
> 
> Sure, adding a cow check would be better. I will send out an official patch.

Thanks!

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file")
  2025-10-23  7:24         ` David Hildenbrand
@ 2025-10-23  7:34           ` Zhang Yi
  0 siblings, 0 replies; 5+ messages in thread
From: Zhang Yi @ 2025-10-23  7:34 UTC (permalink / raw)
  To: David Hildenbrand, Karol Wachowski
  Cc: tytso, adilger.kernel, linux-mm, linux-ext4

On 10/23/2025 3:24 PM, David Hildenbrand wrote:
>>> __split_huge_pmd_locked() contains that handling.
>>>
>>> We have to do that because we did not preallocate a page table we can just throw in.
>>>
>>> We could do that on this path instead: remap the PMD to be mapped by a PTE table. We'd have to preallocate a page table.
>>>
>>> That would avoid the do_pte_missing() below for such faults.
>>>
>>> that could be done later on top of this fix.
>>
>> Yeah, thank you for the explanation! I have another question, just curious.
>> Why do we have to fall back to installing the PTE table instead of creating
>> a new anonymous large folio (2M) and setting a new leaf huge PMD?
> 
> Primarily because it would waste more memory for various use cases, on a factor of 512.
> 

Ha, I got it, that makes sense! :-)

Thanks,
Yi.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-23  7:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ebe38d8f-0b09-47b8-9503-2d8e0585672a@huaweicloud.com>
     [not found] ` <20251020084736.591739-1-karol.wachowski@linux.intel.com>
2025-10-22  2:46   ` Possible regression in pin_user_pages_fast() behavior after commit 7ac67301e82f ("ext4: enable large folio for regular file") Zhang Yi
2025-10-22  8:30     ` David Hildenbrand
2025-10-23  3:04       ` Zhang Yi
2025-10-23  7:24         ` David Hildenbrand
2025-10-23  7:34           ` Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox