[Question] CoW on VM_PFNMAP vma during write fault

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Question] CoW on VM_PFNMAP vma during write fault
@ 2024-02-27 12:28 Wupeng Ma
  2024-02-27 13:00 ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Wupeng Ma @ 2024-02-27 12:28 UTC (permalink / raw)
  To: akpm, khlebnikov, jaredeh, david, linmiaohe, hpa
  Cc: dave.hansen, luto, tglx, peterz, linux-kernel, x86, mawupeng1,
	mingo, rdunlap, bhelgaas, linux-mm

We find that a warn will be produced during our test, the detail log is
shown in the end.

The core problem of this warn is that the first pfn of this pfnmap vma is
cleared during memory-failure. Digging into the source we find that this
problem can be triggered as following:

// mmap with MAP_PRIVATE and specific fd which hook mmap
mmap(MAP_PRIVATE, fd)
  __mmap_region
    remap_pfn_range
    // set vma with pfnmap and the prot of pte is read only
	  
// memset this memory with trigger fault
handle_mm_fault
  __handle_mm_fault
    handle_pte_fault
      // write fault and !pte_write(entry)
      do_wp_page
        wp_page_copy // this will alloc a new page with valid page struct
                     // for this pfnmap vma

// inject a hwpoison to the first page of this vma
madvise_inject_error
  memory_failure
    hwpoison_user_mappings
      try_to_unmap_one
        // mark this pte as invalid (hwpoison)
        mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
                address, range.end);

// during unmap vma, the first pfn of this pfnmap vma is invalid
vm_mmap_pgoff
  do_mmap
    __do_mmap_mm
      __mmap_region
        __do_munmap 
          unmap_region
            unmap_vmas
              unmap_single_vma
                untrack_pfn
                  follow_phys // pte is already invalidate, WARN_ON here

CoW with a valid page for pfnmap vma is weird to us. Can we use
remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
vma during write fault, this page is normal(page flag is valid) for most mm
subsystems, such as memory failure in thais case and extra should be done to
handle this special page.

During unmap, if this vma is pfnmap, unmap shouldn't be done since page
should not be touched for pfnmap vma.

But the root problem is that can we insert a valid page for pfnmap vma?

Any thoughts to solve this warn?

------------[ cut here ]------------
WARNING: CPU: 0 PID: 503 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xed/0x100
Modules linked in: remap_pfn(OE)
CPU: 0 PID: 503 Comm: remap_pfn Tainted: G           OE      6.8.0-rc6-dirty #436
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:untrack_pfn+0xed/0x100
Code: cc cc cc cc 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 ca 48 8b 7b 30 e8 81 de cf 00 89 6b 28 48 8b 7b 30 e8 05 cc b7 e8 ba c3 ce 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90
RSP: 0018:ffffb5f683eafc78 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff960b18537658 RCX: 0000000000000043
RDX: bfffffffdcb13e00 RSI: 0000000000000043 RDI: ffff960e45b7a140
RBP: 0000000000000000 R08: 00007f7df7a00000 R09: ffff960a00000fb8
R10: ffff960a00000000 R11: 000fffffffffffff R12: 00007f7df7a08000
R13: 0000000000000000 R14: ffffb5f683eafdc8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff960e2fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7df7aeb110 CR3: 0000000118b66003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn+0x84/0x140
 ? untrack_pfn+0xed/0x100
 ? report_bug+0x1bd/0x1d0
 ? handle_bug+0x3c/0x70
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? untrack_pfn+0xed/0x100
 ? untrack_pfn+0x5c/0x100
 unmap_single_vma+0xa6/0xe0
 unmap_vmas+0xb2/0x190
 exit_mmap+0xee/0x3c0
 mmput+0x68/0x120
 do_exit+0x2ec/0xb80
 do_group_exit+0x31/0x80
 __x64_sys_exit_group+0x18/0x20
 do_syscall_64+0x66/0x180
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f7df7aeb146
Code: Unable to access opcode bytes at 0x7f7df7aeb11c.
RSP: 002b:00007ffe571100a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f7df7bf08a0 RCX: 00007f7df7aeb146
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff80
R10: 0000000000000002 R11: 0000000000000246 R12: 00007f7df7bf08a0
R13: 0000000000000001 R14: 00007f7df7bf92e8 R15: 0000000000000000
 </TASK>
---[ end trace 0000000000000000 ]---



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-02-27 12:28 [Question] CoW on VM_PFNMAP vma during write fault Wupeng Ma
@ 2024-02-27 13:00 ` David Hildenbrand
  2024-02-27 13:15   ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-02-27 13:00 UTC (permalink / raw)
  To: Wupeng Ma, akpm, khlebnikov, jaredeh, linmiaohe, hpa
  Cc: dave.hansen, luto, tglx, peterz, linux-kernel, x86, mingo,
	rdunlap, bhelgaas, linux-mm

On 27.02.24 13:28, Wupeng Ma wrote:
> We find that a warn will be produced during our test, the detail log is
> shown in the end.
> 
> The core problem of this warn is that the first pfn of this pfnmap vma is
> cleared during memory-failure. Digging into the source we find that this
> problem can be triggered as following:
> 
> // mmap with MAP_PRIVATE and specific fd which hook mmap
> mmap(MAP_PRIVATE, fd)
>    __mmap_region
>      remap_pfn_range
>      // set vma with pfnmap and the prot of pte is read only
> 	

Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.

What fd is that exactly? Often, we disallow private mappings in the 
mmap() callback (for a good reason).

> // memset this memory with trigger fault
> handle_mm_fault
>    __handle_mm_fault
>      handle_pte_fault
>        // write fault and !pte_write(entry)
>        do_wp_page
>          wp_page_copy // this will alloc a new page with valid page struct
>                       // for this pfnmap vma

Here we replace the mapped PFNMAP thingy by a proper anon folio.

> 
> // inject a hwpoison to the first page of this vma

I assume this is an anon folio?

> madvise_inject_error
>    memory_failure
>      hwpoison_user_mappings
>        try_to_unmap_one
>          // mark this pte as invalid (hwpoison)
>          mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>                  address, range.end);
> 
> // during unmap vma, the first pfn of this pfnmap vma is invalid
> vm_mmap_pgoff
>    do_mmap
>      __do_mmap_mm
>        __mmap_region
>          __do_munmap
>            unmap_region
>              unmap_vmas
>                unmap_single_vma
>                  untrack_pfn
>                    follow_phys // pte is already invalidate, WARN_ON here

unmap_single_vma()->...->zap_pte_range() should do the right thing when 
calling vm_normal_page().

untrack_pfn() is the problematic part.

> 
> CoW with a valid page for pfnmap vma is weird to us. Can we use
> remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
> vma during write fault, this page is normal(page flag is valid) for most mm
> subsystems, such as memory failure in thais case and extra should be done to
> handle this special page.
> 
> During unmap, if this vma is pfnmap, unmap shouldn't be done since page
> should not be touched for pfnmap vma.
> 
> But the root problem is that can we insert a valid page for pfnmap vma?
> 
> Any thoughts to solve this warn?

vm_normal_page() documentation explains how that magic is supposed to 
work. vm_normal_page() should be able to correctly identify whether we 
want to look at the struct page for an anon folio that was COWed.

untrack_pfn() indeed does not seem to be well prepared for handling 
MAP_PRIVATE mappings where we end up having anon folios.

I think it will already *completely mess up* simply when unmapping the 
range without the memory failure involved.

See, follow_phys() would get the PFN of the anon folio and then 
untrack_pfn() would do some nonesense with that. Completely broken.

The WARN is just a side-effect of the brokenness.

In follow_phys(), we'd likely have to call vm_normal_page(). If we get a 
page back, we'd likely have to fail follow_phys() instead of returning a 
PFN of an anon folio.

Now, how do we fix untrack_pfn() ? I really don't know. In theory, we 
might no longer have *any* PFNMAP PFN in there after COW'ing everything.

Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some 
broken garbage (sorry). Can we disallow it?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-02-27 13:00 ` David Hildenbrand
@ 2024-02-27 13:15   ` David Hildenbrand
  2024-02-28  1:55     ` mawupeng
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-02-27 13:15 UTC (permalink / raw)
  To: Wupeng Ma, akpm, khlebnikov, jaredeh, linmiaohe, hpa
  Cc: dave.hansen, luto, tglx, peterz, linux-kernel, x86, mingo,
	rdunlap, bhelgaas, linux-mm

On 27.02.24 14:00, David Hildenbrand wrote:
> On 27.02.24 13:28, Wupeng Ma wrote:
>> We find that a warn will be produced during our test, the detail log is
>> shown in the end.
>>
>> The core problem of this warn is that the first pfn of this pfnmap vma is
>> cleared during memory-failure. Digging into the source we find that this
>> problem can be triggered as following:
>>
>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>> mmap(MAP_PRIVATE, fd)
>>     __mmap_region
>>       remap_pfn_range
>>       // set vma with pfnmap and the prot of pte is read only
>> 	
> 
> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
> 
> What fd is that exactly? Often, we disallow private mappings in the
> mmap() callback (for a good reason).
> 
>> // memset this memory with trigger fault
>> handle_mm_fault
>>     __handle_mm_fault
>>       handle_pte_fault
>>         // write fault and !pte_write(entry)
>>         do_wp_page
>>           wp_page_copy // this will alloc a new page with valid page struct
>>                        // for this pfnmap vma
> 
> Here we replace the mapped PFNMAP thingy by a proper anon folio.
> 
>>
>> // inject a hwpoison to the first page of this vma
> 
> I assume this is an anon folio?
> 
>> madvise_inject_error
>>     memory_failure
>>       hwpoison_user_mappings
>>         try_to_unmap_one
>>           // mark this pte as invalid (hwpoison)
>>           mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>>                   address, range.end);
>>
>> // during unmap vma, the first pfn of this pfnmap vma is invalid
>> vm_mmap_pgoff
>>     do_mmap
>>       __do_mmap_mm
>>         __mmap_region
>>           __do_munmap
>>             unmap_region
>>               unmap_vmas
>>                 unmap_single_vma
>>                   untrack_pfn
>>                     follow_phys // pte is already invalidate, WARN_ON here
> 
> unmap_single_vma()->...->zap_pte_range() should do the right thing when
> calling vm_normal_page().
> 
> untrack_pfn() is the problematic part.
> 
>>
>> CoW with a valid page for pfnmap vma is weird to us. Can we use
>> remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
>> vma during write fault, this page is normal(page flag is valid) for most mm
>> subsystems, such as memory failure in thais case and extra should be done to
>> handle this special page.
>>
>> During unmap, if this vma is pfnmap, unmap shouldn't be done since page
>> should not be touched for pfnmap vma.
>>
>> But the root problem is that can we insert a valid page for pfnmap vma?
>>
>> Any thoughts to solve this warn?
> 
> vm_normal_page() documentation explains how that magic is supposed to
> work. vm_normal_page() should be able to correctly identify whether we
> want to look at the struct page for an anon folio that was COWed.
> 
> 
> untrack_pfn() indeed does not seem to be well prepared for handling
> MAP_PRIVATE mappings where we end up having anon folios.
> 
> I think it will already *completely mess up* simply when unmapping the
> range without the memory failure involved.
> 
> See, follow_phys() would get the PFN of the anon folio and then
> untrack_pfn() would do some nonesense with that. Completely broken.
> 
> The WARN is just a side-effect of the brokenness.
> 
> In follow_phys(), we'd likely have to call vm_normal_page(). If we get a
> page back, we'd likely have to fail follow_phys() instead of returning a
> PFN of an anon folio.
> 
> Now, how do we fix untrack_pfn() ? I really don't know. In theory, we
> might no longer have *any* PFNMAP PFN in there after COW'ing everything.
> 
> Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some
> broken garbage (sorry). Can we disallow it?

Staring at track_pfn_copy(), it's maybe similarly broken?

I think we want to do:

diff --git a/mm/memory.c b/mm/memory.c
index 098356b8805ae..da5d1e37c5534 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6050,6 +6050,10 @@ int follow_phys(struct vm_area_struct *vma,
                 goto out;
         pte = ptep_get(ptep);
  
+       /* Never return addresses of COW'ed anon folios. */
+       if (vm_normal_page(vma, address, pte))
+               goto unlock;
+
         if ((flags & FOLL_WRITE) && !pte_write(pte))
                 goto unlock;
  

And then, just disallow it with PAT involved:

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 0904d7e8e1260..e4d2b2e8c0281 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -997,6 +997,15 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
                                 && size == (vma->vm_end - vma->vm_start))) {
                 int ret;
  
+               /*
+                * untrack_pfn() and friends cannot handl regions that suddenly
+                * contain anon folios after COW. In particular, follow_phys()
+                * will fail when we have an anon folio at the beginning og the
+                * VMA.
+                */
+               if (vma && is_cow_mapping(vma->vm_flags))
+                       return -EINVAL;
+
                 ret = reserve_pfn_range(paddr, size, prot, 0);
                 if (ret == 0 && vma)
                         vm_flags_set(vma, VM_PAT);


I'm afraid that will break something. But well, it's already semi-broken.

As long as VM_PAT is not involved, it should work as expected.

In an ideal world, we'd get rid of follow_phys() completely and just
derive that information from the VMA?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-02-27 13:15   ` David Hildenbrand
@ 2024-02-28  1:55     ` mawupeng
  2024-02-28  2:10       ` Matthew Wilcox
  2024-03-04  8:47       ` mawupeng
  0 siblings, 2 replies; 10+ messages in thread
From: mawupeng @ 2024-02-28  1:55 UTC (permalink / raw)
  To: david, akpm, khlebnikov, jaredeh, linmiaohe, hpa
  Cc: mawupeng1, dave.hansen, luto, tglx, peterz, linux-kernel, x86,
	mingo, rdunlap, bhelgaas, linux-mm



On 2024/2/27 21:15, David Hildenbrand wrote:
> On 27.02.24 14:00, David Hildenbrand wrote:
>> On 27.02.24 13:28, Wupeng Ma wrote:
>>> We find that a warn will be produced during our test, the detail log is
>>> shown in the end.
>>>
>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>> cleared during memory-failure. Digging into the source we find that this
>>> problem can be triggered as following:
>>>
>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>> mmap(MAP_PRIVATE, fd)
>>>     __mmap_region
>>>       remap_pfn_range
>>>       // set vma with pfnmap and the prot of pte is read only
>>>     
>>
>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>
>> What fd is that exactly? Often, we disallow private mappings in the
>> mmap() callback (for a good reason).

just a device fd with device-specify mmap which use remap_pfn_range to assign memory.

>>
>>> // memset this memory with trigger fault
>>> handle_mm_fault
>>>     __handle_mm_fault
>>>       handle_pte_fault
>>>         // write fault and !pte_write(entry)
>>>         do_wp_page
>>>           wp_page_copy // this will alloc a new page with valid page struct
>>>                        // for this pfnmap vma
>>
>> Here we replace the mapped PFNMAP thingy by a proper anon folio.

My problem is can wen replace a pfn with fully functioned page for pfnmap vma? This is not MIXEDMAP vma.

>>
>>>
>>> // inject a hwpoison to the first page of this vma
>>
>> I assume this is an anon folio?

Yes.

>>
>>> madvise_inject_error
>>>     memory_failure
>>>       hwpoison_user_mappings
>>>         try_to_unmap_one
>>>           // mark this pte as invalid (hwpoison)
>>>           mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>>>                   address, range.end);

If we can replace the mapped PFNMAP thingy by a proper anon folio, we need to make memory_failure to handle
pfnmap vma properly since pfnmap vma shoule not touch its struct page?

Current this page have a valid mapping and can be unmap.

Maybe there is something wrong with my understanding of CoW on a private pfnmap vma.

>>>
>>> // during unmap vma, the first pfn of this pfnmap vma is invalid
>>> vm_mmap_pgoff
>>>     do_mmap
>>>       __do_mmap_mm
>>>         __mmap_region
>>>           __do_munmap
>>>             unmap_region
>>>               unmap_vmas
>>>                 unmap_single_vma
>>>                   untrack_pfn
>>>                     follow_phys // pte is already invalidate, WARN_ON here
>>
>> unmap_single_vma()->...->zap_pte_range() should do the right thing when
>> calling vm_normal_page().
>>
>> untrack_pfn() is the problematic part.

For pfnmap vma, it don't have a valid page for all pfns, so unmap is not expected. In this case, it just
check wheather the first address have a valid pte or not which seems reasonable to me.

>>
>>>
>>> CoW with a valid page for pfnmap vma is weird to us. Can we use
>>> remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
>>> vma during write fault, this page is normal(page flag is valid) for most mm
>>> subsystems, such as memory failure in thais case and extra should be done to
>>> handle this special page.
>>>
>>> During unmap, if this vma is pfnmap, unmap shouldn't be done since page
>>> should not be touched for pfnmap vma.
>>>
>>> But the root problem is that can we insert a valid page for pfnmap vma?
>>>
>>> Any thoughts to solve this warn?
>>
>> vm_normal_page() documentation explains how that magic is supposed to
>> work. vm_normal_page() should be able to correctly identify whether we
>> want to look at the struct page for an anon folio that was COWed.

vm_normal_page() can find out a CoW mapping but 

>>
>>
>> untrack_pfn() indeed does not seem to be well prepared for handling
>> MAP_PRIVATE mappings where we end up having anon folios.
>>
>> I think it will already *completely mess up* simply when unmapping the
>> range without the memory failure involved.
>>
>> See, follow_phys() would get the PFN of the anon folio and then
>> untrack_pfn() would do some nonesense with that. Completely broken.
>>
>> The WARN is just a side-effect of the brokenness.
>>
>> In follow_phys(), we'd likely have to call vm_normal_page(). If we get a
>> page back, we'd likely have to fail follow_phys() instead of returning a
>> PFN of an anon folio.
>>
>> Now, how do we fix untrack_pfn() ? I really don't know. In theory, we
>> might no longer have *any* PFNMAP PFN in there after COW'ing everything.
>>
>> Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some
>> broken garbage (sorry). Can we disallow it?
> 
> Staring at track_pfn_copy(), it's maybe similarly broken?
> 
> I think we want to do:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 098356b8805ae..da5d1e37c5534 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6050,6 +6050,10 @@ int follow_phys(struct vm_area_struct *vma,
>                 goto out;
>         pte = ptep_get(ptep);
>  
> +       /* Never return addresses of COW'ed anon folios. */
> +       if (vm_normal_page(vma, address, pte))
> +               goto unlock;
> +
>         if ((flags & FOLL_WRITE) && !pte_write(pte))
>                 goto unlock;
>  
> 
> And then, just disallow it with PAT involved:
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 0904d7e8e1260..e4d2b2e8c0281 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -997,6 +997,15 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>                                 && size == (vma->vm_end - vma->vm_start))) {
>                 int ret;
>  
> +               /*
> +                * untrack_pfn() and friends cannot handl regions that suddenly
> +                * contain anon folios after COW. In particular, follow_phys()
> +                * will fail when we have an anon folio at the beginning og the
> +                * VMA.
> +                */
> +               if (vma && is_cow_mapping(vma->vm_flags))
> +                       return -EINVAL;

In this case, anyone use remap_pfn_range can not be cow_maaping which means if VM_MAYWRITE exists, VM_SHARED is
needed for this vma.

This can solve this CoW on private vma problem.

> +
>                 ret = reserve_pfn_range(paddr, size, prot, 0);
>                 if (ret == 0 && vma)
>                         vm_flags_set(vma, VM_PAT);
> 
> 
> I'm afraid that will break something. But well, it's already semi-broken.
> 
> As long as VM_PAT is not involved, it should work as expected.
> 
> In an ideal world, we'd get rid of follow_phys() completely and just
> derive that information from the VMA?
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-02-28  1:55     ` mawupeng
@ 2024-02-28  2:10       ` Matthew Wilcox
  2024-02-28  2:18         ` mawupeng
  2024-03-04  8:47       ` mawupeng
  1 sibling, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2024-02-28  2:10 UTC (permalink / raw)
  To: mawupeng
  Cc: david, akpm, khlebnikov, jaredeh, linmiaohe, hpa, dave.hansen,
	luto, tglx, peterz, linux-kernel, x86, mingo, rdunlap, bhelgaas,
	linux-mm

On Wed, Feb 28, 2024 at 09:55:24AM +0800, mawupeng wrote:
> On 2024/2/27 21:15, David Hildenbrand wrote:
> > On 27.02.24 14:00, David Hildenbrand wrote:
> >> On 27.02.24 13:28, Wupeng Ma wrote:
> >>> We find that a warn will be produced during our test, the detail log is
> >>> shown in the end.
> >>>
> >>> The core problem of this warn is that the first pfn of this pfnmap vma is
> >>> cleared during memory-failure. Digging into the source we find that this
> >>> problem can be triggered as following:
> >>>
> >>> // mmap with MAP_PRIVATE and specific fd which hook mmap
> >>> mmap(MAP_PRIVATE, fd)
> >>>     __mmap_region
> >>>       remap_pfn_range
> >>>       // set vma with pfnmap and the prot of pte is read only
> >>>     
> >>
> >> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
> >>
> >> What fd is that exactly? Often, we disallow private mappings in the
> >> mmap() callback (for a good reason).
> 
> just a device fd with device-specify mmap which use remap_pfn_range to assign memory.

But what meaning do you want MAP_PRIVATE of this fd to have?  Does it
make sense to permit this, or should you rather just return -EINVAL if
somebody tries to mmap() with MAP_PRIVATE set?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-02-28  2:10       ` Matthew Wilcox
@ 2024-02-28  2:18         ` mawupeng
  0 siblings, 0 replies; 10+ messages in thread
From: mawupeng @ 2024-02-28  2:18 UTC (permalink / raw)
  To: willy
  Cc: mawupeng1, david, akpm, khlebnikov, jaredeh, linmiaohe, hpa,
	dave.hansen, luto, tglx, peterz, linux-kernel, x86, mingo,
	rdunlap, bhelgaas, linux-mm



On 2024/2/28 10:10, Matthew Wilcox wrote:
> On Wed, Feb 28, 2024 at 09:55:24AM +0800, mawupeng wrote:
>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>> We find that a warn will be produced during our test, the detail log is
>>>>> shown in the end.
>>>>>
>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>> problem can be triggered as following:
>>>>>
>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>> mmap(MAP_PRIVATE, fd)
>>>>>     __mmap_region
>>>>>       remap_pfn_range
>>>>>       // set vma with pfnmap and the prot of pte is read only
>>>>>     
>>>>
>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>
>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>> mmap() callback (for a good reason).
>>
>> just a device fd with device-specify mmap which use remap_pfn_range to assign memory.
> 
> But what meaning do you want MAP_PRIVATE of this fd to have?  Does it
> make sense to permit this, or should you rather just return -EINVAL if
> somebody tries to mmap() with MAP_PRIVATE set?

I think return -EINVAL if somebody tries to mmap() with MAP_PRIVATE and MAP_MAYWRITE is reasonable to me.

Read to this pfnmap vma will not trigger fault.

> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-02-28  1:55     ` mawupeng
  2024-02-28  2:10       ` Matthew Wilcox
@ 2024-03-04  8:47       ` mawupeng
  2024-03-04  8:57         ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: mawupeng @ 2024-03-04  8:47 UTC (permalink / raw)
  To: david, akpm, khlebnikov, jaredeh, linmiaohe, hpa, tglx, mingo,
	dave.hansen, jaredeh, cotte, npiggin
  Cc: mawupeng1, luto, peterz, linux-kernel, x86, rdunlap, bhelgaas, linux-mm

Hi Maintainers, kindly ping...

On 2024/2/28 9:55, mawupeng wrote:
> 
> 
> On 2024/2/27 21:15, David Hildenbrand wrote:
>> On 27.02.24 14:00, David Hildenbrand wrote:
>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>> We find that a warn will be produced during our test, the detail log is
>>>> shown in the end.
>>>>
>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>> cleared during memory-failure. Digging into the source we find that this
>>>> problem can be triggered as following:
>>>>
>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>> mmap(MAP_PRIVATE, fd)
>>>>     __mmap_region
>>>>       remap_pfn_range
>>>>       // set vma with pfnmap and the prot of pte is read only
>>>>     
>>>
>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>
>>> What fd is that exactly? Often, we disallow private mappings in the
>>> mmap() callback (for a good reason).

We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
use remap_pfn_range may still have this problem.

It do seems wired for private mappings, What is the good reason?

> 
> just a device fd with device-specify mmap which use remap_pfn_range to assign memory.
> 
>>>
>>>> // memset this memory with trigger fault
>>>> handle_mm_fault
>>>>     __handle_mm_fault
>>>>       handle_pte_fault
>>>>         // write fault and !pte_write(entry)
>>>>         do_wp_page
>>>>           wp_page_copy // this will alloc a new page with valid page struct
>>>>                        // for this pfnmap vma
>>>
>>> Here we replace the mapped PFNMAP thingy by a proper anon folio.
> 
> My problem is can wen replace a pfn with fully functioned page for pfnmap vma? This is not MIXEDMAP vma.
> 
>>>
>>>>
>>>> // inject a hwpoison to the first page of this vma
>>>
>>> I assume this is an anon folio?
> 
> Yes.
> 
>>>
>>>> madvise_inject_error
>>>>     memory_failure
>>>>       hwpoison_user_mappings
>>>>         try_to_unmap_one
>>>>           // mark this pte as invalid (hwpoison)
>>>>           mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>>>>                   address, range.end);
> 
> If we can replace the mapped PFNMAP thingy by a proper anon folio, we need to make memory_failure to handle
> pfnmap vma properly since pfnmap vma shoule not touch its struct page?
> 
> Current this page have a valid mapping and can be unmap.
> 
> Maybe there is something wrong with my understanding of CoW on a private pfnmap vma.
> 
>>>>
>>>> // during unmap vma, the first pfn of this pfnmap vma is invalid
>>>> vm_mmap_pgoff
>>>>     do_mmap
>>>>       __do_mmap_mm
>>>>         __mmap_region
>>>>           __do_munmap
>>>>             unmap_region
>>>>               unmap_vmas
>>>>                 unmap_single_vma
>>>>                   untrack_pfn
>>>>                     follow_phys // pte is already invalidate, WARN_ON here
>>>
>>> unmap_single_vma()->...->zap_pte_range() should do the right thing when
>>> calling vm_normal_page().
>>>
>>> untrack_pfn() is the problematic part.
> 
> For pfnmap vma, it don't have a valid page for all pfns, so unmap is not expected. In this case, it just
> check wheather the first address have a valid pte or not which seems reasonable to me.
> 
>>>
>>>>
>>>> CoW with a valid page for pfnmap vma is weird to us. Can we use
>>>> remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
>>>> vma during write fault, this page is normal(page flag is valid) for most mm
>>>> subsystems, such as memory failure in thais case and extra should be done to
>>>> handle this special page.
>>>>
>>>> During unmap, if this vma is pfnmap, unmap shouldn't be done since page
>>>> should not be touched for pfnmap vma.
>>>>
>>>> But the root problem is that can we insert a valid page for pfnmap vma?
>>>>
>>>> Any thoughts to solve this warn?
>>>
>>> vm_normal_page() documentation explains how that magic is supposed to
>>> work. vm_normal_page() should be able to correctly identify whether we
>>> want to look at the struct page for an anon folio that was COWed.
> 
> vm_normal_page() can find out a CoW mapping but 
> 
>>>
>>>
>>> untrack_pfn() indeed does not seem to be well prepared for handling
>>> MAP_PRIVATE mappings where we end up having anon folios.
>>>
>>> I think it will already *completely mess up* simply when unmapping the
>>> range without the memory failure involved.
>>>
>>> See, follow_phys() would get the PFN of the anon folio and then
>>> untrack_pfn() would do some nonesense with that. Completely broken.
>>>
>>> The WARN is just a side-effect of the brokenness.
>>>
>>> In follow_phys(), we'd likely have to call vm_normal_page(). If we get a
>>> page back, we'd likely have to fail follow_phys() instead of returning a
>>> PFN of an anon folio.
>>>
>>> Now, how do we fix untrack_pfn() ? I really don't know. In theory, we
>>> might no longer have *any* PFNMAP PFN in there after COW'ing everything.
>>>
>>> Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some
>>> broken garbage (sorry). Can we disallow it?
>>
>> Staring at track_pfn_copy(), it's maybe similarly broken?
>>
>> I think we want to do:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 098356b8805ae..da5d1e37c5534 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6050,6 +6050,10 @@ int follow_phys(struct vm_area_struct *vma,
>>                 goto out;
>>         pte = ptep_get(ptep);
>>  
>> +       /* Never return addresses of COW'ed anon folios. */
>> +       if (vm_normal_page(vma, address, pte))
>> +               goto unlock;
>> +
>>         if ((flags & FOLL_WRITE) && !pte_write(pte))
>>                 goto unlock;
>>  
>>
>> And then, just disallow it with PAT involved:
>>
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index 0904d7e8e1260..e4d2b2e8c0281 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -997,6 +997,15 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>                                 && size == (vma->vm_end - vma->vm_start))) {
>>                 int ret;
>>  
>> +               /*
>> +                * untrack_pfn() and friends cannot handl regions that suddenly
>> +                * contain anon folios after COW. In particular, follow_phys()
>> +                * will fail when we have an anon folio at the beginning og the
>> +                * VMA.
>> +                */
>> +               if (vma && is_cow_mapping(vma->vm_flags))
>> +                       return -EINVAL;
> 
> In this case, anyone use remap_pfn_range can not be cow_maaping which means if VM_MAYWRITE exists, VM_SHARED is
> needed for this vma.
> 
> This can solve this CoW on private vma problem.
> 
>> +
>>                 ret = reserve_pfn_range(paddr, size, prot, 0);
>>                 if (ret == 0 && vma)
>>                         vm_flags_set(vma, VM_PAT);
>>
>>
>> I'm afraid that will break something. But well, it's already semi-broken.
>>
>> As long as VM_PAT is not involved, it should work as expected.
>>
>> In an ideal world, we'd get rid of follow_phys() completely and just
>> derive that information from the VMA?
>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-03-04  8:47       ` mawupeng
@ 2024-03-04  8:57         ` David Hildenbrand
  2024-03-04  9:04           ` mawupeng
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2024-03-04  8:57 UTC (permalink / raw)
  To: mawupeng, akpm, khlebnikov, jaredeh, linmiaohe, hpa, tglx, mingo,
	dave.hansen, cotte, npiggin
  Cc: luto, peterz, linux-kernel, x86, rdunlap, bhelgaas, linux-mm

On 04.03.24 09:47, mawupeng wrote:
> Hi Maintainers, kindly ping...
> 
> On 2024/2/28 9:55, mawupeng wrote:
>>
>>
>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>> We find that a warn will be produced during our test, the detail log is
>>>>> shown in the end.
>>>>>
>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>> problem can be triggered as following:
>>>>>
>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>> mmap(MAP_PRIVATE, fd)
>>>>>      __mmap_region
>>>>>        remap_pfn_range
>>>>>        // set vma with pfnmap and the prot of pte is read only
>>>>>      
>>>>
>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>
>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>> mmap() callback (for a good reason).
> 
> We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
> problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
> use remap_pfn_range may still have this problem.

I wrote a simple reproducer using MAP_PRIVATE of iouring queues on Friday.

> 
> It do seems wired for private mappings, What is the good reason?

I'm sure there are some use cases that require MAP_PRIVATE of such 
areas, and usually there is nothing wrong with that.

It's just that the PAT implementation incompatible.

I can submit a cleaned-up version of my patches.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-03-04  8:57         ` David Hildenbrand
@ 2024-03-04  9:04           ` mawupeng
  2024-03-04  9:17             ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: mawupeng @ 2024-03-04  9:04 UTC (permalink / raw)
  To: david, akpm, khlebnikov, jaredeh, linmiaohe, hpa, tglx, mingo,
	dave.hansen, cotte, npiggin
  Cc: mawupeng1, luto, peterz, linux-kernel, x86, rdunlap, bhelgaas, linux-mm



On 2024/3/4 16:57, David Hildenbrand wrote:
> On 04.03.24 09:47, mawupeng wrote:
>> Hi Maintainers, kindly ping...
>>
>> On 2024/2/28 9:55, mawupeng wrote:
>>>
>>>
>>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>>> We find that a warn will be produced during our test, the detail log is
>>>>>> shown in the end.
>>>>>>
>>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>>> problem can be triggered as following:
>>>>>>
>>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>>> mmap(MAP_PRIVATE, fd)
>>>>>>      __mmap_region
>>>>>>        remap_pfn_range
>>>>>>        // set vma with pfnmap and the prot of pte is read only
>>>>>>      
>>>>>
>>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>>
>>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>>> mmap() callback (for a good reason).
>>
>> We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
>> problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
>> use remap_pfn_range may still have this problem.
> 
> I wrote a simple reproducer using MAP_PRIVATE of iouring queues on Friday.
> 
>>
>> It do seems wired for private mappings, What is the good reason?
> 
> I'm sure there are some use cases that require MAP_PRIVATE of such areas, and usually there is nothing wrong with that.

So MAP_PRIVATE for VM_PFNMAP area with write access is ok? What is the user case for this situation?

> 
> It's just that the PAT implementation incompatible.

PAT do have its problem.

> 
> I can submit a cleaned-up version of my patches.

Thanks

> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Question] CoW on VM_PFNMAP vma during write fault
  2024-03-04  9:04           ` mawupeng
@ 2024-03-04  9:17             ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-03-04  9:17 UTC (permalink / raw)
  To: mawupeng, akpm, khlebnikov, jaredeh, linmiaohe, hpa, tglx, mingo,
	dave.hansen, cotte, npiggin
  Cc: luto, peterz, linux-kernel, x86, rdunlap, bhelgaas, linux-mm

On 04.03.24 10:04, mawupeng wrote:
> 
> 
> On 2024/3/4 16:57, David Hildenbrand wrote:
>> On 04.03.24 09:47, mawupeng wrote:
>>> Hi Maintainers, kindly ping...
>>>
>>> On 2024/2/28 9:55, mawupeng wrote:
>>>>
>>>>
>>>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>>>> We find that a warn will be produced during our test, the detail log is
>>>>>>> shown in the end.
>>>>>>>
>>>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>>>> problem can be triggered as following:
>>>>>>>
>>>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>>>> mmap(MAP_PRIVATE, fd)
>>>>>>>       __mmap_region
>>>>>>>         remap_pfn_range
>>>>>>>         // set vma with pfnmap and the prot of pte is read only
>>>>>>>       
>>>>>>
>>>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>>>
>>>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>>>> mmap() callback (for a good reason).
>>>
>>> We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
>>> problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
>>> use remap_pfn_range may still have this problem.
>>
>> I wrote a simple reproducer using MAP_PRIVATE of iouring queues on Friday.
>>
>>>
>>> It do seems wired for private mappings, What is the good reason?
>>
>> I'm sure there are some use cases that require MAP_PRIVATE of such areas, and usually there is nothing wrong with that.
> 
> So MAP_PRIVATE for VM_PFNMAP area with write access is ok? What is the user case for this situation?
I recall that MAP_PRIVATE /dev/mem mappings were required for some use 
cases. No details/ideas about other users, though.

Likely sufficient use case that people really thought about ways to get 
it working -- see vm_normal_page() :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-03-04  9:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-27 12:28 [Question] CoW on VM_PFNMAP vma during write fault Wupeng Ma
2024-02-27 13:00 ` David Hildenbrand
2024-02-27 13:15   ` David Hildenbrand
2024-02-28  1:55     ` mawupeng
2024-02-28  2:10       ` Matthew Wilcox
2024-02-28  2:18         ` mawupeng
2024-03-04  8:47       ` mawupeng
2024-03-04  8:57         ` David Hildenbrand
2024-03-04  9:04           ` mawupeng
2024-03-04  9:17             ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox