Re: [PATCH] mm: Fix mmap_assert_locked() in follow

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
@ 2024-07-12 12:17 Bert Karwatzki
  2024-07-12 12:17 ` commit 1c29a32ce65f4cd0f1c causes Bad rss-counter state and firefox-esr crash in linux-next-20240613 Bert Karwatzki
  0 siblings, 1 reply; 12+ messages in thread
From: Bert Karwatzki @ 2024-07-12 12:17 UTC (permalink / raw)
  To: Pei Li
  Cc: Bert Karwatzki, Andrew Morton, Shuah Khan, David Hildenbrand,
	linux-mm, linux-kernel, linux-next, syzbot+35a4414f6e247f515443,
	syzkaller-bugs, linux-kernel-mentees, senozhatsky

This is causing a deadlock for me, too. Since linux-next-20240712 I observe
a hang when starting the gui. I bisected this to commit a13252049629 and
reverting this commit in linux-next-20240712 fixes the issue for me.
I do not have any log messages to prove the dead lock, though, as I compiled
everything without CONFIG_LOCKDEP.

Bert Karwatzki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: commit 1c29a32ce65f4cd0f1c causes Bad rss-counter state and firefox-esr crash in linux-next-20240613
  2024-07-12 12:17 [PATCH] mm: Fix mmap_assert_locked() in follow_pte() Bert Karwatzki
@ 2024-07-12 12:17 ` Bert Karwatzki
  2024-07-12 15:38   ` Liam R. Howlett
  0 siblings, 1 reply; 12+ messages in thread
From: Bert Karwatzki @ 2024-07-12 12:17 UTC (permalink / raw)
  To: Liam R . Howlett
  Cc: Bert Karwatzki, Andrew Morton, Jiri Olsa, linux-mm, linux-kernel,
	linux-next

I did some experiments on the rss counter bug. The next patch is made for linux-next-20240613
with commit 1c29a32ce65f4cd0f1c0f9 reverted. Then I simply inlined the code of do_vmi_unmap()
and do_vmi_align_munmap() into mmap_region(). This version of the code works fine and does not
show the rss counter bug.

diff --git a/mm/mmap.c b/mm/mmap.c
index f95af72ddc9f..0f020c535c83 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -733,7 +733,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);

 	vma_complete(&vp, vmi, vma->vm_mm);
-	validate_mm(vma->vm_mm);
 	return 0;

 nomem:
@@ -2911,6 +2910,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
 	unsigned long end = addr + len;
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
@@ -2933,12 +2934,46 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			return -ENOMEM;
 	}

-	/* Unmap any existing mapping in the area */
-	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
-	if (error == -EPERM)
-		return error;
-	else if (error)
-		return -ENOMEM;
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	if (vma) {
+		struct maple_tree mt_detach;
+
+		/*
+		 * Check if memory is sealed before arch_unmap.
+		 * Prevent unmapping a sealed VMA.
+		 * can_modify_mm assumes we have acquired the lock on MM.
+		 */
+		if (unlikely(!can_modify_mm(mm, addr, end))) {
+			return -EPERM;
+		}
+
+		 /* arch_unmap() might do unmaps itself.  */
+		arch_unmap(mm, addr, end);
+
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, 0);
+
+		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, false);
+		error = vms_gather_munmap_vmas(&vms, &mas_detach);
+		if (error) {
+			validate_mm(mm);
+			return -ENOMEM;
+		}
+
+		vma = NULL;
+		error = vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL);
+		if (error) {
+			abort_munmap_vmas(&mas_detach);
+			return -ENOMEM;
+		}
+
+		/* Point of no return */
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+	} else {
+		// TODO
+	}

 	/*
 	 * Private writable mapping: check memory availability

The next patch now moves the call to vms_complete_munmap_vmas() towards the end of
mmap_region(). This code is also free of the rss counter bug.

commit a4b24bb18dde627792297455befcc465e45be66d
Author: Bert Karwatzki <spasswolf@web.de>
Date:   Thu Jun 20 17:02:08 2024 +0200

    mm: mmap: push back vms_complete_munmap_vmas()

    In order to to debug the rss counter bug we're going to push back
    vms_complete_munmap_vmas() in mmap_region.

    Signed-off-by: Bert Karwatzki <spasswolf@web.de>

diff --git a/mm/mmap.c b/mm/mmap.c
index 0f020c535c83..4fb9dd2e6d6e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2970,9 +2970,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		}

 		/* Point of no return */
-		vms_complete_munmap_vmas(&vms, &mas_detach);
 	} else {
-		// TODO
+		vms.end = 0;
+		vms.nr_pages = 0;
 	}

 	/*
@@ -3016,6 +3016,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vma_iter_next_range(&vmi);
 	}

+	if (vms.end) {
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+		vms.end = 0; // avoid double unmap below
+	}
+
 	/* Actually expand, if possible */
 	if (vma &&
 	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
@@ -3026,7 +3031,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (vma == prev)
 		vma_iter_set(&vmi, addr);
 cannot_expand:
-
+	if (vms.end)
+		vms_complete_munmap_vmas(&vms, &mas_detach);
 	/*
 	 * Determine the object being mapped and call the appropriate
 	 * specific mapper. the address has already been validated, but

The next patch move vms_complete_munmap_vmas() a little further beyond the
call to vma_expand(). This code contain the rss counter bug.

commit 02d6be2410fa503d008f4cc8dcd1518ca56f8793
Author: Bert Karwatzki <spasswolf@web.de>
Date:   Thu Jun 20 20:07:13 2024 +0200

    mm: mmap: push back vms_complete_munmap_vmas()

    This commit actually show the rss counter bug, while the previus does
    not!

    Signed-off-by: Bert Karwatzki <spasswolf@web.de>

diff --git a/mm/mmap.c b/mm/mmap.c
index 4fb9dd2e6d6e..c5f4b4b6fb84 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3016,14 +3016,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vma_iter_next_range(&vmi);
 	}

-	if (vms.end) {
-		vms_complete_munmap_vmas(&vms, &mas_detach);
-		vms.end = 0; // avoid double unmap below
-	}
-
 	/* Actually expand, if possible */
 	if (vma &&
 	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+		if (vms.end) {
+			vms_complete_munmap_vmas(&vms, &mas_detach);
+		}
 		khugepaged_enter_vma(vma, vm_flags);
 		goto expanded;
 	}


So there might be some unwanted interaction between vms_complete_munmap_vmas though
I've no yet figured out what exactly is happening. Hope this will be helpful in
solving the problem.

Bert Karwatzki



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: commit 1c29a32ce65f4cd0f1c causes Bad rss-counter state and firefox-esr crash in linux-next-20240613
  2024-07-12 12:17 ` commit 1c29a32ce65f4cd0f1c causes Bad rss-counter state and firefox-esr crash in linux-next-20240613 Bert Karwatzki
@ 2024-07-12 15:38   ` Liam R. Howlett
  2024-07-12 16:26     ` Bert Karwatzki
  0 siblings, 1 reply; 12+ messages in thread
From: Liam R. Howlett @ 2024-07-12 15:38 UTC (permalink / raw)
  To: Bert Karwatzki
  Cc: Andrew Morton, Jiri Olsa, linux-mm, linux-kernel, linux-next

* Bert Karwatzki <spasswolf@web.de> [240712 08:18]:
> I did some experiments on the rss counter bug. The next patch is made for linux-next-20240613
> with commit 1c29a32ce65f4cd0f1c0f9 reverted. Then I simply inlined the code of do_vmi_unmap()
> and do_vmi_align_munmap() into mmap_region(). This version of the code works fine and does not
> show the rss counter bug.

Are you still working with v1 of this patch set?

I root-caused the rss counter issue and seg fault to the fact that next
or prev may be expanded and I was using the new start/end on munmap() in
the completion.  This was fixed in subsequent patches.

I've sent v4 recently, but will have to a v5 to back off the removal of
arch_unmap() for PPC.

...

Thanks,
Liam


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: commit 1c29a32ce65f4cd0f1c causes Bad rss-counter state and firefox-esr crash in linux-next-20240613
  2024-07-12 15:38   ` Liam R. Howlett
@ 2024-07-12 16:26     ` Bert Karwatzki
  0 siblings, 0 replies; 12+ messages in thread
From: Bert Karwatzki @ 2024-07-12 16:26 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton, Jiri Olsa, linux-mm,
	linux-kernel, linux-next

Am Freitag, dem 12.07.2024 um 11:38 -0400 schrieb Liam R. Howlett:
> * Bert Karwatzki <spasswolf@web.de> [240712 08:18]:
> > I did some experiments on the rss counter bug. The next patch is made for linux-next-20240613
> > with commit 1c29a32ce65f4cd0f1c0f9 reverted. Then I simply inlined the code of do_vmi_unmap()
> > and do_vmi_align_munmap() into mmap_region(). This version of the code works fine and does not
> > show the rss counter bug.
>
> Are you still working with v1 of this patch set?
>
> I root-caused the rss counter issue and seg fault to the fact that next
> or prev may be expanded and I was using the new start/end on munmap() in
> the completion.  This was fixed in subsequent patches.
>
> I've sent v4 recently, but will have to a v5 to back off the removal of
> arch_unmap() for PPC.
>
> ...
>
> Thanks,
> Liam

Sorry, that was a mistake when using git send-email while working on another bug
I accidently sent this old mail. I actually have tested the v3 and v4 patchsets
on linux-next up to 2024711 with no error. I havent't tested v4 on linux-
20240712, yet, because next-20240712 has a deadlock issue:
https://lore.kernel.org/all/20240712080414.GA47643@google.com/
(while sending a mail to this issue, I accidently send the old mail, too)

Bert Karwatzki


^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <ZpBhCHsG39wIVnQN@x1n>]

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
       [not found] <ZpBhCHsG39wIVnQN@x1n>
@ 2024-07-12 13:19 ` David Wang
  0 siblings, 0 replies; 12+ messages in thread
From: David Wang @ 2024-07-12 13:19 UTC (permalink / raw)
  To: peterx
  Cc: akpm, david, linux-kernel-mentees, linux-kernel, linux-mm,
	peili.dev, skhan, syzbot+35a4414f6e247f515443, syzkaller-bugs

Hi,

> Ah yes, I had one rfc patch for that, I temporarily put that aside as it
> seemed nobody cared except myself.. it's here:
> 
> https://lore.kernel.org/all/20240523223745.395337-2-peterx@redhat.com
> 
> I didn't know it can already cause real trouble.  It looks like that patch
> should fix this.
> 
> Thanks,
> 
> -- 
> Peter Xu

Just add another user scenario concering this kernel warning.
Ever since 6.10-rc1, when I suspend my system via `systemctl suspend`, nvidia gpu driver would trigger a warning:

             	 Call Trace:
             	  <TASK>
             	  ? __warn+0x7c/0x120
             	  ? follow_pte+0x15b/0x170
             	  ? report_bug+0x18d/0x1c0
             	  ? handle_bug+0x3c/0x80
             	  ? exc_invalid_op+0x13/0x60
             	  ? asm_exc_invalid_op+0x16/0x20
             	  ? follow_pte+0x15b/0x170
             	  follow_phys+0x3a/0xf0
             	  untrack_pfn+0x53/0x120
             	  unmap_single_vma+0xa6/0xe0
             	  zap_page_range_single+0xe4/0x190
             	  ? _nv002569kms+0x17b/0x210 [nvidia_modeset]
             	  ? srso_return_thunk+0x5/0x5f
             	  ? kfree+0x257/0x290
             	  unmap_mapping_range+0x10d/0x130
             	  nv_revoke_gpu_mappings_locked+0x43/0x70 [nvidia]
             	  nv_set_system_power_state+0x1c9/0x470 [nvidia]
             	  nv_procfs_write_suspend+0xd3/0x140 [nvidia]
             	  proc_reg_write+0x58/0xa0
             	  ? srso_return_thunk+0x5/0x5f
             	  vfs_write+0xf6/0x440
             	  ? __count_memcg_events+0x73/0x110
             	  ? srso_return_thunk+0x5/0x5f
             	  ? count_memcg_events.constprop.0+0x1a/0x30
             	  ? srso_return_thunk+0x5/0x5f
             	  ? handle_mm_fault+0xa9/0x2d0
             	  ? srso_return_thunk+0x5/0x5f
             	  ? preempt_count_add+0x47/0xa0
             	  ksys_write+0x63/0xe0
             	  do_syscall_64+0x4b/0x110
             	  entry_SYSCALL_64_after_hwframe+0x76/0x7e
             	 RIP: 0033:0x7f34a3914240
             	 Code: 40 00 48 8b 15 c1 9b 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 23 0e 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
             	 RSP: 002b:00007ffca2aa2688 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
             	 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f34a3914240
             	 RDX: 0000000000000008 RSI: 000055a02968ed80 RDI: 0000000000000001
             	 RBP: 000055a02968ed80 R08: 00007f34a39eecd0 R09: 00007f34a39eecd0
             	 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000008
             	 R13: 00007f34a39ef760 R14: 0000000000000008 R15: 00007f34a39ea9e0
             	  </TASK>
             	 ---[ end trace 0000000000000000 ]---
             	 PM: suspend entry (deep)

Considering out-of-tree nature of nvidia gpu driver, and nobody reported this kernel warning before with in-trees,
 I had almost convinced myself that nvidia driver may need "big" rework to live with those "PTE" changes.
So glad to see this thread of discussion/issue/fix now, I have been patching my system manually ever since 6.10-rc1,
hope things got fixed soon...


FYI
David



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
@ 2024-07-12 12:43 Bert Karwatzki
  0 siblings, 0 replies; 12+ messages in thread
From: Bert Karwatzki @ 2024-07-12 12:43 UTC (permalink / raw)
  To: Pei Li
  Cc: Bert Karwatzki, Andrew Morton, Shuah Khan, David Hildenbrand,
	linux-mm, linux-kernel, linux-next, syzbot+35a4414f6e247f515443,
	syzkaller-bugs, linux-kernel-mentees, senozhatsky

diff --git a/mm/memory.c b/mm/memory.c
index 282203363177..2f4b4322ec0e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1817,6 +1817,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 {
 	unsigned long start = max(vma->vm_start, start_addr);
 	unsigned long end;
+	bool mm_read_locked;

 	if (start >= vma->vm_end)
 		return;
@@ -1829,11 +1830,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,

 	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
 		if (!mm_wr_locked)
-			mmap_read_lock(vma->vm_mm);
+			mm_read_locked = !mmap_read_trylock(vma->vm_mm);

 		untrack_pfn(vma, 0, 0, mm_wr_locked);

-		if (!mm_wr_locked)
+		if (!mm_wr_locked && !mm_read_locked)
 			mmap_read_unlock(vma->vm_mm);
 	}


This seems to fix the issue without completely removing the locking.

Bert Karwatzki


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
@ 2024-07-11  5:13 Pei Li
  2024-07-11 21:22 ` Andrew Morton
  2024-07-11 21:33 ` David Hildenbrand
  0 siblings, 2 replies; 12+ messages in thread
From: Pei Li @ 2024-07-11  5:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, skhan, syzkaller-bugs,
	linux-kernel-mentees, syzbot+35a4414f6e247f515443, Pei Li

This patch fixes this warning by acquiring read lock before entering
untrack_pfn() while write lock is not held.

syzbot has tested the proposed patch and the reproducer did not
trigger any issue.

Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
Signed-off-by: Pei Li <peili.dev@gmail.com>
---
Syzbot reported the following warning in follow_pte():

WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980

This is because we are assuming that mm->mmap_lock should be held when
entering follow_pte(). This is added in commit c5541ba378e3 (mm:
follow_pte() improvements).

However, in the following call stack, we are not acquring the lock:
 follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
 get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
 untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
 unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
 zap_page_range_single+0x326/0x560 mm/memory.c:1920

In zap_page_range_single(), we passed mm_wr_locked as false, as we do
not expect write lock to be held.
In the special case where vma->vm_flags is set as VM_PFNMAP, we are
hitting untrack_pfn() which eventually calls into follow_phys.

This patch fixes this warning by acquiring read lock before entering
untrack_pfn() while write lock is not held. 

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Tested on:

commit:         9d9a2f29 Merge tag 'mm-hotfixes-stable-2024-07-10-13-1..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13be8021980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=3456bae478301dc8
dashboard link: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
patch:          https://syzkaller.appspot.com/x/patch.diff?x=145e3441980000

Note: testing is done by a robot and is best-effort only.
---
 mm/memory.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index d10e616d7389..75d7959b835b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,9 +1815,16 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	if (vma->vm_file)
 		uprobe_munmap(vma, start, end);

-	if (unlikely(vma->vm_flags & VM_PFNMAP))
+	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
+		if (!mm_wr_locked)
+			mmap_read_lock(vma->vm_mm);
+
 		untrack_pfn(vma, 0, 0, mm_wr_locked);

+		if (!mm_wr_locked)
+			mmap_read_unlock(vma->vm_mm);
+	}
+
 	if (start != end) {
 		if (unlikely(is_vm_hugetlb_page(vma))) {
 			/*

---
base-commit: 734610514cb0234763cc97ddbd235b7981889445
change-id: 20240710-bug12-fecfe4bb0dcb

Best regards,
-- 
Pei Li <peili.dev@gmail.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
  2024-07-11  5:13 Pei Li
@ 2024-07-11 21:22 ` Andrew Morton
  2024-07-11 21:33 ` David Hildenbrand
  1 sibling, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2024-07-11 21:22 UTC (permalink / raw)
  To: Pei Li
  Cc: linux-mm, linux-kernel, skhan, syzkaller-bugs,
	linux-kernel-mentees, syzbot+35a4414f6e247f515443,
	David Hildenbrand

On Wed, 10 Jul 2024 22:13:17 -0700 Pei Li <peili.dev@gmail.com> wrote:

> This patch fixes this warning by acquiring read lock before entering
> untrack_pfn() while write lock is not held.
> 
> syzbot has tested the proposed patch and the reproducer did not
> trigger any issue.
> 

Thanks.

> ---
> Syzbot reported the following warning in follow_pte():
> 
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
> 
> This is because we are assuming that mm->mmap_lock should be held when
> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
> follow_pte() improvements).
> 
> However, in the following call stack, we are not acquring the lock:
>  follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>  get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>  untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>  unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>  zap_page_range_single+0x326/0x560 mm/memory.c:1920
> 
> In zap_page_range_single(), we passed mm_wr_locked as false, as we do
> not expect write lock to be held.
> In the special case where vma->vm_flags is set as VM_PFNMAP, we are
> hitting untrack_pfn() which eventually calls into follow_phys.

I included the above (very relevant) info in the changelog.

And I added 

	Fixes: c5541ba378e3 ("mm: follow_pte() improvements")

and queued the patch for 6.10-rc7.  Hopefully David can review it for us.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
  2024-07-11  5:13 Pei Li
  2024-07-11 21:22 ` Andrew Morton
@ 2024-07-11 21:33 ` David Hildenbrand
  2024-07-11 21:45   ` David Hildenbrand
  2024-07-12  8:04   ` Sergey Senozhatsky
  1 sibling, 2 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-11 21:33 UTC (permalink / raw)
  To: Pei Li, Andrew Morton
  Cc: linux-mm, linux-kernel, skhan, syzkaller-bugs,
	linux-kernel-mentees, syzbot+35a4414f6e247f515443

On 11.07.24 07:13, Pei Li wrote:
> This patch fixes this warning by acquiring read lock before entering
> untrack_pfn() while write lock is not held.
> 
> syzbot has tested the proposed patch and the reproducer did not
> trigger any issue.
> 
> Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
> Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
> Signed-off-by: Pei Li <peili.dev@gmail.com>
> ---
> Syzbot reported the following warning in follow_pte():
> 
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
> 
> This is because we are assuming that mm->mmap_lock should be held when
> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
> follow_pte() improvements).
> 
> However, in the following call stack, we are not acquring the lock:
>   follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>   get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>   untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>   unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>   zap_page_range_single+0x326/0x560 mm/memory.c:1920

That implies that unmap_vmas() is called without the mmap lock in read 
mode, correct?

Do we know how this happens?

* exit_mmap() holds the mmap lock in read mode
* unmap_region is documented to hold the mmap lock in read mode

> 
> In zap_page_range_single(), we passed mm_wr_locked as false, as we do
> not expect write lock to be held.
> In the special case where vma->vm_flags is set as VM_PFNMAP, we are
> hitting untrack_pfn() which eventually calls into follow_phys.
> 
> This patch fixes this warning by acquiring read lock before entering
> untrack_pfn() while write lock is not held.
> 
> syzbot has tested the proposed patch and the reproducer did not trigger any issue:
> 
> Tested on:
> 
> commit:         9d9a2f29 Merge tag 'mm-hotfixes-stable-2024-07-10-13-1..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=13be8021980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3456bae478301dc8
> dashboard link: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
> compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> patch:          https://syzkaller.appspot.com/x/patch.diff?x=145e3441980000
> 
> Note: testing is done by a robot and is best-effort only.
> ---
>   mm/memory.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d10e616d7389..75d7959b835b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,9 +1815,16 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>   	if (vma->vm_file)
>   		uprobe_munmap(vma, start, end);
>   
> -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> +	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
> +		if (!mm_wr_locked)
> +			mmap_read_lock(vma->vm_mm);
> +
>   		untrack_pfn(vma, 0, 0, mm_wr_locked);
>   
> +		if (!mm_wr_locked)
> +			mmap_read_unlock(vma->vm_mm);
> +	}
> +
>   	if (start != end) {
>   		if (unlikely(is_vm_hugetlb_page(vma))) {

I'm not sure if this is the right fix. I like to understand how we end 
up without the mmap lock at least in read mode in that path?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
  2024-07-11 21:33 ` David Hildenbrand
@ 2024-07-11 21:45   ` David Hildenbrand
  2024-07-11 21:51     ` David Hildenbrand
  2024-07-12  8:04   ` Sergey Senozhatsky
  1 sibling, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-07-11 21:45 UTC (permalink / raw)
  To: Pei Li, Andrew Morton
  Cc: linux-mm, linux-kernel, skhan, syzkaller-bugs,
	linux-kernel-mentees, syzbot+35a4414f6e247f515443, Peter Xu

On 11.07.24 23:33, David Hildenbrand wrote:
> On 11.07.24 07:13, Pei Li wrote:
>> This patch fixes this warning by acquiring read lock before entering
>> untrack_pfn() while write lock is not held.
>>
>> syzbot has tested the proposed patch and the reproducer did not
>> trigger any issue.
>>
>> Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>> Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
>> Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>> Signed-off-by: Pei Li <peili.dev@gmail.com>
>> ---
>> Syzbot reported the following warning in follow_pte():
>>
>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
>>
>> This is because we are assuming that mm->mmap_lock should be held when
>> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
>> follow_pte() improvements).
>>
>> However, in the following call stack, we are not acquring the lock:
>>    follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>>    get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>>    untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>>    unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>>    zap_page_range_single+0x326/0x560 mm/memory.c:1920
> 
> That implies that unmap_vmas() is called without the mmap lock in read
> mode, correct?
> 
> Do we know how this happens?
> 
> * exit_mmap() holds the mmap lock in read mode
> * unmap_region is documented to hold the mmap lock in read mode

I think this is it (missed the call from zap_page_range_single()):

  follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
  get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
  untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
  unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
  zap_page_range_single+0x326/0x560 mm/memory.c:1920
  unmap_mapping_range_vma mm/memory.c:3684 [inline]
  unmap_mapping_range_tree mm/memory.c:3701 [inline]
  unmap_mapping_pages mm/memory.c:3767 [inline]
  unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
  truncate_pagecache+0x53/0x90 mm/truncate.c:731
  simple_setattr+0xf2/0x120 fs/libfs.c:886
  notify_change+0xec6/0x11f0 fs/attr.c:499
  do_truncate+0x15c/0x220 fs/open.c:65
  handle_truncate fs/namei.c:3308 [inline]

I think Peter recently questioned whether untrack_pfn() should be even 
called from the place, but I might misremember things.

Fix should work (I suspect we are not violating some locking rules?), 
PFNMAP should not happen there too often that we really care.

If everything fails, we could drop the assert, but I'm hoping we can 
avoid that. We really want most users of follow_pte() to do the right thing.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
  2024-07-11 21:45   ` David Hildenbrand
@ 2024-07-11 21:51     ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-07-11 21:51 UTC (permalink / raw)
  To: Pei Li, Andrew Morton
  Cc: linux-mm, linux-kernel, skhan, syzkaller-bugs,
	linux-kernel-mentees, syzbot+35a4414f6e247f515443, Peter Xu

On 11.07.24 23:45, David Hildenbrand wrote:
> On 11.07.24 23:33, David Hildenbrand wrote:
>> On 11.07.24 07:13, Pei Li wrote:
>>> This patch fixes this warning by acquiring read lock before entering
>>> untrack_pfn() while write lock is not held.
>>>
>>> syzbot has tested the proposed patch and the reproducer did not
>>> trigger any issue.
>>>
>>> Reported-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>>> Closes: https://syzkaller.appspot.com/bug?extid=35a4414f6e247f515443
>>> Tested-by: syzbot+35a4414f6e247f515443@syzkaller.appspotmail.com
>>> Signed-off-by: Pei Li <peili.dev@gmail.com>
>>> ---
>>> Syzbot reported the following warning in follow_pte():
>>>
>>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 rwsem_assert_held include/linux/rwsem.h:195 [inline]
>>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 mmap_assert_locked include/linux/mmap_lock.h:65 [inline]
>>> WARNING: CPU: 3 PID: 5192 at include/linux/rwsem.h:195 follow_pte+0x414/0x4c0 mm/memory.c:5980
>>>
>>> This is because we are assuming that mm->mmap_lock should be held when
>>> entering follow_pte(). This is added in commit c5541ba378e3 (mm:
>>> follow_pte() improvements).
>>>
>>> However, in the following call stack, we are not acquring the lock:
>>>     follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>>>     get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>>>     untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>>>     unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>>>     zap_page_range_single+0x326/0x560 mm/memory.c:1920
>>
>> That implies that unmap_vmas() is called without the mmap lock in read
>> mode, correct?
>>
>> Do we know how this happens?
>>
>> * exit_mmap() holds the mmap lock in read mode
>> * unmap_region is documented to hold the mmap lock in read mode
> 
> I think this is it (missed the call from zap_page_range_single()):
> 
>    follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
>    get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
>    untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
>    unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
>    zap_page_range_single+0x326/0x560 mm/memory.c:1920
>    unmap_mapping_range_vma mm/memory.c:3684 [inline]
>    unmap_mapping_range_tree mm/memory.c:3701 [inline]
>    unmap_mapping_pages mm/memory.c:3767 [inline]
>    unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
>    truncate_pagecache+0x53/0x90 mm/truncate.c:731
>    simple_setattr+0xf2/0x120 fs/libfs.c:886
>    notify_change+0xec6/0x11f0 fs/attr.c:499
>    do_truncate+0x15c/0x220 fs/open.c:65
>    handle_truncate fs/namei.c:3308 [inline]
> 
> I think Peter recently questioned whether untrack_pfn() should be even
> called from the place, but I might misremember things.
> 
> Fix should work (I suspect we are not violating some locking rules?),
> PFNMAP should not happen there too often that we really care.

... thinking again, likely we reach this point with "!mm_wr_locked" and 
the mmap lock already held in read mode. So I suspect the fix won't work 
as is.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Fix mmap_assert_locked() in follow_pte()
  2024-07-11 21:33 ` David Hildenbrand
  2024-07-11 21:45   ` David Hildenbrand
@ 2024-07-12  8:04   ` Sergey Senozhatsky
  1 sibling, 0 replies; 12+ messages in thread
From: Sergey Senozhatsky @ 2024-07-12  8:04 UTC (permalink / raw)
  To: Pei Li, Andrew Morton, David Hildenbrand
  Cc: linux-mm, linux-kernel, skhan, syzkaller-bugs,
	linux-kernel-mentees, syzbot+35a4414f6e247f515443

On (24/07/11 23:33), David Hildenbrand wrote:
[..]
> > @@ -1815,9 +1815,16 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> >   	if (vma->vm_file)
> >   		uprobe_munmap(vma, start, end);
> > -	if (unlikely(vma->vm_flags & VM_PFNMAP))
> > +	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
> > +		if (!mm_wr_locked)
> > +			mmap_read_lock(vma->vm_mm);
> > +
> >   		untrack_pfn(vma, 0, 0, mm_wr_locked);
> > +		if (!mm_wr_locked)
> > +			mmap_read_unlock(vma->vm_mm);
> > +	}
> > +
> >   	if (start != end) {
> >   		if (unlikely(is_vm_hugetlb_page(vma))) {
>
> I'm not sure if this is the right fix. I like to understand how we end up
> without the mmap lock at least in read mode in that path?

I suspect this is causing a deadlock:

[   10.263161] ============================================
[   10.263165] WARNING: possible recursive locking detected
[   10.263170] 6.10.0-rc7-next-20240712+ #645 Tainted: G                 N
[   10.263177] --------------------------------------------
[   10.263179] (direxec)/166 is trying to acquire lock:
[   10.263184] ffff88810b4f0198 (&mm->mmap_lock){++++}-{3:3}, at: mmap_read_lock+0x12/0x40
[   10.263217]
[   10.263217] but task is already holding lock:
[   10.263219] ffff88810b4f0198 (&mm->mmap_lock){++++}-{3:3}, at: exit_mmap+0x9c/0x830
[   10.263238]
[   10.263238] other info that might help us debug this:
[   10.263241]  Possible unsafe locking scenario:
[   10.263241]
[   10.263243]        CPU0
[   10.263245]        ----
[   10.263247]   lock(&mm->mmap_lock);
[   10.263252]   lock(&mm->mmap_lock);
[   10.263257]
[   10.263257]  *** DEADLOCK ***
[   10.263257]
[   10.263259]  May be due to missing lock nesting notation
[   10.263259]
[   10.263262] 3 locks held by (direxec)/166:
[   10.263267]  #0: ffff88810b4e8548 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x70/0x1110
[   10.263286]  #1: ffff88810b4e85e0 (&sig->exec_update_lock){+.+.}-{3:3}, at: exec_mmap+0x9f/0x510
[   10.263302]  #2: ffff88810b4f0198 (&mm->mmap_lock){++++}-{3:3}, at: exit_mmap+0x9c/0x830
[   10.263318]
[   10.263318] stack backtrace:
[   10.263329] CPU: 6 UID: 0 PID: 166 Comm: (direxec) Tainted: G                 N 6.10.0-rc7-next-20240712+ #645
[   10.263340] Tainted: [N]=TEST
[   10.263349] Call Trace:
[   10.263355]  <TASK>
[   10.263360]  dump_stack_lvl+0xa3/0xeb
[   10.263375]  print_deadlock_bug+0x4d5/0x680
[   10.263387]  __lock_acquire+0x65fb/0x7830
[   10.263408]  ? lock_is_held_type+0xdd/0x150
[   10.263425]  lock_acquire+0x14c/0x3e0
[   10.263433]  ? mmap_read_lock+0x12/0x40
[   10.263445]  ? lock_is_held_type+0xdd/0x150
[   10.263454]  down_read+0x58/0x9a0
[   10.263461]  ? mmap_read_lock+0x12/0x40
[   10.263476]  mmap_read_lock+0x12/0x40
[   10.263485]  unmap_single_vma+0x1bf/0x240
[   10.263497]  unmap_vmas+0x146/0x1c0
[   10.263511]  exit_mmap+0x13d/0x830
[   10.263533]  __mmput+0xc2/0x2c0
[   10.263556]  exec_mmap+0x4cb/0x510
[   10.263580]  begin_new_exec+0xfe6/0x1ba0
[   10.263612]  load_elf_binary+0x797/0x22a0
[   10.263637]  ? load_misc_binary+0x53a/0x930
[   10.263656]  ? lock_release+0x50f/0x830
[   10.263673]  ? bprm_execve+0x6d7/0x1110
[   10.263693]  bprm_execve+0x70d/0x1110
[   10.263730]  do_execveat_common+0x44b/0x600
[   10.263745]  __x64_sys_execve+0x8e/0xa0
[   10.263754]  do_syscall_64+0x71/0x110
[   10.263764]  entry_SYSCALL_64_after_hwframe+0x4b/0x53


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-07-12 16:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-12 12:17 [PATCH] mm: Fix mmap_assert_locked() in follow_pte() Bert Karwatzki
2024-07-12 12:17 ` commit 1c29a32ce65f4cd0f1c causes Bad rss-counter state and firefox-esr crash in linux-next-20240613 Bert Karwatzki
2024-07-12 15:38   ` Liam R. Howlett
2024-07-12 16:26     ` Bert Karwatzki
     [not found] <ZpBhCHsG39wIVnQN@x1n>
2024-07-12 13:19 ` [PATCH] mm: Fix mmap_assert_locked() in follow_pte() David Wang
  -- strict thread matches above, loose matches on Subject: below --
2024-07-12 12:43 Bert Karwatzki
2024-07-11  5:13 Pei Li
2024-07-11 21:22 ` Andrew Morton
2024-07-11 21:33 ` David Hildenbrand
2024-07-11 21:45   ` David Hildenbrand
2024-07-11 21:51     ` David Hildenbrand
2024-07-12  8:04   ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox