[PATCH linux-next] mm/madvise: prefer VMA lock for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
@ 2026-01-14  3:24 wang.yaxin
  2026-01-14  4:18 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: wang.yaxin @ 2026-01-14  3:24 UTC (permalink / raw)
  To: akpm, liam.howlett, lorenzo.stoakes, david
  Cc: vbabka, jannh, linux-mm, linux-kernel, xu.xin16, yang.yang29,
	wang.yaxin, fan.yu9, he.peilin, tu.qiang35, qiu.yutan,
	jiang.kun2, lu.zhongjun

From: Jiang Kun <jiang.kun2@zte.com.cn>

MADV_REMOVE currently runs under the process-wide mmap_read_lock() and
temporarily drops and reacquires it around filesystem hole-punching.
For single-VMA, local-mm, non-UFFD-armed ranges we can safely operate
under the finer-grained per-VMA read lock to reduce contention and lock
hold time, while preserving semantics.

This patch:
- Switches MADV_REMOVE to prefer MADVISE_VMA_READ_LOCK via get_lock_mode().
- Adds a branch in madvise_remove():
  * Under VMA lock: avoid mark_mmap_lock_dropped() and mmap lock churn;
    take a file reference and call vfs_fallocate() directly.
  * Under mmap read lock fallback: preserve existing behavior including
    userfaultfd_remove() coordination and temporary mmap_read_unlock/lock
    around vfs_fallocate().

Constraints and fallback:
- try_vma_read_lock() enforces single VMA, local mm, and userfaultfd
  not armed (userfaultfd_armed(vma) == false). If any condition fails,
  we fall back to mmap_read_lock(mm) and use the original path.
- Semantics are unchanged: permission checks, VM_LOCKED rejection,
  shared-may-write requirement, error propagation all remain as before.

Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
---
 mm/madvise.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6bf7009fa5ce..279ec5169879 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1015,7 +1015,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
 	unsigned long start = madv_behavior->range.start;
 	unsigned long end = madv_behavior->range.end;

-	mark_mmap_lock_dropped(madv_behavior);
+	/*
+	 * Prefer VMA read lock path: when operating under VMA lock, we avoid
+	 * dropping/reacquiring the mmap lock and directly perform the filesystem
+	 * operation while the VMA is read-locked. We still take and drop a file
+	 * reference to protect against concurrent file changes.
+	 *
+	 * When operating under mmap read lock (fallback), preserve existing
+	 * behaviour: mark lock dropped, coordinate with userfaultfd_remove(),
+	 * temporarily drop mmap_read_lock around vfs_fallocate(), and then
+	 * reacquire it.
+	 */
+	if (madv_behavior->lock_mode == MADVISE_MMAP_READ_LOCK)
+		mark_mmap_lock_dropped(madv_behavior);

 	if (vma->vm_flags & VM_LOCKED)
 		return -EINVAL;
@@ -1033,12 +1045,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

 	/*
-	 * Filesystem's fallocate may need to take i_rwsem.  We need to
-	 * explicitly grab a reference because the vma (and hence the
-	 * vma's reference to the file) can go away as soon as we drop
-	 * mmap_lock.
+	 * Execute filesystem punch-hole under appropriate locking.
+	 * - VMA lock path: no mmap lock held; call vfs_fallocate() directly.
+	 * - mmap lock path: follow existing protocol including UFFD coordination
+	 *   and temporary mmap_read_unlock/lock around the filesystem call.
 	 */
 	get_file(f);
+	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+		error = vfs_fallocate(f,
+					FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+					offset, end - start);
+		fput(f);
+		return error;
+	}
 	if (userfaultfd_remove(vma, start, end)) {
 		/* mmap_lock was not released by userfaultfd_remove() */
 		mmap_read_unlock(mm);
@@ -1754,7 +1773,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 		return MADVISE_NO_LOCK;

 	switch (madv_behavior->behavior) {
-	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_COLD:
 	case MADV_PAGEOUT:
@@ -1762,6 +1780,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
 		return MADVISE_MMAP_READ_LOCK;
+	case MADV_REMOVE:
 	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
 	case MADV_DONTNEED:
-- 
2.43.5


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
  2026-01-14  3:24 [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE wang.yaxin
@ 2026-01-14  4:18 ` Matthew Wilcox
  2026-01-14  7:00   ` wang.yaxin
  2026-01-14  4:19 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2026-01-14  4:18 UTC (permalink / raw)
  To: wang.yaxin
  Cc: akpm, liam.howlett, lorenzo.stoakes, david, vbabka, jannh,
	linux-mm, linux-kernel, xu.xin16, yang.yang29, fan.yu9,
	he.peilin, tu.qiang35, qiu.yutan, jiang.kun2, lu.zhongjun

On Wed, Jan 14, 2026 at 11:24:17AM +0800, wang.yaxin@zte.com.cn wrote:
> -	mark_mmap_lock_dropped(madv_behavior);
> +	/*
> +	 * Prefer VMA read lock path: when operating under VMA lock, we avoid
> +	 * dropping/reacquiring the mmap lock and directly perform the filesystem
> +	 * operation while the VMA is read-locked. We still take and drop a file
> +	 * reference to protect against concurrent file changes.

How does taking a reference prevent file changes?  What do you mean by
"file changes" anyway?

> +	 * When operating under mmap read lock (fallback), preserve existing
> +	 * behaviour: mark lock dropped, coordinate with userfaultfd_remove(),
> +	 * temporarily drop mmap_read_lock around vfs_fallocate(), and then
> +	 * reacquire it.

This is not the way to write an inline comment; that's how you describe
what you've done in the changelog.

> @@ -1033,12 +1045,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> 
>  	/*
> -	 * Filesystem's fallocate may need to take i_rwsem.  We need to
> -	 * explicitly grab a reference because the vma (and hence the
> -	 * vma's reference to the file) can go away as soon as we drop
> -	 * mmap_lock.
> +	 * Execute filesystem punch-hole under appropriate locking.
> +	 * - VMA lock path: no mmap lock held; call vfs_fallocate() directly.
> +	 * - mmap lock path: follow existing protocol including UFFD coordination
> +	 *   and temporary mmap_read_unlock/lock around the filesystem call.

Again, I don't like what you've done here with the comments.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
  2026-01-14  3:24 [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE wang.yaxin
  2026-01-14  4:18 ` Matthew Wilcox
@ 2026-01-14  4:19 ` Matthew Wilcox
  2026-01-14  8:03 ` [syzbot ci] " syzbot ci
  2026-01-14  9:04 ` [PATCH linux-next] " Lorenzo Stoakes
  3 siblings, 0 replies; 6+ messages in thread
From: Matthew Wilcox @ 2026-01-14  4:19 UTC (permalink / raw)
  To: wang.yaxin
  Cc: akpm, liam.howlett, lorenzo.stoakes, david, vbabka, jannh,
	linux-mm, linux-kernel, xu.xin16, yang.yang29, fan.yu9,
	he.peilin, tu.qiang35, qiu.yutan, jiang.kun2, lu.zhongjun

On Wed, Jan 14, 2026 at 11:24:17AM +0800, wang.yaxin@zte.com.cn wrote:
> From: Jiang Kun <jiang.kun2@zte.com.cn>
> 
> MADV_REMOVE currently runs under the process-wide mmap_read_lock() and
> temporarily drops and reacquires it around filesystem hole-punching.
> For single-VMA, local-mm, non-UFFD-armed ranges we can safely operate
> under the finer-grained per-VMA read lock to reduce contention and lock
> hold time, while preserving semantics.

Oh, and do you have any performance measurements?  You're introducing
complexity, so it'd be good to quantify what performance we're getting
in return for this complexity.  A real workload would be best, but even
an artificial benchmark would be better than nothing.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
  2026-01-14  4:18 ` Matthew Wilcox
@ 2026-01-14  7:00   ` wang.yaxin
  0 siblings, 0 replies; 6+ messages in thread
From: wang.yaxin @ 2026-01-14  7:00 UTC (permalink / raw)
  To: willy
  Cc: akpm, liam.howlett, lorenzo.stoakes, david, vbabka, jannh,
	linux-mm, linux-kernel, xu.xin16, yang.yang29, fan.yu9,
	he.peilin, tu.qiang35, qiu.yutan, jiang.kun2, lu.zhongjun

>> -	mark_mmap_lock_dropped(madv_behavior);
>> +	/*
>> +	 * Prefer VMA read lock path: when operating under VMA lock, we avoid
>> +	 * dropping/reacquiring the mmap lock and directly perform the filesystem
>> +	 * operation while the VMA is read-locked. We still take and drop a file
>> +	 * reference to protect against concurrent file changes.
>
>How does taking a reference prevent file changes?  What do you mean by
"file changes" anyway?

Thanks for the review.

Taking a reference with get_file(f) does not prevent file content or
metadata changes; it pins the struct file so the pointer remains valid
while we temporarily drop mmap_lock. The VMA can be split, unmapped, or
otherwise changed during that window, and without a ref the vma->vm_file
could be freed.

By “file changes” I meant lifetime changes to the VMA → file association
(e.g. concurrent munmap/mremap or VMA splits) while mmap_lock is dropped,
not changes to the file’s data or inode state. Those are serialized by
the filesystem (e.g. inode locks) and are unrelated to the refcount.

>> +	 * When operating under mmap read lock (fallback), preserve existing
>> +	 * behaviour: mark lock dropped, coordinate with userfaultfd_remove(),
>> +	 * temporarily drop mmap_read_lock around vfs_fallocate(), and then
>> +	 * reacquire it.
>
>This is not the way to write an inline comment; that's how you describe
>what you've done in the changelog.
>
>> @@ -1033,12 +1045,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>>  			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>> 
>>  	/*
>> -	 * Filesystem's fallocate may need to take i_rwsem.  We need to
>> -	 * explicitly grab a reference because the vma (and hence the
>> -	 * vma's reference to the file) can go away as soon as we drop
>> -	 * mmap_lock.
>> +	 * Execute filesystem punch-hole under appropriate locking.
>> +	 * - VMA lock path: no mmap lock held; call vfs_fallocate() directly.
>> +	 * - mmap lock path: follow existing protocol including UFFD coordination
>> +	 *   and temporary mmap_read_unlock/lock around the filesystem call.
>
>Again, I don't like what you've done here with the comments.

I’ll fix the inline comment to say: “Pin struct file across potential
mmap_lock drop so the file pointer remains valid even if the VMA is
modified or freed,” and remove the changelog-style prose.

Thanks,
Jiang Kun


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [syzbot ci] Re: mm/madvise: prefer VMA lock for MADV_REMOVE
  2026-01-14  3:24 [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE wang.yaxin
  2026-01-14  4:18 ` Matthew Wilcox
  2026-01-14  4:19 ` Matthew Wilcox
@ 2026-01-14  8:03 ` syzbot ci
  2026-01-14  9:04 ` [PATCH linux-next] " Lorenzo Stoakes
  3 siblings, 0 replies; 6+ messages in thread
From: syzbot ci @ 2026-01-14  8:03 UTC (permalink / raw)
  To: akpm, david, fan.yu9, he.peilin, jannh, jiang.kun2, liam.howlett,
	linux-kernel, linux-mm, lorenzo.stoakes, lu.zhongjun, qiu.yutan,
	tu.qiang35, vbabka, wang.yaxin, xu.xin16, yang.yang29
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] mm/madvise: prefer VMA lock for MADV_REMOVE
https://lore.kernel.org/all/202601141124178748cM66DJW2fzNea7Uym1mG@zte.com.cn
* [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE

and found the following issue:
possible deadlock in blkdev_fallocate

Full report is available here:
https://ci.syzbot.org/series/30acb9df-ca55-4cbf-81ed-89b84da8edc1

***

possible deadlock in blkdev_fallocate

tree:      linux-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base:      4c81c9dc03a068745eeb56984ea9836d86fa3eb2
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/eaa5a44b-5d87-4428-b098-72b46efdb983/config
C repro:   https://ci.syzbot.org/findings/e3240f02-6d2c-4c6d-bc6e-e7f4be22e433/c_repro
syz repro: https://ci.syzbot.org/findings/e3240f02-6d2c-4c6d-bc6e-e7f4be22e433/syz_repro

======================================================
WARNING: possible circular locking dependency detected
syzkaller #0 Not tainted
------------------------------------------------------
syz.0.17/5976 is trying to acquire lock:
ffff888167a287a8 (&sb->s_type->i_mutex_key#10){++++}-{4:4}, at: inode_lock include/linux/fs.h:1027 [inline]
ffff888167a287a8 (&sb->s_type->i_mutex_key#10){++++}-{4:4}, at: blkdev_fallocate+0x260/0x530 block/fops.c:908

but task is already holding lock:
ffff8881043c3b88 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0x1d1/0x500 mm/mmap_lock.c:259

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (vm_lock){++++}-{0:0}:
       __vma_enter_locked+0x243/0x710 mm/mmap_lock.c:72
       __vma_start_write+0x23/0x140 mm/mmap_lock.c:104
       vma_start_write include/linux/mmap_lock.h:213 [inline]
       mprotect_fixup+0x5e1/0xa50 mm/mprotect.c:768
       setup_arg_pages+0x565/0xae0 fs/exec.c:670
       load_elf_binary+0xc5e/0x2980 fs/binfmt_elf.c:1028
       search_binary_handler fs/exec.c:1669 [inline]
       exec_binprm fs/exec.c:1701 [inline]
       bprm_execve+0x93d/0x1410 fs/exec.c:1753
       kernel_execve+0x8ef/0x9e0 fs/exec.c:1919
       try_to_run_init_process+0x13/0x60 init/main.c:1506
       kernel_init+0xad/0x1d0 init/main.c:1634
       ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
       ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246

-> #1 (&mm->mmap_lock){++++}-{4:4}:
       __might_fault+0xcb/0x130 mm/memory.c:7174
       _copy_to_iter+0xf9/0x17d0 lib/iov_iter.c:196
       copy_page_to_iter+0x10c/0x1c0 lib/iov_iter.c:374
       copy_folio_to_iter include/linux/uio.h:204 [inline]
       filemap_read+0x811/0x1230 mm/filemap.c:2851
       blkdev_read_iter+0x30a/0x440 block/fops.c:856
       new_sync_read fs/read_write.c:491 [inline]
       vfs_read+0x582/0xa70 fs/read_write.c:572
       ksys_read+0x150/0x270 fs/read_write.c:715
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&sb->s_type->i_mutex_key#10){++++}-{4:4}:
       check_prev_add kernel/locking/lockdep.c:3165 [inline]
       check_prevs_add kernel/locking/lockdep.c:3284 [inline]
       validate_chain kernel/locking/lockdep.c:3908 [inline]
       __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
       lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
       down_write+0x96/0x200 kernel/locking/rwsem.c:1590
       inode_lock include/linux/fs.h:1027 [inline]
       blkdev_fallocate+0x260/0x530 block/fops.c:908
       vfs_fallocate+0x669/0x7e0 fs/open.c:339
       madvise_remove mm/madvise.c:1055 [inline]
       madvise_vma_behavior+0x120d/0x4460 mm/madvise.c:1379
       madvise_walk_vmas+0x90e/0xaf0 mm/madvise.c:1706
       madvise_do_behavior+0x386/0x540 mm/madvise.c:1956
       do_madvise+0x1fa/0x2e0 mm/madvise.c:2049
       __do_sys_madvise mm/madvise.c:2058 [inline]
       __se_sys_madvise mm/madvise.c:2056 [inline]
       __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2056
       do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
       do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

other info that might help us debug this:

Chain exists of:
  &sb->s_type->i_mutex_key#10 --> &mm->mmap_lock --> vm_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rlock(vm_lock);
                               lock(&mm->mmap_lock);
                               lock(vm_lock);
  lock(&sb->s_type->i_mutex_key#10);

 *** DEADLOCK ***

1 lock held by syz.0.17/5976:
 #0: ffff8881043c3b88 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0x1d1/0x500 mm/mmap_lock.c:259

stack backtrace:
CPU: 1 UID: 0 PID: 5976 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_circular_bug+0x2e1/0x300 kernel/locking/lockdep.c:2043
 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175
 check_prev_add kernel/locking/lockdep.c:3165 [inline]
 check_prevs_add kernel/locking/lockdep.c:3284 [inline]
 validate_chain kernel/locking/lockdep.c:3908 [inline]
 __lock_acquire+0x15a5/0x2cf0 kernel/locking/lockdep.c:5237
 lock_acquire+0x106/0x330 kernel/locking/lockdep.c:5868
 down_write+0x96/0x200 kernel/locking/rwsem.c:1590
 inode_lock include/linux/fs.h:1027 [inline]
 blkdev_fallocate+0x260/0x530 block/fops.c:908
 vfs_fallocate+0x669/0x7e0 fs/open.c:339
 madvise_remove mm/madvise.c:1055 [inline]
 madvise_vma_behavior+0x120d/0x4460 mm/madvise.c:1379
 madvise_walk_vmas+0x90e/0xaf0 mm/madvise.c:1706
 madvise_do_behavior+0x386/0x540 mm/madvise.c:1956
 do_madvise+0x1fa/0x2e0 mm/madvise.c:2049
 __do_sys_madvise mm/madvise.c:2058 [inline]
 __se_sys_madvise mm/madvise.c:2056 [inline]
 __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2056
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5ceb59acb9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd12a05518 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f5ceb815fa0 RCX: 00007f5ceb59acb9
RDX: 0000000000000009 RSI: 0000000000004000 RDI: 0000200000119000
RBP: 00007f5ceb608bf7 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5ceb815fac R14: 00007f5ceb815fa0 R15: 00007f5ceb815fa0
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
  2026-01-14  3:24 [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE wang.yaxin
                   ` (2 preceding siblings ...)
  2026-01-14  8:03 ` [syzbot ci] " syzbot ci
@ 2026-01-14  9:04 ` Lorenzo Stoakes
  3 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2026-01-14  9:04 UTC (permalink / raw)
  To: wang.yaxin
  Cc: akpm, liam.howlett, david, vbabka, jannh, linux-mm, linux-kernel,
	xu.xin16, yang.yang29, fan.yu9, he.peilin, tu.qiang35, qiu.yutan,
	jiang.kun2, lu.zhongjun

Not sure why you're basing on linux-next, use mm-unstable.

However, syzbot is immediately reporting a deadlock bug here. Maybe a 'neat
hack' to get the bots to actually check things...? Very strange that
happened for something not in any tree.

See below for my rough analysis of it.

Also can you make sure to test changes to _really fiddly_ locks like this
with lockdep please?

Overall I don't think this change is justified in any way even if you
somehow resolve the existing lock mess unless you can provide significant
justification so am minded for this concept to be rejected.

But obviously this patch as-is is broken, so no to it clearly.

On Wed, Jan 14, 2026 at 11:24:17AM +0800, wang.yaxin@zte.com.cn wrote:
> From: Jiang Kun <jiang.kun2@zte.com.cn>
>
> MADV_REMOVE currently runs under the process-wide mmap_read_lock() and
> temporarily drops and reacquires it around filesystem hole-punching.
> For single-VMA, local-mm, non-UFFD-armed ranges we can safely operate
> under the finer-grained per-VMA read lock to reduce contention and lock
> hold time, while preserving semantics.
>
> This patch:
> - Switches MADV_REMOVE to prefer MADVISE_VMA_READ_LOCK via get_lock_mode().
> - Adds a branch in madvise_remove():
>   * Under VMA lock: avoid mark_mmap_lock_dropped() and mmap lock churn;
>     take a file reference and call vfs_fallocate() directly.
>   * Under mmap read lock fallback: preserve existing behavior including
>     userfaultfd_remove() coordination and temporary mmap_read_unlock/lock
>     around vfs_fallocate().
>
> Constraints and fallback:
> - try_vma_read_lock() enforces single VMA, local mm, and userfaultfd
>   not armed (userfaultfd_armed(vma) == false). If any condition fails,
>   we fall back to mmap_read_lock(mm) and use the original path.
> - Semantics are unchanged: permission checks, VM_LOCKED rejection,
>   shared-may-write requirement, error propagation all remain as before.

You're not giving any justification for doing this, you really do have to
for this to be acceptable, as Matthew notes. Especially given the locking
complexity here.

>
> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
> Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
> ---
>  mm/madvise.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6bf7009fa5ce..279ec5169879 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1015,7 +1015,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  	unsigned long start = madv_behavior->range.start;
>  	unsigned long end = madv_behavior->range.end;
>
> -	mark_mmap_lock_dropped(madv_behavior);
> +	/*
> +	 * Prefer VMA read lock path: when operating under VMA lock, we avoid
> +	 * dropping/reacquiring the mmap lock and directly perform the filesystem
> +	 * operation while the VMA is read-locked. We still take and drop a file
> +	 * reference to protect against concurrent file changes.

Hmm sounds iffy.

> +	 *
> +	 * When operating under mmap read lock (fallback), preserve existing
> +	 * behaviour: mark lock dropped, coordinate with userfaultfd_remove(),
> +	 * temporarily drop mmap_read_lock around vfs_fallocate(), and then
> +	 * reacquire it.
> +	 */
> +	if (madv_behavior->lock_mode == MADVISE_MMAP_READ_LOCK)
> +		mark_mmap_lock_dropped(madv_behavior);
>
>  	if (vma->vm_flags & VM_LOCKED)
>  		return -EINVAL;
> @@ -1033,12 +1045,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>
>  	/*
> -	 * Filesystem's fallocate may need to take i_rwsem.  We need to
> -	 * explicitly grab a reference because the vma (and hence the
> -	 * vma's reference to the file) can go away as soon as we drop
> -	 * mmap_lock.
> +	 * Execute filesystem punch-hole under appropriate locking.
> +	 * - VMA lock path: no mmap lock held; call vfs_fallocate() directly.
> +	 * - mmap lock path: follow existing protocol including UFFD coordination
> +	 *   and temporary mmap_read_unlock/lock around the filesystem call.

Turns out this isn't correct.

>  	 */
>  	get_file(f);
> +	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> +		error = vfs_fallocate(f,
> +					FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +					offset, end - start);
> +		fput(f);
> +		return error;
> +	}

You're duplicating code here (including the god-awful indentation
here... not your fault but still) it's really not good.

Please do not send patches that literally copy/paste entire blocks of code
like this, it's not acceptable.

Anyway on to the deadlock... far more fatally the syzkaller suggests that
you are now in trouble and deadlocking on the inode lock...

You have:

- A vfs_read() .. inode lock + mmap read lock ....
- A cheeky setup_arg_pages() -> mprotect_fixup() VMA _write_ lock...
- And now this vfs_fallocate() which is waiting on the inode lock DEADLOCK.

This is my rough reading.

This strikes me as rendering MADV_REMOVE not amenable to use of the VMA
read lock.

The locking here is already very complicated due to synchronising the rmap
lock, so can we please just not?

>  	if (userfaultfd_remove(vma, start, end)) {
>  		/* mmap_lock was not released by userfaultfd_remove() */
>  		mmap_read_unlock(mm);

I mean you could have done e.g.:

	/* VMA locking doesn't support UFFD here. */
	if (!using_vma_lock && userfaultfd_remove(...)) {
		...
	}

Anyway it's kinda moot now obviously!

> @@ -1754,7 +1773,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>  		return MADVISE_NO_LOCK;
>
>  	switch (madv_behavior->behavior) {
> -	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> @@ -1762,6 +1780,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>  	case MADV_POPULATE_WRITE:
>  	case MADV_COLLAPSE:
>  		return MADVISE_MMAP_READ_LOCK;
> +	case MADV_REMOVE:
>  	case MADV_GUARD_INSTALL:
>  	case MADV_GUARD_REMOVE:
>  	case MADV_DONTNEED:
> --
> 2.43.5

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-14  9:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-14  3:24 [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE wang.yaxin
2026-01-14  4:18 ` Matthew Wilcox
2026-01-14  7:00   ` wang.yaxin
2026-01-14  4:19 ` Matthew Wilcox
2026-01-14  8:03 ` [syzbot ci] " syzbot ci
2026-01-14  9:04 ` [PATCH linux-next] " Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox