Re: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: wang.yaxin@zte.com.cn
Cc: akpm@linux-foundation.org, liam.howlett@oracle.com,
	david@kernel.org, vbabka@suse.cz, jannh@google.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	xu.xin16@zte.com.cn, yang.yang29@zte.com.cn, fan.yu9@zte.com.cn,
	he.peilin@zte.com.cn, tu.qiang35@zte.com.cn,
	qiu.yutan@zte.com.cn, jiang.kun2@zte.com.cn,
	lu.zhongjun@zte.com.cn
Subject: Re: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
Date: Wed, 14 Jan 2026 09:04:03 +0000	[thread overview]
Message-ID: <29af26de-7339-4dda-a3ee-00eb19993493@lucifer.local> (raw)
In-Reply-To: <202601141124178748cM66DJW2fzNea7Uym1mG@zte.com.cn>

Not sure why you're basing on linux-next, use mm-unstable.

However, syzbot is immediately reporting a deadlock bug here. Maybe a 'neat
hack' to get the bots to actually check things...? Very strange that
happened for something not in any tree.

See below for my rough analysis of it.

Also can you make sure to test changes to _really fiddly_ locks like this
with lockdep please?

Overall I don't think this change is justified in any way even if you
somehow resolve the existing lock mess unless you can provide significant
justification so am minded for this concept to be rejected.

But obviously this patch as-is is broken, so no to it clearly.

On Wed, Jan 14, 2026 at 11:24:17AM +0800, wang.yaxin@zte.com.cn wrote:
> From: Jiang Kun <jiang.kun2@zte.com.cn>
>
> MADV_REMOVE currently runs under the process-wide mmap_read_lock() and
> temporarily drops and reacquires it around filesystem hole-punching.
> For single-VMA, local-mm, non-UFFD-armed ranges we can safely operate
> under the finer-grained per-VMA read lock to reduce contention and lock
> hold time, while preserving semantics.
>
> This patch:
> - Switches MADV_REMOVE to prefer MADVISE_VMA_READ_LOCK via get_lock_mode().
> - Adds a branch in madvise_remove():
>   * Under VMA lock: avoid mark_mmap_lock_dropped() and mmap lock churn;
>     take a file reference and call vfs_fallocate() directly.
>   * Under mmap read lock fallback: preserve existing behavior including
>     userfaultfd_remove() coordination and temporary mmap_read_unlock/lock
>     around vfs_fallocate().
>
> Constraints and fallback:
> - try_vma_read_lock() enforces single VMA, local mm, and userfaultfd
>   not armed (userfaultfd_armed(vma) == false). If any condition fails,
>   we fall back to mmap_read_lock(mm) and use the original path.
> - Semantics are unchanged: permission checks, VM_LOCKED rejection,
>   shared-may-write requirement, error propagation all remain as before.

You're not giving any justification for doing this, you really do have to
for this to be acceptable, as Matthew notes. Especially given the locking
complexity here.

>
> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
> Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
> ---
>  mm/madvise.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6bf7009fa5ce..279ec5169879 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1015,7 +1015,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  	unsigned long start = madv_behavior->range.start;
>  	unsigned long end = madv_behavior->range.end;
>
> -	mark_mmap_lock_dropped(madv_behavior);
> +	/*
> +	 * Prefer VMA read lock path: when operating under VMA lock, we avoid
> +	 * dropping/reacquiring the mmap lock and directly perform the filesystem
> +	 * operation while the VMA is read-locked. We still take and drop a file
> +	 * reference to protect against concurrent file changes.

Hmm sounds iffy.

> +	 *
> +	 * When operating under mmap read lock (fallback), preserve existing
> +	 * behaviour: mark lock dropped, coordinate with userfaultfd_remove(),
> +	 * temporarily drop mmap_read_lock around vfs_fallocate(), and then
> +	 * reacquire it.
> +	 */
> +	if (madv_behavior->lock_mode == MADVISE_MMAP_READ_LOCK)
> +		mark_mmap_lock_dropped(madv_behavior);
>
>  	if (vma->vm_flags & VM_LOCKED)
>  		return -EINVAL;
> @@ -1033,12 +1045,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
>  			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>
>  	/*
> -	 * Filesystem's fallocate may need to take i_rwsem.  We need to
> -	 * explicitly grab a reference because the vma (and hence the
> -	 * vma's reference to the file) can go away as soon as we drop
> -	 * mmap_lock.
> +	 * Execute filesystem punch-hole under appropriate locking.
> +	 * - VMA lock path: no mmap lock held; call vfs_fallocate() directly.
> +	 * - mmap lock path: follow existing protocol including UFFD coordination
> +	 *   and temporary mmap_read_unlock/lock around the filesystem call.

Turns out this isn't correct.

>  	 */
>  	get_file(f);
> +	if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> +		error = vfs_fallocate(f,
> +					FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +					offset, end - start);
> +		fput(f);
> +		return error;
> +	}

You're duplicating code here (including the god-awful indentation
here... not your fault but still) it's really not good.

Please do not send patches that literally copy/paste entire blocks of code
like this, it's not acceptable.

Anyway on to the deadlock... far more fatally the syzkaller suggests that
you are now in trouble and deadlocking on the inode lock...

You have:

- A vfs_read() .. inode lock + mmap read lock ....
- A cheeky setup_arg_pages() -> mprotect_fixup() VMA _write_ lock...
- And now this vfs_fallocate() which is waiting on the inode lock DEADLOCK.

This is my rough reading.

This strikes me as rendering MADV_REMOVE not amenable to use of the VMA
read lock.

The locking here is already very complicated due to synchronising the rmap
lock, so can we please just not?

>  	if (userfaultfd_remove(vma, start, end)) {
>  		/* mmap_lock was not released by userfaultfd_remove() */
>  		mmap_read_unlock(mm);

I mean you could have done e.g.:

	/* VMA locking doesn't support UFFD here. */
	if (!using_vma_lock && userfaultfd_remove(...)) {
		...
	}

Anyway it's kinda moot now obviously!

> @@ -1754,7 +1773,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>  		return MADVISE_NO_LOCK;
>
>  	switch (madv_behavior->behavior) {
> -	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> @@ -1762,6 +1780,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>  	case MADV_POPULATE_WRITE:
>  	case MADV_COLLAPSE:
>  		return MADVISE_MMAP_READ_LOCK;
> +	case MADV_REMOVE:
>  	case MADV_GUARD_INSTALL:
>  	case MADV_GUARD_REMOVE:
>  	case MADV_DONTNEED:
> --
> 2.43.5

Cheers, Lorenzo

     prev parent reply	other threads:[~2026-01-14  9:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-14  3:24 wang.yaxin
2026-01-14  4:18 ` Matthew Wilcox
2026-01-14  7:00   ` wang.yaxin
2026-01-14  4:19 ` Matthew Wilcox
2026-01-14  8:03 ` [syzbot ci] " syzbot ci
2026-01-14  9:04 ` Lorenzo Stoakes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=29af26de-7339-4dda-a3ee-00eb19993493@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=fan.yu9@zte.com.cn \
    --cc=he.peilin@zte.com.cn \
    --cc=jannh@google.com \
    --cc=jiang.kun2@zte.com.cn \
    --cc=liam.howlett@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lu.zhongjun@zte.com.cn \
    --cc=qiu.yutan@zte.com.cn \
    --cc=tu.qiang35@zte.com.cn \
    --cc=vbabka@suse.cz \
    --cc=wang.yaxin@zte.com.cn \
    --cc=xu.xin16@zte.com.cn \
    --cc=yang.yang29@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox