From: <wang.yaxin@zte.com.cn>
To: <akpm@linux-foundation.org>, <liam.howlett@oracle.com>,
<lorenzo.stoakes@oracle.com>, <david@kernel.org>
Cc: <vbabka@suse.cz>, <jannh@google.com>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <xu.xin16@zte.com.cn>,
<yang.yang29@zte.com.cn>, <wang.yaxin@zte.com.cn>,
<fan.yu9@zte.com.cn>, <he.peilin@zte.com.cn>,
<tu.qiang35@zte.com.cn>, <qiu.yutan@zte.com.cn>,
<jiang.kun2@zte.com.cn>, <lu.zhongjun@zte.com.cn>
Subject: [PATCH linux-next] mm/madvise: prefer VMA lock for MADV_REMOVE
Date: Wed, 14 Jan 2026 11:24:17 +0800 (CST) [thread overview]
Message-ID: <202601141124178748cM66DJW2fzNea7Uym1mG@zte.com.cn> (raw)
From: Jiang Kun <jiang.kun2@zte.com.cn>
MADV_REMOVE currently runs under the process-wide mmap_read_lock() and
temporarily drops and reacquires it around filesystem hole-punching.
For single-VMA, local-mm, non-UFFD-armed ranges we can safely operate
under the finer-grained per-VMA read lock to reduce contention and lock
hold time, while preserving semantics.
This patch:
- Switches MADV_REMOVE to prefer MADVISE_VMA_READ_LOCK via get_lock_mode().
- Adds a branch in madvise_remove():
* Under VMA lock: avoid mark_mmap_lock_dropped() and mmap lock churn;
take a file reference and call vfs_fallocate() directly.
* Under mmap read lock fallback: preserve existing behavior including
userfaultfd_remove() coordination and temporary mmap_read_unlock/lock
around vfs_fallocate().
Constraints and fallback:
- try_vma_read_lock() enforces single VMA, local mm, and userfaultfd
not armed (userfaultfd_armed(vma) == false). If any condition fails,
we fall back to mmap_read_lock(mm) and use the original path.
- Semantics are unchanged: permission checks, VM_LOCKED rejection,
shared-may-write requirement, error propagation all remain as before.
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
---
mm/madvise.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6bf7009fa5ce..279ec5169879 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1015,7 +1015,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
unsigned long start = madv_behavior->range.start;
unsigned long end = madv_behavior->range.end;
- mark_mmap_lock_dropped(madv_behavior);
+ /*
+ * Prefer VMA read lock path: when operating under VMA lock, we avoid
+ * dropping/reacquiring the mmap lock and directly perform the filesystem
+ * operation while the VMA is read-locked. We still take and drop a file
+ * reference to protect against concurrent file changes.
+ *
+ * When operating under mmap read lock (fallback), preserve existing
+ * behaviour: mark lock dropped, coordinate with userfaultfd_remove(),
+ * temporarily drop mmap_read_lock around vfs_fallocate(), and then
+ * reacquire it.
+ */
+ if (madv_behavior->lock_mode == MADVISE_MMAP_READ_LOCK)
+ mark_mmap_lock_dropped(madv_behavior);
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1033,12 +1045,19 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
/*
- * Filesystem's fallocate may need to take i_rwsem. We need to
- * explicitly grab a reference because the vma (and hence the
- * vma's reference to the file) can go away as soon as we drop
- * mmap_lock.
+ * Execute filesystem punch-hole under appropriate locking.
+ * - VMA lock path: no mmap lock held; call vfs_fallocate() directly.
+ * - mmap lock path: follow existing protocol including UFFD coordination
+ * and temporary mmap_read_unlock/lock around the filesystem call.
*/
get_file(f);
+ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+ error = vfs_fallocate(f,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ offset, end - start);
+ fput(f);
+ return error;
+ }
if (userfaultfd_remove(vma, start, end)) {
/* mmap_lock was not released by userfaultfd_remove() */
mmap_read_unlock(mm);
@@ -1754,7 +1773,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
return MADVISE_NO_LOCK;
switch (madv_behavior->behavior) {
- case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_COLD:
case MADV_PAGEOUT:
@@ -1762,6 +1780,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
case MADV_POPULATE_WRITE:
case MADV_COLLAPSE:
return MADVISE_MMAP_READ_LOCK;
+ case MADV_REMOVE:
case MADV_GUARD_INSTALL:
case MADV_GUARD_REMOVE:
case MADV_DONTNEED:
--
2.43.5
next reply other threads:[~2026-01-14 3:24 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-14 3:24 wang.yaxin [this message]
2026-01-14 4:18 ` Matthew Wilcox
2026-01-14 4:19 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=202601141124178748cM66DJW2fzNea7Uym1mG@zte.com.cn \
--to=wang.yaxin@zte.com.cn \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=fan.yu9@zte.com.cn \
--cc=he.peilin@zte.com.cn \
--cc=jannh@google.com \
--cc=jiang.kun2@zte.com.cn \
--cc=liam.howlett@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lu.zhongjun@zte.com.cn \
--cc=qiu.yutan@zte.com.cn \
--cc=tu.qiang35@zte.com.cn \
--cc=vbabka@suse.cz \
--cc=xu.xin16@zte.com.cn \
--cc=yang.yang29@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox