* [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE
@ 2026-04-10 8:02 jiang.kun2
2026-04-10 13:40 ` Lorenzo Stoakes
0 siblings, 1 reply; 2+ messages in thread
From: jiang.kun2 @ 2026-04-10 8:02 UTC (permalink / raw)
To: akpm, liam.howlett, ljs, david, vbabka, jannh
Cc: linux-mm, linux-kernel, xu.xin16, wang.yaxin, jiang.kun2, lu.zhongjun
From: Jiang Kun <jiang.kun2@zte.com.cn>
MADV_REMOVE prefers the per-VMA read lock for single-VMA, local-mm,
non-UFFD-armed ranges, avoiding mmap_lock contention for such ranges.
However, calling into the filesystem while holding vm_lock (VMA lock) can
create lock ordering issues. syzbot reported a possible deadlock in
blkdev_fallocate() when vfs_fallocate() is called under vm_lock.
Fix this by dropping the VMA lock before invoking vfs_fallocate(), after
taking an extra reference to the file. Keep the existing mmap_lock fallback
path and its userfaultfd coordination unchanged.
Repeated benchmark runs show no regression in the uncontended case, and show
benefit once mmap_lock contention is introduced.
Link: https://ci.syzbot.org/series/30acb9df-ca55-4cbf-81ed-89b84da8edc1
Link: https://lore.kernel.org/all/aWcZCwz__qwwKbxw@casper.infradead.org/
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
---
mm/madvise.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..0932579bccb4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1008,8 +1008,6 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
unsigned long start = madv_behavior->range.start;
unsigned long end = madv_behavior->range.end;
- mark_mmap_lock_dropped(madv_behavior);
-
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1025,6 +1023,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+ /* Avoid calling into the filesystem while holding a VMA lock. */
+ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+ get_file(f);
+ vma_end_read(vma);
+ madv_behavior->vma = NULL;
+ error = vfs_fallocate(f,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ offset, end - start);
+ fput(f);
+ return error;
+ }
+
+ mark_mmap_lock_dropped(madv_behavior);
+
/*
* Filesystem's fallocate may need to take i_rwsem. We need to
* explicitly grab a reference because the vma (and hence the
@@ -1677,7 +1689,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
try_vma_read_lock(madv_behavior)) {
error = madvise_vma_behavior(madv_behavior);
- vma_end_read(madv_behavior->vma);
+ if (madv_behavior->vma)
+ vma_end_read(madv_behavior->vma);
return error;
}
@@ -1746,7 +1759,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
return MADVISE_NO_LOCK;
switch (madv_behavior->behavior) {
- case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_COLD:
case MADV_PAGEOUT:
@@ -1754,6 +1766,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
case MADV_POPULATE_WRITE:
case MADV_COLLAPSE:
return MADVISE_MMAP_READ_LOCK;
+ case MADV_REMOVE:
case MADV_GUARD_INSTALL:
case MADV_GUARD_REMOVE:
case MADV_DONTNEED:
--
2.53.0
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE
2026-04-10 8:02 [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE jiang.kun2
@ 2026-04-10 13:40 ` Lorenzo Stoakes
0 siblings, 0 replies; 2+ messages in thread
From: Lorenzo Stoakes @ 2026-04-10 13:40 UTC (permalink / raw)
To: jiang.kun2
Cc: akpm, liam.howlett, david, vbabka, jannh, linux-mm, linux-kernel,
xu.xin16, wang.yaxin, lu.zhongjun
On Fri, Apr 10, 2026 at 04:02:49PM +0800, jiang.kun2@zte.com.cn wrote:
> From: Jiang Kun <jiang.kun2@zte.com.cn>
>
> MADV_REMOVE prefers the per-VMA read lock for single-VMA, local-mm,
> non-UFFD-armed ranges, avoiding mmap_lock contention for such ranges.
I'm not sure what 'MADV_REMOVE prefers' means here? You're making the change,
you should be making a case for it, ideally with performance numbers.
I'm really not sure we should just be enabling the VMA lock for things
arbitrarily unless there's a real need for it.
>
> However, calling into the filesystem while holding vm_lock (VMA lock) can
Not sure what vm_lock is? :)
> create lock ordering issues. syzbot reported a possible deadlock in
> blkdev_fallocate() when vfs_fallocate() is called under vm_lock.
It's not really worth mentioning an issue that came up in a v1 of this code in
the commit message. The commit message should describe what you're actually
doing, why, and any details that might be helpful to people looking at the
patch.
>
> Fix this by dropping the VMA lock before invoking vfs_fallocate(), after
> taking an extra reference to the file. Keep the existing mmap_lock fallback
> path and its userfaultfd coordination unchanged.
It's worth mentioning this, but not in terms of a bug that happened in v1
>
> Repeated benchmark runs show no regression in the uncontended case, and show
> benefit once mmap_lock contention is introduced.
Numbers please :) ideally with some statistics like standard deviation etc.
>
> Link: https://ci.syzbot.org/series/30acb9df-ca55-4cbf-81ed-89b84da8edc1
> Link: https://lore.kernel.org/all/aWcZCwz__qwwKbxw@casper.infradead.org/
> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
> Signed-off-by: Yaxin Wang <wang.yaxin@zte.com.cn>
Some things to address, I have attached a patch that reworks what you have
here in a way that avoids duplication, so if you're good with that feel
free to send as v3, I don't need attribution other than a Suggested-by if
you like.
The logic in general seems correct, but we need numbers! And the commit
message to be improved as per the above also.
Thanks, Lorenzo
> ---
> mm/madvise.c | 21 +++++++++++++++++----
> 1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 69708e953cf5..0932579bccb4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1008,8 +1008,6 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
> unsigned long start = madv_behavior->range.start;
> unsigned long end = madv_behavior->range.end;
>
> - mark_mmap_lock_dropped(madv_behavior);
> -
> if (vma->vm_flags & VM_LOCKED)
> return -EINVAL;
>
> @@ -1025,6 +1023,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
> offset = (loff_t)(start - vma->vm_start)
> + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>
> + /* Avoid calling into the filesystem while holding a VMA lock. */
> + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> + get_file(f);
> + vma_end_read(vma);
> + madv_behavior->vma = NULL;
> + error = vfs_fallocate(f,
> + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> + offset, end - start);
> + fput(f);
> + return error;
> + }
> +
> + mark_mmap_lock_dropped(madv_behavior);
I don't like this code duplication, you're literally copy/pasting code here.
Let's use what we already have.
We can avoid duplicating like this, make mark_mmap_lock_dropped() valid for VMA
locks too and rename it to mark_lock_dropped(), then change the logic around a
bit and have something a bit nicer I think.
See attached patch, you're welcome to use this as-is (you can add a Suggested-by
if you like), but that avoids the duplication and keeps things unified.
> +
> /*
> * Filesystem's fallocate may need to take i_rwsem. We need to
> * explicitly grab a reference because the vma (and hence the
> @@ -1677,7 +1689,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
> try_vma_read_lock(madv_behavior)) {
> error = madvise_vma_behavior(madv_behavior);
> - vma_end_read(madv_behavior->vma);
> + if (madv_behavior->vma)
> + vma_end_read(madv_behavior->vma);
> return error;
> }
>
> @@ -1746,7 +1759,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
> return MADVISE_NO_LOCK;
>
> switch (madv_behavior->behavior) {
> - case MADV_REMOVE:
> case MADV_WILLNEED:
> case MADV_COLD:
> case MADV_PAGEOUT:
> @@ -1754,6 +1766,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
> case MADV_POPULATE_WRITE:
> case MADV_COLLAPSE:
> return MADVISE_MMAP_READ_LOCK;
> + case MADV_REMOVE:
> case MADV_GUARD_INSTALL:
> case MADV_GUARD_REMOVE:
> case MADV_DONTNEED:
> --
> 2.53.0
Suggested reworked version below:
----8<----
From 97f6ff9749a38256e919a24ff2867a077a6f8cfc Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <ljs@kernel.org>
Date: Fri, 10 Apr 2026 14:37:23 +0100
Subject: [PATCH] slightly reworked
---
mm/madvise.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 69708e953cf5..4af0565809f2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -269,9 +269,8 @@ static void shmem_swapin_range(struct vm_area_struct *vma,
}
#endif /* CONFIG_SWAP */
-static void mark_mmap_lock_dropped(struct madvise_behavior *madv_behavior)
+static void mark_lock_dropped(struct madvise_behavior *madv_behavior)
{
- VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
madv_behavior->lock_dropped = true;
}
@@ -315,7 +314,7 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior)
* vma's reference to the file) can go away as soon as we drop
* mmap_lock.
*/
- mark_mmap_lock_dropped(madv_behavior);
+ mark_lock_dropped(madv_behavior);
get_file(file);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -909,7 +908,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
struct vm_area_struct *vma;
- mark_mmap_lock_dropped(madv_behavior);
+ mark_lock_dropped(madv_behavior);
mmap_read_lock(mm);
madv_behavior->vma = vma = vma_lookup(mm, range->start);
if (!vma)
@@ -1005,10 +1004,9 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
struct file *f;
struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma = madv_behavior->vma;
- unsigned long start = madv_behavior->range.start;
- unsigned long end = madv_behavior->range.end;
-
- mark_mmap_lock_dropped(madv_behavior);
+ const unsigned long start = madv_behavior->range.start;
+ const unsigned long end = madv_behavior->range.end;
+ const bool vma_locked = madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK;
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1029,18 +1027,20 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
* Filesystem's fallocate may need to take i_rwsem. We need to
* explicitly grab a reference because the vma (and hence the
* vma's reference to the file) can go away as soon as we drop
- * mmap_lock.
+ * its lock.
*/
get_file(f);
- if (userfaultfd_remove(vma, start, end)) {
- /* mmap_lock was not released by userfaultfd_remove() */
+ mark_lock_dropped(madv_behavior);
+ if (vma_locked)
+ vma_end_read(vma);
+ else if (userfaultfd_remove(vma, start, end)) /* uffd didn't drop? */
mmap_read_unlock(mm);
- }
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- mmap_read_lock(mm);
+ if (!vma_locked)
+ mmap_read_lock(mm);
return error;
}
@@ -1677,7 +1677,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
try_vma_read_lock(madv_behavior)) {
error = madvise_vma_behavior(madv_behavior);
- vma_end_read(madv_behavior->vma);
+ if (!madv_behavior->lock_dropped)
+ vma_end_read(madv_behavior->vma);
return error;
}
@@ -1746,7 +1747,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
return MADVISE_NO_LOCK;
switch (madv_behavior->behavior) {
- case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_COLD:
case MADV_PAGEOUT:
@@ -1754,6 +1754,7 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
case MADV_POPULATE_WRITE:
case MADV_COLLAPSE:
return MADVISE_MMAP_READ_LOCK;
+ case MADV_REMOVE:
case MADV_GUARD_INSTALL:
case MADV_GUARD_REMOVE:
case MADV_DONTNEED:
--
2.53.0
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-04-10 13:40 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-10 8:02 [PATCH v2] mm/madvise: prefer VMA lock for MADV_REMOVE jiang.kun2
2026-04-10 13:40 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox