From: Mateusz Guzik <mjguzik@gmail.com>
To: Suren Baghdasaryan <surenb@google.com>
Cc: linux-mm <linux-mm@kvack.org>,
"Liam R. Howlett" <liam.howlett@oracle.com>
Subject: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded
Date: Thu, 27 Mar 2025 06:46:27 +0100 [thread overview]
Message-ID: <CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com> (raw)
Hi Suren,
I brought this up long time ago, you agreed the idea works and stated
you will sort it out, but it apparently fell to the wayside (no hard
feelings though :>)
Here is the original:
https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
the thread is weirdly long and I recommend not opening it without a
good reason, I link it for reference if needed.
At the time there were woes concerning stability of the new locking
scheme, resulting in CC list being rather excessive. As such I'm
starting a new thread with a modest list, not sure who to add though.
So dup_mmap() holds the caller mmap_sem for writing and calls
vma_start_write() to protect against fault handling in another threads
using the same mm.
If this is the only thread with this ->mm in place, there are no
sibling threads to worry about and this can be checked with mm_users
== 1.
AFAICS all remote accesses require the mmap_sem to also be held, which
provides exclusion against dup_mmap, meaning they don't pose a problem
either.
The patch to merely skip locking is a few liner and I would officially
submit it myself, but for my taste an assert is needed in fault
handling to runtime test the above invariant. So happens I really
can't be bothered to figure out how to sort it out and was hoping you
would step in. ;) Alternatively if you guys don't think the assert is
warranted, that's your business.
As for whether this can save any locking -- yes:
I added a probe (below for reference) with two args: whether we are
single-threaded and vma_start_write() returning whether it took the
down/up cycle and ran make -j 20 in the kernel dir.
The lock was taken for every single vma (377913 in total), while all
forking processes were single-threaded. Or to put it differently all
of these were skippable.
the probe (total hack):
bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
probe diff:
diff --git a/fs/namei.c b/fs/namei.c
index ecb7b95c2ca3..d6cde76eda81 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5459,3 +5459,7 @@ const struct inode_operations
page_symlink_inode_operations = {
.get_link = page_get_link,
};
EXPORT_SYMBOL(page_symlink_inode_operations);
+
+void dup_probe(int, int);
+void dup_probe(int, int) { }
+EXPORT_SYMBOL(dup_probe);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f80baddacc5..f7b1f0a02f2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
vm_area_struct *vma, unsigned int *mm_l
* Exclude concurrent readers under the per-VMA lock until the currently
* write-locked mmap_lock is dropped or downgraded.
*/
-static inline void vma_start_write(struct vm_area_struct *vma)
+static inline bool vma_start_write(struct vm_area_struct *vma)
{
unsigned int mm_lock_seq;
if (__is_vma_write_locked(vma, &mm_lock_seq))
- return;
+ return false;
down_write(&vma->vm_lock->lock);
/*
@@ -776,6 +776,7 @@ static inline void vma_start_write(struct
vm_area_struct *vma)
*/
WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
up_write(&vma->vm_lock->lock);
+ return true;
}
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..0cc56255a339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
struct mm_struct *oldmm)
pr_warn_once("exe_file_deny_write_access() failed in
%s\n", __func__);
}
+void dup_probe(int, int);
+
#ifdef CONFIG_MMU
static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
@@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
unsigned long charge = 0;
LIST_HEAD(uf);
VMA_ITERATOR(vmi, mm, 0);
+ bool only_user;
if (mmap_write_lock_killable(oldmm))
return -EINTR;
+ only_user = atomic_read(&oldmm->mm_users) == 1;
flush_cache_dup_mm(oldmm);
uprobe_dup_mmap(oldmm, mm);
/*
@@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mt_clear_in_rcu(vmi.mas.tree);
for_each_vma(vmi, mpnt) {
struct file *file;
+ bool locked;
+
+ locked = vma_start_write(mpnt);
+ dup_probe(only_user ? 1 :0, locked ? 1 : 0);
- vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
mpnt->vm_end, GFP_KERNEL);
--
Mateusz Guzik <mjguzik gmail.com>
next reply other threads:[~2025-03-27 5:46 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-27 5:46 Mateusz Guzik [this message]
2025-03-29 0:56 ` Suren Baghdasaryan
2025-03-29 1:15 ` Mateusz Guzik
2025-03-29 1:35 ` Suren Baghdasaryan
2025-03-29 1:51 ` Mateusz Guzik
2025-03-30 19:23 ` Suren Baghdasaryan
2025-03-30 19:25 ` Suren Baghdasaryan
2025-03-30 19:43 ` Mateusz Guzik
2025-03-31 16:43 ` Liam R. Howlett
2025-03-31 17:50 ` Mateusz Guzik
2025-03-31 18:42 ` Suren Baghdasaryan
2025-03-31 19:24 ` Liam R. Howlett
2025-03-31 20:27 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com \
--to=mjguzik@gmail.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=surenb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox