Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Suren Baghdasaryan <surenb@google.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: linux-mm <linux-mm@kvack.org>,
	"Liam R. Howlett" <liam.howlett@oracle.com>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded
Date: Fri, 28 Mar 2025 17:56:54 -0700	[thread overview]
Message-ID: <CAJuCfpHB42wZ+V74ny-KQGV4+R98YkOE2mEQSJ9LvripaDRfXA@mail.gmail.com> (raw)
In-Reply-To: <CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com>

Hi Mateusz,

On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Hi Suren,
>
> I brought this up long time ago, you agreed the idea works and stated
> you will sort it out, but it apparently fell to the wayside (no hard
> feelings though :>)

Sorry about that. Yeah, that was a hectic time and I completely forgot
to look into the single-user case.

>
> Here is the original:
> https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
>
> the thread is weirdly long and I recommend not opening it without a
> good reason, I link it for reference if needed.

I had to re-read it to remember what it was all about :) To bring
others up-to-speed, the suggested change would look something like
this:

@@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        unsigned long charge = 0;
        LIST_HEAD(uf);
        VMA_ITERATOR(vmi, mm, 0);
+       bool only_user;

        if (mmap_write_lock_killable(oldmm))
                return -EINTR;
+       only_user = atomic_read(&oldmm->mm_users) == 1;
        flush_cache_dup_mm(oldmm);
        uprobe_dup_mmap(oldmm, mm);
        /*
@@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        mt_clear_in_rcu(vmi.mas.tree);
        for_each_vma(vmi, mpnt) {
                struct file *file;

-               vma_start_write(mpnt);
+               if (!only_user)
+                       vma_start_write(mpnt);
                if (mpnt->vm_flags & VM_DONTCOPY) {
                        retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
                                                    mpnt->vm_end, GFP_KERNEL);

>
> At the time there were woes concerning stability of the new locking
> scheme, resulting in CC list being rather excessive. As such I'm
> starting a new thread with a modest list, not sure who to add though.
>
> So dup_mmap() holds the caller mmap_sem for writing and calls
> vma_start_write() to protect against fault handling in another threads
> using the same mm.
>
> If this is the only thread with this ->mm in place, there are no
> sibling threads to worry about and this can be checked with mm_users
> == 1.
>
> AFAICS all remote accesses require the mmap_sem to also be held, which
> provides exclusion against dup_mmap, meaning they don't pose a problem
> either.

Yes, I believe the optimization you proposed would be safe.

>
> The patch to merely skip locking is a few liner and I would officially
> submit it myself, but for my taste an assert is needed in fault
> handling to runtime test the above invariant.

Hmm. Adding an assert in the pagefault path that checks whether the
fault is triggered during dup_mmap() && mm_users==1 would not be
trivial. We would need to indicate that we expect no page faults while
in that section of dup_mmap() (maybe a flag inside mm_struct) and then
assert that this flag is not set inside the pagefault handling path.
I'm not sure it's worth the complexity... As was discussed in that
thread, the only other task that might fault a page would be external
and therefore would have to go through
access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
already holds mmap_write_lock the new user would have to wait.

> So happens I really
> can't be bothered to figure out how to sort it out and was hoping you
> would step in. ;) Alternatively if you guys don't think the assert is
> warranted, that's your business.

I don't think it's worth it but I'll CC Matthew and Lorenzo (you
already CC'ed Liam) to get their opinion.

>
> As for whether this can save any locking -- yes:

Yeah, I'm sure it will make a difference in performance. While forking
we are currently locking each VMA separately, so skipping that would
be nice.
Folks, WDYT? Do we need a separate assertion that pagefault can't
happen if mm_users==1 and we are holding mmap_write_lock
(access_remote_vm() will block)?
Thanks,
Suren.

>
> I added a probe (below for reference) with two args: whether we are
> single-threaded and vma_start_write() returning whether it took the
> down/up cycle and ran make -j 20 in the kernel dir.
>
> The lock was taken for every single vma (377913 in total), while all
> forking processes were single-threaded. Or to put it differently all
> of these were skippable.
>
> the probe (total hack):
> bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
>
> probe diff:
> diff --git a/fs/namei.c b/fs/namei.c
> index ecb7b95c2ca3..d6cde76eda81 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -5459,3 +5459,7 @@ const struct inode_operations
> page_symlink_inode_operations = {
>         .get_link       = page_get_link,
>  };
>  EXPORT_SYMBOL(page_symlink_inode_operations);
> +
> +void dup_probe(int, int);
> +void dup_probe(int, int) { }
> +EXPORT_SYMBOL(dup_probe);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f80baddacc5..f7b1f0a02f2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> vm_area_struct *vma, unsigned int *mm_l
>   * Exclude concurrent readers under the per-VMA lock until the currently
>   * write-locked mmap_lock is dropped or downgraded.
>   */
> -static inline void vma_start_write(struct vm_area_struct *vma)
> +static inline bool vma_start_write(struct vm_area_struct *vma)
>  {
>         unsigned int mm_lock_seq;
>
>         if (__is_vma_write_locked(vma, &mm_lock_seq))
> -               return;
> +               return false;
>
>         down_write(&vma->vm_lock->lock);
>         /*
> @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> vm_area_struct *vma)
>          */
>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>         up_write(&vma->vm_lock->lock);
> +       return true;
>  }
>
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f3..0cc56255a339 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> struct mm_struct *oldmm)
>                 pr_warn_once("exe_file_deny_write_access() failed in
> %s\n", __func__);
>  }
>
> +void dup_probe(int, int);
> +
>  #ifdef CONFIG_MMU
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                                         struct mm_struct *oldmm)
> @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         unsigned long charge = 0;
>         LIST_HEAD(uf);
>         VMA_ITERATOR(vmi, mm, 0);
> +       bool only_user;
>
>         if (mmap_write_lock_killable(oldmm))
>                 return -EINTR;
> +       only_user = atomic_read(&oldmm->mm_users) == 1;
>         flush_cache_dup_mm(oldmm);
>         uprobe_dup_mmap(oldmm, mm);
>         /*
> @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         mt_clear_in_rcu(vmi.mas.tree);
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
> +               bool locked;
> +
> +               locked = vma_start_write(mpnt);
> +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
>
> -               vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
>                                                     mpnt->vm_end, GFP_KERNEL);
>
> --
> Mateusz Guzik <mjguzik gmail.com>

next prev parent reply	other threads:[~2025-03-29  0:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-27  5:46 Mateusz Guzik
2025-03-29  0:56 ` Suren Baghdasaryan [this message]
2025-03-29  1:15   ` Mateusz Guzik
2025-03-29  1:35     ` Suren Baghdasaryan
2025-03-29  1:51       ` Mateusz Guzik
2025-03-30 19:23         ` Suren Baghdasaryan
2025-03-30 19:25           ` Suren Baghdasaryan
2025-03-30 19:43           ` Mateusz Guzik
2025-03-31 16:43             ` Liam R. Howlett
2025-03-31 17:50               ` Mateusz Guzik
2025-03-31 18:42                 ` Suren Baghdasaryan
2025-03-31 19:24                 ` Liam R. Howlett
2025-03-31 20:27                   ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpHB42wZ+V74ny-KQGV4+R98YkOE2mEQSJ9LvripaDRfXA@mail.gmail.com \
    --to=surenb@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mjguzik@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox