From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 287AFC28B20 for ; Sun, 30 Mar 2025 19:25:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 35BDF280004; Sun, 30 Mar 2025 15:25:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 30A86280003; Sun, 30 Mar 2025 15:25:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1AB7D280004; Sun, 30 Mar 2025 15:25:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EBEF7280003 for ; Sun, 30 Mar 2025 15:25:35 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E64621A1194 for ; Sun, 30 Mar 2025 19:25:35 +0000 (UTC) X-FDA: 83279196630.10.72BC9B2 Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf08.hostedemail.com (Postfix) with ESMTP id 1342F160005 for ; Sun, 30 Mar 2025 19:25:33 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=oFvpA7rB; spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743362734; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lw3J8IQ0JiTyI4s6EEHWG5Xf9mFErssfEZXwkO+zAaM=; b=1NxGyW0v/lNdgiY4YdTAeAX/GippTL26GYFpUJac3lUHP7CXXMpECL7jC1gPb0By7DaFHy BX/JYPVWW9PO/3suxoA4mMBSzRGvcZW9irWwIuSrTp2lu9P30dZ24dYFiBTbP0bow3hhdA a8RzBBDIiIQ2bC9oeYkt3hV3zbzG1Zg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743362734; a=rsa-sha256; cv=none; b=QPSrB8BPq7ZI43+PTSoD9vM6i8O6cpW9K/FdRCcB2ZDczZJRbL0/VDI/MPbrhHGlqmUgN5 m9EggYPq8JUekmGT8TYxaWAVW55qXOOZWIPTxwXyYFpNKJc4S45UdIuK8f7fshnQaXLBRt jw0gj/18MMhmDuzeb7C7zuOcmCxDySI= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=oFvpA7rB; spf=pass (imf08.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-47666573242so542711cf.0 for ; Sun, 30 Mar 2025 12:25:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743362733; x=1743967533; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=lw3J8IQ0JiTyI4s6EEHWG5Xf9mFErssfEZXwkO+zAaM=; b=oFvpA7rBOzYXfI1zcAo434Lyr8CTNy3im9mjeDjDraACB9aWoOPEvuPhZ/VuxR+/Tq /u3ftEcRQYttx8/AMkZkvsQc2ccGq7fYN78YMtFobISfSw2SXhmnYo2V3nph2IrdAjK1 5BoZueVYF5RNBHEmdDzt1Jl0jIADgNgU3VUMO1Ayv9rkrNEU82JXX6Qa+4u8RoLMnD6p /X0gR7gCMUNJNEk1gw3BbAq2odXOr1K0OgIg4mTTQLthzzGpIlT56hd3tnbN7DCtrrUB tVBoM+HHTEhF948/dXCdok+kYpcDadCuxGCfwweTfpAbCePFh5whZq+Vyti7WCBmj58A aR0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743362733; x=1743967533; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lw3J8IQ0JiTyI4s6EEHWG5Xf9mFErssfEZXwkO+zAaM=; b=F6BADCr9JlTUZuyGZcn37kyi89+h3RFQevf298MtdJMm7IroUx2g9z7eXfF2JWV538 eBKRtzC5mK9sSfkB1a81P7dHUpImXhHy0JC0Qg9wroWP+cS9mIn5SRq5ht6ZeERuYEy5 Cx9ZWor9WZ9SiPVetA72tklzhd4z1+lwOBCfHqcGmcxMxlXEnWIO7MEx7TYnuvB2t+3g BrjwwkqrZTGxxE6bk0QrtunqvXwPg/DD5sCFLPC5i0vedACq3ec1frArxI05hEkrhf1K Vu8OdhclhDqBbzJ/y9wor1i+EEKOTQzXr//E/zSN0iBv5tF7y5Dap2lxW5VRMSqKprSG ZmDw== X-Gm-Message-State: AOJu0YxUh4ND4EZROyhHrOGCKgr1H0tdc7gBnS4xUF0d4AYA+2zNq1Vz PJk30PWH5tvmdFgL6EnEVveb1OFXjt8Jg3m3XdPFs0EO8QhQLazE5a5Z9A1uhQ7b7/t6Pitqr1I feC/9fgKK9Dm9QeWLWU9MCrC706KxRo4IExHa X-Gm-Gg: ASbGnctAP/mOWlR3UnehDMKVf4GPa5gku9p8yEK3oxs2rp8Z6/fhIPpdHnL66Jw5INF krgiG+4fI5+LGXmdz/k+9OG3vXVPGLewvtx4uSJI5w1V+li0Z0mziEcYBipZ3gg4TdXodQ55ULw FlbN32z8aYFkAJ7sBzl+3vWjRSp/NYDZpyHByWcYOvfmF80x6+JLXrP0mU X-Google-Smtp-Source: AGHT+IFdMiLHGWXCBnoTOnenm074RALXiBOsJzo5PZjkyeRR7P7HQ8ktIp/J68glemTLfYF+Alw06IBhp4kGs19QaYk= X-Received: by 2002:a05:622a:1a24:b0:476:77be:9102 with SMTP id d75a77b69052e-47875568f27mr5102531cf.7.1743362732813; Sun, 30 Mar 2025 12:25:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Sun, 30 Mar 2025 12:25:21 -0700 X-Gm-Features: AQ5f1Jo15m6MZyIwzdlbt5CKFuQMNjcETpyM9e_wpDU1XnEbl5xEDx8gupzVmRA Message-ID: Subject: Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded To: Mateusz Guzik Cc: linux-mm , "Liam R. Howlett" , Lorenzo Stoakes , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 1342F160005 X-Stat-Signature: ouypw115nmmrzd6uf4y96jfmb97nt5c6 X-HE-Tag: 1743362733-581406 X-HE-Meta: U2FsdGVkX19pOEun5o4F3gVStDJjarMedcu6jUcXXjrO3KT4iQ5MEQw7SUkVH7qhYNaEZCc0M1hxI48cNcejW4G00+7ervARr69K8anhJFb5WMa9CV7ngQ1SD6yPzrD8IAbLcjFIwT7mc9ZhnhwcSAVVvDCMXs4OfYm/WPM9mfxLqv6LsI+QxFYM/bQz+c4H+17FlxpRAjEz6/uR1Mzi93eH2kvi/oqjgd/obUh+rU5SBC+VkB/6Hz7bFfWgKWIOQYBbrvuc/hZImmG+Ic9x6TqmXPZpOWND8HV/CLK37ldYxfQJxcLYUtjX46Nzct/RYv3R+rkrK5F0DQBsY4pYM1zwh8HpoBEX1yXshPU3jP/rY714zFff90xJsrbqOrFGcPujsTFK4Yydqpwc2utX+KyY1HuWAk2R+FBQtYALzX5q8ewGhFVoTqqs3n2wbndcFV53Nx4+OkVCyib7i/OaKSX7EAQ8G0KPG1fo5N6hH7vBO0207murc3Vx5bw1tsv4p8sSJC6oybe4hnYBkSWeakoeFWA01TQl2AlfuKdQTUbtpNDMCs6JTJHFq2ov8Fean6gX3L76DKmOYi+xtLr2k99A/Tgr8iMvTig79GvF3PGn50feut0Tm/18045QrNXSVOb77YUcXTn9bwKM/ERLP8qi2bsfE3PcQYJlDFCA+aBSwPYlTS5jP9sNa1xiMnywaMauqTgHqNtt/D0DbwDV95GNdSyUIAWxyRzYWKsaISOXJHDm6K0O6jAIGbjen4xJ9b4JSMTJ5+u+4pU8FMQJWzP4l46Zl3zhVeuUQA9F5pGGtMp8cvQL+YSpRo6CVROcT7wCbfkpOh/st82qVk/XRxSzNTJlunJw8ciceZ9Wd9t/66FyFbkOxZYb/jDxVshtbWq1H+yi2lsY47HgCAM5frOK2LESPLfw/TQ38CM1K0UiOanuPHAZ8IfYMNj/9qAatQo92RuKmFfYDkAYyOe xR9CTaxB mh/Uu9rfTPB8zfKNz19SaHVx6h9FJuf1Kq84nHgD9GfG53hDXNga+vyYaA3rVbQU0fSk7TmfZwUIr9SfCqBSYxa5+/TRl7+Frdd7xMNU17qyXjGyugWBawWjpio6fynxpG1Lhsfh9GaF24l1Ij7hU5M6DtfmpgZokMynmIbjnsYKpemTnKqk70DReX4vnU9JWVgg/G3c1bGahIJ3TAzxZsKutXmAdld8/cA0n+CWVOB6Fmee/vfY8p0MYli7CMxXps2h3e63y9DyWpIs/IcWOJ1mayDD9rXa8YbTXuqIw5fPCeBx/hN3G9pXIRQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Mar 30, 2025 at 12:23=E2=80=AFPM Suren Baghdasaryan wrote: > > On Fri, Mar 28, 2025 at 6:51=E2=80=AFPM Mateusz Guzik = wrote: > > > > top post warning > > > > It just hit me that userfaultfd() may be a problem. Creating it only > > bumps mm_count, while mm_users gets transiently modified during > > various ops. > > > > So in particular you may end up pounding off the userfaultfd instance > > to another process while being single-threaded, then mm_users =3D=3D 1 = and > > userfaultfd may be trying to do something. I have no idea if this one > > is guaranteed to take the lock. > > Hmm, yeah. uffd is nasty. Even in the cases when it does > mmget_not_zero(), there is still the possibility of a race. For > example: > > dup_mmap > only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > userfaultfd_move() > mmget_not_zero() <-- > inc mm_users > > if (!only_user) > vma_start_write(mpnt); <-- gets skipped > > move_pages() > uffd_move_lock() > uffd_lock_vm= a() > > lock_vma_under_rcu() <-- succeeds > move_pages_pte() > copy_page_range() Sorry about formatting. This might look a bit better: Task 1 Task 2 dup_mmap only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; userfaultfd_move() mmget_not_zero() if (!only_user) vma_start_write(mpnt); <-- gets skipped move_pages() uffd_move_lock() uffd_lock_vma() lock_vma_under_rc= u() move_pages_pte() copy_page_range() > > So, userfaultfd_move() will happily move pages while dup_mmap() is > doing copy_page_range(). The copied range might look quite > interesting... > > > > > However, the good news is that mm_count tends to be 1. If both > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > nobody to add it either. > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > calling mmgrab(), therefore I think it can race with the code checking > its value. > > > > > State of mm_count verified with: bpftrace -e 'kprobe:copy_process { > > @[curtask->mm->mm_count.counter] =3D count(); }' > > > > On Sat, Mar 29, 2025 at 2:35=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > On Fri, Mar 28, 2025 at 6:16=E2=80=AFPM Mateusz Guzik wrote: > > > > > > > > On Sat, Mar 29, 2025 at 1:57=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > On Wed, Mar 26, 2025 at 10:46=E2=80=AFPM Mateusz Guzik wrote: > > > > > > Here is the original: > > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > > > > > the thread is weirdly long and I recommend not opening it witho= ut a > > > > > > good reason, I link it for reference if needed. > > > > > > > > > > I had to re-read it to remember what it was all about :) To bring > > > > > others up-to-speed, the suggested change would look something lik= e > > > > > this: > > > > > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct = mm_struct *mm, > > > > > unsigned long charge =3D 0; > > > > > LIST_HEAD(uf); > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > + bool only_user; > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > return -EINTR; > > > > > + only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > > > > flush_cache_dup_mm(oldmm); > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > /* > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct = mm_struct *mm, > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > for_each_vma(vmi, mpnt) { > > > > > struct file *file; > > > > > > > > > > - vma_start_write(mpnt); > > > > > + if (!only_user) > > > > > + vma_start_write(mpnt); > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > retval =3D vma_iter_clear_gfp(&vmi, mpnt-= >vm_start, > > > > > mpnt->vm_end,= GFP_KERNEL); > > > > > > > > > > > > > > > > > At the time there were woes concerning stability of the new loc= king > > > > > > scheme, resulting in CC list being rather excessive. As such I'= m > > > > > > starting a new thread with a modest list, not sure who to add t= hough. > > > > > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > > > vma_start_write() to protect against fault handling in another = threads > > > > > > using the same mm. > > > > > > > > > > > > If this is the only thread with this ->mm in place, there are n= o > > > > > > sibling threads to worry about and this can be checked with mm_= users > > > > > > =3D=3D 1. > > > > > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held= , which > > > > > > provides exclusion against dup_mmap, meaning they don't pose a = problem > > > > > > either. > > > > > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > > > > > > > The patch to merely skip locking is a few liner and I would off= icially > > > > > > submit it myself, but for my taste an assert is needed in fault > > > > > > handling to runtime test the above invariant. > > > > > > > > > > Hmm. Adding an assert in the pagefault path that checks whether t= he > > > > > fault is triggered during dup_mmap() && mm_users=3D=3D1 would not= be > > > > > trivial. We would need to indicate that we expect no page faults = while > > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and= then > > > > > assert that this flag is not set inside the pagefault handling pa= th. > > > > > I'm not sure it's worth the complexity... As was discussed in tha= t > > > > > thread, the only other task that might fault a page would be exte= rnal > > > > > and therefore would have to go through > > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap(= ) > > > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > > > > > So happens I really > > > > > > can't be bothered to figure out how to sort it out and was hopi= ng you > > > > > > would step in. ;) Alternatively if you guys don't think the ass= ert is > > > > > > warranted, that's your business. > > > > > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > > > > > Yeah, I'm sure it will make a difference in performance. While fo= rking > > > > > we are currently locking each VMA separately, so skipping that wo= uld > > > > > be nice. > > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > > > happen if mm_users=3D=3D1 and we are holding mmap_write_lock > > > > > (access_remote_vm() will block)? > > > > > > > > pseudocode-wise I was thinking in the lines of the following in the= fault path: > > > > > > > > if (current->mm !=3D vma->vm_mm) > > > > mmap_assert_locked(vma->vm_mm); > > > > > > > > with the assumption that mmap_assert_locked expands to nothing with= out debug > > > > > > I see. IOW, if someone external is faulting a page then it has to be > > > holding at least mmap_read_lock. So, it's a more general check but > > > seems reasonable. I think adding it at the end of lock_vma_under_rcu(= ) > > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we= are > > > > > > single-threaded and vma_start_write() returning whether it took= the > > > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > > > > > The lock was taken for every single vma (377913 in total), whil= e all > > > > > > forking processes were single-threaded. Or to put it differentl= y all > > > > > > of these were skippable. > > > > > > > > > > > > the probe (total hack): > > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] =3D count(); }' > > > > > > > > > > > > probe diff: > > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > > > --- a/fs/namei.c > > > > > > +++ b/fs/namei.c > > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > > > page_symlink_inode_operations =3D { > > > > > > .get_link =3D page_get_link, > > > > > > }; > > > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > > > + > > > > > > +void dup_probe(int, int); > > > > > > +void dup_probe(int, int) { } > > > > > > +EXPORT_SYMBOL(dup_probe); > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > > > --- a/include/linux/mm.h > > > > > > +++ b/include/linux/mm.h > > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > > > vm_area_struct *vma, unsigned int *mm_l > > > > > > * Exclude concurrent readers under the per-VMA lock until the= currently > > > > > > * write-locked mmap_lock is dropped or downgraded. > > > > > > */ > > > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > > > { > > > > > > unsigned int mm_lock_seq; > > > > > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > > - return; > > > > > > + return false; > > > > > > > > > > > > down_write(&vma->vm_lock->lock); > > > > > > /* > > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > > > vm_area_struct *vma) > > > > > > */ > > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > > up_write(&vma->vm_lock->lock); > > > > > > + return true; > > > > > > } > > > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_stru= ct *vma) > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > > > --- a/kernel/fork.c > > > > > > +++ b/kernel/fork.c > > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struc= t *mm, > > > > > > struct mm_struct *oldmm) > > > > > > pr_warn_once("exe_file_deny_write_access() fail= ed in > > > > > > %s\n", __func__); > > > > > > } > > > > > > > > > > > > +void dup_probe(int, int); > > > > > > + > > > > > > #ifdef CONFIG_MMU > > > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > struct mm_struct *oldmm= ) > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struc= t mm_struct *mm, > > > > > > unsigned long charge =3D 0; > > > > > > LIST_HEAD(uf); > > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > > + bool only_user; > > > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > > return -EINTR; > > > > > > + only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > > > > > flush_cache_dup_mm(oldmm); > > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > > /* > > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struc= t mm_struct *mm, > > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > > for_each_vma(vmi, mpnt) { > > > > > > struct file *file; > > > > > > + bool locked; > > > > > > + > > > > > > + locked =3D vma_start_write(mpnt); > > > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > > > > > - vma_start_write(mpnt); > > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > > retval =3D vma_iter_clear_gfp(&vmi, mpn= t->vm_start, > > > > > > mpnt->vm_en= d, GFP_KERNEL); > > > > > > > > > > > > -- > > > > > > Mateusz Guzik > > > > > > > > > > > > > > > > -- > > > > Mateusz Guzik > > > > > > > > -- > > Mateusz Guzik