From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E61B1C28B20 for ; Sat, 29 Mar 2025 01:51:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 75880280176; Fri, 28 Mar 2025 21:51:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 70E3A280165; Fri, 28 Mar 2025 21:51:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A9C6280176; Fri, 28 Mar 2025 21:51:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3AAE8280165 for ; Fri, 28 Mar 2025 21:51:20 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A18B91A0A40 for ; Sat, 29 Mar 2025 01:51:20 +0000 (UTC) X-FDA: 83272911120.16.AF195F0 Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf27.hostedemail.com (Postfix) with ESMTP id B3A3340009 for ; Sat, 29 Mar 2025 01:51:18 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jpaMRc9C; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743213078; a=rsa-sha256; cv=none; b=DHlOntWEqTNFd/cxMDsygCscRBTZ07tbidXQqCMoP0gdgcyOye78S8GSld4qxfH8lhIzxp YR0Rmt9PQPwbzWXb+hj0Xp7IEJfdEXYfQ7DON7UAnt4ic+izy0y9UWjPXGzdgqnSnXIFEa K3AQKlJLJmj/DYKJR1JWRlDggLOHl0I= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jpaMRc9C; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=mjguzik@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743213078; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=u+cLvfUj0RRNXPBGLMGmqz0wd2R/Uhgg1Zp41NKOrEI=; b=QXywm8oKDEdQzXz1wNPDG7tThNin8LPVFqRQmyAMs9p6tB++NgAu3jFtwY1kO/UQgLguah 9i4bH6JAvtjLeBM/NkriYhq0UMToqEbUClb96RwHWKANbf+liQeGrB1idkdrYEr8EEIVNX g/I8oFKvwJzSEaC83j7tOiL80TrHd5g= Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-5e5bc066283so4466654a12.0 for ; Fri, 28 Mar 2025 18:51:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1743213077; x=1743817877; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=u+cLvfUj0RRNXPBGLMGmqz0wd2R/Uhgg1Zp41NKOrEI=; b=jpaMRc9COi6xGbmVuJW4FlLq79FcS1Kt6b7WbGFMUCUK3QgMvxB6X05yGHQX7ZXLYj IANPPQZtPOqoDQoAhKkcAcb9e76Z9D8JR2IkxN+qWM91h4HCUfJSKerxofg94og6O8iH 5boEP9x21Pq2QS7e+Wq1w2zAWFV7qZaLSKTo+CQMNNdGGGHf7F/w+67DqCRBvaT7+kMp VyLZfya9mZ/Y2GxeSzqs5JOLwnxKLuSKJIe6yIDaoJFM043XycEJL7Wfj9F2PBg8mK+5 p8D7tIPbDOu+S12rXAEVCIWOTLbyXCmk+aoyXaPvUr9svyO8W03g1AhnYQseOkhLJR1Q wbxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743213077; x=1743817877; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=u+cLvfUj0RRNXPBGLMGmqz0wd2R/Uhgg1Zp41NKOrEI=; b=Sn1yvtYQd90/flDg5R6nvBUTT6BbuPWoSyXY71rTYRaTdCaNFGxRyXwKPPNvp9C1sf X1dk0EIIlhhQJ+vDb0Ue8k0c9zdlfXa7JkMk9WTr+B08qeldrLXbNwQa1t0TfCgzhqni ZYr3WjhqML8/hvw0hhXRFkact2btrcjyNRBKcXnX6egmq3cXyUO9NYAFmMVXxFsDQH/c DrptuZgOh8G29DRL8fbX5m2c+oREGAOCKuWGbFcwI20BqkeVq0QyYFHLdiBmCeWSOHbK kO3Y4NuWM+i7YCCAf2H6ArcBeHcNmelu7OtICwZvIUgxlo6k/EO7F+PscO4zmDCfU4g1 lM+A== X-Gm-Message-State: AOJu0YwFHOAsMQLJs3ltiagD8qEpMgPwmdOunRr8J1YNYxz+BCVIX2bZ JXRh4JWPc9ErknjgM9z2+WGSHDrIMLVmW+w6sUmsMwoe8pR/u/ZZXXFW1djqRM/zWWvp+eNm2T5 vgy65DGpIDuXsewIYAk/yixvIgAT0C/gK X-Gm-Gg: ASbGncuE7/ZxXei7bBNX8BLx6Xqzr02bAw1YCuv75PRsGHJ+I4XyMOao6OkNwZOVOay Xz//tAD/2zG2ZJ7uARBC4fvrfdrt+x9Cn3WfKCx7bJ4FLnKAMr3c32I2eyD3ZY7p/RmrhgcjHnG IxpdwyeGLtbs5J4CHxbxpOs2Iv X-Google-Smtp-Source: AGHT+IES/s+Pzl+3i684EYkEghDdMDS4R51x94B8wkTGYJHtw9bA86f9123/LC6dLj91ea16gZBOi5gEVGJh8PsN77g= X-Received: by 2002:a05:6402:3588:b0:5e6:466e:5866 with SMTP id 4fb4d7f45d1cf-5edfdbff5aamr1090532a12.25.1743213076749; Fri, 28 Mar 2025 18:51:16 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Mateusz Guzik Date: Sat, 29 Mar 2025 02:51:04 +0100 X-Gm-Features: AQ5f1Jrya3H-twY99u7sQApUqUmTbv0Jpy6GspsqEUF3dtfLflgoZoJOIZgMAr0 Message-ID: Subject: Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded To: Suren Baghdasaryan Cc: linux-mm , "Liam R. Howlett" , Lorenzo Stoakes , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B3A3340009 X-Stat-Signature: c9ks11trffh637kogcsa3wpd5pppy3io X-HE-Tag: 1743213078-218394 X-HE-Meta: U2FsdGVkX19WAZUukJykuIYnaUR2FCSVHnvS427BNdcV2T+0TLDklI5iT5urS0NjCa3adMfE82wGGAAgnimD6/8+gYHC/pakDUtA/AD9MU2yNF/uJ0SzpYDdobcuTt/6F7qlImIqF3j5RkvYHXUDB+D5gIlxj1KJYNU0DoO2kocV5eADt5x4re7HiD8IXXTvLMu2k13+97azaIRYunJ0jkCxihxzcrYsSxEAAESByiTlG+ouGiJ4MdfEtoZqTkVfQC/zu+0J/ve9pgSJdLbtjHT5BhGgOrmcnF55Zfay00HQ3bhlV6FKGy4ZHwLUX9edLtpun9FaONLzhwCRPUcfDK9NvNz7kTtS1dpMnvB4SvVfkPd6nN3LVDi+B083Xi5PacVcGQjhc9bi2k/isozdv8Du4WBv6bKf6Yfd+NV+8A8lmzpMm7W87t9Y/SkDxDWpt1IWX7dDeKAfWdYQP7k8F5JMSfgBNeCHOTmqqFIRmAXVxj/yZl/KdTIOU1dMqH5VJ02s4ycQjaFl5xYJej0TTb2FVWsn0oTJHZFml2EeiEqY7molsdGZUQJOM6xBiswjs4a+nPion0csUheTL3XznCpF4Oi4mttYZ01YempLgzelk/s4Kj/GD6MEvEdjSmE4DU1pbcNK/w6vSZcjfLwv094ikjCbFf1khfJPImdoRpXmNdvfCpmBmTE4UJwAsCbRE3xw7l4pj6ndgp9EtoEwEkaO8x6CRTJhyWfw3BFz1aS04JG6RK0zotb2AJQau5188xtbXOEmiJ9/K07zG6VfmEBmN9JlhyvikwuMfe9KitovtxkW7pGO1CU5tCqaFhZA/CRJpi4GNrlHZYu/GdLzociY+UN5ZUwUsmtHNIgkbYVzJq69EDV6XYgsrnZtyYkI50gSMBmMeyFwYM4vC3PQlC2xrah88TYPUsk2Wu9MCNSCtSB5IgsZLjNObGFAIjGPo1QhyI8rt4ultg7Bt46 Vc/81/C1 NIbzK656BNdEi4feBX8iqCMZ7YRv/D/JFvLFn8O7lOKIiuVcr6DGAR2l8b9wL1vbj9rThqCRsbckwZ7vmL72EC5GM4lwIha5athI9/YZD9E26sqjv3qzUT3DuSMvNDbhlf7IyBDONxiG6nEsBPm7D7+NqYwJUgyw2EtCI076LIxp+SMGUcaY+C8n4gNX3o0TgZWXie39Wks6h5ePPLcFRxZbYhfdqzINfm6KDhXmT/ACLSedpvdC0wAoDMh+t6SkgbPSd9+korQ2vhj6XiuQm+eH2hlq1CN0ImREi9z2eGB5CXHVUJwhrnZxgAESaFEJrjnmSQkKj1GyaHM1I6RZiE3/j44nLSFDgo6cOGzMLvGbyDcoDiGD43dsvoYn5ysVGs6xS9l3VYSqqT7uA2D4Bo0HgvDsNyuj55IPR8yQdUY94crqhUH2WmG7DYQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: top post warning It just hit me that userfaultfd() may be a problem. Creating it only bumps mm_count, while mm_users gets transiently modified during various ops. So in particular you may end up pounding off the userfaultfd instance to another process while being single-threaded, then mm_users =3D=3D 1 and userfaultfd may be trying to do something. I have no idea if this one is guaranteed to take the lock. However, the good news is that mm_count tends to be 1. If both mm_count and mm_users are 1, then there is no usefaultfd in use and nobody to add it either. State of mm_count verified with: bpftrace -e 'kprobe:copy_process { @[curtask->mm->mm_count.counter] =3D count(); }' On Sat, Mar 29, 2025 at 2:35=E2=80=AFAM Suren Baghdasaryan wrote: > > On Fri, Mar 28, 2025 at 6:16=E2=80=AFPM Mateusz Guzik = wrote: > > > > On Sat, Mar 29, 2025 at 1:57=E2=80=AFAM Suren Baghdasaryan wrote: > > > On Wed, Mar 26, 2025 at 10:46=E2=80=AFPM Mateusz Guzik wrote: > > > > Here is the original: > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > good reason, I link it for reference if needed. > > > > > > I had to re-read it to remember what it was all about :) To bring > > > others up-to-speed, the suggested change would look something like > > > this: > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_s= truct *mm, > > > unsigned long charge =3D 0; > > > LIST_HEAD(uf); > > > VMA_ITERATOR(vmi, mm, 0); > > > + bool only_user; > > > > > > if (mmap_write_lock_killable(oldmm)) > > > return -EINTR; > > > + only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > > flush_cache_dup_mm(oldmm); > > > uprobe_dup_mmap(oldmm, mm); > > > /* > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_s= truct *mm, > > > mt_clear_in_rcu(vmi.mas.tree); > > > for_each_vma(vmi, mpnt) { > > > struct file *file; > > > > > > - vma_start_write(mpnt); > > > + if (!only_user) > > > + vma_start_write(mpnt); > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > retval =3D vma_iter_clear_gfp(&vmi, mpnt->vm_= start, > > > mpnt->vm_end, GFP= _KERNEL); > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > starting a new thread with a modest list, not sure who to add thoug= h. > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > vma_start_write() to protect against fault handling in another thre= ads > > > > using the same mm. > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > sibling threads to worry about and this can be checked with mm_user= s > > > > =3D=3D 1. > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, wh= ich > > > > provides exclusion against dup_mmap, meaning they don't pose a prob= lem > > > > either. > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officia= lly > > > > submit it myself, but for my taste an assert is needed in fault > > > > handling to runtime test the above invariant. > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > fault is triggered during dup_mmap() && mm_users=3D=3D1 would not be > > > trivial. We would need to indicate that we expect no page faults whil= e > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and the= n > > > assert that this flag is not set inside the pagefault handling path. > > > I'm not sure it's worth the complexity... As was discussed in that > > > thread, the only other task that might fault a page would be external > > > and therefore would have to go through > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > So happens I really > > > > can't be bothered to figure out how to sort it out and was hoping y= ou > > > > would step in. ;) Alternatively if you guys don't think the assert = is > > > > warranted, that's your business. > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > Yeah, I'm sure it will make a difference in performance. While forkin= g > > > we are currently locking each VMA separately, so skipping that would > > > be nice. > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > happen if mm_users=3D=3D1 and we are holding mmap_write_lock > > > (access_remote_vm() will block)? > > > > pseudocode-wise I was thinking in the lines of the following in the fau= lt path: > > > > if (current->mm !=3D vma->vm_mm) > > mmap_assert_locked(vma->vm_mm); > > > > with the assumption that mmap_assert_locked expands to nothing without = debug > > I see. IOW, if someone external is faulting a page then it has to be > holding at least mmap_read_lock. So, it's a more general check but > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > single-threaded and vma_start_write() returning whether it took the > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > The lock was taken for every single vma (377913 in total), while al= l > > > > forking processes were single-threaded. Or to put it differently al= l > > > > of these were skippable. > > > > > > > > the probe (total hack): > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] =3D count(); }' > > > > > > > > probe diff: > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > --- a/fs/namei.c > > > > +++ b/fs/namei.c > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > page_symlink_inode_operations =3D { > > > > .get_link =3D page_get_link, > > > > }; > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > + > > > > +void dup_probe(int, int); > > > > +void dup_probe(int, int) { } > > > > +EXPORT_SYMBOL(dup_probe); > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > vm_area_struct *vma, unsigned int *mm_l > > > > * Exclude concurrent readers under the per-VMA lock until the cur= rently > > > > * write-locked mmap_lock is dropped or downgraded. > > > > */ > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > { > > > > unsigned int mm_lock_seq; > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > - return; > > > > + return false; > > > > > > > > down_write(&vma->vm_lock->lock); > > > > /* > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > vm_area_struct *vma) > > > > */ > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > up_write(&vma->vm_lock->lock); > > > > + return true; > > > > } > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *= vma) > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > --- a/kernel/fork.c > > > > +++ b/kernel/fork.c > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *m= m, > > > > struct mm_struct *oldmm) > > > > pr_warn_once("exe_file_deny_write_access() failed i= n > > > > %s\n", __func__); > > > > } > > > > > > > > +void dup_probe(int, int); > > > > + > > > > #ifdef CONFIG_MMU > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > struct mm_struct *oldmm) > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm= _struct *mm, > > > > unsigned long charge =3D 0; > > > > LIST_HEAD(uf); > > > > VMA_ITERATOR(vmi, mm, 0); > > > > + bool only_user; > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > return -EINTR; > > > > + only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > > > flush_cache_dup_mm(oldmm); > > > > uprobe_dup_mmap(oldmm, mm); > > > > /* > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm= _struct *mm, > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > for_each_vma(vmi, mpnt) { > > > > struct file *file; > > > > + bool locked; > > > > + > > > > + locked =3D vma_start_write(mpnt); > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > - vma_start_write(mpnt); > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > retval =3D vma_iter_clear_gfp(&vmi, mpnt->v= m_start, > > > > mpnt->vm_end, G= FP_KERNEL); > > > > > > > > -- > > > > Mateusz Guzik > > > > > > > > -- > > Mateusz Guzik --=20 Mateusz Guzik