From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3441C28B20 for ; Sun, 30 Mar 2025 19:23:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 40164280002; Sun, 30 Mar 2025 15:23:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B205280001; Sun, 30 Mar 2025 15:23:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2799C280002; Sun, 30 Mar 2025 15:23:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 07F04280001 for ; Sun, 30 Mar 2025 15:23:16 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0867B81630 for ; Sun, 30 Mar 2025 19:23:17 +0000 (UTC) X-FDA: 83279190834.26.26D9B27 Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by imf29.hostedemail.com (Postfix) with ESMTP id 243CC12000C for ; Sun, 30 Mar 2025 19:23:14 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=J1b0FeyO; spf=pass (imf29.hostedemail.com: domain of surenb@google.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743362595; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tkcmXsEulbsKCrVJgcmjCkYUC6TvTGES16owLtqeNwQ=; b=aNiCnrvgfxZerjWEoQA1KBnYe6KnHnrT1zzm2uA7cy+uv0YUJ+iViWZFTltxWTr24YWDIc a11QL4cUOwJyA9rOW7VQaz356ijVg8f0qoSdOzeDFe1z0FyQN9EOT6AFf/gcAG07xriuWE DCx4i5wGjMCEq2PFpF0CCFV2c6gAEjw= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=J1b0FeyO; spf=pass (imf29.hostedemail.com: domain of surenb@google.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743362595; a=rsa-sha256; cv=none; b=6QxJBiXRzHFYhnN1tHXj1ukCYVqbsWe367id5nprsREKbAIGY4UKhlR6EKufq30JnG9l34 vuZmEPJeiFIu1xBzblHT4uf1gMFmdLn6ZX+Uo89tuvQYE77p3Tvshg5qM/cIoIwolfeX6V nXFWd38Pmyh7n9Mumh3o117/F1PDYVA= Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-4769e30af66so457241cf.1 for ; Sun, 30 Mar 2025 12:23:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1743362594; x=1743967394; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tkcmXsEulbsKCrVJgcmjCkYUC6TvTGES16owLtqeNwQ=; b=J1b0FeyO6+Ost8z6eT1sGBgi+A4h2qmXYskw1SA6z7iebKou0OW7Z3k8JnYIeBn6Z+ wZXamI3u+U64BaDA/ClubTgt5uqE+pExUDoPxTR3WM26xmwa24VvBV2kFA/F3Ed4Mkny v1vFl/IqnS5sBV7XyuymwRKnIfSl84maP+anN3yMf9rhhtpy9eBCj0/0NeNh5Pm7Wk2R SyLctAerg+FejRwSpXbBkvlOzEOO+W/OAYrVJ/OIO0YHFbc83/dWs/V8d774Mh/M1CYq 6O08N0XM5dT/R4Vb1yOjxPvllQnVh2Ddw4SproMc1wDXsdf9G7HQgbaexBwUOtvAUsUf e8NA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743362594; x=1743967394; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tkcmXsEulbsKCrVJgcmjCkYUC6TvTGES16owLtqeNwQ=; b=YAjAFSTLF2giYAYSZCX9a4DR8lbuqiUnHRfYgo+rZ2lLzrRKx3N4cLASnMaIfWerPN oVkI2mUcYxyBiTTNPgD+MjFYtdQTm7nUxTjWMf6YZm6AQp12ITV1NzjvOmWjCEfnMO4Y JR8ywqR5nkvt6SgMngIaQ+z81bKYa4FZRYD5Gt7pSgAGQAKQAzBDvy56DHiH1w7smfmU 5iGQMwPvdarfaAUVj94i64ED1GPAR9f0SQfKFin7J2YSAwRcNFB21byNII++32KxeCxv kLGp+kRZT8eFqd0BnAqLnDs3VsIbsDU4dLwTGtay1jZY6q50Qkyqyy8bXIdepxefuWn5 PKTA== X-Gm-Message-State: AOJu0Yw3qMzoe99IAOodWRdI/Q4aqR2ca7pG1HgHLQcqMxKlUDGfLEba cVlDq+xtLOMwoJQqxGrGVid9CuS8z8/x+fcWimeL5DU3J7dJ0wzybafrcODHwTfraAqqfYNVr1s mokwyqM8/t2D55V5OEKNTHfNIl68eUezhE0d1 X-Gm-Gg: ASbGncu7l0ULI+S9P44FpIUsWouM4SiJAxgSlaDzOKAlhr96ZTwl6zZoK/eYKB7wKoE E+TaKjqfk4ihTvuc5ACnX5UpYVrYFE+zdxPGbkaY5jsXpPO04KCDJfVOAY2xWCgqtGWgLl8eOUZ W1j/rQX7I8ewb1EsMpr/A7WtwVwUFZKYXCB4Tx9A9HGoYq6wijiAMYmx76 X-Google-Smtp-Source: AGHT+IEUjtEfp2WxwEc6mU7vmQQTdcob04CooafQxlSU/9j26z094cwthpQAE1AnvKi0peEbPB4jA4R9P3bpx9FR4EU= X-Received: by 2002:a05:622a:152:b0:476:77bb:687a with SMTP id d75a77b69052e-4787558a181mr4711941cf.21.1743362593673; Sun, 30 Mar 2025 12:23:13 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Sun, 30 Mar 2025 12:23:02 -0700 X-Gm-Features: AQ5f1Jq1NNV5qGDEA_SBxlFpK3-e-9bu1Ieacb_Mbpwc6XfQTIpCQ2me9qP-oKA Message-ID: Subject: Re: not issuing vma_start_write() in dup_mmap() if the caller is single-threaded To: Mateusz Guzik Cc: linux-mm , "Liam R. Howlett" , Lorenzo Stoakes , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 243CC12000C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 71b5qyy58aed9kdquwhbrmi1ba99tex1 X-HE-Tag: 1743362594-124273 X-HE-Meta: U2FsdGVkX19wVgPzTHlA+V/GNBUgHbyQffF8656FRpf/Vqfho7RfH3hpoNYJCd2OGGDkOOi8fDFouTG0H8gCq48Toe4q4tkSD8hJ42rSBnYjetV0Uk8XUyYDNtZeCINwG7dkem+gaDCc3vTThlYcaf5JiErzkhtWZNZ5kS5BVwXqaYkXKDS2OOdjdXUpB4K7BevhV/klUc2aFpViEzYHaUENq9xwewHf3n79gV+JHcO8Lt2/8Bg/r9CjGk/AAVf3kE77vUU/wRYNR28t45V2km+9xW8NrPYDXyNUJXdluYltHq1E2uJwRC86perFkUP8bLbMAKQ+QcaMeE80bEP5kjNnbKhXAFtzhWg2j19qm4lr9bDKr/AbKrsi756XWBpjIUt99cmBw3HZAiPRnk1KPTzqHGntA5A6yXpd/9gWcmAqCaNFIYmQ8TxkkZOijbd0dpxe8SJrssE1JL0Et2Jdy0ngBccFiGYY+E1ZJU+YyhSnGXiQekquw7vHUNUeDHCSk4cxQR4Ou7Qw8/G0D3EgKzNz4ILtH6DMhqz1r/k+hnPFctoCE5K2VTMru6LJ1lTA8ppuPpbrNYbHOvQJ2pdLF2Xmcvy+vzm8nZ5h9pKWchreeEqgOAHuaJEYE/kU1v7hSDWuZ9dhVShFV07vs/xUVQXj9S54tcA7WnyDoR3gFYEgwV2/3cJGzeY8e0xF9VAROkJ7jV0EZnsH6zO2kmU7kPGaVI2Tr8jquTBTMU+gBtUx+MaYs6m4wK7WLNPmW1AwR0RS0Yc40LGXr25fPR/dyX9sje5oq6kcfL5aP3WAvdod3KFBIBh5aOVA4zJ+gPCVPEkCEhMpdpx2PBlDg3oZ6fY5OjCMWIUgxV+59IEX8AvajLAcb0LDjMN0wtJBCePqbQ6PGtOvMWJbbOU2HLbcOD+pBBiGS9wqCSrNAm9KF19Gq14hbGTXBrmMS1rdleCVBIIFvIOZTY9hQCJPu5o XS7aWaoy +apHVvj/fYogqoNQQqQLZakxgBW//l12gxFqXR09TowUkcDiaQrZFX9DfUrxjPgfphBbymeNYHY4JpX0jJIwoIYLzdxfruNbjBgAdT1k2565Q8r7duTrvC++yW6yA4ac1+rGZgs6FsfMr3pmViK92ZY1oCT1SVs0LMoyhV/GTeqhSSBLQAdhIbgkoa3nbIjAO4fNnkq4dSp0cE7uIMHCKap6xzzf27YpOhppb2cgjrOhf1ebKmSEhG9p63HGua1YW27YGQ7ybrBswX4XyXg5ioWgE/1su6MoDTGI8xo0r2qizZ3g78yHeGlLHJA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 28, 2025 at 6:51=E2=80=AFPM Mateusz Guzik w= rote: > > top post warning > > It just hit me that userfaultfd() may be a problem. Creating it only > bumps mm_count, while mm_users gets transiently modified during > various ops. > > So in particular you may end up pounding off the userfaultfd instance > to another process while being single-threaded, then mm_users =3D=3D 1 an= d > userfaultfd may be trying to do something. I have no idea if this one > is guaranteed to take the lock. Hmm, yeah. uffd is nasty. Even in the cases when it does mmget_not_zero(), there is still the possibility of a race. For example: dup_mmap only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; userfaultfd_move() mmget_not_zero() <-- inc mm_users if (!only_user) vma_start_write(mpnt); <-- gets skipped move_pages() uffd_move_lock() uffd_lock_vma(= ) lock_vma_under_rcu() <-- succeeds move_pages_pte() copy_page_range() So, userfaultfd_move() will happily move pages while dup_mmap() is doing copy_page_range(). The copied range might look quite interesting... > > However, the good news is that mm_count tends to be 1. If both > mm_count and mm_users are 1, then there is no usefaultfd in use and > nobody to add it either. I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while calling mmgrab(), therefore I think it can race with the code checking its value. > > State of mm_count verified with: bpftrace -e 'kprobe:copy_process { > @[curtask->mm->mm_count.counter] =3D count(); }' > > On Sat, Mar 29, 2025 at 2:35=E2=80=AFAM Suren Baghdasaryan wrote: > > > > On Fri, Mar 28, 2025 at 6:16=E2=80=AFPM Mateusz Guzik wrote: > > > > > > On Sat, Mar 29, 2025 at 1:57=E2=80=AFAM Suren Baghdasaryan wrote: > > > > On Wed, Mar 26, 2025 at 10:46=E2=80=AFPM Mateusz Guzik wrote: > > > > > Here is the original: > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > > > the thread is weirdly long and I recommend not opening it without= a > > > > > good reason, I link it for reference if needed. > > > > > > > > I had to re-read it to remember what it was all about :) To bring > > > > others up-to-speed, the suggested change would look something like > > > > this: > > > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm= _struct *mm, > > > > unsigned long charge =3D 0; > > > > LIST_HEAD(uf); > > > > VMA_ITERATOR(vmi, mm, 0); > > > > + bool only_user; > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > return -EINTR; > > > > + only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > > > flush_cache_dup_mm(oldmm); > > > > uprobe_dup_mmap(oldmm, mm); > > > > /* > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm= _struct *mm, > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > for_each_vma(vmi, mpnt) { > > > > struct file *file; > > > > > > > > - vma_start_write(mpnt); > > > > + if (!only_user) > > > > + vma_start_write(mpnt); > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > retval =3D vma_iter_clear_gfp(&vmi, mpnt->v= m_start, > > > > mpnt->vm_end, G= FP_KERNEL); > > > > > > > > > > > > > > At the time there were woes concerning stability of the new locki= ng > > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > > starting a new thread with a modest list, not sure who to add tho= ugh. > > > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > > vma_start_write() to protect against fault handling in another th= reads > > > > > using the same mm. > > > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > > sibling threads to worry about and this can be checked with mm_us= ers > > > > > =3D=3D 1. > > > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, = which > > > > > provides exclusion against dup_mmap, meaning they don't pose a pr= oblem > > > > > either. > > > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > > > > The patch to merely skip locking is a few liner and I would offic= ially > > > > > submit it myself, but for my taste an assert is needed in fault > > > > > handling to runtime test the above invariant. > > > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > > fault is triggered during dup_mmap() && mm_users=3D=3D1 would not b= e > > > > trivial. We would need to indicate that we expect no page faults wh= ile > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and t= hen > > > > assert that this flag is not set inside the pagefault handling path= . > > > > I'm not sure it's worth the complexity... As was discussed in that > > > > thread, the only other task that might fault a page would be extern= al > > > > and therefore would have to go through > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > > > So happens I really > > > > > can't be bothered to figure out how to sort it out and was hoping= you > > > > > would step in. ;) Alternatively if you guys don't think the asser= t is > > > > > warranted, that's your business. > > > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > > > Yeah, I'm sure it will make a difference in performance. While fork= ing > > > > we are currently locking each VMA separately, so skipping that woul= d > > > > be nice. > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > > happen if mm_users=3D=3D1 and we are holding mmap_write_lock > > > > (access_remote_vm() will block)? > > > > > > pseudocode-wise I was thinking in the lines of the following in the f= ault path: > > > > > > if (current->mm !=3D vma->vm_mm) > > > mmap_assert_locked(vma->vm_mm); > > > > > > with the assumption that mmap_assert_locked expands to nothing withou= t debug > > > > I see. IOW, if someone external is faulting a page then it has to be > > holding at least mmap_read_lock. So, it's a more general check but > > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we a= re > > > > > single-threaded and vma_start_write() returning whether it took t= he > > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > > > The lock was taken for every single vma (377913 in total), while = all > > > > > forking processes were single-threaded. Or to put it differently = all > > > > > of these were skippable. > > > > > > > > > > the probe (total hack): > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] =3D count(); }' > > > > > > > > > > probe diff: > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > > --- a/fs/namei.c > > > > > +++ b/fs/namei.c > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > > page_symlink_inode_operations =3D { > > > > > .get_link =3D page_get_link, > > > > > }; > > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > > + > > > > > +void dup_probe(int, int); > > > > > +void dup_probe(int, int) { } > > > > > +EXPORT_SYMBOL(dup_probe); > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > > --- a/include/linux/mm.h > > > > > +++ b/include/linux/mm.h > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > > vm_area_struct *vma, unsigned int *mm_l > > > > > * Exclude concurrent readers under the per-VMA lock until the c= urrently > > > > > * write-locked mmap_lock is dropped or downgraded. > > > > > */ > > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > > { > > > > > unsigned int mm_lock_seq; > > > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > - return; > > > > > + return false; > > > > > > > > > > down_write(&vma->vm_lock->lock); > > > > > /* > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > > vm_area_struct *vma) > > > > > */ > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > up_write(&vma->vm_lock->lock); > > > > > + return true; > > > > > } > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct= *vma) > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > > --- a/kernel/fork.c > > > > > +++ b/kernel/fork.c > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct = *mm, > > > > > struct mm_struct *oldmm) > > > > > pr_warn_once("exe_file_deny_write_access() failed= in > > > > > %s\n", __func__); > > > > > } > > > > > > > > > > +void dup_probe(int, int); > > > > > + > > > > > #ifdef CONFIG_MMU > > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > struct mm_struct *oldmm) > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct = mm_struct *mm, > > > > > unsigned long charge =3D 0; > > > > > LIST_HEAD(uf); > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > + bool only_user; > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > return -EINTR; > > > > > + only_user =3D atomic_read(&oldmm->mm_users) =3D=3D 1; > > > > > flush_cache_dup_mm(oldmm); > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > /* > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct = mm_struct *mm, > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > for_each_vma(vmi, mpnt) { > > > > > struct file *file; > > > > > + bool locked; > > > > > + > > > > > + locked =3D vma_start_write(mpnt); > > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > > > - vma_start_write(mpnt); > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > retval =3D vma_iter_clear_gfp(&vmi, mpnt-= >vm_start, > > > > > mpnt->vm_end,= GFP_KERNEL); > > > > > > > > > > -- > > > > > Mateusz Guzik > > > > > > > > > > > > -- > > > Mateusz Guzik > > > > -- > Mateusz Guzik