From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C6AA4EB64DA for ; Wed, 5 Jul 2023 08:08:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 233A68D0001; Wed, 5 Jul 2023 04:08:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E36E6B0075; Wed, 5 Jul 2023 04:08:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 084198D0001; Wed, 5 Jul 2023 04:08:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E95F26B0074 for ; Wed, 5 Jul 2023 04:08:31 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id AABFD140A4A for ; Wed, 5 Jul 2023 08:08:31 +0000 (UTC) X-FDA: 80976831222.28.06F692A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 59F5EA0012 for ; Wed, 5 Jul 2023 08:08:29 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VeHQJMm1; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688544509; a=rsa-sha256; cv=none; b=fgQ2vGMpnd8oj8Xr3XSQFxM9GhTMZnWPJRz/BgMUDUq9x1d+Czrs0n3h0KIUX71fDceL7U k6S1sBwUwFhkdwDYtNsGnNZJ2QVT5W2kXgiwe4iBKoArY8r0EZuuCgQqTMiGVvuR7FR8xA YZiXHqz+lcdElLvhyDlvVOqmuLQtuLg= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VeHQJMm1; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688544509; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zykKd6o7uQBadhK+QOJVEiiMMOWwoNrDKmhn2jAEWOk=; b=OhCBuDbPxqMNuoOFRYUU202mnyzYZbYYbqO1uySjVBbj+8uJy3eS/tVo7kzB+IHGXq6C++ krZMUlYbtYwOoF94yzmFxVGLOf86FvMt49PFPt7yctfxtwbitMIQQWY8yTUVmG8jr9/sVI SUUjOD2odJQdO5bB/GZwxGxFWOFpv5w= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1688544508; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zykKd6o7uQBadhK+QOJVEiiMMOWwoNrDKmhn2jAEWOk=; b=VeHQJMm1FHPYJ+TjsmOOFkfN2YkOgnq0KnyR07cYiEUYKjIs4UEbX8OGIQGigbbO1S4GmN j9l8oWiCkLya7VHcBItzUZ28lwqWlF4rn+h6sKytLcmbkm7bOhGD2829DMYl/ZAaONebjl urwnj+5b5tA2jeUgBBlLqCjhN46zTc4= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-622-PB7WByvUMx2nSNbTJXNytQ-1; Wed, 05 Jul 2023 04:08:27 -0400 X-MC-Unique: PB7WByvUMx2nSNbTJXNytQ-1 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3143c9a3519so1490373f8f.2 for ; Wed, 05 Jul 2023 01:08:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688544506; x=1691136506; h=content-transfer-encoding:in-reply-to:subject:organization:from :content-language:references:cc:to:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zykKd6o7uQBadhK+QOJVEiiMMOWwoNrDKmhn2jAEWOk=; b=Ou58tWHEaP6ZOmHdFsRlg/497nNKnXkBwEVcuPFQJ0wDHsAUmE5ckp5p1BGUT/9T+L t3yhVMHD6u0lLAPzOfufGp7HGzubA+p6fiHuUbPjBhyiglw9vPVot/AR8D2JIo6I6P+p UKPcyegzU57yJtOMzGzwkz2YzL9aAak4fOUCTswYBJzym1pff26dMC4PpecWF0DxjqAA Krx3SKu2OVM6tRmcFBgvHu+2fIWmovuo9Gf/gK6PDjhUF7MEuvFUGITfcIYbLfIQdfsz YKR9mvMhcICK4EIsZhb4ft6glcVshKfD2+xrwuo0Dj19MfBtCs9WdHJnVuZ+Ng63OV8p x5+w== X-Gm-Message-State: ABy/qLZmr7ABnXRDW4NEM71GEfN9dViTuzkUSJ842maJO8IGLPUfHKQj fByhQqo5JILysNb1nVJLUgIaxvxio5bDXRaOyl9jpdlxVuu6AK4xSX18tvoXwAn8x4kqdJJpzHf zJOHbHz9FXyU= X-Received: by 2002:a5d:5691:0:b0:314:46af:e1e7 with SMTP id f17-20020a5d5691000000b0031446afe1e7mr1875268wrv.34.1688544506565; Wed, 05 Jul 2023 01:08:26 -0700 (PDT) X-Google-Smtp-Source: APBJJlFCmoL+C20ubL9YTlE9vV43mW/bpliGveeEn8UIIwaQ8V9PTsaOY3aTQ+a0f2fYsRMGhFrQ+Q== X-Received: by 2002:a5d:5691:0:b0:314:46af:e1e7 with SMTP id f17-20020a5d5691000000b0031446afe1e7mr1875229wrv.34.1688544506135; Wed, 05 Jul 2023 01:08:26 -0700 (PDT) Received: from ?IPV6:2003:cb:c71a:1c00:e2b1:fc33:379b:a713? (p200300cbc71a1c00e2b1fc33379ba713.dip0.t-ipconnect.de. [2003:cb:c71a:1c00:e2b1:fc33:379b:a713]) by smtp.gmail.com with ESMTPSA id v11-20020a5d610b000000b00313e2abfb8dsm29649877wrt.92.2023.07.05.01.08.23 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 05 Jul 2023 01:08:25 -0700 (PDT) Message-ID: <970295ab-e85d-7af3-76e6-df53a5c52f8b@redhat.com> Date: Wed, 5 Jul 2023 10:08:23 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Suren Baghdasaryan , akpm@linux-foundation.org Cc: jirislaby@kernel.org, jacobly.alt@gmail.com, holger@applied-asynchrony.com, hdegoede@redhat.com, michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@techsingularity.net, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, paulmck@kernel.org, mingo@redhat.com, will@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, punit.agrawal@bytedance.com, lstoakes@gmail.com, peterjung1337@gmail.com, rientjes@google.com, chriscli@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, rppt@kernel.org, jannh@google.com, shakeelb@google.com, tatashin@google.com, edumazet@google.com, gthelen@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org References: <20230705063711.2670599-1-surenb@google.com> <20230705063711.2670599-2-surenb@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 1/2] fork: lock VMAs of the parent process when forking In-Reply-To: <20230705063711.2670599-2-surenb@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 59F5EA0012 X-Stat-Signature: 9iynfqohw6em87435zqu7ugdopsoy18h X-HE-Tag: 1688544509-95889 X-HE-Meta: U2FsdGVkX1+WthSRDdExZVhTQfT9fP3HZmbMd9dVAxMOW0GFDiTjqqb++1ageErjXdS66XME0199ASGHvAZJQBdwTcGlxre+WNTBHdlL291xK5FG/bjvrQ0ovbJa3y0orVmCQVPrgQEV1Y/7cpZQtOEV+WXWPCEKEQeCyGl2bdLGNLR0xcKfb6E7wfa+dDSGNjk3ErQUpYA47MjCBE8KXNaThzZi+nimkfTO4NzeJYNxzcM5ZVYGkpU5zm7oaihczhLks9ewQ0u1k5gpkQLz229xPRAMVUpT1gLZjq1w4MkyvYVgrbOr7wa5wm7r67zFo6mG3scbt3PDs+O+IZT5PDTGddiCDHn5m567ipimUQ2jW2Djo99WL/GA0DNxnH0yX4NdV7+mwPMB6r/47ZTErofon7f56LpnuOY8wDzr1gn4FnjOsuLHEJWjN2m1cD+rXd2W6J8NUCZLoWPesx/GI/Qje9Vi53hUMl918BEIw8BkxSVnN64U/BV9/jQaIUBBS2hhmCKP9rcOqrLEnoK50kEvgo7tCVsawRnivvDMkaVRw2+SJkZbXN1Hnzb1I6+WqlBDBh78U06+73m51sqHME/HZ1kWw6I5CImgatiN8DnTxCYgicf/p8by59TWRsVyxkh7Mvc8/16xt/LxGrd05dpeFq1Y9THdQCd9sHPfmG4Bg16NEJ8c3uTBYC+RxzIepIHi3Q22uUKHUmefDNAU2ExUD2T8XF7EtmAVXYAC8kDJbwJ/j1DCE+EJOg+ogKI4DFpGXp005mmAFt33ZhH7CmAtj00wgNl1HeMCT8g3QliUflbuYUltl8BRid0ra5gweuBif5fgSXqV4kLQn5vGooIBVJva7NdtASXMVZ+5AwLz8sy7wqfWLgEyhz+ZtZIoyVouGmA9gO+loKAP+B/SD7AGoXfnav6R6TTnGrKe9ngbLi1Kmx9QDH6r7hfEqu/dwDLk8lQtdxwO1vprt6K FPHgrg6i LSUW5aR+nGY2TAeyYHkE8miGimsT0ILDfsIcz7ktZW47Qb81CxOvoLX1yh7y+JIrtgDIPnPsuGiYa2s8CYmSTSDJGvblUbirW5cbrl0QwJDF2SFRRVqqGxarzE4LtcKOdznNAlIdGTyUeszJ9+z1ZzF89JFWxs0KTJpNhFkriRMyTXw9h3yhImxGBcX0GBnaTCVGOM/Ei58ZFU1qJ7Oeq7T93CyJrHdb2xhqgCv1WH1VhqkmWHaEcq99NSELq+8DeKH1M8M4eU5Zk8R4PKMtn0p5GXQ598ToUvfF3VAQAXqMD9KXPOqo3T4t5wOCzn0qQbOe0MkJROac7j7QSBD0gRR0w5MS75eReI/KpqlgeKtb8GJ5F04IsBDvdN1P1dFT4U/vZQqNWCxzst+kdbeUeg6O5SwXp2O1KShIKNQwFE0QMMQuA3YLHy9M2AIFyydyaSxSQN/V897bogfEw8Kqc6+bSeNsBSdknp0XD+Dcs8zQ5YjS3W1LDZZXUyL6mv2pajgWl8Gj/B2snIjIkBwqhyCb6PNYgw9I7PQoOPWEAPGcW9JFkuCPGJ0a+wHU18E/yiursFSLSyU9J2y0aqLagOK9gnhvwPpX77PqBKVW/Z2BDluaZeEtd951W69UIzYnN05amykyl6OaM3ho= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.07.23 08:37, Suren Baghdasaryan wrote: > When forking a child process, parent write-protects an anonymous page > and COW-shares it with the child being forked using copy_present_pte(). > Parent's TLB is flushed right before we drop the parent's mmap_lock in > dup_mmap(). If we get a write-fault before that TLB flush in the parent, > and we end up replacing that anonymous page in the parent process in > do_wp_page() (because, COW-shared with the child), this might lead to > some stale writable TLB entries targeting the wrong (old) page. > Similar issue happened in the past with userfaultfd (see flush_tlb_page() > call inside do_wp_page()). > Lock VMAs of the parent process when forking a child, which prevents > concurrent page faults during fork operation and avoids this issue. > This fix can potentially regress some fork-heavy workloads. Kernel build > time did not show noticeable regression on a 56-core machine while a > stress test mapping 10000 VMAs and forking 5000 times in a tight loop > shows ~5% regression. If such fork time regression is unacceptable, > disabling CONFIG_PER_VMA_LOCK should restore its performance. Further > optimizations are possible if this regression proves to be problematic. Out of interest, did you also populate page tables / pages for some of these VMAs, or is this primarily looping over 10000 VMAs that don't actually copy any page tables? > > Suggested-by: David Hildenbrand > Reported-by: Jiri Slaby > Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/ > Reported-by: Holger Hoffstätte > Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/ > Reported-by: Jacob Young > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") > Cc: stable@vger.kernel.org > Signed-off-by: Suren Baghdasaryan > --- > kernel/fork.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/fork.c b/kernel/fork.c > index b85814e614a5..d2e12b6d2b18 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > for_each_vma(old_vmi, mpnt) { > struct file *file; > > + vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); > continue; After the mmap_write_lock_killable(), there will still be a period where page faults can happen. Essentially, page faults can happen for a VMA until we lock that VMA. I cannot immediately name something that is broken allowing for that, and this change should fix the issue at hand, but exotic things like flush_cache_dup_mm(oldmm); make me wonder if we really want to allow for that or if there is some other corner case in fork() handling that really doesn't expect concurrent page faults (and, thereby, page table modifications) with fork. For example, documentation/core-api/cachetlb.rst says 2) ``void flush_cache_dup_mm(struct mm_struct *mm)`` This interface flushes an entire user address space from the caches. That is, after running, there will be no cache lines associated with 'mm'. This interface is used to handle whole address space page table operations such as what happens during fork. This option is separate from flush_cache_mm to allow some optimizations for VIPT caches. An alternative that requires another VMA walk would be diff --git a/kernel/fork.c b/kernel/fork.c index 41c964104b58..0f182d3f049b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -662,6 +662,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, retval = -EINTR; goto fail_uprobe_end; } + + /* Disallow any page faults early by locking all VMAs. */ + if (IS_ENABLED(CONFIG_PER_VMA_LOCK)) { + for_each_vma(old_vmi, mpnt) + vma_start_write(mpnt); + vma_iter_init(old_vmi, old_mm, 0); + } flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); /* -- 2.41.0 Unless there are other thoughts, I guess you change is fine regarding the problem at hand. Not so sure regarding any other corner cases, that's why I'm spelling it out. Acked-by: David Hildenbrand -- Cheers, David / dhildenb