From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D243BEB64DA for ; Sat, 8 Jul 2023 19:06:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EBEC96B0071; Sat, 8 Jul 2023 15:06:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E6F646B0072; Sat, 8 Jul 2023 15:06:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D37458D0001; Sat, 8 Jul 2023 15:06:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C56126B0071 for ; Sat, 8 Jul 2023 15:06:15 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5BDD51A01EC for ; Sat, 8 Jul 2023 19:06:15 +0000 (UTC) X-FDA: 80989375110.02.B31EF86 Received: from mail-ej1-f51.google.com (mail-ej1-f51.google.com [209.85.218.51]) by imf23.hostedemail.com (Postfix) with ESMTP id 474EE14000B for ; Sat, 8 Jul 2023 19:06:11 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=BBBDWJ8p; spf=pass (imf23.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.218.51 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688843172; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=o+hqttwHhvpKq8WU0CE6n+csMr8KAQKC/7ZFjSW0qHs=; b=AW20dopjxrZM1geT5FxYlljg0TKE0JUyMEGlM3+xlX2Gyj2AO9ib0Wx9pke5T4sguNbF6V ojTxEuU8wW3/UPpffIvompdOdC5MaFmmqr03Gkyy9/kPBaSUVCGL6c0uN+GBtTpldVowe8 0EuzZemgWSnQWGq7Sb0d+4je5jxllNM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688843172; a=rsa-sha256; cv=none; b=DNyL5h3fsppeJzjGna6ElMMyCweiekqLHLvsmTK4SrsqXy7U/VkBD3HzWFIzWxp1dOW442 47u+qEC0pLyvC51H0OsI0n04jiR0tcqwkVYyuQJVotRGC3XeSkGX0a24AKmawFqBMOU8dT HF9A9S5KZ2IVc1Q01np4zoa9H2Vp2K0= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=BBBDWJ8p; spf=pass (imf23.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.218.51 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org; dmarc=none Received: by mail-ej1-f51.google.com with SMTP id a640c23a62f3a-992b66e5affso381424366b.3 for ; Sat, 08 Jul 2023 12:06:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; t=1688843170; x=1691435170; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=o+hqttwHhvpKq8WU0CE6n+csMr8KAQKC/7ZFjSW0qHs=; b=BBBDWJ8pR7idPI3iYhZgBOEEIpLZWv8w5xVSH76wb3YlQvLIvZAIadsvS0Zj8cAUtT PTGb2Se41jaiSNzoMH/2ms9eq4kcnS13IFgmciXpyhH2k+LVRlNN6OC5AcpFTdKVPGB5 wUtGDrkBN7J+SDFp/wRdlUg1hXapk6UaSo5Js= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688843170; x=1691435170; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=o+hqttwHhvpKq8WU0CE6n+csMr8KAQKC/7ZFjSW0qHs=; b=Kc3wzRuD9rxJ7SCTpbud80VVLNA8FdhnUbgX3fL3iV9eG1JoQ4nHlbvSJgJ1omrASc vxJfNWrCS1p3fZUMaHdkcWXgippcNv6fANJQzZvJflSenhfuyqurcydUSeXqjkFbZp1M hYRcRDuLpH7yjxuo4gPe7oOVtVif4+XucZqtJnYCOeCHDuGecqILUXBiSyEhr0Y/m8HN P3jEuDZ7DaPb+T6gwDAm+1YKutsQl+egBP4xUH9RaIqN3LNU5+XGotw5KmmtJfNQHKiV 4WtRFhR0w1m3f2tu9+4/goWMMQgsZMqV+cFa6opvdEhA/rciJlNc4KAqGpfe+FwEBAUW 9yXw== X-Gm-Message-State: ABy/qLY9jXHpaHXT2h5CJzc1yumPHiz1+P3t9xPik4V6l9MwPi9kt4rE l6Bvy3YwCyk5/Og2J0u4G9TQA+6tHLz7+uC8tD1MCWWj X-Google-Smtp-Source: APBJJlEE50omakcls79KewwIVl0EE6xKLyhgbgueikW9CG2V6+sQMd6kNFFk+XA9moNx2gGeWHcp9w== X-Received: by 2002:a17:906:11:b0:98d:3491:68da with SMTP id 17-20020a170906001100b0098d349168damr7396807eja.44.1688843170291; Sat, 08 Jul 2023 12:06:10 -0700 (PDT) Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com. [209.85.208.47]) by smtp.gmail.com with ESMTPSA id cf8-20020a170906b2c800b009934707378fsm3810526ejb.87.2023.07.08.12.06.09 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 08 Jul 2023 12:06:09 -0700 (PDT) Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-51e28b299adso4145052a12.2 for ; Sat, 08 Jul 2023 12:06:09 -0700 (PDT) X-Received: by 2002:aa7:c1d4:0:b0:51d:d622:713d with SMTP id d20-20020aa7c1d4000000b0051dd622713dmr5404706edp.39.1688843168772; Sat, 08 Jul 2023 12:06:08 -0700 (PDT) MIME-Version: 1.0 References: <5c7455db-4ed8-b54f-e2d5-d2811908123d@leemhuis.info> <2023070359-evasive-regroup-f3b8@gregkh> <2023070453-plod-swipe-cfbf@gregkh> <20230704091808.aa2ed3c11a5351d9bf217ac9@linux-foundation.org> <2023070509-undertow-pulverize-5adc@gregkh> <7668c45a-70b1-dc2f-d0f5-c0e76ec17145@leemhuis.info> <20230705084906.22eee41e6e72da588fce5a48@linux-foundation.org> <20230708103936.4f6655cd0d8e8a0478509e25@linux-foundation.org> In-Reply-To: From: Linus Torvalds Date: Sat, 8 Jul 2023 12:05:51 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Fwd: Memory corruption in multithreaded user space program while calling fork To: Suren Baghdasaryan Cc: Andrew Morton , Thorsten Leemhuis , Bagas Sanjaya , Jacob Young , Laurent Dufour , Linux Kernel Mailing List , Linux Memory Management , Linux PowerPC , Linux ARM , Greg KH , Linux regressions mailing list Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: zj3yfxmcypjms6t5s7wcea1h9ohsbkmc X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 474EE14000B X-Rspam-User: X-HE-Tag: 1688843171-683494 X-HE-Meta: U2FsdGVkX18ghb1nL6E9CZAi8t4D0/FYZ4d983N4y8zzpWsC6i74yDRLCcuF2iTcht0FEeGvhbDil/W7khhYBwWiR8yy5vF5tYivNZ7L/PZI1HZsKoLRYun/ETrNQkvGZNPMHmg4IqEpFKH/IRrIQWaFBJ3Cc/mZOtgzUme0a0YCxH74pzfNothmZRQ8qQZRWHLlA/5JAVynEKP55kcDxyxhrYmMMZMDtp7fLrpc5BOjYHlK911d+6HNtivYeO0JYJF4qOK60lJhUR0HSjUB83vBQabQUQnEKLcLlaTVa/tJaBn5Uc3xf1nz8RaNxNZLYWCgkZt9Ui7keLAr5pP5a/Ru0IiXC4DU6XsKn1iKYMINNFnt1g7EyoDrAINkgzRH/RQo/78d+YtAYz7mhmdcRmcVLDT+uZ54Du6Tqc2xuyYggLq1/P1V5i4vTpG4l7LbwHUj3En1fprUKrzk2/UTfDeL9JjlJ0ndnKZk1JXjX/absTt5xG6Rg93CkE3fpB4G6oFGaJ4Pc0rwctS3fFj2mfcrjD1WqBZrBFX5Iveu+joGxyMRxpe0Ho9r9ke7+ypavJtrtoChcv+ak8ZFVpYGJmuh4DfsXUdEeVdjYZ8ikvcvizVCFrp2py0kTpj6NnUqDzF55v8pW0AT6+kalqgAinR9fMw4FmzunrGxjsEQF95fVbgBAK1s28GgjH0JcHI7aRhKq7414epujMkvtPcM9pLPLziTKSOjaF1AkVhyjBpncATsZw2jEN8mDBk3d3QcgaRcvII8P5r7IS6bEaqZMXfkv5kgS+c7JXbx3OeCpj1ZHE+nN9yQmfMmIzQSecWQF/Wo3QO/wqvBIsJpCtYA7FhTZxU5XySxhTYuFdR+gQgWD2S3ff6pe3t12CjI6HjXYiQPC/U/qk41Bq8hGFKiKyUPrRnumMeBBcv2g3ZRKfk40YdtUO14ETOCxx0+8xzsr6S3hW9bZzTvaf8R6Op Mmgoo2GC SzC6werdOixJCGYyyv9GSGOXhsyXcJ98FOaLj/aHNRRs0lwOvo2CjU+MbPgJ1Z8uDeBZ+tl+tjlpSmkkyCK2kGuyNp/zCCxab3TswyuvSyJjFkwMpH0rDMqvvLYbC6Q6bG0kWvjj+YCXDCCE6jYazCWjli++9uV9hyDCNXSRFoRPGLixNqNMdv4c0uibFLtM/gFkyR6lPQh57hMINPkSOw0ysH1FfQw1xeVuWiIIDUkOUcM5s1bv9BxkQRcRV8w7DNBRkBGifNFWQUtv4g2hvfxBDl4V/oBSQ+D/Twdbch1VRN7cgkUD3J516mSNMi3t5TUSoibndoe7GtsVhMNuil3Qw7KO8YBAwtREmkiKkEDW5w7g45q+1DGTj3mXZkDWt6VG4nnahPnKYKjfoIVVDAcJ/cBjAPfcCz/L730qqq9/mjOxgIv8FQoikYlf5klTaFtrlDZegdvOSumiT+x2J0BheBXGXP5kh3b4UebxiLi9V+a+gW1231AfGpM/K0G4pNR4oUkyCJhjaAtGw2YLHUVwdIHDhrvIJlAwb0JhF+lptNOUzrTYG3ESi8n6hsVFDo0LTcTIZ5os7C2giABUvHSVdHVIwjJ4d59q3/GVJNLnxcYmZd1NKTc5izUCEtg/DBgsnb2/abNHBlhm5Sm5Rl3LdLA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, 8 Jul 2023 at 11:40, Suren Baghdasaryan wrote: > > My understanding was that flush_cache_dup_mm() is there to ensure > nothing is in the cache, so locking VMAs before doing that would > ensure that no page faults would pollute the caches after we flushed > them. Is that reasoning incorrect? It is indeed incorrect. The VIVT caches are fundamentally broken, and we have various random hacks for them to make them work in legacy situations. And that flush_cache_dup_mm() is exactly that: a band-aid to make sure that when we do a fork(), any previous writes that are dirty in the caches will have made it to memory, so that they will show up in the *new* process that has a different virtual mapping. BUT! This has nothing to do with page faults, or other threads. If you have a threaded application that does fork(), it can - and will - dirty the VIVT caches *during* the fork, and so the whole "flush_cache_dup_mm()" is completely and fundamentally race wrt any *new* activity. It's not even what it is trying to deal with. All it tries to do is to make sure that the newly forked child AT LEAST sees all the changes that the parent did up to the point of the fork. Anything after that is simply not relevant at all. So think of all this not as some kind of absolute synchronization and cache coherency (because you will never get that on a VIVT architecture anyway), but as a "for the simple cases, this will at least get you the expected behavior". But as mentioned, for the issue of PER_VMA_LOCK, this is all *doubly* irrelevant. Not only was it not relevant to begin with (ie that cache flush only synchronizes parent -> child, not other-threads -> child), but VIVT caches don't even exist on any relevant architecture because they are fundamentally broken in so many other ways. So all our "synchronize caches by hand" is literally just band-aid for legacy architectures. I think it's mostly things like the old broken MIPS chips, some sparc32, pa-risc: the "old RISC" stuff, where people simplified the hardware a bit too much. VIVT is lovely for hardware people becasue they get a shortcut. But it's "lovely" in the same way that "PI=3" is lovely. It's simpler - but it's _wrong_. And it's almost entirely useless if you ever do SMP. I guarantee we have tons of races with it for very fundamental reasons - the problems it causes for software are not fixable, they are "hidable for the simple case". So you'll also find things like dcache_page_flush(), which flushes writes to a page to memory. And exactly like the fork() case, it's *not* real cache coherency, and it's *not* some kind of true global serialization. It's used in cases where we have a particular user that wants the changes *it* made to be made visible. And exactly like flush_cache_dup_mm(), it cannot deal with concurrent changes that other threads make. > Ok, I think these two are non-controversial: > https://lkml.kernel.org/r/20230707043211.3682710-1-surenb@google.com > https://lkml.kernel.org/r/20230707043211.3682710-2-surenb@google.com These look sane to me. I wonder if the vma_start_write() should have been somewhere else, but at least it makes sense in context, even if I get the feeling that maybe it should have been done in some helper earlier. As it is, we randomly do it in other helpers like vm_flags_set(), and I've often had the reaction that these vma_start_write() calls are randomly sprinked around without any clear _design_ for where they are. > and the question now is how we fix the fork() case: > https://lore.kernel.org/all/20230706011400.2949242-2-surenb@google.com/ > (if my above explanation makes sense to you) See above. That patch is nonsensical. Trying to order flush_cache_dup_mm() is not about page faults, and is fundamentally not doable with threads anyway. > https://lore.kernel.org/all/20230705063711.2670599-2-surenb@google.com/ This is the one that makes sense to me. Linus