From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1F2EC5B555 for ; Fri, 6 Jun 2025 14:12:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3F3DD6B007B; Fri, 6 Jun 2025 10:12:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 37C586B0088; Fri, 6 Jun 2025 10:12:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24A246B0089; Fri, 6 Jun 2025 10:12:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 032216B007B for ; Fri, 6 Jun 2025 10:12:34 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 497241210F7 for ; Fri, 6 Jun 2025 14:12:34 +0000 (UTC) X-FDA: 83525166228.25.78DBF41 Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf19.hostedemail.com (Postfix) with ESMTP id 41EB41A0004 for ; Fri, 6 Jun 2025 14:12:32 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YXKGho3S; spf=pass (imf19.hostedemail.com: domain of jannh@google.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749219152; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=95I2iPBNFDgcO0wVMG4TI/l0yeqGP5Gpnj7ClCifN1c=; b=LPyQ2vN1DLhMpXZlYympFV6fPsAVMPSaAUUqL3W48PqZjtyJei79th0S5KXyPBb8afqduk 4k9UKMlP8/2i9NM2UscFnx6v0+CFURAFcw3st63iqfhbUAFXAiKU8l/TGe6Ef6OVT7Y9KI lurTqsh8IroaINO+pGIF49mCaIidmws= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=YXKGho3S; spf=pass (imf19.hostedemail.com: domain of jannh@google.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749219152; a=rsa-sha256; cv=none; b=E3vf16mrDwBfMHcIJrz/RjrrQKp8aYCr06bX3HgRzMxUZRzeNs+Q6aXUJnDWhO/EGdQVnI 3gCGrXNC72S51cRQLLUREWLLVvXpXE6zUx7Y6L5seYXIDPiEEKQOlOal9YTZzka6dU1F6e NCTVYIOM5wqqfwywaB2jx+wdrN4/XDk= Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso10935a12.0 for ; Fri, 06 Jun 2025 07:12:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1749219150; x=1749823950; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=95I2iPBNFDgcO0wVMG4TI/l0yeqGP5Gpnj7ClCifN1c=; b=YXKGho3SEz/u1/qs2QTb796TK4ADnsO+HkhA0ezkRSazyO3MVPhKRyXPG/y7I6f1Re SEq420n3gG/UHquHpcCkF/plbW2qIsRAsRiOKEKvx+59YPvPoT3SjRC6fRZBI3XvrLzt MORkBU+jsHmsxYS2ipWF3dBf7w0nwJYWQc5K5eHxa+rrpFuyIgAvUEooz+5EGJgKOjHt kflBN2ojjHAOmQtq8f82snXNOP/56SoGm/J5TQdk1PR535MT1X/lqF5xDMIXRfJHBdDS Z7n4Xv8MC3amTpmckgWxvxuvQcLtESXrwiu30htYCOeaai8wWE69rLu9LgMSO6fdghAd j7ew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749219150; x=1749823950; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=95I2iPBNFDgcO0wVMG4TI/l0yeqGP5Gpnj7ClCifN1c=; b=xNmGuEHRfY2su5EAUcV0V8iKW3FpGgg3uoW25E9DaDJkeixKjdHGbU2XhlYFFKf8wC vFCY15zEztxAjdx6RccN3QHwkmfkqMHjLNkqrhbgYDsUmm6zEU2LpmdG1Mhkgo9TtE+o wdhdyaMMBmm6xXK526pezZS+85lG4FDOTXSly+7C+WEbw5jBJsdMIorpHxDqNnUlyPwn wnviJVU7Wl2sd1wQNHX/T9o+3ywve4zgK5+81vEsBawNGC6X7D50azYITfWMWW7x1eoW WytHVIy6XgRTXkE4QUEX6+VX8S/LZSO/fWrFW9ugN1hHvliWj0QW1Oab5lCxTJLbL5m4 heeg== X-Forwarded-Encrypted: i=1; AJvYcCUeAw/xC+aJr+7llaVUAnBFsynNlblUpJeH44Hx3YA479nWLwFRj+0gpgXrYfaW2hAtSohrqDm2QQ==@kvack.org X-Gm-Message-State: AOJu0YxF0O77iSm9M0GvqSBy0HC5iR/1eOYjjKr4uquj46JgvgKt10Z6 fkQ4GunBq5PLaocF62sqx1dlv87ah4fxj9A25QQhKDngl5oNR6B51ZtSn+fDxbsvgRArYaSGKLF jPYuS2q+rKeuJ6R5V7qSQFaImK+43po38lCkzx2/f X-Gm-Gg: ASbGncv6tD6zoykptmzSCEI4TicFAov89/lxN/ptZ3DPRTuOKN4T02S895NnBJJyxl6 DLlVn5bL92uTwVG0FcJfb6RUgMgNaVmEd2H2+7M1oLrvRKmLztRXD+y8nPpwKDuAMogWK7NXU8u auXB4iA01NM9hcUdNGl4Zaqow+OgdDYoCOahS0cofNRlrIl11l7pk5DcZHQursQgiHj9VuVkxoa LyDybcE X-Google-Smtp-Source: AGHT+IHqIn3Jk3eP+7LQUPBbUcuYi6EVLf341pqFYrDCsWsDFahP9S44GqvjMWx6jpK+6TRzbqVNtgGTgqWj1KfH+Gs= X-Received: by 2002:aa7:c712:0:b0:604:58e9:516c with SMTP id 4fb4d7f45d1cf-607793dc3b2mr70098a12.5.1749219150123; Fri, 06 Jun 2025 07:12:30 -0700 (PDT) MIME-Version: 1.0 References: <20250603-fork-tearing-v1-0-a7f64b7cfc96@google.com> <20250603-fork-tearing-v1-2-a7f64b7cfc96@google.com> In-Reply-To: From: Jann Horn Date: Fri, 6 Jun 2025 16:11:52 +0200 X-Gm-Features: AX0GCFuBXCdcVqd_TM6eMNdkMpWfdvVC0xldynLZs4JJ3m7vYu5AacWJNNxBRAU Message-ID: Subject: Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot To: Peter Xu Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 8dshgcbubk7pb5yweyi7ejktx65z3y7y X-Rspamd-Queue-Id: 41EB41A0004 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1749219152-731554 X-HE-Meta: U2FsdGVkX1+dLmP7ibZYgxMn+RvPnhhXBYjst7gLE6u7e+xbQMEGqaj3Vw11V+VrP3MZA/y5c9rpdxYyTIG+tai8obhHroV1l1AHfpQsa3RGd/kXosjChIqaS6qFikrDuc+tz7ZtdAFoQx6NLzM0/FgI/91cmjvZP+tADUvip/eF9ttqlSFVu0Wv75oYOCty1r6YElG2FHKMemA5/W2D4XxTr/nP7Rq82Ol38Y+TQo2Vyx1lSyk3RP1u5yAh+MF94oQVHWMJG8oIOLZQ2lH6S8ts/KLYrkOM7Vo3ey3rR8/DsddNiOmPX7jydea8g63lD1DPqtvSizzrhzLKqXyBEg7MISt6UZIuek+KupZqpNJNb+2bPZJEWm6/nMyIZLQhOwv6ZqFLI00LPLnj/ri2HT4V92/44TTyMnbOCePrpo1xQC9Xse6/UMSnYLP2GXQUep7/7PBjhAYPKCeZoBmnl8QQE/X21e76Ihs4/qlw8D9FCN4fFj3QXON021DV3ZeCbItEgOlj+gQv1lTqA44GqoF61ehejRhpVJYsiB0DHN6+n8ezqsC8ft5+xlf2y4rxOmf4q/hZfoSrQlEi2exIdFRP5w6BEK9CwgbhENMlPXLnRmu6lFovlZ6t7PrEAUTk+XdJTmUOD7m6S6PLNzRkAjc0VRlKkY6qccpQcHQyJ7Zk5X7thQoT9NoAoZd8l9e8wSZRfVJSn4VI20mhgSMr5ZknnqnaE3dq4vvs2X7+blfzXZ7dIa4lc6m2c83hlw4i95RyriUbf5aQ2F47UbYZBzq0k8rHR2caRdJPN4MdC031rBr2E/YIML7kXa3s7Irz+s8z+M9aPc2+o3ODqoLH98o1TwuhlNCYjlocjj/nsJj4tVscgCk6oVGypMi2MOOTrK1yfeBiT2y6tv/Wa5qKYFjGV7s/bezfQqJCKX4tus9z3yO/vs9hF+7XM6AkiWzf8dg/9JPLBNyhvaV0c8E FQyBAJ16 mUQ0gOZXcChjWy6hLoF3zu/zJUWDC3NE8xuWfw+oM2DjE4OYy5JmT6HqckItnaosUHCXFyNsoSGvK9L2lW/zAvYCz9VQ3L+6b46XUr9JK+fO2sXVjdOnTgCGOhUOKL1nVm5Q6OHWRh9HTmYyDDIi4FKR/kqwoL0VOBFq4hRjBUZSI1kGdjSYTW6o27u4gLtsQIQTY+B/nwSDCc1GJaKhIUtveMFrHI+s1o9FLrOyVJH4U4Sn+6WkVJs+kqA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 4, 2025 at 10:10=E2=80=AFPM Peter Xu wrote: > On Wed, Jun 04, 2025 at 08:11:08PM +0200, Jann Horn wrote: > > On Wed, Jun 4, 2025 at 7:04=E2=80=AFPM Peter Xu wro= te: > > > On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote: > > > > It is not currently documented that the child of fork() should rece= ive a > > > > coherent snapshot of the parent's memory, or how we get such a snap= shot. > > > > Add a comment block to explain this. > > > > > > > > Signed-off-by: Jann Horn > > > > --- > > > > kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 34 insertions(+) > > > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > index 85afccfdf3b1..f78f5df596a9 100644 > > > > --- a/kernel/fork.c > > > > +++ b/kernel/fork.c > > > > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *= mm, struct mm_struct *oldmm) > > > > } > > > > > > > > #ifdef CONFIG_MMU > > > > +/* > > > > + * Anonymous memory inherited by the child MM must, on success, co= ntain a > > > > + * coherent snapshot of corresponding anonymous memory in the pare= nt MM. > > > > > > Should we better define what is a coherent snapshot? Or maybe avoid = using > > > this term which seems to apply to the whole mm? > > > > > > I think it's at least not a snapshot of whole mm at a specific time, > > > because as long as there can be more than one concurrent writers (hen= ce, it > > > needs to be at least 3 threads in the parent process, 1 in charge of = fork), > > > this can happen: > > > > > > parent writer 1 parent writer 2 parent fork thr > > > --------------- --------------- --------------- > > > wr-protect P1 > > > write P1 <---- T1 > > > (trapped, didn't happen) > > > write PN <---- T2 > > > (went through) > > > ... > > > wr-protect PN > > > > > > The result of above would be that child process will see a mixture of= old > > > P1 (at timestamp T1) but updated P2 (timestamp T2). I don't think it= 's > > > impossible that the userapp could try to serialize "write P1" and "wr= ite > > > PN" operations in a way that it would also get a surprise seeing in t= he > > > child PN updated but P1 didn't. > > > > If the write at T1 hits a page fault, then it doesn't actually happen > > at T1. The write instruction starts doing something at T1, but it does > > not fully retire, and the architectural register state does not > > change, and in particular the instruction pointer does not advance > > past this instruction; just like when speculative execution is aborted > > after a branch misprediction, except that the CPU raises an exception > > and we enter the page fault handler. The write actually happens when > > the instruction is executed a second time after page fault handling > > has completed after the mmap lock is dropped. (Unless something during > > page fault handling raises a signal, in which case the instruction > > might never architecturally execute.) > > Fair enough. So maybe that's something like a best-effort whole mm > snapshot anytime happened during the fork() but before releasing mmap wri= te > lock. > > Your comment did mention one exception on the kernel, is it still pretty > easy to happen? I'm thinking this use case of trying to load some data > from a O_DIRECT fd and then set the var to show it's loaded: > > bool data_read=3D0 > read(...); > data_read=3D1; > > Then IIUC this can happen: > > parent thread 1 parent fork thr > --------------- --------------- > read(...) > using O_DIRECT on priv-anon buffers P1 > pin_user_pages > fork() happens > Sees P1 pinned > P1 early COW (child sees no = data loaded) > memcpy() > set data_read=3D1 > (data_read can be a global private var on P2) > P2 wr-protected (child sees = data_read=3D1) > > Hence in child even if it sees data_read=3D1 it is possible the buffer ma= y be > uninitialized, or the buffer is partly loaded, still racing with the kern= el > early COW. Urgh. True, I had not considered that case. > I'm not sure if I understand it correct this time as you discussed in the > comment. If so, should we still not emphasize too much on the kernel > providing coherent mm snapshot, at least emphasize the best-effort part > (both in comment of patch 2, but also in patch subjects)? After all, it > seems it isn't straightforward for any userapp to see when that coherency > will be violated. Yeah, at least I should add a big caveat in this comment about how O_DIRECT read buffers might be stale in such a case, and that we're doing the best we can with a single-pass approach. Urgh. > From that POV, maybe it's better we should still suggest the undefined > behavior, even if it'll recover the old behavior some existing use case?