From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD91CC5B549 for ; Wed, 4 Jun 2025 18:11:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 442468D0053; Wed, 4 Jun 2025 14:11:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F2718D0007; Wed, 4 Jun 2025 14:11:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2E1E98D0053; Wed, 4 Jun 2025 14:11:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0F5F78D0007 for ; Wed, 4 Jun 2025 14:11:49 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id AD588B94DE for ; Wed, 4 Jun 2025 18:11:48 +0000 (UTC) X-FDA: 83518511496.06.24352B0 Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf20.hostedemail.com (Postfix) with ESMTP id B8BCF1C0004 for ; Wed, 4 Jun 2025 18:11:46 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=q8HG30lR; spf=pass (imf20.hostedemail.com: domain of jannh@google.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749060706; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/LzZ8aKYfGSQLbJCEE0x4JtCaBazcXVo8YRVwpbcIJk=; b=r4molSmSl8ShvOOpPxbxf++BrD0WC+BKXgeREBavW0Ql9CaK24CLZKQfaqpxpL5ZZ0xSwy wM4RAe/q16l+Xy7YW1TGgx8KBhXZhf1P/tUB4IBs6TgKk5xdLIFF7arLW9uICRSv80WIHE ORVHw2UhTnL41dq45egGjJl3sv7YN+k= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=q8HG30lR; spf=pass (imf20.hostedemail.com: domain of jannh@google.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749060706; a=rsa-sha256; cv=none; b=DsoFLyFOTk+Zlsa9blAu5m95Ff44QEVlNI7/ICIUOrHALwgSdc1AkT21HyDnpYvkva35KF IyA+/GBXFGBIrzKhGSBdbh1tMTULEhb8lPcX6hqVOniwoRIBjcbIDgrGNbQT2W3JKc0Rvp GpZi6eky5ZpqcF6Kw7rOTaWwOOamDCY= Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso16551a12.0 for ; Wed, 04 Jun 2025 11:11:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1749060705; x=1749665505; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/LzZ8aKYfGSQLbJCEE0x4JtCaBazcXVo8YRVwpbcIJk=; b=q8HG30lRhwrnMnajsd+eUScZldMi4RJF+VvSAV/FQymJNIb/qvBxQI8RJeSXeN1tiJ laBgu+ncwHcTLNbsoHFi3nDhWsCRuBSMbiuUTOiiP05k7oKhDjnvIhhD+zTMk1+tqVXA hvPMX8d7/vZInEMrkmY41HecousVz7tWwyFyPxDwXQKJtj6AkDMw00+t3Kp1uz8B3n25 iHYB+ARM8xOU6i54UG/7S/54WUx4YAVdtFidGPKvpgSm7lR2iH0WMtmIjvu/ewQYZya8 eC/A+Pd3W3DSchrB0KKxBA5CAV3lU5gxELPb5F7h03cc5upkeSccPxeDl4jQGaxiYqB7 dewQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749060705; x=1749665505; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/LzZ8aKYfGSQLbJCEE0x4JtCaBazcXVo8YRVwpbcIJk=; b=u80+pw9saF4RO5mreiEJTii0QyFixtc7hQIxWam6bB56Iep0R/0bHKKs4ViJWiOruR R6rzws/81QiBLwZukxl1jpzAP31t6czJ/9nYHkNtf/oF7+nIqcMT9wI3VpUdWyZNOZQ1 Gp02/HFzPWvUuz7wwSw5UrLp4mnuh50m0oX0cy5z9KfrfCQKN/ApxqnbENIlk5tZyixo 709bN/+tKm1nFwLVrVqmoi/Cf5uZnhGM8h8iDQRVmpQ3xWS80jSxUgj8NjG/MnjhERqR uSA2aGiB+yZUh03Ylk3oIMEQcWi2IdNF/nDln1CHWTScnPOTKapJRSTU0KoIH7aUWNtr dZBA== X-Forwarded-Encrypted: i=1; AJvYcCVQ5faYmL9Azv6cbgl/nX06I1d1UhqVKXVi2bygV6lYGUhNUkMl0wWKBf8wE5TULs35iGB63ByTCA==@kvack.org X-Gm-Message-State: AOJu0YyeFtHrk6SmGs8DVGamPzAOVcfUW2e+Qs/6UHkmA6LPYdiXeFuT ZKMJkrgVOTYWoPAV5QUVfpBnAehR96IRXKZXrNZZe3Lg846lUiYWnlAa90b44Q6ve/LGcPUliNO 7BhAhzlSmlwZd2q6wZAGuCSYOF/gAOl7W1jk7R15z X-Gm-Gg: ASbGnct/d1xt69HR7WcQQeESEEGlYqn+wt3xDFOCDommltyWvldH0WcDJb7xey8Oo0L CEUVOu3sRp382P0YadUFrVmtryo80KGYr5a0KtQ5ud6qFPWesmppOSOlmnjUwc+bFx9iWrQn9gE xkg1++3t31wHPeATVT7/Lt0/fNbNwWxTfahxHaN4AuBJ62COvf5mbb0SqKeige4+LCwqwKEhI= X-Google-Smtp-Source: AGHT+IFCkXPOcTNUguWUujQQlju1ewFk7aI+SFMX6HBNuzAwdciaqqiQvPAtBtL6o1JGAST0fmv5qVVhZ4uW7/5DEPo= X-Received: by 2002:a05:6402:356:b0:606:9166:767 with SMTP id 4fb4d7f45d1cf-607246bd1bamr6357a12.2.1749060704646; Wed, 04 Jun 2025 11:11:44 -0700 (PDT) MIME-Version: 1.0 References: <20250603-fork-tearing-v1-0-a7f64b7cfc96@google.com> <20250603-fork-tearing-v1-2-a7f64b7cfc96@google.com> In-Reply-To: From: Jann Horn Date: Wed, 4 Jun 2025 20:11:08 +0200 X-Gm-Features: AX0GCFs3ggfcdfsjjImwIM0Pox1N9kXOHyOTJg-ThZY1YZYUkE0TclP4P3fi2cU Message-ID: Subject: Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot To: Peter Xu Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: ipyxc569q37rtkdehi3mgesttae689yx X-Rspamd-Queue-Id: B8BCF1C0004 X-Rspamd-Server: rspam11 X-HE-Tag: 1749060706-58305 X-HE-Meta: U2FsdGVkX18Y27EIAeaj4aF77co0Ee77WYRmDYxiHnJvQ3lFYWQnNwUqT1Z5GXg7Ct2fscSiqA+S26E2U2Xk8CLlvawSxEYTisdDbqe+OcV5W8LxmeyGe0uhEGS7osOFlV1GD8qvKrOb/tF6qSkqKreQgjG3YmTABgniKteBdZbQUKFD6cX0Z+NXduW5H2hpbEK3/ti7aPzx/ZrQDiqhUseZhNTxDkRzG1CoijqYp+OC22I9BKsRkkMhj9m2GupGqnw/D4/952qyKdBvdF0+ejJEJ01fMz55e7awz87/vRgPsVdTET6Xkvcg0wh93KPW1DWHbKQZ7+687nOvN6BXUxRrWosA0Qc6mKj4pl5PWJHm22KBVnpxwvCoKzPedJHyvQjPwYS04b1hq6t9zI9yEqYBFcUIqhL5GrSabHfohJNZ1op+vH6hU/rnNfFC4YizuA84w34iDCURuchednTJs0vnht1zxIxDTQuKyQ/aaCDVamcXzUMflvTwdx/rmmaHuEN5aqB2788FYQMT9hJWYOMmfsa6/lqWflnSEqLMjXwVytrOesVmduuBeUFca8tpWu5CsS5iG7XImOM0pXEG+0zL71yLA3xrfGpR9Lw0M6LKuDVxQh+JaSJkSIQUzPa42aT2vf5SsTsGJyPraWqEySgy2mO99hX/wxp+RDCgCd9fD/PFLSTu1wxi72PjfKyIskKXtQ3ZCEY4xhOhUxuMUubNngdsrdRno3c/vwA4sste0GOInm42rar1KnyrllcTVy3UoSYeZLYms0mlq9w5X/sY3a9EKA05/u6oYxbx2owHknw41Aul/SpckeNQHY/IRPE0wUt333GFEzjnuI/l+yhpxJru2418uS14On3HC0KraEBl6ZJ6hnKOwcfWQa7KZosC+U3bGQ5bKotEac2qC/gNQmEGYVw6zSqilppopH2nVoq/Bu/946baGRrl8gUql1TsoDckJCV+vf0LDvn kWH9htMC searrZbWeAdKr1crELr7xmd94mwPPvPXWdZ9zALr6xaDzMKOk8dC77qfhiPOHaD4pSKpGdn5rYkk+VGxYXCisDf7WaQ2VUmTCbAjMBWWo98ynuui42WSs+67Y5hlt028Y8QunpFyMbsHNO28O1cMlJpj6GviUVv+rbvPm5xmIvzWRHJHniyk0j98R7BoB81s/JjTADbjJVO3vJiQk8RuO1GdllfTMnPuFWflKMIUoKNAmSteT/m62bjp3GQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 4, 2025 at 7:04=E2=80=AFPM Peter Xu wrote: > On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote: > > It is not currently documented that the child of fork() should receive = a > > coherent snapshot of the parent's memory, or how we get such a snapshot= . > > Add a comment block to explain this. > > > > Signed-off-by: Jann Horn > > --- > > kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 85afccfdf3b1..f78f5df596a9 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, = struct mm_struct *oldmm) > > } > > > > #ifdef CONFIG_MMU > > +/* > > + * Anonymous memory inherited by the child MM must, on success, contai= n a > > + * coherent snapshot of corresponding anonymous memory in the parent M= M. > > Should we better define what is a coherent snapshot? Or maybe avoid usin= g > this term which seems to apply to the whole mm? > > I think it's at least not a snapshot of whole mm at a specific time, > because as long as there can be more than one concurrent writers (hence, = it > needs to be at least 3 threads in the parent process, 1 in charge of fork= ), > this can happen: > > parent writer 1 parent writer 2 parent fork thr > --------------- --------------- --------------- > wr-protect P1 > write P1 <---- T1 > (trapped, didn't happen) > write PN <---- T2 > (went through) > ... > wr-protect PN > > The result of above would be that child process will see a mixture of old > P1 (at timestamp T1) but updated P2 (timestamp T2). I don't think it's > impossible that the userapp could try to serialize "write P1" and "write > PN" operations in a way that it would also get a surprise seeing in the > child PN updated but P1 didn't. If the write at T1 hits a page fault, then it doesn't actually happen at T1. The write instruction starts doing something at T1, but it does not fully retire, and the architectural register state does not change, and in particular the instruction pointer does not advance past this instruction; just like when speculative execution is aborted after a branch misprediction, except that the CPU raises an exception and we enter the page fault handler. The write actually happens when the instruction is executed a second time after page fault handling has completed after the mmap lock is dropped. (Unless something during page fault handling raises a signal, in which case the instruction might never architecturally execute.) (There is a caveat to what I just said, which makes this more complex but does not fundamentally change the outcome: An instruction that performs multiple memory writes without specific atomicity guarantees can successfully do some writes and then fail on a later write. In this case, after the page fault handler resolves the fault, the entire instruction will run from the start again, including re-doing the writes that were already done on the first execution, and this works because such instructions are designed to be idempotent in this regard and they don't make atomicity guarantees.) > I do agree it at least recovered the per-page coherence, though, no matte= r > what is the POSIX definition of that. IIUC an userapp can always fix suc= h > problem, but maybe it's too complicated in some cases, and if Linux used = to > at least maintain per-page coherency, then it may make sense to recover t= he > behavior especially when it only affects pinned. > > Said that, maybe we still want to be specific on the goal of the change.