From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92989C5B549 for ; Wed, 4 Jun 2025 15:42:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CEFC96B060E; Wed, 4 Jun 2025 11:42:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C2B676B060F; Wed, 4 Jun 2025 11:42:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ACBB76B0610; Wed, 4 Jun 2025 11:42:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 82ECA6B060E for ; Wed, 4 Jun 2025 11:42:29 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 328A65FEA6 for ; Wed, 4 Jun 2025 15:42:29 +0000 (UTC) X-FDA: 83518135218.01.03DBEB9 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by imf09.hostedemail.com (Postfix) with ESMTP id 4579214000F for ; Wed, 4 Jun 2025 15:42:27 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Zq/MFcgy"; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749051747; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G7t9vtmyioTV7HgLglXfMiupzKVCAToEG32ldpNqXxc=; b=i46PRmJiloJYps1n015kzZr/4HUMf97FFfTCgfVHwpkrarHEZvc0bVeknLVWZsz+Dweh0B 9o3KqUT2E0kHzClx7pI+8hBAcA755eLXXkJbVPtVeYq3rtk+67JLgS2XAXC0bqSom0yILK 03BepTI7HdUWovOCFkfa7LB/ZYrG1vg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749051747; a=rsa-sha256; cv=none; b=5Kt3i6rD2HDCiU5iYyyikjf8TK4w0UsJQnF/f04gQRhuoELAP0qs9Tr6yi4Mkd7dIobShb 6i0B3c1wMJCBd1eAqx5NOiqUgzppWFo3Xc748wrmWB4m2c0LPCaB58weJymrKg0hDnf3fz 7zI0AvX0aywZmjOQAf3omh8hMlTeJPw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="Zq/MFcgy"; spf=pass (imf09.hostedemail.com: domain of jannh@google.com designates 209.85.208.53 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ed1-f53.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso14145a12.0 for ; Wed, 04 Jun 2025 08:42:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1749051746; x=1749656546; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=G7t9vtmyioTV7HgLglXfMiupzKVCAToEG32ldpNqXxc=; b=Zq/MFcgyKIEAfombVNUGxyv85X2GMeOi/jUhFrb2eFXtmbH2hbb2uNw9uhmKtVIJ5p m8GQLK+XIcPrW94Zm4u6Fz2KGtNrDDkE4vewKcaLukObRIOLPUdSBoA5S6v2Msv1Ee7H hG2dLBr6nHhNzeILLtn+TM4JmU7jOgRGX+ccfemq9M8y/gwB51HRJTDVrAXAkLpOq5hK xKO3om1Wo5yAjRS0wBXH4svoDBmhHSB9TUFNKU3EjSDF7UrK5lHBu+qa1tcNpd9T9qJw j5TioHP12CavhPXQe3fbyroO2P8htAXlThxFtkXQ9U0mb+HpowTbFUKXeGrc5XyohWp/ u3Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749051746; x=1749656546; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=G7t9vtmyioTV7HgLglXfMiupzKVCAToEG32ldpNqXxc=; b=u6vgB0p2SiSw6yOWdIxhKVnCXykwDYHJoM5zgNIOZG9e1vHDKuggyGJ2P3VbcSeT89 Gz6GkI69JUuUtb6Be4TDRZ2KUoLBSB7ucaEaPSOra51WWsVYm4hr5rWnmSU7IjfETgun eaotfdYY0GKUAXLF0IhNzlCYmc9JFLBiKyXFigpzCAzIM0j4CdY/lQb2/EvWLQ8lYPXy wIe+5F7353WsHZLDB234jqCw24H2jlrXVibWHCQxg2hpZ9Je2HsuT2lOGSAde0ThQzA8 P0DllkUk++5cNxUPG+F1aDQOd/Fa0CuIBmVUawK0aNbNDbm+JiVqBJawJsjadFMChs5o GtTA== X-Forwarded-Encrypted: i=1; AJvYcCU42veYmRNljR5CLbYgpFSQ04b4FY55OAtw9TfK6KT6SnQgPT3yoRvsqmIZ4jzVvDzbRR2pgvIHLg==@kvack.org X-Gm-Message-State: AOJu0YzY/3efiQj1Tkb9TVNRNQWOVgc6Qbm6dWwgdP+bk+AAutwZH4bg 7L6UHHuGlb7OaMz4Dl/CvDMrs8kO4sFsjktk/o+hOd0QRD3yF02b29GmHLZIaFn4qMcxzAmawcn BC78KGIZEQv98TIcpVnOv3x8BZQyeD2eOwGQZ9SYP X-Gm-Gg: ASbGncu5V8YeJSb2m86XPZXVjaZbL4Z5Dr/nyhuD3NL1pjSPVQFm5UpkJ6lD7hCML8P iEM/22uVESfEhvLyvbSA+Rs75FcJjXydAIufFCJvhQFCpflx4RVFm5gIBbhosSd99Klvtt+i51L XyleByP/Eys/WRe13j925EOQu4SkWI7t132HrmvZq3fpO/hKaxw4NdX5S5WCmzRESEfrv/ybqKU qFXzlmn X-Google-Smtp-Source: AGHT+IGfPx1qQPMXmJnjVKhaEVQluNKuo3ENlcH6cBs0pacjJK5V/ek72eM4MKfZg2Ho+J1PAnctmEkhuhZRW6TjsQE= X-Received: by 2002:a50:ee0d:0:b0:606:efc1:949b with SMTP id 4fb4d7f45d1cf-606efc199femr93289a12.3.1749051745229; Wed, 04 Jun 2025 08:42:25 -0700 (PDT) MIME-Version: 1.0 References: <20250603-fork-tearing-v1-0-a7f64b7cfc96@google.com> <20250603-fork-tearing-v1-1-a7f64b7cfc96@google.com> In-Reply-To: From: Jann Horn Date: Wed, 4 Jun 2025 17:41:47 +0200 X-Gm-Features: AX0GCFsCvblCWcy9XKdf8MqurIoG-ndCCsP0NcLQgarcwuDEBT9XXvHYP38LsIc Message-ID: Subject: Re: [PATCH 1/2] mm/memory: ensure fork child sees coherent memory snapshot To: Pedro Falcato Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, Peter Xu , linux-kernel@vger.kernel.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 4579214000F X-Stat-Signature: kfzjxa6rg5k993gr7sngmdpgx367qyh9 X-Rspam-User: X-HE-Tag: 1749051747-920803 X-HE-Meta: U2FsdGVkX1+GXaw4o97c2KahSSKUoRSUHZbml6W3LxwkA3PPlt0b2e8gFjJMrCdLqss/zaidsgCqNHXlQ9xNCnwv9Qao/NnWC+gB/cWoRSaxyUdbtWHNY8/JhrocK9NyNKujEuCGyJA8QJBRUJg+fA/Wxn/sSKEnC3FCivJhtUfQhZghFmX6orl6Lp4a7Qa5DCs05INH0GpEs1tbY7u3GMiUqswgjPVoSXzT80tbUcqnt+Wdn/WttG6hiExyvbXdpqPP/oHVqg97FUJ58VIetuJ6Yrvy8BGr/QuMqeRyZmy+4hLrDMnNOoy/YfhhMqatZtpC9ya/JAazj3uOHaC1eucVMQM2+HS5VyLV3FCwD9QWz2tNTTBAyZLDmnwiuQY3nppI0tWvvN4hytSChFrGIwC1P/mNxokJKUJPhAA+rnWGsHdUCWGLW3OHxbYKH7ZxQrxXQvgfcRfyDYvosgPS1dmjwU/uISYYyv8JLeCNZkKtC80TgfL1L1546Z4lNg4gLiAbLL61CPcuCNjvWwPJzeLc8jCNzQ6taqZ/Vb2THr/3bVy2jKbZD0ba+PmQ0l/IUz+leT4UxGDLlduF2PWzlZ885b6vIyCsWcEQVfc1aDbUiwPZuYoaQnaewLGaDtKKOB4j0Q9IlSK+N6aBbA+/KPtJyeEjG5h8OyvbJEDfrBb/dXs+U6DSAOcEBHT4VDQCWn0KPpXb+eMttY2nOWszcl4Z8vnPtve77art/KEwmNByNp0TkoRYaxD7Megu5Vv1jcSITlVpwisqUKVMNLIuuK6h6UVMLXEfYDit1DkWJu/pXjNMDowkAWTXs2hp4qIul9h8RwvMjIPP+CBYvHK+chklaX8utxv1J+w/D5439AawKVVb3rP7rLo8DV/Fq4Ei3q0vlpsgrPn42JF7Vd/JdMybfy8+g89TLs/maVgRYtitA9H+ZTXkMJt5H6iONftJl3d2FpF1y8Bhor8j5EM R3QHA0hj s58msC6O2M/0qHBw95IaeEtPs9x1gNRO0wTNjhDnJs03TFj4QbRqpBZZP0HG+9au30Cy3f3MWrNLlhUL1RQ0Xx1EMmvia3TAmb/VDLYvs+uzLVEPSqWV2MWqqovGojhgYkAxWBXsV1UBCTOTFF1hpMCQ3Hl5Qzy/tVyKJtzlVBo+Q5oz+W5h43Tx2YZJCtAG8lG64 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 10:32=E2=80=AFPM Pedro Falcato wr= ote: > On Tue, Jun 03, 2025 at 08:21:02PM +0200, Jann Horn wrote: > > When fork() encounters possibly-pinned pages, those pages are immediate= ly > > copied instead of just marking PTEs to make CoW happen later. If the pa= rent > > is multithreaded, this can cause the child to see memory contents that = are > > inconsistent in multiple ways: > > > > 1. We are copying the contents of a page with a memcpy() while userspac= e > > may be writing to it. This can cause the resulting data in the child= to > > be inconsistent. > > This is an interesting problem, but we'll get to it later. > > > 2. After we've copied this page, future writes to other pages may > > continue to be visible to the child while future writes to this page= are > > no longer visible to the child. > > > > Yes, and this is not fixable. It's also a problem for the regular write-p= rotect > pte path where inevitably only a part of the address space will be write-= protected. I don't understand what you mean by "inevitably only a part of the address space will be write-protected". Are you talking about how shared pages are kept shared between parent in child? Or are you talking about how there is a point in time at which part of the address space is write-protected while another part is not yet write-protected? In that case: Yes, that can happen, but that's not a problem. > This would only be fixable if e.g we suspended every thread on a multi-th= readed fork. No, I think it is fine to keep threads running in parallel on a multi-threaded fork as long as all the writes they do are guaranteed to also be observable in the child. Such writes are no different from writes performed before fork(). It would only get problematic if something in the parent first wrote to page A, which has already been copied to the child (so the child no longer sees the write) and then wrote to page B, which is CoWed (so the child would see the write). I prevent this scenario by effectively suspending the thread that tries to write to page A until the fork is over (by making it block on the mmap lock in the fault handling path). > > This means the child could theoretically see incoherent states where > > allocator freelists point to objects that are actually in use or stuff = like > > that. A mitigating factor is that, unless userspace already has a deadl= ock > > bug, userspace can pretty much only observe such issues when fancy lock= less > > data structures are used (because if another thread was in the middle o= f > > mutating data during fork() and the post-fork child tried to take the m= utex > > protecting that data, it might wait forever). > > > > Ok, so the issue here is that atomics + memcpy (or our kernel variants) w= ill > possibly observe tearing. This is indeed a problem, and POSIX doesn't _re= ally_ > tell us anything about this. _However_: > > POSIX says: > > Any locks held by any thread in the calling process that have been set = to be process-shared > > shall not be held by the child process. For locks held by any thread in= the calling process > > that have not been set to be process-shared, any attempt by the child p= rocess to perform > > any operation on the lock results in undefined behavior (regardless of = whether the calling > > process is single-threaded or multi-threaded). > > The interesting bit here is "For locks held by any thread [...] any attem= pt by > the child [...] results in UB". I don't think it's entirely far-fetched t= o say > the spirit of the law is that atomics may also be UB (just like a lock[1]= that was > held by a separate thread, then unlocked mid-concurrent-fork is in a UB s= tate). I think interpreting atomic operations as locks is far-fetched. Also, POSIX is a sort of minimal bar, and if we only implemented things explicitly required by POSIX, we might not have a particularly useful operating system. Besides, I think things specified by the C standard override whatever POSIX says, and C23 specifies that there are atomic operations, and I haven't seen anything in C23 that restricts availability of those to before fork(). > In any way, I think the bottom-line is that fork memory snapshot coherenc= y is > a fallacy. It's really impossible to reach without adding insane constrai= nts > (like the aforementioned thread suspending + resume). It's not even possi= ble > when going through normal write-protect paths that have been conceptually= stable since > the BSDs in the 1980s (due to the write-protect-a-page-at-a-time-problem)= . No, Linux already had memory snapshot coherency before commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes"). Write-protecting a page at a time does not cause coherency issues, because letting a concurrent thread write into such memory during fork() is no different from letting it do so before fork() from a memory coherency perspective, as long as fork() write-locks memory management for the process.