From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98BC5C5B549 for ; Wed, 4 Jun 2025 17:04:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 39C846B02F9; Wed, 4 Jun 2025 13:04:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 374606B02FB; Wed, 4 Jun 2025 13:04:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 289BF6B02FF; Wed, 4 Jun 2025 13:04:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 079A96B02F9 for ; Wed, 4 Jun 2025 13:04:05 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BA7FB8016C for ; Wed, 4 Jun 2025 17:04:04 +0000 (UTC) X-FDA: 83518340808.11.7ECA87E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 80A23C000F for ; Wed, 4 Jun 2025 17:04:02 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PQ7tKUWG; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749056642; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KXYECtp31evftfpREsHH1UacYDYuxjJZpbQfu51cYzw=; b=lV7BjbW4AY0Aqd4Djf5zZXw8CUKlktQMXz8XPmql1MrJPv4946YwiNuz95X+V7slZlYGhR tmF3p1RS605YySdUvY/fxaf6HyXhC+3l27aX2r23O1JAk35r+D07hTsRarUCJQ2EXNtfxO l6X8DgzzzOk4ncH5iIEcvyLEyqyTKIc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PQ7tKUWG; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749056642; a=rsa-sha256; cv=none; b=N9id/qUQIr+2dMqvreGL7FkFwQE1WejhBtZqfoL8J3yo6gk5wjcyB/TjoS6/nrviuWjXVQ qAKdzdedejIIKuerVlJo7JGYubKTgItGElSXvpYjJ97wmAaP8dL9LwdP61JBG41EmJY8vn K8ej5W7aFdnzz2FqxWySL5PO+UQ/ykM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1749056641; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=KXYECtp31evftfpREsHH1UacYDYuxjJZpbQfu51cYzw=; b=PQ7tKUWGK2RZcTa2KfUFXiNJGmSS1KUEDXnDCWL392P9C3hQV2qVdjr98s9erHo73D2O4Y U2iwiKAHSj2U4/yKODukwf5PJ3YgakUU8fvEDhzzLEqezWsi0/V5kHXKa4rrP7z5tukpcj 3hydWLTqQs2kZIZtHaktro96d0ixpVM= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-623-g3bAPJeBM_qEqkxd-jLXkw-1; Wed, 04 Jun 2025 13:04:00 -0400 X-MC-Unique: g3bAPJeBM_qEqkxd-jLXkw-1 X-Mimecast-MFC-AGG-ID: g3bAPJeBM_qEqkxd-jLXkw_1749056640 Received: by mail-qk1-f200.google.com with SMTP id af79cd13be357-7ceb5b5140eso14417185a.2 for ; Wed, 04 Jun 2025 10:04:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749056640; x=1749661440; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=KXYECtp31evftfpREsHH1UacYDYuxjJZpbQfu51cYzw=; b=tQxFy9X+8znAd+wACyMnhFjK2dpQuHk7XCexx9rR5aCsI2lx1JDPpZq5iekAYoXrOx NhOmogtonsJYauXHEzqr/Pc/3etSu3l0lHuy/BolTpBjJ+hhDXWu0SFUsqJ/OhIyXjdD dUZbemUUYAKXjBjNc3A2XW3v82xLlB6jMECNIEcnQbQWXNs6PssPto09f3Bu49urcoJ8 dgK6aL68hFdvxNwaLNvuWE2CH4ErECNjWOJpgQIiWl7ddi4Xeiav7XKEPlCOe08vruV1 FYP3P4PrHdtZxkaQHyQ5U0OInexbLmH3aVueMZKArTqjsIyO9OG0JpludIFGbeU5iJFA odiw== X-Forwarded-Encrypted: i=1; AJvYcCUT6bvZZXZ/TkPnaTYd9kSLpPUUPOJV7Em9HcLWBioyQCThYZC8rH749MOVifW1Q9CDIVxMHonQbA==@kvack.org X-Gm-Message-State: AOJu0YzfNn7lIEX0QkcQLzdTpL9n8maqtwCfmgrHfO3Ked7m1LwGDy4e yC/EWr8bNgWvRjTyVa2zlJCr+MmdFfsNL7TYdEs9u8Fdtk82YnTevJxZZB1iDcp88aBADu2AK7N q+wU+eToMkRvdRcRDKM8MGzWLsksV+85ZW/5Pk0v1ZWv9YMt6dFz1 X-Gm-Gg: ASbGncuhaHJpZinOQPCMrnF7U5wEWPlNTAKQeBQOBnS3D4vY7SYQzDPJmBNdKpwL2Nw BnXt0jS+8wUE9OdFY9xXbr3CNKqDQEPYiHTBN4kaGtoIYrQfN5D3CjMcIxlXvDh9HsIfZ1vTNL5 DlEHWTWFblKjHKxuVHdzbBnnrFR4Mj11vKdINopaNH6B2r/MYe+NmISitYhEycVye6iSdutBD81 d+QKQ6fslvtBlGIdosVcAh4P3URZTT+FN2pDUDZ3QgsWbGFxN3BxANUobb7+Wphdf49GFUPJ7zC vITrUkaTrBsEpw== X-Received: by 2002:ad4:5dec:0:b0:6fa:9baa:face with SMTP id 6a1803df08f44-6faf6fc3deamr54288806d6.35.1749056639996; Wed, 04 Jun 2025 10:03:59 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEOvQDaKnUU3r62FtrvsKHhJElDFbbpfiXfuDmF6cUSLcZ7REl/ix2InFDElBeDf4MxXpK8jw== X-Received: by 2002:ad4:5dec:0:b0:6fa:9baa:face with SMTP id 6a1803df08f44-6faf6fc3deamr54288406d6.35.1749056639598; Wed, 04 Jun 2025 10:03:59 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6fac6d33ab1sm102398986d6.2.2025.06.04.10.03.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Jun 2025 10:03:59 -0700 (PDT) Date: Wed, 4 Jun 2025 13:03:56 -0400 From: Peter Xu To: Jann Horn Cc: Andrew Morton , David Hildenbrand , Lorenzo Stoakes , "Liam R. Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] mm/memory: Document how we make a coherent memory snapshot Message-ID: References: <20250603-fork-tearing-v1-0-a7f64b7cfc96@google.com> <20250603-fork-tearing-v1-2-a7f64b7cfc96@google.com> MIME-Version: 1.0 In-Reply-To: <20250603-fork-tearing-v1-2-a7f64b7cfc96@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Gnkep69STnnqjxqcgEESDnk_cw5XiE7zUFUHaS277s8_1749056640 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Stat-Signature: 8qs48ampym4zjyhqc8u85k799emffn7w X-Rspamd-Queue-Id: 80A23C000F X-Rspamd-Server: rspam11 X-HE-Tag: 1749056642-202289 X-HE-Meta: U2FsdGVkX1/dMuGDhx+sXmTGIVQISON4u1d97A52/QLGivOENFMS1aIIAwNbruK30qcmUJwK3kiYDiG0YPK2CNN8PJyD3LfpOzyVF9F0v+R8/IdeM82NpQ9C+yu56h1JikZWQRliCsmWai5RYZXtTMp5sQGSeN8YPX8WnM56ZDYlg1JhN++wX2VuYcFjoMmgikm1m3W3QNpBkWN8DM+LrFNZODl6oAGuJKXoSV+x03w3A34DA46QLy0BfxPlqVMftkYLrzlDRguwDxBz7Ba07dO+HajEetwlBri6bcwJJM/7VRgL8DUXNdnwSRqYoxJ4DMbReT/Ex2JMq71Q7xrSYrmUb0eoCRX0RWpnZ1VLBwguUykXqamWxyDU4N/NyObnKPZPu+fminR7OXZ80+s1ks/t/5AN+y7/DTs6Nf4hH1C4UYH+53dlq4PXFxXpYGfDc6S2Za64+DGeVd9dYIY1rQC9JiSno92jHhmKuK5tzDl1RJCXjXt4upSYyTL5p54I5zE7qAcx6KFTzMyfR9vxBCIRxUVTrawoYe0KiwwN6KpCQZKoExTxNNP94jw5GaUxZVn3u2Ksi8BbY54BdLniyBWHEQKtxcpGevtvbQ29Y38aAIUO0Pr5ovCrriB4y0/Pv6F6/1Fpphih0HEGBuHkxhTX1TBInqTsQPZ2fSusb/jFHAhIcPhDusduvmSoKCTNZHnR9qoqWJTT0sqeVsS66Ao1UzW3uvynHFMIaFBk/xPR3UlcpjIgK/alF5h78lzhwORhO2Fgblyi/qGl0Bb+ePxKLZAGf2voz7KLQ7p9AOOkB9j7zVdxHCqzUxPLnOCYS2Ddo+zDVx1YqfjhfB/Qyrs+xAJ3XKyQWic2u9oXgqKPR7UqCQtDR65n3sEe6QkBnx2FNsfTAzs2B0sjMQSd6SpqVHwCiWFTDhSXJYBsMtUXXGA2teKhBMRN+oD3HAOv/jp237vdAOyRwjzdu6+ Oc3i3rI6 93t7UgXt2kxaENyY7HenkaXWqwQeiiLgadRK5R5LzWHHw2hk2sq91Xlo6Q8WNwiSqcXywbihTNuPxpY7g3XIALmWJyAVyrxlqNPtzvr/JGPv2voJ4+JqYLRrCHNXKsb9FMlUYbuj7kMDYqw3nQvNuhJlFKfXurxiy9XCYPW6OUIkWh7+rtzldLebGkVMcCUVub19yWtTzwTuEQ7kWvlYNu9ZqxWdKVUknBrS/YSHNM6OBoH1cKE/UDiGjPwntfgZ7wUKIrynSgo6MIL8ikBjtL71V4TPojnbzmN4ENuiw7Ne0zsw18PCOAm888DP+i2LyVtFa4Oh1bGoWtRGwL1TSo0jr5GqZwC/XehCop/gbTv5QkAQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 03, 2025 at 08:21:03PM +0200, Jann Horn wrote: > It is not currently documented that the child of fork() should receive a > coherent snapshot of the parent's memory, or how we get such a snapshot. > Add a comment block to explain this. > > Signed-off-by: Jann Horn > --- > kernel/fork.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/kernel/fork.c b/kernel/fork.c > index 85afccfdf3b1..f78f5df596a9 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -604,6 +604,40 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) > } > > #ifdef CONFIG_MMU > +/* > + * Anonymous memory inherited by the child MM must, on success, contain a > + * coherent snapshot of corresponding anonymous memory in the parent MM. Should we better define what is a coherent snapshot? Or maybe avoid using this term which seems to apply to the whole mm? I think it's at least not a snapshot of whole mm at a specific time, because as long as there can be more than one concurrent writers (hence, it needs to be at least 3 threads in the parent process, 1 in charge of fork), this can happen: parent writer 1 parent writer 2 parent fork thr --------------- --------------- --------------- wr-protect P1 write P1 <---- T1 (trapped, didn't happen) write PN <---- T2 (went through) ... wr-protect PN The result of above would be that child process will see a mixture of old P1 (at timestamp T1) but updated P2 (timestamp T2). I don't think it's impossible that the userapp could try to serialize "write P1" and "write PN" operations in a way that it would also get a surprise seeing in the child PN updated but P1 didn't. I do agree it at least recovered the per-page coherence, though, no matter what is the POSIX definition of that. IIUC an userapp can always fix such problem, but maybe it's too complicated in some cases, and if Linux used to at least maintain per-page coherency, then it may make sense to recover the behavior especially when it only affects pinned. Said that, maybe we still want to be specific on the goal of the change. Thanks, > + * (An exception are anonymous memory regions which are concurrently written > + * by kernel code or hardware devices through page references obtained via GUP.) > + * We effectively snapshot the parent's memory just before > + * mmap_write_unlock(oldmm); any writes after that point are invisible to the > + * child, while attempted writes before that point are either visible to the > + * child or delayed until after mmap_write_unlock(oldmm). > + * > + * To make that work while only needing a single pass through the parent's VMA > + * tree and page tables, we follow these rules: > + * > + * - Before mmap_write_unlock(), a TLB flush ensures that parent threads can't > + * write to copy-on-write pages anymore. > + * - Before dup_mmap() copies page contents (which happens rarely), the > + * parent's PTE for the page is made read-only and a TLB flush is issued, so > + * subsequent writes are delayed until mmap_write_unlock(). > + * - Before dup_mmap() starts walking the page tables of a VMA in the parent, > + * the VMA is write-locked to ensure that the parent can't perform writes > + * that won't be visible in the child before mmap_write_unlock(): > + * a) through concurrent copy-on-write handling > + * b) by upgrading read-only PTEs to writable > + * > + * Not following these rules, and giving the child a torn copy of the parent's > + * memory contents where different segments come from different points in time, > + * would likely _mostly_ work: > + * Any memory to which a concurrent parent thread could be writing under a lock > + * can't be accessed from the child without risking deadlocks (since the child > + * might inherit the lock in a locked state, in which case the lock will stay > + * locked forever in the child). > + * But if userspace is using trylock or lock-free algorithms, providing a torn > + * view of memory could break the child. > + */ > static __latent_entropy int dup_mmap(struct mm_struct *mm, > struct mm_struct *oldmm) > { > > -- > 2.49.0.1204.g71687c7c1d-goog > -- Peter Xu