From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D58B1C67871 for ; Tue, 25 Oct 2022 03:21:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4556480008; Mon, 24 Oct 2022 23:21:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3DEC480007; Mon, 24 Oct 2022 23:21:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27F5B80008; Mon, 24 Oct 2022 23:21:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 15F0F80007 for ; Mon, 24 Oct 2022 23:21:48 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CF67AA76CA for ; Tue, 25 Oct 2022 03:21:47 +0000 (UTC) X-FDA: 80058022254.28.20FAD61 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf11.hostedemail.com (Postfix) with ESMTP id 960C140004 for ; Tue, 25 Oct 2022 03:21:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=2YhJiurh3yeqLAY7/HFpg3fq1seSTJqbsUcJnHEJrjo=; b=FAxLsAyIHFrQpx9r+joBCrVs5x CNiXnOy8QwPnLgfHFTn7fUAU/HuLeneBCnASS0XmIpblKGn9GFaViAhQf98i95OtmscUry5Ugi168 hJku8DlWqsby24PS/6iYJSsU+E1lvtnVhilzbDXpmW6ZhbfUf2t9V35r1nvc5zxXzxwQyQucVfq5V BXKo4JXccKH4MQ9jWphalnX7IVHdwH2QnVyZa4EslliANXYAQknYyvEpNu3Jiq1H56/zDSf6y1Kdl f0Pd0VvvvClevk+knHycWRlUbdvSZBuQxCKil6J5qtEKRrWVRpCS291PXzgSBQT/aT2tC9LfFuy7C OVMHO4JA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1onAVX-00Fxrc-9n; Tue, 25 Oct 2022 03:21:39 +0000 Date: Tue, 25 Oct 2022 04:21:39 +0100 From: Matthew Wilcox To: Jann Horn Cc: Linus Torvalds , Peter Zijlstra , John Hubbard , x86@kernel.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, ubizjak@gmail.com Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment Message-ID: References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666668107; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2YhJiurh3yeqLAY7/HFpg3fq1seSTJqbsUcJnHEJrjo=; b=10JBVA2V/Y5nyTgKHcLw5c8p2jDLZwi8b80KVbQYdeUBSY0Clv0vz2H7yNOB5gqy2lSrlC p89fVcfQ2b90yeXlcr1YWkIzyhDXO77QZQZ7AV+UDvfi1rFy0sRZ/MluELlug539W+oTiR nssb1N0677kAocwwNKwZVsP21uAr7bg= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=FAxLsAyI; dmarc=none; spf=none (imf11.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666668107; a=rsa-sha256; cv=none; b=PED1B3SLKwbTr/k18ZeGMC/Irxvgu3hzbnoBhc+p7hdwbr9VV7ajWmn6omiAIu7HkFhOLT MzdI43xIAZ7Gy649xvcow0U41HGvSH7DtrSIeF4Vhc4kqYPZVqT8eg/37G8b0OG8kU+TKy 7DBXFE9d/ElKapMijB55L1IgYMbkSzM= X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 960C140004 X-Rspam-User: Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=FAxLsAyI; dmarc=none; spf=none (imf11.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org X-Stat-Signature: dy6cmw4rtwn9mousrps48z1ty3fb6xma X-HE-Tag: 1666668106-616412 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > """ > This guarantees that the page tables that are being walked > aren't freed concurrently, but at the end of the walk, we > have to grab a stable reference to the referenced page. > For this we use the grab-reference-and-revalidate trick > from above again: > First we (locklessly) load the page > table entry, then we grab a reference to the page that it > points to (which can fail if the refcount is zero, in that > case we bail), then we recheck that the page table entry > is still the same, and if it changed in between, we drop the > page reference and bail. > This can, again, grab a reference to a page after it has > already been freed and reallocated. The reason why this is > fine is that the metadata structure that holds this refcount, > `struct folio` (or `struct page`, depending on which kernel > version you're looking at; in current kernels it's `folio` > but `struct page` and `struct folio` are actually aliases for > the same memory, basically, though that is supposed to maybe > change at some point) is never freed; even when a page is > freed and reallocated, the corresponding `struct folio` > stays. This does have the fun consequence that whenever a > page/folio has a non-zero refcount, the refcount can > spuriously go up and then back down for a little bit. > (Also it's technically not as simple as I just described it, > because the `struct page` that the PTE points to might be > a "tail page" of a `struct folio`. > So actually we first read the PTE, the PTE gives us the > `page*`, then from that we go to the `folio*`, then we > try to grab a reference to the `folio`, then if that worked > we check that the `page` still points to the same `folio`, > and then we recheck that the PTE is still the same.) > """ Nngh. In trying to make this description fit all kernels (with both pages and folios), you've complicated it maximally. Let's try a more simple explanation: First we (locklessly) load the page table entry, then we grab a reference to the folio that contains it (which can fail if the refcount is zero, in that case we bail), then we recheck that the page table entry is still the same, and if it changed in between, we drop the folio reference and bail. This can, again, grab a reference to a folio after it has already been freed and reallocated. The reason why this is fine is that the metadata structure that holds this refcount, `struct folio` is never freed; even when a folio is freed and reallocated, the corresponding `struct folio` stays. This does have the fun consequence that whenever a folio has a non-zero refcount, the refcount can spuriously go up and then back down for a little bit. (Also it's slightly more complex than I just described, because the PTE that we just loaded might be in the middle of being reallocated into a different folio. So actually we first read the PTE, translate the PTE into the `page*`, then from that we go to the `folio*`, then we try to grab a reference to the `folio`, then if that worked we check that the `page` is still in the same `folio`, and then we recheck that the PTE is still the same. Older kernels did not make a clear distinction between pages and folios, so it was even more confusing.) Better?