From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B487C5320E for ; Tue, 27 Aug 2024 04:34:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8AFA56B007B; Tue, 27 Aug 2024 00:34:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 85F996B0082; Tue, 27 Aug 2024 00:34:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74EA26B0083; Tue, 27 Aug 2024 00:34:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 562BD6B007B for ; Tue, 27 Aug 2024 00:34:06 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D6DEC1415B3 for ; Tue, 27 Aug 2024 04:34:05 +0000 (UTC) X-FDA: 82496758050.13.0A341EA Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) by imf26.hostedemail.com (Postfix) with ESMTP id E3F5E140011 for ; Tue, 27 Aug 2024 04:34:02 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=bXfe2xnU; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724733158; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Bq87ZON6ZNMLSUnrDlDs3pHFeMkFR+bTi97+RCtdjic=; b=Si1YAaw7IGZxawomy2VHdi8AwhMX+TTq+YTyUhTGs6/iGME3fbRotFumf8bXGEip6c9cjs c6GLaCBU0xZhAWjmg0x8On8tZf3E3ro8AvkUW3fAhgWm3xFHfP1g1wCMnJzBe+bGFa/mj9 KVzOHBuG1nq55qcjrNr9VGuJwjs5JJ8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724733158; a=rsa-sha256; cv=none; b=2W6U/vcIGMVq7Dw7lot8aQutQcuI4RAOovUjNaNNR6P6WcT5HzhREPywmRwj6Vbopf6UKg 3oToEBs4zwnf4X9bvV1GF0rnYgth1VtKWhWKsq/RXNG2TSf7EvFsaF3Bb25S7A6xMDfZOv xKBPHYSzT9vqKjAnXyyU6GnX7aV6CAE= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=bXfe2xnU; spf=pass (imf26.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.174 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-2020e83eca1so47046155ad.2 for ; Mon, 26 Aug 2024 21:34:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1724733241; x=1725338041; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=Bq87ZON6ZNMLSUnrDlDs3pHFeMkFR+bTi97+RCtdjic=; b=bXfe2xnULt0FbZexbvF/B0pHwDCdzSkFCKmfj3zu3J1BpMcYdevqepWd4Ly5OUJ+hB d/+BTUjFfxPlGEK6t4NaIYjVGdFYJ4LdVXnv77V0z8jULeq89RXhLEJy9zneU+OYixlh ZrkJL4zCvO6YNgEHY9aeorFzUfhXmVtUlds3PvndMt/oGdJ5Wxe+EwY6TwQWdcDeRDmQ Sq5gIhlW2rX3VAOYzkpcYhQkXgB6TsS00AMJ8TigBolHeWgcMwgihMvfNhyKg7uXo2T6 ajnvlPZzrtx4tNEGENinEsrE9QL3OfDLosYUCMeeREW56NP0/9+d1mVKHMiAMsdlrJ5c LG6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724733241; x=1725338041; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Bq87ZON6ZNMLSUnrDlDs3pHFeMkFR+bTi97+RCtdjic=; b=c+34c9Dxoy0pH4OwENtLAb4zu3lMDQfiLmNaEBFSfA6HiCfk6YgBPkVtSDPagvp7Qa 7+FlGblT8UTV/oLjSYYQkqfo1RlkpFpBVBpNsHxUn/bhvlFx+EVXVcSRw4aU1SXDWDmF TuK3NKd4NmSJUlEbaORi23bObc5ZeWv1UV0e8WI+wVEFJzxhzzFM4jY4fWsyDPAdsJwM iPFB5yUR1F3mMT8KkzuGhIwPOMqnf9T++sdWlOKRsDidr3W+F29PWgAQinFXtdJyPKEJ ZtwG5YVa6HAxdhmoZQsv8yTViXaRQwbEpaS4CZ+cUNlpOeW20O2CsS7PQiIo0x//XW8t 7I2g== X-Forwarded-Encrypted: i=1; AJvYcCUcGq7CPOlluiPbyD+p2p2oTyt4JDD+oJPkRww+2Ktg61YXfcQAof7qmcGIXJQQMVf8YBojbhc9VQ==@kvack.org X-Gm-Message-State: AOJu0YxbzoKWq50Dvy1GyjDnyNh4UTRHeEjT6J8fjFz9dEw1VP9hbikA mfZM9U8/I9OYhbhXNerqMm/9pjgnUPaIHUPff1Fl5CnOFl3YLlRrCPUdOxrI9j4= X-Google-Smtp-Source: AGHT+IGrjg/3DusjsTQ/CyMwBnOzZDq+gba6Du2dXt4+TdxTJQagehvc1DJvTP0xEmtvYDB7oJxjZw== X-Received: by 2002:a17:902:db0d:b0:201:df0b:2b5d with SMTP id d9443c01a7336-2039e52da6cmr106666295ad.64.1724733241304; Mon, 26 Aug 2024 21:34:01 -0700 (PDT) Received: from [10.4.59.158] ([139.177.225.242]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-203855df793sm75232815ad.155.2024.08.26.21.33.55 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 26 Aug 2024 21:34:00 -0700 (PDT) Message-ID: <4481a0e4-a7a5-4223-a8ab-d1215d7c6352@bytedance.com> Date: Tue, 27 Aug 2024 12:33:53 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock() Content-Language: en-US To: David Hildenbrand Cc: hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, rppt@kernel.org, vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com, christophe.leroy2@cs-soprasteria.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org References: From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: E3F5E140011 X-Stat-Signature: qtj9xft4hop9g8q1s14ihm38zt8ixy5e X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1724733242-716101 X-HE-Meta: U2FsdGVkX19CETMYHt4gVfQd+ud5LFyVfxOZFXCl7rEKGAuWF7fP6ve5aZwFWJiQGWbuyvF1oVYohdMapSXVB6Gms2kDC1EXckAChUIIu2LvaAWEFG621KPxH97eIYNqwPX+KDxo9vbn35vbC1IAOEpjhZaiyCsAgg6z8Sz80bP14LUP+Pmg/WXi4vkqOQJer3DwzXhuOki90aE3e3XO8Vm5Xqm5QXAiZ+9zGuHMSBghsm9i7qw98lQaVs+pSDVz+Isn4BX+FseKwhduq4RI8lckWhlM/VPYt5cZ3GBWoyvc0wml2PBZza4T5Zk9cz59+m3osVzOxlt6BNISnw6ZJwYq7z2SK9gAKsAusWlBgEyQwNJxzfhCTq93n1AuMLKRyVqcndYRJ78Wuj2fABmM+VyLTqn1oxNFlXaNQFuwhzqE1oRQQ7+o+djErPgR39NHtR6vZh7LmuVBC6RZOfqIAEQTnvs0oXQe85qr7SrOyURofIS5hIYGb4JxR9DlNw4PWkNQsH4jT6i6eXHM/T8vMSiUN9FF1sJilAJfXQKU6T77oTH2jCXK3txtBDuPhBdSyCtL7RrChDwpaOFmlFehJdMMIvFcp6J0UFdsiVSeuZUJQsjsbRRjcnoEGbJbl9r+onJpU04AJdScGZO1Lr6XsDbQ6HuShE7KWSXBU2ndzCCQTFnglPAP7nuSxYvkJhR9l+IJEnVh/ODbxteK+FRDhBLVhCd+EEkXQx2GS8AhRdNTY51kqxd891Y+r89Mp7JGX1aPSV0qOzsDX8WCuJuwjLW6mIc3CILM9vW6tZ4VMdfjY6jYnvIw2W/8eSTPCJt9BpM0RGI7to06pDgI2f8RG9RwWZHfe+gDjp+osQ0PoHf0y5NGur7qpgASiKmhnI1MfO/t6ExTZyj1572arVcHUcXfN1K2v8ByzVx5OvLJkEgDG6kLSk1htMPPvYG5aATHo5voePIccJx/jR8zPny BurFVwmN Vu63hjF1hcrwsd2g6k+ghQ++DwE8HXFVR24TTlt6PI2VkhCzHmNyIq2jHBNUr88M1hCbnOXpAbYGCuVj/PDh/YKRy1U5XJLg63PlBI1k0X05OaQbbqRoGelVBu1S2vIB1TpPbMs7TF7OBjgAByyyFarl6sVlRDg6CVKOLyntwGUmZ8vi9rF0X2Ollg91sKlhWhn04J+I1TpJpklx+1Adcv+GPlzhAGvg7WKqDOmf3jzUOnuduImsxnmN4rvyuTCoS8qz6kBMdU+Ghuy89ZcYjjUw/rcscBg+d+oDpXqHFVNH4CUC6SsZuSaSzFXbAFes0M9EE51/vyOJcAP7DDk3iNrejGS2jMc0YynT7e/qzhBmABCHXBUuXi2ixUyixnWhZ5HzrGGM0iDh9o1AzjNMti4/SOa5apc0AuTjz0Zp9hjpsU2FmM+t5JD6md0+atU+S6+hOavj6lJhmX9oGaJTaqzE30d2sDdXlJxR56OXqOM62/IqmTY3Hyfqp3d8iihAGk77gEnB8/QXbkEM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi David, On 2024/8/26 23:21, David Hildenbrand wrote: > On 22.08.24 09:13, Qi Zheng wrote: >> Currently, the usage of pte_offset_map_nolock() can be divided into the >> following two cases: >> >> 1) After acquiring PTL, only read-only operations are performed on the >> PTE >>     page. In this case, the RCU lock in pte_offset_map_nolock() will >> ensure >>     that the PTE page will not be freed, and there is no need to worry >>     about whether the pmd entry is modified. > > There is also the usage where we don't grab the PTL at all, and only do > a racy (read-only) lookup. IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock() in this case. > >> >> 2) After acquiring PTL, the pte or pmd entries may be modified. At this >>     time, we need to ensure that the pmd entry has not been modified >>     concurrently. >> >> To more clearing distinguish between these two cases, this commit >> introduces two new helper functions to replace pte_offset_map_nolock(). >> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition >> to changing the name to pte_offset_map_rw_nolock(), it also outputs the >> pmdval when successful. This can help the caller recheck *pmd once the >> PTL >> is taken. In some cases, that is, either the mmap_lock for write, or >> pte_same() check on contents, is also enough to ensure that the pmd entry >> is stable. But in order to prevent the interface from being abused, we >> choose to pass in a dummy local variable instead of NULL. >> >> Subsequent commits will convert pte_offset_map_nolock() into the above >> two functions one by one, and finally completely delete it. >> >> Signed-off-by: Qi Zheng >> --- >>   Documentation/mm/split_page_table_lock.rst |  7 ++++ >>   include/linux/mm.h                         |  5 +++ >>   mm/pgtable-generic.c                       | 43 ++++++++++++++++++++++ >>   3 files changed, 55 insertions(+) >> >> diff --git a/Documentation/mm/split_page_table_lock.rst >> b/Documentation/mm/split_page_table_lock.rst >> index e4f6972eb6c04..08d0e706a32db 100644 >> --- a/Documentation/mm/split_page_table_lock.rst >> +++ b/Documentation/mm/split_page_table_lock.rst >> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other >> accessor functions: >>    - pte_offset_map_nolock() >>       maps PTE, returns pointer to PTE with pointer to its PTE table >>       lock (not taken), or returns NULL if no PTE table; > > What will happen to pte_offset_map_nolock() after this series? Does it > still exist or will it become an internal helper? I choose to remove it completely in [PATCH v2 13/14]. > >> + - pte_offset_map_ro_nolock() >> +    maps PTE, returns pointer to PTE with pointer to its PTE table >> +    lock (not taken), or returns NULL if no PTE table; >> + - pte_offset_map_rw_nolock() >> +    maps PTE, returns pointer to PTE with pointer to its PTE table >> +    lock (not taken) and the value of its pmd entry, or returns NULL >> +    if no PTE table; > > [...] > >> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, >> +                unsigned long addr, pmd_t *pmdvalp, >> +                spinlock_t **ptlp) >> +{ >> +    pmd_t pmdval; >> +    pte_t *pte; >> + >> +    BUG_ON(!pmdvalp); > > As raised, no BUG_ON please. VM_WARN_ON_ONCE() is helpful during early > testing and should catch these kind of things. OK, this patch was sent before you pointed out this, will use VM_WARN_ON_ONCE() instead of BUG_ON() in v3. > > If someone thinks not requiring a non-NULL pointer is better, please > speak up, I'm not married to that idea :) > >> +    pte = __pte_offset_map(pmd, addr, &pmdval); >> +    if (likely(pte)) >> +        *ptlp = pte_lockptr(mm, &pmdval); >> +    *pmdvalp = pmdval; >> +    return pte; >> +} >> + >>   /* >>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal >> implementation >>    * __pte_offset_map_lock() below, is usually called with the pmd >> pointer for >> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct >> *mm, pmd_t *pmd, >>    * recheck *pmd once the lock is taken; in practice, no callsite >> needs that - >>    * either the mmap_lock for write, or pte_same() check on contents, >> is enough. >>    * >> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like >> + * pte_offset_map(); but when successful, it also outputs a pointer >> to the >> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case >> without >> + * locking it.  This helps the caller to avoid a later >> pte_lockptr(mm, *pmd), >> + * which might by that time act on a changed *pmd: >> pte_offset_map_ro_nolock() >> + * provides the correct spinlock pointer for the page table that it >> returns. >> + * For readonly case, the caller does not need to recheck *pmd after >> the lock is >> + * taken, because the RCU lock will ensure that the PTE page will not >> be freed. > + * >> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is >> like >> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the >> + * pdmval. For cases where pte or pmd entries may be modified, that >> is, maywrite >> + * case, this can help the caller recheck *pmd once the lock is >> taken. In some >> + * cases, that is, either the mmap_lock for write, or pte_same() >> check on >> + * contents, is also enough to ensure that the pmd entry is stable. >> + * >>    * Note that free_pgtables(), used after unmapping detached vmas, or >> when >>    * exiting the whole mm, does not take page table lock before >> freeing a page >>    * table, and may not use RCU at all: "outsiders" like khugepaged >> should avoid > > In general to me a step into the right direction. Likely the > documentation could be further clarified in some aspects: > > Like that the use of pte_offset_map_ro_nolock() does not allow to easily > identify if the page table was replaced in the meantime. Even after > grabbing the PTL, we might be looking either at a page table that is > still mapped or one that was unmapped and is about to get freed. But for > R/O access this is usually sufficient AFAIUK. > > Or that "RO" / "RW" expresses the intended semantics, not that the > *kmap* will be RO/RW protected. How about the following: pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); but when successful, it also outputs a pointer to the spinlock in ptlp - as pte_offset_map_lock() does, but in this case without locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock pointer for the page table that it returns. Even after grabbing the spinlock, we might be looking either at a page table that is still mapped or one that was unmapped and is about to get freed. But for R/O access this is usually sufficient AFAIUK. pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval. For R/W access, the callers can not accept that the page table it sees has been unmapped and is about to get freed. The pmdval can help callers to recheck pmd_same() to identify this case once the spinlock is taken. For some cases where exclusivity is already guaranteed, such as holding the write lock of mmap_lock, or in cases where checking is sufficient, such as a !pte_none() pte will be rechecked after the spinlock is taken, there is no need to recheck pdmval. Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will be RO/RW protected. Thanks, Qi >