From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BF76C5474D for ; Thu, 29 Aug 2024 03:28:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F9F96B00C9; Wed, 28 Aug 2024 23:28:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7A9BE6B00D6; Wed, 28 Aug 2024 23:28:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 623966B00D7; Wed, 28 Aug 2024 23:28:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 41D7A6B00C9 for ; Wed, 28 Aug 2024 23:28:00 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 067EEA0983 for ; Thu, 29 Aug 2024 03:28:00 +0000 (UTC) X-FDA: 82503849120.30.4D416A9 Received: from mail-oa1-f43.google.com (mail-oa1-f43.google.com [209.85.160.43]) by imf10.hostedemail.com (Postfix) with ESMTP id 41047C0003 for ; Thu, 29 Aug 2024 03:27:57 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=WtxXIP0b; spf=pass (imf10.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.43 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724902034; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KUSlIPY0I0p/J1gM+q6b9ZwgCjkUVZv6zdCV16DJSmE=; b=AhlA3OsdTA0QQ9evCgWJp/16eIT3T0ZZsMfHmejjxyu8aG736sGBzvcZsObrgTeFggR/NC gBkfjUNMuL5eOgi+jtWdcaPYa3pPVGl3ClURQIkcdtsWTKnDefv5ApgJmCB4J3h5Yo9DgY yg8uCjXLawJA3vv6bMmeWArCjOZ3aEI= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=WtxXIP0b; spf=pass (imf10.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.43 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724902034; a=rsa-sha256; cv=none; b=WpV5mXYfJfIGWtA16ShYGE7te+yz6uzdyL2TzINjYt+OjLEH2VqMcTgP+nDieJhyca0unL lRse2zJ3D7u9iPJQ2hc4kFQswjPb2mxjHjAP+BxlEfLTYYC2uNuVyw6/v7iw2ELSDn6rLd BywEotjlj5T4Z6TI2D8JfUZc317Y8ko= Received: by mail-oa1-f43.google.com with SMTP id 586e51a60fabf-2704b6a6fe6so109585fac.1 for ; Wed, 28 Aug 2024 20:27:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1724902076; x=1725506876; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=KUSlIPY0I0p/J1gM+q6b9ZwgCjkUVZv6zdCV16DJSmE=; b=WtxXIP0bVqETiWUEf4IDIhPUhLImiwwmFYhgHkSsOpzE8sA4ovZF76PHgKZGW35Zbd 61Dm2SySOmFUocnckpcL3PfZzN/IV3b4cSTSlR1OaGn3EuWDT6AohpcusdUKaliklcJv e06/RAKQQAlROD+mNqSUAp0uehHbVD3Rk9L/kE4Z3OtHOXy0LOB9Yz7fwpk0fOxKrLxO +Jp1+CJ3NdpxXeNsF8/9SMFMF02JiPMLOO+LGvHb8RCYd95g1gkJsTw5tjPoM203AhfU 9GBT49/dwx+Oq6Pi7P+ovI4fNXZD/8ojwkKGIrFtMj0tu7kSiUE6R2jyEfq4BijRtfOj dwfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724902076; x=1725506876; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KUSlIPY0I0p/J1gM+q6b9ZwgCjkUVZv6zdCV16DJSmE=; b=wKwrrpXQs+mCa60/JE1Ry17MWBq71roaL5XrJypAT4mp57NUx3fRUu1QlyLAs46k8i AieuyZIZNxsfOG86YXzBwK1jLVRDnMD1pTMmd7mDBGhQRzeP9yHVxRHjS0KsSl4UmMh4 ve56pUPWtiDonU48CwKeGuCYTySw9M8ABH9f1DBNl+/FdFIPMh1SGUJ2ozlYF+KxKMPY yvOAkJTvPekcWKeWJ7ofDIdrr1IhhfEgnvg0f/W0IdG6vdJ7IqbKzaqc71vgPPF8sogg iukcQVO3t48GI6nCC6Ay6JVjhLv7yTOKbuH79U/CpiAh9BZCOP5mLjYswYFFN1fyTsZB oxIw== X-Forwarded-Encrypted: i=1; AJvYcCV57QiqkgmqdUPS0WUVHUxSF/oMYx5MX2wzjzDBb4PA2SCkvQSOlHd6Ha3geyJ5x/AY/0SYorOQxw==@kvack.org X-Gm-Message-State: AOJu0YzB1PLpONKkBKmBsgTEmr4ZC57yY1IDYPbwlCiKnqTFlpeqpKqy pISw7q2xtiIo00fEKL99pK+7bsX3cxuRMhpIiDYC9ilfveHt/mjd7Xir+wnx+Og= X-Google-Smtp-Source: AGHT+IGZPCEcwowFRJHQOC2QH+v2+Ysm2wJ5TBLKGyXexoQWGql8ItSqdv/XlmWonkjQbasTqI0R6g== X-Received: by 2002:a05:6870:2011:b0:25e:1c9d:f180 with SMTP id 586e51a60fabf-277903675a4mr1814959fac.50.1724902075836; Wed, 28 Aug 2024 20:27:55 -0700 (PDT) Received: from [10.4.59.158] ([139.177.225.242]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-715e55a4548sm203799b3a.59.2024.08.28.20.27.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 28 Aug 2024 20:27:55 -0700 (PDT) Message-ID: <469e0ab5-af2c-4996-bfd4-fe7ab6a7bc8c@bytedance.com> Date: Thu, 29 Aug 2024 11:27:46 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock() Content-Language: en-US To: David Hildenbrand , muchun.song@linux.dev Cc: hughd@google.com, willy@infradead.org, vbabka@kernel.org, akpm@linux-foundation.org, rppt@kernel.org, vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com, christophe.leroy2@cs-soprasteria.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org References: <4481a0e4-a7a5-4223-a8ab-d1215d7c6352@bytedance.com> <42aba316-2b01-4fdb-9aff-9e670aac4c6e@redhat.com> From: Qi Zheng In-Reply-To: <42aba316-2b01-4fdb-9aff-9e670aac4c6e@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 41047C0003 X-Stat-Signature: agwem9hrcrtio6duw1a4u31e3crufizj X-HE-Tag: 1724902077-475048 X-HE-Meta: U2FsdGVkX19/pfG8l8HMwMLww/E6dxDhbMWiZpMGRr8UNyzYKzIsxYo0JRrm6RAKvzbus3Cq1CPXl+UXu4RdwZSVr0uJ7B2ETPuHuVcRumWoPSVXnTKs6/TC4jv4eyVbmWNRS8TQJAsnzaGvoksufhp1G9pO4EGyK06X6YD8SDmUgbTuQNeBeaIi/azHdlianQiMyS6Uz7Rc0Mc0updScNoM1rrthP2UBQTfFpEpYExwffHg8XcI5Z/oASpWBRQfD+qjeqUK5pAC+oW7rRNixCFKkMNmtQ11ANUGYDmAAiUoKzsJnebEhBuuwmshodjQWN+riQxPIN75kv51dMMGY/8ekclBmszPOWpp3V2vz2NaPZv502DzKjynhtplhiNsd4uNNt7cd7jSAxERHfoBDLMdJ9uksFnlQ+Lt7nULp9zJDsArfX56jkiSxyAAelSlIdgccSdOm2jPAM3uvsmRSiJsjylaXnDXq65Z9qFFY2HrlAra710AiJZfaIsHgHdYSwdKjCnutAakgAc6m4P1OUjozyeOkhtJDHYqAMr1GMkSQ/4eJWtj65nAkvdlLSbB/ccJHQVrsIXhzNox6OU3h4jYDN9H4oXV5sn82ykstU9hC9lCiuKTT3fXmt/Zw79XHYuNDIBTsis6znSTs0siDIF+eMFHzpCVJzFBRvBtP2VEoUpf5x101RoTzTRyTMyAX0lKJAZlHIJYzXDSQlv61K9HLFAVLnqe+JJpLMuskRbUn3tojtch3xlxVUTg1wd8ar+pA/TB9Rh3VJ1AXxnUWr2/ICn7iszXOCaRWu19NLh077cxTMt6VWCqfcXzr5lL5LOq2YD3CSKzpyH7mcREY0Fq1FGoEI640K1/Uxp1w1pMt/tux7SrbzQyYXHaovxt+7OZ7SWbuL1oG8uyCtL2O93BP+IjuntJ9YnLZrHuoCOCfco94ols+3YUQkfxIw5QLUmgzPUET92MI43g7Bj 6D6mvvB4 7+N0h/gGXNrQrRPdqlbAgAUE+0Mp/k5QIB7+ikYSu1jCwwm16XyzjQrwgY9JVOWSb33q1ACZPdLxNQzTzf0f0bMCry5QsOC0xsRGu9GLGtKvRmcflPoE/ovgXMzkhc6oUR7rqNHFwlpApJU0IwIsiqDVIUzkrkZATdZ4ZQg9NtHHl/TBj84YhceZUfk2vGWY99C+ElzVQv438ko/gq4cXZ47+m78B4AUO1npviydLMvShdNS1Rybbcj2Daqhk3vlxy9RnpU09Y2imTKa1kKpZsIvH0r6O5B+O9sdFob9AzGoTcJj42i4L+MYPZvFJ7l0AeuMTs7FGwZgomEurVKDx8ejoCA7qUPlyrNREvPBwfGDPD9d9U4muJiarLBSrFwh7Hfr84FtIYPQxzQf7c6er/M+rIXvda9S7HyQgF7Sk9BzrStnGf1yqhGjcOpZG/fQs7UK2fd1YLuV+qL4KZsjAcAgx3VhW3cOQVBQS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/8/28 18:48, David Hildenbrand wrote: > On 27.08.24 06:33, Qi Zheng wrote: >> Hi David, >> >> On 2024/8/26 23:21, David Hildenbrand wrote: >>> On 22.08.24 09:13, Qi Zheng wrote: >>>> Currently, the usage of pte_offset_map_nolock() can be divided into the >>>> following two cases: >>>> >>>> 1) After acquiring PTL, only read-only operations are performed on the >>>> PTE >>>>      page. In this case, the RCU lock in pte_offset_map_nolock() will >>>> ensure >>>>      that the PTE page will not be freed, and there is no need to worry >>>>      about whether the pmd entry is modified. >>> >>> There is also the usage where we don't grab the PTL at all, and only do >>> a racy (read-only) lookup. >> >> IIUC, pte_offset_map() should be used instead of pte_offset_map_nolock() >> in this case. > > Yes, but the filemap.c thingy conditionally wants to lock later. But I > agree that pte_offset_map() is better when not even wanting to lock. > > [...] > >>>> accessor functions: >>>>     - pte_offset_map_nolock() >>>>        maps PTE, returns pointer to PTE with pointer to its PTE table >>>>        lock (not taken), or returns NULL if no PTE table; >>> >>> What will happen to pte_offset_map_nolock() after this series? Does it >>> still exist or will it become an internal helper? >> >> I choose to remove it completely in [PATCH v2 13/14]. >> > > Ah, great. > > [...] > >>> If someone thinks not requiring a non-NULL pointer is better, please >>> speak up, I'm not married to that idea :) >>> >>>> +    pte = __pte_offset_map(pmd, addr, &pmdval); >>>> +    if (likely(pte)) >>>> +        *ptlp = pte_lockptr(mm, &pmdval); >>>> +    *pmdvalp = pmdval; >>>> +    return pte; >>>> +} >>>> + >>>>    /* >>>>     * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal >>>> implementation >>>>     * __pte_offset_map_lock() below, is usually called with the pmd >>>> pointer for >>>> @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct >>>> *mm, pmd_t *pmd, >>>>     * recheck *pmd once the lock is taken; in practice, no callsite >>>> needs that - >>>>     * either the mmap_lock for write, or pte_same() check on contents, >>>> is enough. >>>>     * >>>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like >>>> + * pte_offset_map(); but when successful, it also outputs a pointer >>>> to the >>>> + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case >>>> without >>>> + * locking it.  This helps the caller to avoid a later >>>> pte_lockptr(mm, *pmd), >>>> + * which might by that time act on a changed *pmd: >>>> pte_offset_map_ro_nolock() >>>> + * provides the correct spinlock pointer for the page table that it >>>> returns. >>>> + * For readonly case, the caller does not need to recheck *pmd after >>>> the lock is >>>> + * taken, because the RCU lock will ensure that the PTE page will not >>>> be freed. > + * >>>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is >>>> like >>>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs >>>> the >>>> + * pdmval. For cases where pte or pmd entries may be modified, that >>>> is, maywrite >>>> + * case, this can help the caller recheck *pmd once the lock is >>>> taken. In some >>>> + * cases, that is, either the mmap_lock for write, or pte_same() >>>> check on >>>> + * contents, is also enough to ensure that the pmd entry is stable. >>>> + * >>>>     * Note that free_pgtables(), used after unmapping detached vmas, or >>>> when >>>>     * exiting the whole mm, does not take page table lock before >>>> freeing a page >>>>     * table, and may not use RCU at all: "outsiders" like khugepaged >>>> should avoid >>> >>> In general to me a step into the right direction. Likely the >>> documentation could be further clarified in some aspects: >>> >>> Like that the use of pte_offset_map_ro_nolock() does not allow to easily >>> identify if the page table was replaced in the meantime. Even after >>> grabbing the PTL, we might be looking either at a page table that is >>> still mapped or one that was unmapped and is about to get freed. But for >>> R/O access this is usually sufficient AFAIUK. >>> >>> Or that "RO" / "RW" expresses the intended semantics, not that the >>> *kmap* will be RO/RW protected. >> >> How about the following: >> >> pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like >> pte_offset_map(); but when successful, it also outputs a pointer to the >> spinlock in ptlp - as pte_offset_map_lock() does, but in this case >> without locking it.  This helps the caller to avoid a later >> pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd: >> pte_offset_map_ro_nolock() provides the correct spinlock pointer for the >> page table that it returns. Even after grabbing the spinlock, we might >> be looking either at a page table that is still mapped or one that was >> unmapped and is about to get freed. But for R/O access this is usually >> sufficient AFAIUK. > > Drop the "AFAIUK" :) > > "For R/O access this is sufficient." OK. > >> >> pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like >> pte_offset_map_ro_nolock(); but when successful, it also outputs the >> pdmval. For R/W access, the callers can not accept that the page table >> it sees has been unmapped and is about to get freed. The pmdval can help >> callers to recheck pmd_same() to identify this case once the spinlock is >> taken. For some cases where exclusivity is already guaranteed, such as >> holding the write lock of mmap_lock, or in cases where checking is >> sufficient, such as a !pte_none() pte will be rechecked after the >> spinlock is taken, there is no need to recheck pdmval. > > Right, using pte_same() one can achieve a similar result, assuming that > the freed page table gets all ptes set to pte_none(). > > page_table_check_pte_clear_range() before pte_free_defer() in > retract_page_tables/collapse_pte_mapped_thp() sanity checks that I think. > > In collapse_huge_page() that is not the case. But here, we also > currently grab all heavily locks, to prevent any concurrent page table > walker. Yes. > >> >> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* >> will be RO/RW protected. > > > Good. Please also incorporate the feedback from Muchun. OK, I will change it in v3 to the following: pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map(); but when successful, it also outputs a pointer to the spinlock in ptlp - as pte_offset_map_lock() does, but in this case without locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock pointer for the page table that it returns. Even after grabbing the spinlock, we might be looking either at a page table that is still mapped or one that was unmapped and is about to get freed. But for R/O access this is sufficient. So it is only applicable for read-only cases where any modification operations to the page table are not allowed even if the corresponding spinlock is held afterwards. pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval. It is applicable for may-write cases where any modification operations to the page table may happen after the corresponding spinlock is held afterwards. But the users should make sure the page table is stable like holding mmap_lock for write or checking pte_same() or checking pmd_same() by using the output pmdval before performing the write operations. Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will be read-only/read-write protected. >