From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C485EEEB575 for ; Thu, 12 Sep 2024 09:28:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 521446B0082; Thu, 12 Sep 2024 05:28:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45B026B0088; Thu, 12 Sep 2024 05:28:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C34D6B0083; Thu, 12 Sep 2024 05:28:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E818A6B0082 for ; Thu, 12 Sep 2024 05:28:54 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 62A1CC1EAE for ; Thu, 12 Sep 2024 09:28:54 +0000 (UTC) X-FDA: 82555561788.05.6B9550F Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) by imf20.hostedemail.com (Postfix) with ESMTP id 94F661C0006 for ; Thu, 12 Sep 2024 09:28:51 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=JGOTkh4F; spf=pass (imf20.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726133193; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AT65D6caxyi8O3IqF5sxmWB5t+zqhreFVtKjeFFurB0=; b=fMgIwgKlgdg4Q1TvpDW+VxARFdBtQ9j9t711JnZUOOUNIGSP6S0ZnpJPL5/k2o7sZnqaYE RSIUtaMSl9yaXiID0CwIvyMcgLhJ1B83+umeDNwsqPxOkJEI+BRW56pkQVBeQqGtmqKPbk auMYJNLM0nH9AS6zEAqwdhVIGOIzNl8= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=JGOTkh4F; spf=pass (imf20.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.54 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726133193; a=rsa-sha256; cv=none; b=Jmex1kNyxseBhKzvtmDbIprjGDlP6fyuEUYNOjSmKDM6DJoXzp7sHT9lmXOoQ7pKfnOHe6 1yY5g8kUKWBzcHNVNugUodnlVljcItgZsms3t+az29DxFdvlJ39rFSuUzfFoG/dmoGpCFO zgbnonkQMBgTw6IkXRuEfQxl8K6UmrM= Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-2db89fb53f9so495554a91.3 for ; Thu, 12 Sep 2024 02:28:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1726133330; x=1726738130; darn=kvack.org; h=content-transfer-encoding:in-reply-to:content-language:from :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=AT65D6caxyi8O3IqF5sxmWB5t+zqhreFVtKjeFFurB0=; b=JGOTkh4F3Q69jgU907udiTv05gOesl8WKDvdHosQQ6fwL4KohLteep/BU4hUM4TLZB SOwGyRYiCgdhA2uP5WjXYZTYTVTLpq9aGIjBTnX7z4fs2bNRC/J4K86YL+840bLAwZN2 yol2IcJEAQhGCMMnKcBMWdST+DVrVTyASVSNy72d9fpThxR5MYnXoa//pococibvy8Is VGYD7GxQE8qMIg/Uym5AiPX4zlBblB0PfCS3XZDTaLpOaWcgtnWtf5pe7cHYlbQVPejH 4kHEQQC5lx9864fPXdEmAW7DyQLauVrXmD5XM2yKZag/CO2APT0SpahqeBLKlHzKql12 pNSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726133330; x=1726738130; h=content-transfer-encoding:in-reply-to:content-language:from :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=AT65D6caxyi8O3IqF5sxmWB5t+zqhreFVtKjeFFurB0=; b=AyJexIcVY6FDZ3R8LFQ/FGlezaeteExmNsXHgdgVx+yAJanDNwM5Nyvl+g9wbVuJ7f kzg2niZhrYQ7s+ekRR8+RPbUIivBFm28hb4f45bzIwWVyDKGVpNqs0+NmT1oDOW9Q4GJ 2E9UQeMbMCwMMxBRhJtVNOL8m/mPGHg0Ker1a1zPBKE3TQHydjYIABpY7fIn+HwZ5erz rnO6yDeTnxXOPxH1vw3XbwCGVO6s9nsDX7eLGiGP9lEVHDYDzXRdLI8V7ZYeZVbzwXoK 1H+wL7hQB/vWKar+IoVyK9horHYd5zbA8+GwL82qKDWQMfK3kawh7XD1gUL86hkAxTXa dj6w== X-Forwarded-Encrypted: i=1; AJvYcCXlzqC94k1dnotF2RUptqvl/Z1dB6+KC9e0wUZPVxf6SmrCN7ldWdq3qAphFVdVLJnF2FAUGFUnTA==@kvack.org X-Gm-Message-State: AOJu0YyQkLIRMqJxuZOTu6RTiV6yv90YLHczWb91THEmgkhrxqiaI4pF hEEYFTrazkWOB2fCioI6WYkvI+xbaBCWDl+68PrS7Wj1TJvCDHLCvojOeMexsxk= X-Google-Smtp-Source: AGHT+IHMs1K6hiu4sOIMjo3eAvDnYqfsQOfMfhz6f9zTLJFmGbbptxphUXzZIvkiw21YhPx9o9GhKg== X-Received: by 2002:a17:90a:684c:b0:2d3:d414:4511 with SMTP id 98e67ed59e1d1-2db9ffefa37mr2370711a91.24.1726133330037; Thu, 12 Sep 2024 02:28:50 -0700 (PDT) Received: from [10.4.59.158] ([139.177.225.242]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2db043c0672sm9995203a91.22.2024.09.12.02.28.42 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 12 Sep 2024 02:28:49 -0700 (PDT) Message-ID: Date: Thu, 12 Sep 2024 17:28:39 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock() To: Muchun Song Cc: LKML , Linux Memory Management List , linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, David Hildenbrand , hughd@google.com, willy@infradead.org, vbabka@kernel.org, akpm@linux-foundation.org, rppt@kernel.org, vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com, christophe.leroy2@cs-soprasteria.com References: <20240904084022.32728-1-zhengqi.arch@bytedance.com> <20240904084022.32728-2-zhengqi.arch@bytedance.com> From: Qi Zheng Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 94F661C0006 X-Stat-Signature: ocia7497u5z4cbru7mzpeezfhcmwo446 X-Rspam-User: X-HE-Tag: 1726133331-506361 X-HE-Meta: U2FsdGVkX19QxEfS1x2n8+6rqAuOb3+9Ukhnqdr+liGla6rS6zVX9RnwqC6kY7MdXw9lUHK2PV1jgbzAtEskRZRaHxdbaTrVyBtFdT+xnGZs/aGXFFZ38rZrhtwZqbF5sbfWARAC+nnyMvaF9M73OjBB3sxrARCmmmTyO9NjRi+Q0RgNpsJ/myU8vcoCjYGfjAYuMKBwBFpqFUI52Z8yQtklTzApFrnaOmzo3Ph8459Kt6RUnk2ykDpRRsEYChHIVVuAXsCjzQ5LZULc78bWNldkBAoXVYRWt8KD09ZeNnxG4vupJzYMSpxQTRZ7/1ClJPy15z+MarEkixU8ZBilVLXzoaeHR5M6vou1DX2OGixRcoQzj66wc3jCBFO3upkych+FYer6W16gDLO/FBW3EgaSx16KNjvtiVnwknNH2oPUBBv1gThwOsj0hpiy5g76kzjT7JhWnFcdX5EBvL20ek377f9dIiA47Vdrg5/XE2vSwcVlnKWvBVO+0fc+gVe8nSv0x2hE8l24QDnZVe/9hVBV7mL/v0khX8Snl18cm1JVV1jqKqUZ/jSyxt+laNVaCq0lyrC/BO4PflT1lrYU61RObNl/qJimqlCiSvASr0l/ERZWCKatEw02dwAOaFJjl4LbfhyvcbeZYcqdUYfR803xdW6Hx4k30IN1C1UvkgqNdXsTdR3JsgR99mPvQUv9JfHvAJRKiiM3LNWhM7YEn+9WAZE0kVKvx3nO8Q9JqpIQyqQhXzn8H2z3MpQCSG3lJjeAMMdcqbV+sQtB3PdtzjMizQ8IxQwExCwayjLpdwbfoju+2Aw+LjUE7d9qXROEFxnbWgarbKceysqDS5tbcAdAgxBpujAmEE9RjRoy4MmwBrtSHojJReYm0+1ki8Z73weVg3oGfRwt6eYwMIq489WLYOB1zL74Vd9PCRNpSGXiEOC3Na6NNxIVTi6ockqEffiQ4wANji84rZ5IY4c MTCzjBKO 4SHcm9iGxmI1V671FTPmwAvV8Qh17QdXKYamSsBttaflB6Zr0KVJHHcXnr0NlEaTo8xAQr2asTEDAcOIkrC23zzr+ujFLJfMDyIW2HMneAaxASiYRYRk0nI9LUkt+2ndn+atNOGELBC/RJ8rJTNlOcsy/ffQ16EIS24K4xUKogNnTdTnl28VR/UPfc8vQhu3x0/4nK6nrSR2th30IZbhHTW4U567ZNuJLPVK3rSLBliWK8rowCU2HepzvfwZ/vmgkqFdKwrFvT+cIiw+0B2JHCJiUts02qVjat31TMH6PaksKDlHsuCdRBkC2XM32EgMuqCL4tENZwlj5/PTKZtP+p7otou5h+7kqux+gOPED4LJxOKHraci/Y8FnWwfClRltkP0/gqw6kFA8f0ujER/jfEXNu+LnN+hJFqqAfOneSL6B8Yk+uryBH/v+tbcTHu2mD3x0gaMb6dRPEXbxFgs51/of2QQnmFWnpZFXXhYp9tXslb2ZoH1q5VsKAA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Muchun, On 2024/9/6 15:20, Muchun Song wrote: > > > On 2024/9/4 16:40, Qi Zheng wrote: >> Currently, the usage of pte_offset_map_nolock() can be divided into the >> following two cases: >> >> 1) After acquiring PTL, only read-only operations are performed on the >> PTE >>     page. In this case, the RCU lock in pte_offset_map_nolock() will >> ensure >>     that the PTE page will not be freed, and there is no need to worry >>     about whether the pmd entry is modified. >> >> 2) After acquiring PTL, the pte or pmd entries may be modified. At this >>     time, we need to ensure that the pmd entry has not been modified >>     concurrently. >> >> To more clearing distinguish between these two cases, this commit >> introduces two new helper functions to replace pte_offset_map_nolock(). >> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition >> to changing the name to pte_offset_map_rw_nolock(), it also outputs the >> pmdval when successful. It is applicable for may-write cases where any >> modification operations to the page table may happen after the >> corresponding spinlock is held afterwards. But the users should make sure >> the page table is stable like checking pte_same() or checking pmd_same() >> by using the output pmdval before performing the write operations. >> >> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* >> will be read-only/read-write protected. >> >> Subsequent commits will convert pte_offset_map_nolock() into the above >> two functions one by one, and finally completely delete it. >> >> Signed-off-by: Qi Zheng >> --- >>   Documentation/mm/split_page_table_lock.rst |  7 +++ >>   include/linux/mm.h                         |  5 +++ >>   mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++ >>   3 files changed, 62 insertions(+) >> >> diff --git a/Documentation/mm/split_page_table_lock.rst >> b/Documentation/mm/split_page_table_lock.rst >> index e4f6972eb6c04..08d0e706a32db 100644 >> --- a/Documentation/mm/split_page_table_lock.rst >> +++ b/Documentation/mm/split_page_table_lock.rst >> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other >> accessor functions: >>    - pte_offset_map_nolock() >>       maps PTE, returns pointer to PTE with pointer to its PTE table >>       lock (not taken), or returns NULL if no PTE table; >> + - pte_offset_map_ro_nolock() >> +    maps PTE, returns pointer to PTE with pointer to its PTE table >> +    lock (not taken), or returns NULL if no PTE table; >> + - pte_offset_map_rw_nolock() >> +    maps PTE, returns pointer to PTE with pointer to its PTE table >> +    lock (not taken) and the value of its pmd entry, or returns NULL >> +    if no PTE table; >>    - pte_offset_map() >>       maps PTE, returns pointer to PTE, or returns NULL if no PTE table; >>    - pte_unmap() >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index a7c74a840249a..1fde9242231c9 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct >> mm_struct *mm, pmd_t *pmd, >>   pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, >>               unsigned long addr, spinlock_t **ptlp); >> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, >> +                unsigned long addr, spinlock_t **ptlp); >> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, >> +                unsigned long addr, pmd_t *pmdvalp, >> +                spinlock_t **ptlp); >>   #define pte_unmap_unlock(pte, ptl)    do {        \ >>       spin_unlock(ptl);                \ >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >> index a78a4adf711ac..262b7065a5a2e 100644 >> --- a/mm/pgtable-generic.c >> +++ b/mm/pgtable-generic.c >> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct >> *mm, pmd_t *pmd, >>       return pte; >>   } >> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, >> +                unsigned long addr, spinlock_t **ptlp) >> +{ >> +    pmd_t pmdval; >> +    pte_t *pte; >> + >> +    pte = __pte_offset_map(pmd, addr, &pmdval); >> +    if (likely(pte)) >> +        *ptlp = pte_lockptr(mm, &pmdval); >> +    return pte; >> +} >> + >> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, >> +                unsigned long addr, pmd_t *pmdvalp, >> +                spinlock_t **ptlp) >> +{ >> +    pmd_t pmdval; >> +    pte_t *pte; >> + >> +    VM_WARN_ON_ONCE(!pmdvalp); >> +    pte = __pte_offset_map(pmd, addr, &pmdval); >> +    if (likely(pte)) >> +        *ptlp = pte_lockptr(mm, &pmdval); >> +    *pmdvalp = pmdval; >> +    return pte; >> +} >> + >>   /* >>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal >> implementation >>    * __pte_offset_map_lock() below, is usually called with the pmd >> pointer for >> @@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct >> *mm, pmd_t *pmd, >>    * recheck *pmd once the lock is taken; in practice, no callsite >> needs that - >>    * either the mmap_lock for write, or pte_same() check on contents, >> is enough. >>    * >> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like >> pte_offset_map(); >> + * but when successful, it also outputs a pointer to the spinlock in >> ptlp - as >> + * pte_offset_map_lock() does, but in this case without locking it. >> This helps >> + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by >> that time >> + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the >> correct spinlock >> + * pointer for the page table that it returns. Even after grabbing >> the spinlock, >> + * we might be looking either at a page table that is still mapped or >> one that >> + * was unmapped and is about to get freed. But for R/O access this is >> sufficient. >> + * So it is only applicable for read-only cases where any >> modification operations >> + * to the page table are not allowed even if the corresponding >> spinlock is held >> + * afterwards. >> + * >> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is >> like >> + * pte_offset_map_ro_nolock(); but when successful, it also outputs >> the pdmval. >> + * It is applicable for may-write cases where any modification >> operations to the >> + * page table may happen after the corresponding spinlock is held >> afterwards. >> + * But the users should make sure the page table is stable like >> checking pte_same() >> + * or checking pmd_same() by using the output pmdval before >> performing the write >> + * operations. > > Now, we have two options to make sure the stability of PTE for users > of pte_offset_map_rw_nolock(), in order to ease this operation, how > about proposing a new helper (or two, one for pmd_same, another for > pte_same) like pte_lock_stability (I am not good at naming, maybe > you can) which helps users 1) hold the PTL and 2) check if the PTE is > stable and 3) return true if the PTE stable, otherwise return false. I've been trying to do this these days, but I found it was not very convenient. I introduced the following helpers: #define __PTE_STABILITY(lock) \ bool __pte_stability_##lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, \ pte_t *orig_pte, spinlock_t *ptlp) \ { \ pte_spin_##lock(ptlp); \ if (orig_pte) { \ VM_WARN_ON_ONCE(pte_none(*orig_pte)); \ return pte_same(*orig_pte, ptep_get(pte)); \ } \ if (orig_pmd) { \ VM_WARN_ON_ONCE(pmd_none(*orig_pmd)); \ return pmd_same(*orig_pmd, pmdp_get_lockless(pmd)); \ } \ VM_WARN_ON_ONCE(1); \ return false; \ } __PTE_STABILITY(lock) __PTE_STABILITY(lock_nested) static inline bool pte_stability_lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, pte_t *orig_pte, spinlock_t *ptlp) __acquires(ptlp) { return __pte_stability_lock(pmd, orig_pmd, pte, orig_pte, ptlp); } #ifdef CONFIG_SPLIT_PTE_PTLOCKS static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, pte_t *orig_pte, spinlock_t *ptlp) __acquires(ptlp) { return __pte_stability_lock_nested(pmd, orig_pmd, pte, orig_pte, ptlp); } static inline void pte_stability_unlock_nested(spinlock_t *ptlp) __releases(ptlp) { spin_unlock(ptlp); } #else static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, pte_t *orig_pte, spinlock_t *ptlp) { return true; } static inline void pte_stability_unlock_nested(spinlock_t *ptlp) { } #endif /* CONFIG_SPLIT_PTE_PTLOCKS */ and try to use them with pte_offset_map_rw_nolock() in the following functions: 1. collapse_pte_mapped_thp 2. handle_pte_fault 3. map_pte 4. move_pages_pte 5. walk_pte_range For 1, 2 and 3, the conversion is relatively simple, but 2 actually already does a pte_same() check, so it does not reduce the amount of code much. For 4, the pte_same() checks have already been done, and it is not easy to convert double_pt_lock() to use pte_stability_lock{_nested}(). For 5, it calls spin_trylock(), we should introduce another pte_stability_trylock() helper for it, but it feels unnecessary. There are not many places where pte_offset_map_rw_nolock() is called, and some places have already done pte_same() checks, so maybe open code is enough and there is no need to introduce more helper function. Thanks, Qi > > Muchun, > Thanks. > >> + * >> + * Note: "RO" / "RW" expresses the intended semantics, not that the >> *kmap* will >> + * be read-only/read-write protected. >> + * >>    * Note that free_pgtables(), used after unmapping detached vmas, or >> when >>    * exiting the whole mm, does not take page table lock before >> freeing a page >>    * table, and may not use RCU at all: "outsiders" like khugepaged >> should avoid >