From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 41B8DC54E67 for ; Wed, 27 Mar 2024 09:57:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 94A806B0082; Wed, 27 Mar 2024 05:57:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D3EC6B0083; Wed, 27 Mar 2024 05:57:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79B636B0085; Wed, 27 Mar 2024 05:57:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 591576B0082 for ; Wed, 27 Mar 2024 05:57:53 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 12EDC16099C for ; Wed, 27 Mar 2024 09:57:53 +0000 (UTC) X-FDA: 81942367626.29.C119392 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf10.hostedemail.com (Postfix) with ESMTP id 40453C000C for ; Wed, 27 Mar 2024 09:57:51 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711533471; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=29BMCCnETrM4F7YLpcZhkqc0ZpxwH5n866sLjOhjeF0=; b=AVBkP6lmulA9492G3/wa/HaVuAytXDnr0AzmBM/tOzjKP4dI9jKFSc/nH6dHRWpyxcaC5a pr/atoE5Tk++mb6Np+j1xXongZFowSiBP8VqpmyJIOPG3hezZv3EB7OSZXJyN1dfykkx1O yPa7GqhLYH2Jxb+QjnjW5gtYTATRqeA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711533471; a=rsa-sha256; cv=none; b=dxvDEa8Bx6omlrH2Fml3PBhkUqxg0H/8e/GO2bmd4WgwJm29m3sC8dFeSIkTQw5wxT2oPA 3EeSOVr9o4VT18duVMa0WpopJmZK0k8YALWDjYOkUlKSUhLl7GB8B/+f8tfC+Qf6PLPR90 spwmRWnhIrXKlnmWwdR9tI9zTRxrYfg= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6DCE52F4; Wed, 27 Mar 2024 02:58:24 -0700 (PDT) Received: from [10.57.72.121] (unknown [10.57.72.121]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5E30A3F694; Wed, 27 Mar 2024 02:57:47 -0700 (PDT) Message-ID: <31b903f8-99dd-4790-8338-f4b9950b1ee6@arm.com> Date: Wed, 27 Mar 2024 09:57:45 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 1/4] mm: Introduce ptep_get_lockless_norecency() Content-Language: en-GB To: David Hildenbrand , Mark Rutland , Catalin Marinas , Will Deacon , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Andrew Morton , Muchun Song Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240215121756.2734131-1-ryan.roberts@arm.com> <20240215121756.2734131-2-ryan.roberts@arm.com> <7aefa967-43aa-490b-ae0d-7d1455402e89@redhat.com> <5c8dbda1-5b89-4599-9bc2-f840e7bc6a74@arm.com> <6777213f-6273-4942-86be-e712ee5ebd1a@redhat.com> From: Ryan Roberts In-Reply-To: <6777213f-6273-4942-86be-e712ee5ebd1a@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: dhhuhhhbkyijjiqq7x6q3sxj4b853u8m X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 40453C000C X-Rspam-User: X-HE-Tag: 1711533471-832216 X-HE-Meta: U2FsdGVkX19bma4U8JObO4aSNSZ+NMmZGeqcF4JoFb8cmPrDCaeL4LmV5gmE+Xkdy5YZlHXvh8FH2JscwYEa4wMQ57NByQePW1QzHchd270ArcoQpGp5WLm0APlM+abN1q9m9hRueRNawur0ppbUtz/TAiXPkrT5JWmNk7JDTEtcNC7Rh24rvesgaC5hwqz9Dd8ZDvOWhiFH1HErNeC21Cx2a9N1nysMfmvrANkvsriJjrLgIqccl9q0HxJARm/uuefNj8/eM53S5hN53Uk2V5veS1t8OpGSQr0DDRKo6nYPjaNpnzLxDaOG6ArsOyHxgNuH4TVMkrYvq2QT3IiwxHxf0QRAXHpAOsYQT9gB/SkMKPwvLTaFSuxA+aGxHuLfF6lqlp6WOmeCsrF+NsDQGRVtKGcebR3OVEBS4SCE/meQV+bbv2B+wGkYs7TfTL7QKwkBMHwUnSLYTs5YYDkc38CtxI6tLblZ/KPVeAve7v6Jk+W/J71oxQbM61McZjdLK7RLEEHKrTLh8fWfATT9fhAwQZpoh/qxsz3jeAX9K4PDlYkUYY3quAPEo3VMyZ073Kq4YC46IcoHYoMkg5yZG26cmWjm1Vt8DhPPhdK9oenL5PkKCure3z1nUk0a6dsd8b4WPCrFScAkualOLV1CjDs2qk4Raitzaz1LI143DiY3X5/l+NYuoIkTC8BTExKAXUcZwdy7ce+Vbdpw1pvI+VPtCZK8ibTNCNXBwasojBk/YKsg+rssktR+pBH1SBzEo1TOskXyE3HVHC9hIScQ+K/loILa4Otz9RANkteYtvA1rlFzBslVmaeTvQlCgDso2ZVErPp/eHgzJPjSeMI/008DuOfMN+JzJhmXhy1w3YcD95aVVdn0d7i9+mrb1YSoKrN4u6GKVXwC8PXFWa035V1tiepx14QmlVyett79lf7qulq2fI9WgC122uhucMRihNZjKFMA+zvxUiarpDK jVtGozn3 nOzVwyOxBamfPuREMo51OMkN82Ao1d+guy7kDoP2hGjz9zJHhudDCWj5rZ9/Fj7CukB+IX831mGni1qjK0Q/SuEDD0Z5sNW732mgHF7OZcdu9jm0amumzCJvR5iXa8WDB5i92iHO+c/ETxP0FPbmWTN9pyV/e+UEbNv7hua5i4dAZN/+6CHhdw0L2jcmXsRubrNQMmc01MFuocv5G+7n4wbvmKtqUZ5YEWZEHy8opWQDSRnbQgrHtuJU9mb+tJfXm1jMIFYA4IdzNinYkmLPxAqkebA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 27/03/2024 09:28, David Hildenbrand wrote: > On 26.03.24 17:39, Ryan Roberts wrote: >> On 26/03/2024 16:27, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> With the introduction of contpte mapping support for arm64, that >>>> architecture's implementation of ptep_get_lockless() has become very >>>> complex due to the need to gather access and dirty bits from across all >>>> of the ptes in the contpte block. This requires careful implementation >>>> to ensure the returned value is consistent (because its not possible to >>>> read all ptes atomically), but even in the common case when there is no >>>> racing modification, we have to read all ptes, which gives an ~O(n^2) >>>> cost if the core-mm is iterating over a range, and performing a >>>> ptep_get_lockless() on each pte. >>>> >>>> Solve this by introducing ptep_get_lockless_norecency(), which does not >>>> make any guarantees about access and dirty bits. Therefore it can simply >>>> read the single target pte. >>>> >>>> At the same time, convert all call sites that previously used >>>> ptep_get_lockless() but don't care about access and dirty state. >>>> >>> >>> I'd probably split that part off. >> >> I thought the general guidance was to introduce new APIs in same patch they are >> first used in? If I split this off, I'll have one patch for a new (unused) API, >> then another for the first users. > > I don't know what exact guidance there is, but I tend to leave "non trivial > changes" to separate patches. > > Some of the changes here are rather trivial (mm/hugetlb.c), and I agree that we > can perform them here. > > At least the "vmf.orig_pte" looked "non-trivial" to me, thus my comment. got it. > >> >>> >>>> We may want to do something similar for ptep_get() (i.e. >>>> ptep_get_norecency()) in future; it doesn't suffer from the consistency >>>> problem because the PTL serializes it with any modifications, but does >>>> suffer the same O(n^2) cost. >>>> >>>> Signed-off-by: Ryan Roberts >>>> --- >>>>    include/linux/pgtable.h | 37 ++++++++++++++++++++++++++++++++++--- >>>>    kernel/events/core.c    |  2 +- >>>>    mm/hugetlb.c            |  2 +- >>>>    mm/khugepaged.c         |  2 +- >>>>    mm/memory.c             |  2 +- >>>>    mm/swap_state.c         |  2 +- >>>>    mm/swapfile.c           |  2 +- >>>>    7 files changed, 40 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >>>> index a36cf4e124b0..9dd40fdbd825 100644 >>>> --- a/include/linux/pgtable.h >>>> +++ b/include/linux/pgtable.h >>>> @@ -528,16 +528,47 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp) >>>>    #endif /* CONFIG_PGTABLE_LEVELS > 2 */ >>>>    #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */ >>>> >>>> -/* >>>> - * We require that the PTE can be read atomically. >>>> - */ >>>>    #ifndef ptep_get_lockless >>>> +/** >>>> + * ptep_get_lockless - Get a pte without holding the page table lock. Young >>>> and >>>> + *                     dirty bits are guaranteed to accurately reflect the >>>> state >>>> + *                     of the pte at the time of the call. >>>> + * @ptep: Page table pointer for pte to get. >>>> + * >>>> + * If young and dirty information is not required, use >>>> + * ptep_get_lockless_norecency() which can be faster on some architectures. >>>> + * >>>> + * May be overridden by the architecture; otherwise, implemented using >>>> + * ptep_get(), on the assumption that it is atomic. >>>> + * >>>> + * Context: Any. >>>> + */ >>> >>> I think we usually say "Any context.". But I would just do it like idr.h: >>> >>> "Any context. It is safe to call this function without locking in your code." >>> >>> ... but is this true? We really want to say "without page table lock". Because >>> there must be some way to prevent against concurrent page table freeing. For >>> example, GUP-fast disables IRQs, whereby page table freeing code frees using >>> RCU. >> >> How about: >> >> " >> Context: Any context that guarrantees the page table can't be freed > > s/guarrantees/guarantees/ > >> concurrently. The page table lock is not required. >> " >> > > Sounds good. Great! > >>> >>>>    static inline pte_t ptep_get_lockless(pte_t *ptep) >>>>    { >>>>        return ptep_get(ptep); >>>>    } >>>>    #endif >>>> >>>> +#ifndef ptep_get_lockless_norecency >>>> +/** >>>> + * ptep_get_lockless_norecency - Get a pte without holding the page table >>>> lock. >>>> + *                 Young and dirty bits may not be accurate. >>>> + * @ptep: Page table pointer for pte to get. >>>> + * >>>> + * Prefer this over ptep_get_lockless() when young and dirty information is >>>> not >>>> + * required since it can be faster on some architectures. >>>> + * >>>> + * May be overridden by the architecture; otherwise, implemented using the >>>> more >>>> + * precise ptep_get_lockless(). >>>> + * >>>> + * Context: Any. >>> >>> Same comment. >>> >>>> + */ >>>> +static inline pte_t ptep_get_lockless_norecency(pte_t *ptep) >>>> +{ >>>> +    return ptep_get_lockless(ptep); >>>> +} >>>> +#endif >>> >>> [...] >>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 68283e54c899..41dc44eb8454 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -7517,7 +7517,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct >>>> vm_area_struct *vma, >>>>        } >>>> >>>>        if (pte) { >>>> -        pte_t pteval = ptep_get_lockless(pte); >>>> +        pte_t pteval = ptep_get_lockless_norecency(pte); >>>> >>>>            BUG_ON(pte_present(pteval) && !pte_huge(pteval)); >>>>        } >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 2771fc043b3b..1a6c9ed8237a 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1019,7 +1019,7 @@ static int __collapse_huge_page_swapin(struct mm_struct >>>> *mm, >>>>                } >>>>            } >>>> >>>> -        vmf.orig_pte = ptep_get_lockless(pte); >>>> +        vmf.orig_pte = ptep_get_lockless_norecency(pte); >>>>            if (!is_swap_pte(vmf.orig_pte)) >>>>                continue; >>> >>> >>> Hm, I think you mentioned that we want to be careful with vmf.orig_pte. >> >> Yeah good point. So I guess this should move to patch 3 (which may be dropped - >> tbd)? >> > > Yes. Or a separate one where you explain in detail why do_swap_page() can handle > it just fine. Ahh no wait - I remember now; the reason I believe this is a "trivial" case is because we only leak vmf.orig_pte to the rest of the world if its a swap entry. And if its a swap entry, then ptep_get_lockless_norecency() is equivalent to ptep_get_lockless() - the pte is not present so there are no access or dirty bits. So I think this can stay here?