From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D1FDC4829B for ; Mon, 12 Feb 2024 15:47:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C15706B0078; Mon, 12 Feb 2024 10:47:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BC5D56B007D; Mon, 12 Feb 2024 10:47:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AB6056B007E; Mon, 12 Feb 2024 10:47:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9D2B96B0078 for ; Mon, 12 Feb 2024 10:47:55 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 651DE804D2 for ; Mon, 12 Feb 2024 15:47:55 +0000 (UTC) X-FDA: 81783582510.18.326F603 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf02.hostedemail.com (Postfix) with ESMTP id D235880018 for ; Mon, 12 Feb 2024 15:47:52 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707752873; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rOudUpc6RxTx8ymu+MW/Q5v7OD25E5Y+IKYA1VsFyhQ=; b=Aba/rKrArdR1hCjF7CpIxTT1SgpR+2ugYb86viXGJOfT2a5qbrQAMXiGuFjxUiE/cNcSRS QZyKJqX8cIVIkgMViywYwNsKmuVh85rPZMRUGSve5ouR3yNo+xLJ7wOiUSMle75WjFtLwu 46Pf5bghDRBk5qI9r56yPLOyoreHBQA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707752873; a=rsa-sha256; cv=none; b=BkgG4VrnWo8gBsejFo994GoXZQgBoPlio5L6AOKtlaEbBofzfS3O4nwE4lY139zbX7ncyi SNyV3Lo4MGprErCssyX097D1PYdBltrYQYIBiP89WoE+YOCeZjqguIuN1OD2MD3QkspENU UGzlL2R+FLZ6Ej4EcQs9UrlXy6HlU3M= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 526FBDA7; Mon, 12 Feb 2024 07:48:33 -0800 (PST) Received: from [10.57.78.115] (unknown [10.57.78.115]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4BF323F766; Mon, 12 Feb 2024 07:47:46 -0800 (PST) Message-ID: <82c59a7f-328e-4521-8855-ccacc3dc4ce5@arm.com> Date: Mon, 12 Feb 2024 15:47:44 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch() Content-Language: en-GB To: David Hildenbrand , Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , James Morse , Andrey Ryabinin , Andrew Morton , Matthew Wilcox , Mark Rutland , Kefeng Wang , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi , Nicholas Piggin , Christophe Leroy , "Aneesh Kumar K.V" , "Naveen N. Rao" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" Cc: linux-arm-kernel@lists.infradead.org, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240202080756.1453939-1-ryan.roberts@arm.com> <20240202080756.1453939-23-ryan.roberts@arm.com> <6d452a1a-1edc-4e97-8b39-99dc48315bb8@redhat.com> From: Ryan Roberts In-Reply-To: <6d452a1a-1edc-4e97-8b39-99dc48315bb8@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: D235880018 X-Rspam-User: X-Stat-Signature: faue8ce3kiywrs88braro1adea6gkrm9 X-Rspamd-Server: rspam03 X-HE-Tag: 1707752872-645899 X-HE-Meta: U2FsdGVkX1/fcYXVduAV4A3ABde1yCbX2CMl0JsFy31v+cHj+eDzEfHMXCf92I5hSTt3rWKW4wHZcD8YhfhYIcFxNvmAbG6KlRE+3Jk34H2M6tNYxpHPPox82WFR8DjFi6806HJrM27mBHeVDguq+2j8qyqszEhvB0Bex3b709SOrjXX+Es8GmFMLNutNwZtoH1VMlG/BPM3/Ll5jPq7s4qh892iKr2yau4CFYqMCugCAZRu0DLA1xiMrmgAkdEAJ3S3NhRxUA8pEqvfLrTSkY/j1h8nIG4jN8Wvi3Csj6MmyDsq6H3r02a8rGVjQLbNPa7uUZWz8d6UuYSJo3d/zGpzeUpcC6DeconVkKklQHMVxVmJDIDsjeFfEktW0/hDf0LA7IplA9m7/OEH5c5tgSPtkYEcsqF61FLv+xJstBrE8bPTTB4ialdX3dA4FaLISasq5cWcjGQhS4KMP0TvOUzgHyp4trJeeIKyyFaAV0/YfPo6ijIFp8QM3XRmdSHMCqC9JTNxeFwGCOSuPELA6L734QHGAp1sKbhagwjaDp8u8qQN4zP82smFeUs0KanYXvEUUqRWr+P1rvKnPd7ZXocA3zg8hy3GpjppgOGcqhjhTnGoGqqluoY4aD//zJBfp048ZFc+qiDt1w0qMXcLPw/Yv6UTmo5kI9Y3P7Vi0G4Hf6p8lEWdOfhC/UJGUpIAv3Umm14hFIFIBtX6+1BPAK4mSrIwQlIICaMPMR2DA8STz5q78JhvZIo7cdGB0gbFIA3fu50Gh08LqTU+xMhEei4A8/rS8OZ8R2Cexevkn52yxKSs/LiZ5ZBH0nX8jxr3vOOUlOHLrwg0Cd9AqGHT5GRgnVuilrqRd6OsGLL+LP5q3hKVzY0GHEN8VK/51RMFOZLWMKEiz2ttq+lQzMOGW2QZ+5FKLKzhCtidZvUYUjFkpgd7Vb/glM7bO5oaIdDeiFnK0J74d7hrxBU97NN SOg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/02/2024 13:43, David Hildenbrand wrote: > On 02.02.24 09:07, Ryan Roberts wrote: >> Some architectures (e.g. arm64) can tell from looking at a pte, if some >> follow-on ptes also map contiguous physical memory with the same pgprot. >> (for arm64, these are contpte mappings). >> >> Take advantage of this knowledge to optimize folio_pte_batch() so that >> it can skip these ptes when scanning to create a batch. By default, if >> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so >> the changes are optimized out and the behaviour is as before. >> >> arm64 will opt-in to providing this hint in the next patch, which will >> greatly reduce the cost of ptep_get() when scanning a range of contptes. >> >> Tested-by: John Hubbard >> Signed-off-by: Ryan Roberts >> --- >>   include/linux/pgtable.h | 18 ++++++++++++++++++ >>   mm/memory.c             | 20 +++++++++++++------- >>   2 files changed, 31 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h >> index 50f32cccbd92..cba31f177d27 100644 >> --- a/include/linux/pgtable.h >> +++ b/include/linux/pgtable.h >> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd) >>   #define arch_flush_lazy_mmu_mode()    do {} while (0) >>   #endif >>   +#ifndef pte_batch_hint >> +/** >> + * pte_batch_hint - Number of pages that can be added to batch without scanning. >> + * @ptep: Page table pointer for the entry. >> + * @pte: Page table entry. >> + * >> + * Some architectures know that a set of contiguous ptes all map the same >> + * contiguous memory with the same permissions. In this case, it can provide a >> + * hint to aid pte batching without the core code needing to scan every pte. > > I think we might want to document here the expectation regarding > dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with > FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them > always when batching, because the dirty bit may target any pte part of the > cont-pte group either way. > > Maybe something like: > > " > An architecture implementation may only ignore the PTE accessed and dirty bits. > Further, it may only ignore the dirty bit if that bit is already not > maintained with precision per PTE inside the hinted batch, and ptep_get() > would already have to collect it from various PTEs. > " I'm proposing to simplify this to: " An architecture implementation may ignore the PTE accessed state. Further, the dirty state must apply atomically to all the PTEs described by the hint. " Which I think more accurately describes the requirement. Shout if you disagree. > > I think there are some more details to it, but I'm hoping something along > the lines above is sufficient. > > >> + >>   #ifndef pte_advance_pfn >>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr) >>   { >> diff --git a/mm/memory.c b/mm/memory.c >> index 65fbe4f886c1..902665b27702 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, >> unsigned long addr, >>   { >>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); >>       const pte_t *end_ptep = start_ptep + max_nr; >> -    pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), >> flags); >> -    pte_t *ptep = start_ptep + 1; >> +    pte_t expected_pte = __pte_batch_clear_ignored(pte, flags); >> +    pte_t *ptep = start_ptep; >>       bool writable; >> +    int nr; >>         if (any_writable) >>           *any_writable = false; >>         VM_WARN_ON_FOLIO(!pte_present(pte), folio); >>   -    while (ptep != end_ptep) { >> +    nr = pte_batch_hint(ptep, pte); >> +    expected_pte = pte_advance_pfn(expected_pte, nr); >> +    ptep += nr; >> + > > *Maybe* it's easier to get when initializing expected_pte+ptep only once. > > Like: > > [...] > pte_t expected_pte, *ptep; > [...] > > nr = pte_batch_hint(start_ptep, pte); > expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); > ptep = start_ptep + nr; > >> +    while (ptep < end_ptep) { >>           pte = ptep_get(ptep); >>           if (any_writable) >>               writable = !!pte_write(pte); >> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, >> unsigned long addr, >>            * corner cases the next PFN might fall into a different >>            * folio. >>            */ >> -        if (pte_pfn(pte) == folio_end_pfn) >> +        if (pte_pfn(pte) >= folio_end_pfn) >>               break; >>             if (any_writable) >>               *any_writable |= writable; >>   -        expected_pte = pte_advance_pfn(expected_pte, 1); >> -        ptep++; >> +        nr = pte_batch_hint(ptep, pte); >> +        expected_pte = pte_advance_pfn(expected_pte, nr); >> +        ptep += nr; >>       } >>   -    return ptep - start_ptep; >> +    return min(ptep - start_ptep, max_nr); >>   } > > Acked-by: David Hildenbrand >