From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10CCAC4167B for ; Mon, 4 Dec 2023 10:55:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 979086B029D; Mon, 4 Dec 2023 05:55:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 928FF6B029F; Mon, 4 Dec 2023 05:55:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7CA3C6B02A0; Mon, 4 Dec 2023 05:55:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6E3C06B029D for ; Mon, 4 Dec 2023 05:55:06 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 3A454A08CD for ; Mon, 4 Dec 2023 10:55:06 +0000 (UTC) X-FDA: 81528828612.19.9415CA6 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf04.hostedemail.com (Postfix) with ESMTP id 90E1F40004 for ; Mon, 4 Dec 2023 10:55:04 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701687304; a=rsa-sha256; cv=none; b=TL9M3BLkcSNvWVWdlXYULYq2dAFwzMkkseskbdO8+TClSt15uIWnS2rOuo2ejWDkcilAAN b0ekVrOjST8GfbSCVMhFuBcHje53kzq07h8vA9QVil4YSwhfa9hqYe8Muo9opFbRv7T0Cu HtSH/0e+QRvrx1glJCcPBbbZ0zVW+EQ= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; spf=pass (imf04.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701687304; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EQA7si1ROP03u6D9py244k+Ey/Ri2m2/5MH2b5lgCqc=; b=niu7ANwd9EVMydts9Euu2RFSPZ0iSX1B+M26/0l/8O5NAHKlhd12iA2ba1jIFJ1gsgZFem tiz3iXy1aB0MqgDV2gp8G1T2lhPX/6vbJSKBMrOV1mE+LV3iZUUFMhM1fydpymit2szHTG 5bJM2n8Kt9IqBdJy8/f3cgRqFQ5wudU= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1DD87169C; Mon, 4 Dec 2023 02:55:51 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6A2433F6C4; Mon, 4 Dec 2023 02:55:00 -0800 (PST) From: Ryan Roberts To: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Anshuman Khandual , Matthew Wilcox , Yu Zhao , Mark Rutland , David Hildenbrand , Kefeng Wang , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v3 02/15] mm: Batch-clear PTE ranges during zap_pte_range() Date: Mon, 4 Dec 2023 10:54:27 +0000 Message-Id: <20231204105440.61448-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20231204105440.61448-1-ryan.roberts@arm.com> References: <20231204105440.61448-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 90E1F40004 X-Stat-Signature: sq4ido6hp5x48tppkwzoj3z6uendmqpu X-HE-Tag: 1701687304-704169 X-HE-Meta: U2FsdGVkX1+wcXV9zl0j5zAHCM1DrmRZmqS7MLNm8FFqHvOsXwsTyc7DEZbBFP8WLF9bIsA9NAsD0huCVDZVTrmKb+7TIemlWUubjKZLi2cSuXaMuiMttoL/2OZ1DA3dOBxSjrifKgYxTOM/Sxduj5RQhJwgZ/mjQKJdzIzGS2yW6initjkZ1nQnCUtgcKgY3pi9T5bfDDI8KcTZ7QJy6rQ4c51tuMaSLsJMVVfSsixGKZXL7OewQPaVLOIHvqQOgQ5B8J/58HrIGoYt6zroKWbu9w01PcfKPtK5CAMq3L+eMcTw5X2pw/52OgU1bhWJgm5v0wDfFEFwqYPidHwSC/hhxlJwJ+pu+hLi657DDKRiW4MNIQVrqqd00AlLMOV2eDpTgFW2pLPxTFSlTV3stWGz+Vid/JcYAwrnrMLGkfTqQMVgDGc3cbyl+e1hkGIsTdXochYPgJ+6KFMq4RsXzZ6Ba2k3xQYqzllfMAv4eyS9a/T830mMrcEZEdV+GJ1glnVJa5qkgeJevp5datiR4TDSTSvNDgs79ywQ0HaCYz6DsKxCVsc9htVHyY033Jn6cq9TvxLz2B+QRKTbbTNHC2rKQ7IfHiPtrl+9BCDtac/e719RWVMmfWzRvBeg6wFDEAb3mOzwqvg9dovGT7D42fUfRotkSDkT1UuRto0/lY+1IjMidxGG0rKyV+5+gtrAuPonUzlte1us9t2ZquCIKe6PHd5dabyfoh9ZuTG7i4C+aEOKIwwInUWAtDGxsZSbFiC3uLXfd1+oNAYIF4hefvQISaWPb7X0gndYYmRSjnydaBXEWbFsdlacJ4SJVDpHk1LUm9w5CgECK0GTa0Y8bHU9wMd/UgqC7ExY2JHflQVhwti3MqE6LmSPAW4fIO46Li8kM+jg0ScjCs4hQAHfDYaBXGhxwX2FhMgAgulgfmZq0U1JaP6DXSi4sDSrlulM1j3pft/cG490TEvowwz vtvIwLf6 JZfQ5BL3RE/4ti809AcIT6lbVU2GhtalBCTp3xjXh5s3ji8T98EI/6fR5vK0c/VvUpYoNjbkHK17qu/O4NFpnANJ3Sy7PX4tJ30zvvXcWhRc52//xnXoDMSAvK8YOkNaVIQjhcqhot93bd+avY7gc+XoCygMYjdN2DrMW/W6Q47fwUnH/lLVuucpZG26qGpPVXemQJnf42XhcmGG2BGjZBsFGzPchTaQliiram8PEfB1xW/jzUgPEsHRgwwbtEBeg1SLt8YTAf9LWyxhsM3ir5h55RA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Convert zap_pte_range() to clear a set of ptes in a batch. A given batch maps a physically contiguous block of memory, all belonging to the same folio. This will likely improve performance by a tiny amount due to removing duplicate calls to mark the folio dirty and accessed. And also provides us with a future opportunity to batch the rmap removal. However, the primary motivation for this change is to reduce the number of tlb maintenance operations that the arm64 backend has to perform during exit and other syscalls that cause zap_pte_range() (e.g. munmap, madvise(DONTNEED), etc.), as it is about to add transparent support for the "contiguous bit" in its ptes. By clearing ptes using the new clear_ptes() API, the backend doesn't have to perform an expensive unfold operation when a PTE being cleared is part of a contpte block. Instead it can just clear the whole block immediately. This change addresses the core-mm refactoring only, and introduces clear_ptes() with a default implementation that calls ptep_get_and_clear_full() for each pte in the range. Note that this API returns the pte at the beginning of the batch, but with the dirty and young bits set if ANY of the ptes in the cleared batch had those bits set; this information is applied to the folio by the core-mm. Given the batch is garranteed to cover only a single folio, collapsing this state does not lose any useful information. A separate change will implement clear_ptes() in the arm64 backend to realize the performance improvement as part of the work to enable contpte mappings. Signed-off-by: Ryan Roberts --- include/asm-generic/tlb.h | 9 ++++++ include/linux/pgtable.h | 26 ++++++++++++++++ mm/memory.c | 63 ++++++++++++++++++++++++++------------- mm/mmu_gather.c | 14 +++++++++ 4 files changed, 92 insertions(+), 20 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 129a3a759976..b84ba3aa1f6e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -75,6 +75,9 @@ * boolean indicating if the queue is (now) full and a call to * tlb_flush_mmu() is required. * + * tlb_get_guaranteed_space() returns the minimum garanteed number of pages + * that can be queued without overflow. + * * tlb_remove_page() and tlb_remove_page_size() imply the call to * tlb_flush_mmu() when required and has no return value. * @@ -263,6 +266,7 @@ struct mmu_gather_batch { extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size); +extern unsigned int tlb_get_guaranteed_space(struct mmu_gather *tlb); #ifdef CONFIG_SMP /* @@ -273,6 +277,11 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb, extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma); #endif +#else +static inline unsigned int tlb_get_guaranteed_space(struct mmu_gather *tlb) +{ + return 1; +} #endif /* diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 1c50f8a0fdde..e998080eb7ae 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -635,6 +635,32 @@ static inline void ptep_set_wrprotects(struct mm_struct *mm, } #endif +#ifndef clear_ptes +struct mm_struct; +static inline pte_t clear_ptes(struct mm_struct *mm, + unsigned long address, pte_t *ptep, + int full, unsigned int nr) +{ + unsigned int i; + pte_t pte; + pte_t orig_pte = ptep_get_and_clear_full(mm, address, ptep, full); + + for (i = 1; i < nr; i++) { + address += PAGE_SIZE; + ptep++; + pte = ptep_get_and_clear_full(mm, address, ptep, full); + + if (pte_dirty(pte)) + orig_pte = pte_mkdirty(orig_pte); + + if (pte_young(pte)) + orig_pte = pte_mkyoung(orig_pte); + } + + return orig_pte; +} +#endif + /* * On some architectures hardware does not set page access bit when accessing * memory page, it is responsibility of software setting this bit. It brings diff --git a/mm/memory.c b/mm/memory.c index 8a87a488950c..60f030700a3f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1515,6 +1515,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + int nr; tlb_change_page_size(tlb, PAGE_SIZE); init_rss_vec(rss); @@ -1527,6 +1528,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, do { pte_t ptent = ptep_get(pte); struct page *page; + int i; + + nr = 1; if (pte_none(ptent)) continue; @@ -1535,45 +1539,64 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break; if (pte_present(ptent)) { - unsigned int delay_rmap; + unsigned int delay_rmap = 0; + bool tlb_full = false; + struct folio *folio = NULL; page = vm_normal_page(vma, addr, ptent); if (unlikely(!should_zap_page(details, page))) continue; - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); + + if (likely(page)) { + folio = page_folio(page); + nr = folio_nr_pages_cont_mapped(folio, page, + pte, addr, end, + ptent, true, &i, &i); + nr = min_t(int, nr, tlb_get_guaranteed_space(tlb)); + } + + ptent = clear_ptes(mm, addr, pte, tlb->fullmm, nr); arch_check_zapped_pte(vma, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, - ptent); + + for (i = 0; i < nr; i++) { + unsigned long subaddr = addr + PAGE_SIZE * i; + + tlb_remove_tlb_entry(tlb, &pte[i], subaddr); + zap_install_uffd_wp_if_needed(vma, subaddr, + &pte[i], details, ptent); + } if (unlikely(!page)) { ksm_might_unmap_zero_page(mm, ptent); continue; } - delay_rmap = 0; - if (!PageAnon(page)) { + if (!folio_test_anon(folio)) { if (pte_dirty(ptent)) { - set_page_dirty(page); + folio_mark_dirty(folio); if (tlb_delay_rmap(tlb)) { delay_rmap = 1; force_flush = 1; } } if (pte_young(ptent) && likely(vma_has_recency(vma))) - mark_page_accessed(page); + folio_mark_accessed(folio); } - rss[mm_counter(page)]--; - if (!delay_rmap) { - page_remove_rmap(page, vma, false); - if (unlikely(page_mapcount(page) < 0)) - print_bad_pte(vma, addr, ptent, page); + for (i = 0; i < nr; i++, page++) { + rss[mm_counter(page)]--; + if (!delay_rmap) { + page_remove_rmap(page, vma, false); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); + } + if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { + tlb_full = true; + force_flush = 1; + addr += PAGE_SIZE * (i + 1); + break; + } } - if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) { - force_flush = 1; - addr += PAGE_SIZE; + if (unlikely(tlb_full)) break; - } continue; } @@ -1624,7 +1647,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); - } while (pte++, addr += PAGE_SIZE, addr != end); + } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode(); diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c index 4f559f4ddd21..57b4d5f0dfa4 100644 --- a/mm/mmu_gather.c +++ b/mm/mmu_gather.c @@ -47,6 +47,20 @@ static bool tlb_next_batch(struct mmu_gather *tlb) return true; } +unsigned int tlb_get_guaranteed_space(struct mmu_gather *tlb) +{ + struct mmu_gather_batch *batch = tlb->active; + unsigned int nr_next = 0; + + /* Allocate next batch so we can guarrantee at least one batch. */ + if (tlb_next_batch(tlb)) { + tlb->active = batch; + nr_next = batch->next->max; + } + + return batch->max - batch->nr + nr_next; +} + #ifdef CONFIG_SMP static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma) { -- 2.25.1