From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C9C4C3ABBE for ; Thu, 8 May 2025 14:00:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 39FFA6B0082; Thu, 8 May 2025 10:00:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 34EF76B0085; Thu, 8 May 2025 10:00:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2170E6B0088; Thu, 8 May 2025 10:00:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 046F76B0082 for ; Thu, 8 May 2025 10:00:48 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A9BFF1A0625 for ; Thu, 8 May 2025 14:00:49 +0000 (UTC) X-FDA: 83419901418.03.F4C7A97 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf29.hostedemail.com (Postfix) with ESMTP id 94465120002 for ; Thu, 8 May 2025 14:00:47 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746712848; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5+GT3Oj5B5+z040fSynzWhjY9snTmEVSu0OJIxU2FZ8=; b=UYMViBvjd9eOK1rVRGJiAgrYHPftNB5VY7G94jcGRwWCWG/h8dudXGmZmcaN52KG6SGOFK YSrQQ72SawvsYUTSf7e5ajXvH9kmxox9cUQsLPqTwVD7jkNEQY68KFaDiL1++WIclwP76a Tt7sJuIeKX7Suf7finRxEnvedwwt1eQ= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746712848; a=rsa-sha256; cv=none; b=prnRu7p8O9zKH8mIBH7pk7rb1suDqsXBgIhM9m3goKoHb0Go7Nh/zYUzl7i1EVyiA5gnD/ C6pJd+V3OP998zWZtEjFw2JVo2ryV3vBC8ZtmnpknH8QGHb6xc7aULKGMUpstUrVoiIttk 2bVlZL+iX4ZoNgOGxLAMG4PaIt47n+k= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4299D106F; Thu, 8 May 2025 07:00:36 -0700 (PDT) Received: from [10.57.90.222] (unknown [10.57.90.222]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 672033F5A1; Thu, 8 May 2025 07:00:44 -0700 (PDT) Message-ID: <9689bab0-fe44-40e4-a24d-72b778a521e6@arm.com> Date: Thu, 8 May 2025 15:00:42 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 00/11] Perf improvements for hugetlb and vmalloc on arm64 Content-Language: en-GB To: Catalin Marinas , Will Deacon , Pasha Tatashin , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , David Hildenbrand , "Matthew Wilcox (Oracle)" , Mark Rutland , Anshuman Khandual , Alexandre Ghiti , Kevin Brodsky Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20250422081822.1836315-1-ryan.roberts@arm.com> From: Ryan Roberts In-Reply-To: <20250422081822.1836315-1-ryan.roberts@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: w81kej5g73pffrz6zsrf85j6uinopznn X-Rspamd-Queue-Id: 94465120002 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1746712847-747365 X-HE-Meta: U2FsdGVkX18xXnN+HCmcDqWAp3tSQqUftAC04xI32gAEmAsc/sozi8TIYCkOLEBL88XBs16uLETyUH6E+bks696ja/CdnTDPYyR/kn3N55sbEldtILWoutwqb0mZNwdjJ0wKArsjKGoTzjkrdgId17yAzcwnRJQf3CVdM6QxmshyXM3AsIzV/tcRcV7h3up6kvWpXtKujgD30aXi46hJKIKSKwQWAbHutqF9TWqVrF4niLCo2Atfkn74oByOQ/9OyJX103AADanyKnDCWOYZqTRhVOTYsua0n2dy/dMzI80SfLkqsfsRWzEuZSdcLEqBxnqv0cuRcf1AEO8qEVp8WG2cM1qGswkS/qaCYITEJZY6Kt6DJ8fFWSxO14dI6hGssH2Bp8nQdUwfvJgeNxQBcvfukVGKn7skK2abJJ7wiDfRZan8DbHy/moDZmsKT5BT4yN7KW9CR3OcTfV0i0QKZZE1dHQu2EfM2Alt2rm58kkkZNTgNF7JhKj0/wnslFiGNka37UVjQ3scXY0TeED/D6c+lifEqqrvFCUZsRi5yYEE4zf/gY2S83tlxVfeMbOqfiw1m9Y14qhY3DO8OYxlqMNf4Pb1cvnyYNik7wpMlOx8KAJg/94ewBmTO1ECNQeAHULJPiwub9mJFPxieyAuIcdDcQsNFb1Iw2rQ2i9hHlM4n8v5djVGxO1FwLIDB8AG0s1OcYx7uaQ0SFkyvUCGE7NAFZ1M3BP4bzbuJLu9ZsArOzMzxv3LTwkUWFeNi8u627HnWnW0L9yNLwKluy700dB6CkUieBCkWGvxh5NiFZPeS7SkIVGvALOpu4Zs2qRriqdaMmJB07PJpoBUgRUX42NuhUmww8gY78sW8my4y1GAr92VhdI6PNNlEeYPnmbsisZP0Qne1QWRVvRV+B7IcEh4KR8l0ATfPc2H/tgUVFtujQMRt6JZX10kt9tXL4oBh3N3ftDWBQlVGMtxilT DkK28XFJ x2G5rLZVzeV0VM+YCiXMhDQeBjLgahC1wdA/3PY6DlXLgAel/evKWKsu9kO7jgGnMq2lCrlxgJsTrq5WiPdRB6U5sjEWyzNEri0HTups46udhS1BK1RYD1yGCfvGeR4te2ZqmoSfbLU2QnhrBtUuY4k6dIIfjuglyIncClTQ1kc5dyxyD8vFkQfw9NOaMfis3oWfrbRG2K8+YGC2tZGPq6K9B3AzMrGJNNOLS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Will, Just a bump on this; I believe it's had review by all the relavent folks (and has the R-b) tags. I was hoping to get this into v6.16, but getting nervous that time is running out to soak it in linux-next. Any chance you could consider pulling? Thanks, Ryan On 22/04/2025 09:18, Ryan Roberts wrote: > Hi All, > > This is v4 of a series to improve performance for hugetlb and vmalloc on arm64. > Although some of these patches are core-mm, advice from Andrew was to go via the > arm64 tree. All patches are now acked/reviewed by relevant maintainers so I > believe this should be good-to-go. > > The 2 key performance improvements are 1) enabling the use of contpte-mapped > blocks in the vmalloc space when appropriate (which reduces TLB pressure). There > were already hooks for this (used by powerpc) but they required some tidying and > extending for arm64. And 2) batching up barriers when modifying the vmalloc > address space for upto 30% reduction in time taken in vmalloc(). > > vmalloc() performance was measured using the test_vmalloc.ko module. Tested on > Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole > test was repeated 10 times. > > legend: > - p: nr_pages (pages to allocate) > - h: use_huge (vmalloc() vs vmalloc_huge()) > - (I): statistically significant improvement (95% CI does not overlap) > - (R): statistically significant regression (95% CI does not overlap) > - measurements are times; smaller is better > > +--------------------------------------------------+-------------+-------------+ > | Benchmark | | | > | Result Class | Apple M2 | Ampere Alta | > +==================================================+=============+=============+ > | micromm/vmalloc | | | > | fix_align_alloc_test: p:1, h:0 (usec) | (I) -11.53% | -2.57% | > | fix_size_alloc_test: p:1, h:0 (usec) | 2.14% | 1.79% | > | fix_size_alloc_test: p:4, h:0 (usec) | (I) -9.93% | (I) -4.80% | > | fix_size_alloc_test: p:16, h:0 (usec) | (I) -25.07% | (I) -14.24% | > | fix_size_alloc_test: p:16, h:1 (usec) | (I) -14.07% | (R) 7.93% | > | fix_size_alloc_test: p:64, h:0 (usec) | (I) -29.43% | (I) -19.30% | > | fix_size_alloc_test: p:64, h:1 (usec) | (I) -16.39% | (R) 6.71% | > | fix_size_alloc_test: p:256, h:0 (usec) | (I) -31.46% | (I) -20.60% | > | fix_size_alloc_test: p:256, h:1 (usec) | (I) -16.58% | (R) 6.70% | > | fix_size_alloc_test: p:512, h:0 (usec) | (I) -31.96% | (I) -20.04% | > | fix_size_alloc_test: p:512, h:1 (usec) | 2.30% | 0.71% | > | full_fit_alloc_test: p:1, h:0 (usec) | -2.94% | 1.77% | > | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) | -7.75% | 1.71% | > | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) | -9.07% | (R) 2.34% | > | long_busy_list_alloc_test: p:1, h:0 (usec) | (I) -29.18% | (I) -17.91% | > | pcpu_alloc_test: p:1, h:0 (usec) | -14.71% | -3.14% | > | random_size_align_alloc_test: p:1, h:0 (usec) | (I) -11.08% | (I) -4.62% | > | random_size_alloc_test: p:1, h:0 (usec) | (I) -30.25% | (I) -17.95% | > | vm_map_ram_test: p:1, h:0 (usec) | 5.06% | (R) 6.63% | > +--------------------------------------------------+-------------+-------------+ > > So there are some nice improvements but also some regressions to explain: > > fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The > regression is actually introduced by enabling contpte-mapped 64K blocks in these > tests, and that regression is reduced (from about 8% if memory serves) by doing > the barrier batching. I don't have a definite conclusion on the root cause, but > I've ruled out the differences in the mapping paths in vmalloc. I strongly > believe this is likely due to the difference in the allocation path; 64K blocks > are not cached per-cpu so we have to go all the way to the buddy. I'm not sure > why this doesn't show up on M2 though. Regardless, I'm going to assert that it's > better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation > call duration. > > Changes since v3 [3] > ==================== > - Applied R-bs (thanks all!) > - Renamed set_ptes_anysz() -> __set_ptes_anysz() (Catalin) > - Renamed ptep_get_and_clear_anysz() -> __ptep_get_and_clear_anysz() (Catalin) > - Only set TIF_LAZY_MMU_PENDING if not already set to avoid atomic ops (Catalin) > - Fix commet typos (Anshuman) > - Fix build warnings when PMD is folded (buildbot) > - Reverse xmas tree for variables in __page_table_check_p[mu]ds_set() (Pasha) > > Changes since v2 [2] > ==================== > - Removed the new arch_update_kernel_mappings_[begin|end]() API > - Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching > - Removed clean up to avoid barriers for invalid or user mappings > > Changes since v1 [1] > ==================== > - Split out the fixes into their own series > - Added Rbs from Anshuman - Thanks! > - Added patch to clean up the methods by which huge_pte size is determined > - Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in > flush_hugetlb_tlb_range() > - Renamed ___set_ptes() -> set_ptes_anysz() > - Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz() > - Fixed typos in commit logs > - Refactored pXd_valid_not_user() for better reuse > - Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent > - Concluded the extra isb() in __switch_to() is not required > - Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings > > Applies on top of v6.15-rc3. All mm selftests run and no regressions observed. > > [1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/ > [3] https://lore.kernel.org/all/20250304150444.3788920-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > Ryan Roberts (11): > arm64: hugetlb: Cleanup huge_pte size discovery mechanisms > arm64: hugetlb: Refine tlb maintenance scope > mm/page_table_check: Batch-check pmds/puds just like ptes > arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() > arm64: hugetlb: Use __set_ptes_anysz() and > __ptep_get_and_clear_anysz() > arm64/mm: Hoist barriers out of set_ptes_anysz() loop > mm/vmalloc: Warn on improper use of vunmap_range() > mm/vmalloc: Gracefully unmap huge ptes > arm64/mm: Support huge pte-mapped pages in vmap > mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes > arm64/mm: Batch barriers when updating kernel mappings > > arch/arm64/include/asm/hugetlb.h | 29 ++-- > arch/arm64/include/asm/pgtable.h | 209 +++++++++++++++++++-------- > arch/arm64/include/asm/thread_info.h | 2 + > arch/arm64/include/asm/vmalloc.h | 45 ++++++ > arch/arm64/kernel/process.c | 9 +- > arch/arm64/mm/hugetlbpage.c | 73 ++++------ > include/linux/page_table_check.h | 30 ++-- > include/linux/vmalloc.h | 8 + > mm/page_table_check.c | 34 +++-- > mm/vmalloc.c | 40 ++++- > 10 files changed, 329 insertions(+), 150 deletions(-) > > -- > 2.43.0 >