From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B5EDC369CB for ; Wed, 23 Apr 2025 19:19:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 91C116B0008; Wed, 23 Apr 2025 15:19:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C9AA6B000A; Wed, 23 Apr 2025 15:19:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76AD06B000C; Wed, 23 Apr 2025 15:19:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 55EED6B0008 for ; Wed, 23 Apr 2025 15:19:13 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4F2CCC79F9 for ; Wed, 23 Apr 2025 19:19:13 +0000 (UTC) X-FDA: 83366271786.21.02EA1B1 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id A9FE9C0009 for ; Wed, 23 Apr 2025 19:19:10 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WD5aACeZ; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf10.hostedemail.com: domain of luizcap@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=luizcap@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745435950; a=rsa-sha256; cv=none; b=wKVuJiUR/kGtMD5aOFFSw0nrz+KlQ80iz2MSbfsJrCVMPyPrUxAUBlTXmq3jdulBzW5ohu A+iwSbV9BdHELStzM1aU8lxvyrwQlxQRGGGWXZaulMK/P7yLX2KiVxstXqgFXKZ+HMuO09 owr+fQLeVMlhGBsCTb+hfH8ZuXdM764= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WD5aACeZ; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf10.hostedemail.com: domain of luizcap@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=luizcap@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745435950; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wAHGe4zE8gnERUqvBrhdXOuLfdZN+3XixHIZALCJ1W8=; b=zn+KwmizCfbzPdrVLlNrpbXj3vhN5p6Plgoue70SXo5iuQE9OxUJaT/r+jEmUcIlr451J9 zTfTwYKHxvMrKu6tO67Zp7b87UxOy81OTP6DcOJ1uAEbeBPSILnMopBaIEfPzDLPVYqXjn QOIx52/wcZlTxbVCPrYmvVvmG0LnRQo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1745435949; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wAHGe4zE8gnERUqvBrhdXOuLfdZN+3XixHIZALCJ1W8=; b=WD5aACeZyBsgDiv+5Kqn8swXofPaa0CLuw0T8ldwKNOjWxMrkRvGx5QE3KySdhLPNU6tsk Mt21YlH1dQxI+yg+jubq2QX9/WO6gOXGxIqDrwxhDTAyvYxaQpa8xLN2ng0Y1Y7poirbG2 cTCM+C2Od6JluyGfJD3RHCEkPuUucrw= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-22-ot8efTKQNGKgP4ERuOMKgA-1; Wed, 23 Apr 2025 15:19:06 -0400 X-MC-Unique: ot8efTKQNGKgP4ERuOMKgA-1 X-Mimecast-MFC-AGG-ID: ot8efTKQNGKgP4ERuOMKgA_1745435946 Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-47685de2945so3574631cf.0 for ; Wed, 23 Apr 2025 12:19:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745435946; x=1746040746; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wAHGe4zE8gnERUqvBrhdXOuLfdZN+3XixHIZALCJ1W8=; b=HYso6NNGQ3hapvW0KuhEJdxhKjvYMs1zJNF7JLVi4CZUFmy2CuJQEyzTbjz4o5T5T/ U0dYQ5inFh/rYASlctMWSmN5PmUhEcvgtKef6fuhO+D20nJd8mDB2/k1PAYCbvbIRbMV MiXr7IM9XKA0LAZMAw54A4mjegc46gwLnQmvqFVxr9ayO5PNUwvp+cAmYC3HPumEXGeA VCssAxr5rFkDKlq/bSF12LWVjO6xhPfJVXrc4zFPNPyGjRsKWbBdJF76myojxd/aJ646 dKhGkPyn7zlkzO0ioRmW/TWcEtv4fqvR1nZzgYgVzJYTiQbig0Uuk/gXe+yTmQci0UoF m5Fg== X-Forwarded-Encrypted: i=1; AJvYcCX7oUv5dyyd9RAbeBxSr3S/g0qmL6n3FKl+SMA45OzT/5IClCMbGwKnNbx500fYCNegHeaggS4hsQ==@kvack.org X-Gm-Message-State: AOJu0YzYseBLhi2ps/W3WnloRUtUmiruAmRweyovvzSTBP6Ql2/q1d2p X8GvtzWc+7b+CLPeA0rVNQjGDe7bknfoghSIdAvlJeoBF149Rp7gMmDpxydEaysEpeSpRRtT/Bm r9rDaVv9nKPfko9yYmTEwplEWrPQl3svH//9GjLBNBFqPT449 X-Gm-Gg: ASbGnctT+EstPruC90spA6fapvXuoLIsbpRTLBYU0yAOfTxD4lk3UpyDeYdNscvh6Bv XitlVTrVDwZzVH1xvANbWJqe93ECLI6OApC6txfXkYGJ6x8JBHqHt3OOpLbluyYstNl5AXGmIeu RAbxFikcu7Wpmey+C7al4ONRM4qedwSp/KohFj52PXxqCjhROjfnXzVZ5T/+lT3TPiwtzZsChY3 PxWky/YdhKrCFCyi7vrO+ADE05SJroeneJ8H+QZrUDpO6/Wz6In4cdX3zrgeoA4EzpwNh7mk6Ez pItXROUUo4YClQ== X-Received: by 2002:a05:622a:1802:b0:476:7ff5:cc27 with SMTP id d75a77b69052e-47aec4dcc0dmr256540441cf.51.1745435946222; Wed, 23 Apr 2025 12:19:06 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHYZnsrqVQFMmxCCRWBHo6cgTz6A3ekouceyIUYb+WLWz7C/tNEfgqMMuRkHi33l8+eZSJ1kA== X-Received: by 2002:a05:622a:1802:b0:476:7ff5:cc27 with SMTP id d75a77b69052e-47aec4dcc0dmr256539801cf.51.1745435945812; Wed, 23 Apr 2025 12:19:05 -0700 (PDT) Received: from [192.168.2.110] ([70.53.200.211]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-47ae9cf9a8dsm71719371cf.70.2025.04.23.12.19.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Apr 2025 12:19:05 -0700 (PDT) Message-ID: <4d18462f-8bc6-459e-9eeb-09e63559d854@redhat.com> Date: Wed, 23 Apr 2025 15:18:54 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 00/11] Perf improvements for hugetlb and vmalloc on arm64 To: Ryan Roberts , Catalin Marinas , Will Deacon , Pasha Tatashin , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , David Hildenbrand , "Matthew Wilcox (Oracle)" , Mark Rutland , Anshuman Khandual , Alexandre Ghiti , Kevin Brodsky Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20250422081822.1836315-1-ryan.roberts@arm.com> From: Luiz Capitulino In-Reply-To: <20250422081822.1836315-1-ryan.roberts@arm.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 1uRtovf9Z8ukRBEHVkb5fLK7xw4bqWbHdL_NMrvqhG4_1745435946 X-Mimecast-Originator: redhat.com Content-Language: en-US, en-CA Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: A9FE9C0009 X-Stat-Signature: ypmkxbumuge56ok6fdch6ctt4bhgnbo3 X-Rspam-User: X-HE-Tag: 1745435950-574699 X-HE-Meta: U2FsdGVkX1/CgTK1wDvd6jJgG0vQVukfQC7LXDg+awARR5MrwgLoO3L0t3rJu4UOEWQ8Vma9dbBpIbgXcHD0WSCbAdLHXlihvKznSooPOG2ai/rBn2RmhbKImiwnt5M/EkzsL8ViqRVZ+gnP64AR4asnY7AYa1A30XTc8m7wbpJt9wm8K7YA8aRgZC5VENr5Pfsz8nDthwJnP19zcdr9/Ck14KbI5U9bVahBKsWXqUaTu8sAhnY8PQWDk5SLGlZs3wtn2V97SUI3Wvtt7AUSraUt6VLmW/A6+swel4I/viLjnaQxpLY91sPBHL8dXuNVQLIV+MaBBycT2IKKF/GCzEl4vlAlQ9KzTP6qVvniIYeCXND1ApNKFOsMShgO326Y9/DTn3ppXumr+WbJ+Yh4T5fplJIe3Y4X/BFMtKY04uuA45AXoM20hyuy82X3lPw1GwMsYjw6nMMrKIMiVj8shmbnAyqh+k1MgiiompbCA7N+5mUavmNAuSRAV8T2TcK6mdpdsm37bdoPCLLUIJjEkpQAsJR1G51qIbE60vs3rSIEYcokmU4m6VG3sHUm7puU6wiRvLH7RYpXw66E0JrHISUJezWrUCwmtyV5LyjGoYHoLFTnRRtSN+SoUijg+3ATORgVyCq/zkHaFc1sKQeUI/AVVWc26OEFbvVfIcjlSXj8TULOV7IQpgab1Ilp11tZi529ZzUsu3wNh3I+KeGNamVdwncAsAWd5fz0zEgXv8ZLYxMLuKdTrKRmMF99qcBhf0U6E9qQPyqRHXeLEQl+ccX6vJHJnVCVbUrOHQHnTpvYjWc3p3LiIy+mgWX6BFhdxRxui6Ct7YIWq5aKIitnvddpkAIBAZ38QCN+SNb8VW/2F9Ox+2YC/WCIhqARTh4DCs9cYJX1slzo43Fr/VQN2n81/1akkPuEUV6uLythAJWh78fWr5BUDs91fTNiw/bo5BZCatSyUvxPC1N9g+0 a83pCT+v sqEwv9+Ul7mUTKiVLPMctKi61IAdOvyt68KvrYCm5EVnDRDNrSDzeIjSPj53gjiE5B6wSTxanjk07Y8ZL7F/H8JToAfqTo9oWCwuJ3JPiDk9j0J7QtZb9jICMepK3uCUjs5sRO1RseHvt2HHv8wh3h5ETzZfjWuPy2kPf12pQ/0XXQhzjwLk//Fa75O0y6gLFnSD5eIow30zpfH77c2D2W/amc2sLnrmjU0wdRhG7wrRqKShr2B3lHy4Oo9KTS+JeU6GVi2MWZZhNCapZElBGSk65uiZyaGChn/5r2Rh7GErDKP4yISP4aGwtX0r7spNaDKhnOXXy7cMih2fYpelAtdoo60OcaaAuZyDIJcJ8qv4nAb7QdjqbrAQ77oluj0ZL0zN2rXMqemEGJIBgupTbvkhyMcwW5T4oBS/33vuwx7OgA1Y+bnZ0DTSVrC8+2YZJJyRdjXywsdE4ZpI6lI9s5lzVxt0nDEdwNJrQEHqC5uNG17Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025-04-22 04:18, Ryan Roberts wrote: > Hi All, > > This is v4 of a series to improve performance for hugetlb and vmalloc on arm64. > Although some of these patches are core-mm, advice from Andrew was to go via the > arm64 tree. All patches are now acked/reviewed by relevant maintainers so I > believe this should be good-to-go. > > The 2 key performance improvements are 1) enabling the use of contpte-mapped > blocks in the vmalloc space when appropriate (which reduces TLB pressure). There > were already hooks for this (used by powerpc) but they required some tidying and > extending for arm64. And 2) batching up barriers when modifying the vmalloc > address space for upto 30% reduction in time taken in vmalloc(). > > vmalloc() performance was measured using the test_vmalloc.ko module. Tested on > Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole > test was repeated 10 times. > > legend: > - p: nr_pages (pages to allocate) > - h: use_huge (vmalloc() vs vmalloc_huge()) > - (I): statistically significant improvement (95% CI does not overlap) > - (R): statistically significant regression (95% CI does not overlap) > - measurements are times; smaller is better > > +--------------------------------------------------+-------------+-------------+ > | Benchmark | | | > | Result Class | Apple M2 | Ampere Alta | > +==================================================+=============+=============+ > | micromm/vmalloc | | | > | fix_align_alloc_test: p:1, h:0 (usec) | (I) -11.53% | -2.57% | > | fix_size_alloc_test: p:1, h:0 (usec) | 2.14% | 1.79% | > | fix_size_alloc_test: p:4, h:0 (usec) | (I) -9.93% | (I) -4.80% | > | fix_size_alloc_test: p:16, h:0 (usec) | (I) -25.07% | (I) -14.24% | > | fix_size_alloc_test: p:16, h:1 (usec) | (I) -14.07% | (R) 7.93% | > | fix_size_alloc_test: p:64, h:0 (usec) | (I) -29.43% | (I) -19.30% | > | fix_size_alloc_test: p:64, h:1 (usec) | (I) -16.39% | (R) 6.71% | > | fix_size_alloc_test: p:256, h:0 (usec) | (I) -31.46% | (I) -20.60% | > | fix_size_alloc_test: p:256, h:1 (usec) | (I) -16.58% | (R) 6.70% | > | fix_size_alloc_test: p:512, h:0 (usec) | (I) -31.96% | (I) -20.04% | > | fix_size_alloc_test: p:512, h:1 (usec) | 2.30% | 0.71% | > | full_fit_alloc_test: p:1, h:0 (usec) | -2.94% | 1.77% | > | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) | -7.75% | 1.71% | > | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) | -9.07% | (R) 2.34% | > | long_busy_list_alloc_test: p:1, h:0 (usec) | (I) -29.18% | (I) -17.91% | > | pcpu_alloc_test: p:1, h:0 (usec) | -14.71% | -3.14% | > | random_size_align_alloc_test: p:1, h:0 (usec) | (I) -11.08% | (I) -4.62% | > | random_size_alloc_test: p:1, h:0 (usec) | (I) -30.25% | (I) -17.95% | > | vm_map_ram_test: p:1, h:0 (usec) | 5.06% | (R) 6.63% | > +--------------------------------------------------+-------------+-------------+ > > So there are some nice improvements but also some regressions to explain: > > fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The > regression is actually introduced by enabling contpte-mapped 64K blocks in these > tests, and that regression is reduced (from about 8% if memory serves) by doing > the barrier batching. I don't have a definite conclusion on the root cause, but > I've ruled out the differences in the mapping paths in vmalloc. I strongly > believe this is likely due to the difference in the allocation path; 64K blocks > are not cached per-cpu so we have to go all the way to the buddy. I'm not sure > why this doesn't show up on M2 though. Regardless, I'm going to assert that it's > better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation > call duration. I ran a couple of basic functional tests for HugeTLB for 1G, 2M, 32M and 64k pages on an Ampere Mt. Jade and a Jetson AGX Orin systems and didn't get any issues, so: Tested-by: Luiz Capitulino > > Changes since v3 [3] > ==================== > - Applied R-bs (thanks all!) > - Renamed set_ptes_anysz() -> __set_ptes_anysz() (Catalin) > - Renamed ptep_get_and_clear_anysz() -> __ptep_get_and_clear_anysz() (Catalin) > - Only set TIF_LAZY_MMU_PENDING if not already set to avoid atomic ops (Catalin) > - Fix commet typos (Anshuman) > - Fix build warnings when PMD is folded (buildbot) > - Reverse xmas tree for variables in __page_table_check_p[mu]ds_set() (Pasha) > > Changes since v2 [2] > ==================== > - Removed the new arch_update_kernel_mappings_[begin|end]() API > - Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching > - Removed clean up to avoid barriers for invalid or user mappings > > Changes since v1 [1] > ==================== > - Split out the fixes into their own series > - Added Rbs from Anshuman - Thanks! > - Added patch to clean up the methods by which huge_pte size is determined > - Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in > flush_hugetlb_tlb_range() > - Renamed ___set_ptes() -> set_ptes_anysz() > - Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz() > - Fixed typos in commit logs > - Refactored pXd_valid_not_user() for better reuse > - Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent > - Concluded the extra isb() in __switch_to() is not required > - Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings > > Applies on top of v6.15-rc3. All mm selftests run and no regressions observed. > > [1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/ > [3] https://lore.kernel.org/all/20250304150444.3788920-1-ryan.roberts@arm.com/ > > Thanks, > Ryan > > Ryan Roberts (11): > arm64: hugetlb: Cleanup huge_pte size discovery mechanisms > arm64: hugetlb: Refine tlb maintenance scope > mm/page_table_check: Batch-check pmds/puds just like ptes > arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() > arm64: hugetlb: Use __set_ptes_anysz() and > __ptep_get_and_clear_anysz() > arm64/mm: Hoist barriers out of set_ptes_anysz() loop > mm/vmalloc: Warn on improper use of vunmap_range() > mm/vmalloc: Gracefully unmap huge ptes > arm64/mm: Support huge pte-mapped pages in vmap > mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes > arm64/mm: Batch barriers when updating kernel mappings > > arch/arm64/include/asm/hugetlb.h | 29 ++-- > arch/arm64/include/asm/pgtable.h | 209 +++++++++++++++++++-------- > arch/arm64/include/asm/thread_info.h | 2 + > arch/arm64/include/asm/vmalloc.h | 45 ++++++ > arch/arm64/kernel/process.c | 9 +- > arch/arm64/mm/hugetlbpage.c | 73 ++++------ > include/linux/page_table_check.h | 30 ++-- > include/linux/vmalloc.h | 8 + > mm/page_table_check.c | 34 +++-- > mm/vmalloc.c | 40 ++++- > 10 files changed, 329 insertions(+), 150 deletions(-) > > -- > 2.43.0 > >