From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDE02C282C5 for ; Mon, 3 Mar 2025 10:57:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3FBCF6B008A; Mon, 3 Mar 2025 05:57:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 383206B008C; Mon, 3 Mar 2025 05:57:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1FCC36B0092; Mon, 3 Mar 2025 05:57:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id F2BA16B008A for ; Mon, 3 Mar 2025 05:57:35 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7A6EF1406E5 for ; Mon, 3 Mar 2025 10:57:35 +0000 (UTC) X-FDA: 83179938870.22.0840B34 Received: from mail.alien8.de (mail.alien8.de [65.109.113.108]) by imf05.hostedemail.com (Postfix) with ESMTP id 7761A10000E for ; Mon, 3 Mar 2025 10:57:33 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=aHeEeVH6; spf=pass (imf05.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740999453; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fsF/BJYY/vDdKR++SVp7pZPQ4pqpWof0XiKnlyeuEyc=; b=7XZQxVzOQyHTfgtks1RQ1ye2DVZr0cUjuZppPi7qOh2zQU9UfHFN2PM976Q9eU+HseybP+ kbFusUV5NcGekpFY6ZRH5Jk8ERjdJqAXPPjJSE7Sq51e8hnu5tNwGb9bNjpOIsc1v28Mwp dYqACuz3M3gsK+bCp3edWzhQhdpM0K4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740999453; a=rsa-sha256; cv=none; b=Rqd3CFILMjz6Ryiv1DyMiiE+C8BhKcdiAhivYPJIwgP2OGOcjKhca0rlzNqWO5/ZQyZHJW fAdiVkyujELu19vlz8RDWzZRWEGWMgnK/K7+0lSGUHBxNGhy0FN/6w5n5BTBrA/3VB+YG6 5COy39pGcVVafQ8o76y+hp0orPRRo6U= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=aHeEeVH6; spf=pass (imf05.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTP id 5A45A40E01D1; Mon, 3 Mar 2025 10:57:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.alien8.de Received: from mail.alien8.de ([127.0.0.1]) by localhost (mail.alien8.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id txShmVpQ3j3m; Mon, 3 Mar 2025 10:57:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=alien8; t=1740999444; bh=fsF/BJYY/vDdKR++SVp7pZPQ4pqpWof0XiKnlyeuEyc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=aHeEeVH6vARp+vq/iNsLnfLXsQahhf7y8tqD7HBji9il6Wr2pXEQ782qC25ooI0KM RJRux8tT/oZ0lKMTGzs6DcVUh9/PK2Jpf00uylQ8uglfuBwL1kjDrL7PBsZSnqC6HC K0ZLSgbUKiMMyjViybWsAQZ3lPzFaxoDDTEQgJgz6u5H5NN31ZRe/UZeBGkmbIV1OM L6BwHGkNd6r167f6wNnQj7LkCPBgv6sqxWjgiMrmJe9OKrIPMx13BgICXKO2QTWsUS stwfTVnOCqS193hqWO9SBDsMBdHJMqTc/eDMCOm4btC6PRKjFWbvwCIRXP+WmaE6aK VrF/6GUXsq1qRWI+r/qmvqw52ZQdAIQO3POo6W5SYIPV+Y3EuK1qWiB5eI/3YRz5LM ZjOR5vsQsAAFGews4++4wJQ2kYKJ2ATj8qIWZdfaMKcsX3HthSb/DJtjTe5KQL6LpN Q36+NlAVwLAuLuFv7GbCMeRyHWsyZ6FSBH5GKI5SAjRPMWeJa90Vvkh+4dKBqguOzh JAbbWnFu6xVHjDmTOXImkNNidmSG2p+l81c8UD33IfcR3QbQ8QMy3oVWdi7HcsSN72 MnJyw3WJv2UxBcBbeb7alHxGYe8iWsUXAQRTb2MZEw5NKSHwx78r7a5GgV8JT+AwpI TwLxd8LlrneIo2s7GROMojeU= Received: from zn.tnic (pd95303ce.dip0.t-ipconnect.de [217.83.3.206]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id D335B40E01AD; Mon, 3 Mar 2025 10:57:06 +0000 (UTC) Date: Mon, 3 Mar 2025 11:57:00 +0100 From: Borislav Petkov To: Rik van Riel Cc: x86@kernel.org, linux-kernel@vger.kernel.org, peterz@infradead.org, dave.hansen@linux.intel.com, zhengqi.arch@bytedance.com, nadav.amit@gmail.com, thomas.lendacky@amd.com, kernel-team@meta.com, linux-mm@kvack.org, akpm@linux-foundation.org, jackmanb@google.com, jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com, Manali.Shukla@amd.com, mingo@kernel.org Subject: Re: [PATCH v14 10/13] x86/mm: enable broadcast TLB invalidation for multi-threaded processes Message-ID: <20250303105700.GAZ8WK_Bkq_r6lBNVc@fat_crate.local> References: <20250226030129.530345-1-riel@surriel.com> <20250226030129.530345-11-riel@surriel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250226030129.530345-11-riel@surriel.com> X-Stat-Signature: dqgw6k1u168t7w1q1nsbp14d938xjh31 X-Rspamd-Queue-Id: 7761A10000E X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740999453-336706 X-HE-Meta: U2FsdGVkX1+TMchwJASjJ5E3bcofGw07ZMUgMKYMtK9mLMbLRHwgmfDvWkywdBmiplN1VfIvnFpH4hmH2MhV6G09ijEqIfzfMGFsKa+8/5IdOC6O4zuxXReIn+77VbdzrO6MmbT2Go2qaJR3pc3ONdaYgx9ycNPvKEozo3cu9yAaCt3n/7KdV4sYtGr1qnDkOnxGrN1DLPH7e+e4qOLeMG3K8VPTHz0JmoXq4CAhubrsD0Rl8FdEdaAQsJWWVbABdsHls69N9C8hpmDNOWK/GebEdoykqdSIcWtBYZuvQPuSmdk8/gBhH5cGgTu/tXTxcUtUKBLDPTYXh5SJa/tvco3UePZ293tGXQYOTLHlM3cu112PxGlzOuaZmOcVCG7lZMr+43F5o7Mc9LLAwlUbIWNmiklwxRpP1C+rnElP53qr+A6KVWemTXU/40Po9e78G0bF4Qz7kXjqc41RcoamIq93gz9/2B8NG6XFGUPKLBlVDL9F+5DFnDAtv78tt2jdO6E8JBrwBR1nHARizmaoz3gQOpxKqlh0h4OesJNO5miQCG//6PoHS6OJHfoQ1slSgAG06+uIDUG+VtA3mYw0xz9xJ9rUewxRuKF4TOqdFcJ2LxIQQ1/4uErGrE9etbDrEvhyICB5tt5gtzQw1y4DxPWpQOTQeSctiZCaVXeyNRrlCaPQr9Y3nJbV66JSynsYbL6u5dDyUoPfVePoeI4JbVu8Sm9rNqtbdEiF4e+C71a7lI0WhYr4F1zIxUKQ2cIPwYVU0S0hFaizVPsMJ0PbtnB28vLE+uj60Lx0sRg3KjMFvYp5+9/wORMqwjT6FcFqVZXhnpxPNautrl0XpYiK1dI1qN29Ij7c68s6dF4sg6tHaYjErVFrOi0CI9eaznyJiIITMs0+xD6R+V3HD9MMzKhrXB/hPABuOYus+JKx2hzOg87xKO5BZ+PZQrh21zF3+aZRAhhEim03sOsqlKf n+vNbQ03 B+FD88ibeDPKcTFFmc4894vbrcOxr5QU6ZeqdvNjmmqvQVzSL+QOzdKms4tLZMkx0wLJdokaKcUdm2zIwKtswHpOwvcSIv1JJgV9cEZxfwS2OxKlOI+MFUVQKxSJTVLUuXgOwqnOFKIU5ZmGAO0So8MRFd+En70b1W5RSnqv7ak3WhXujLvQWmNjuiOjBZfxOXC0c2X6UL4pgQpdyAkyqdyLFxyJi6ZPfwtynjyn8+QoVlixdWrC9e8WYjDVy3yB8NUUaZAXxpZfCdjlXVcZ4b2l2TI/dDQ5J5JvrjKW26YOojfoJkaTfRr9VuUBKfiuMO7Z74KCKeXlC3+54QvTCpaKTacW0k+xVu4c7zI3IfnwE2eA682BcuK1XVYRK5F8xz+HjxyuKYGCQYH/NfjKKTAmOH05fL6GW49U9e3AtsaF3goTahLx32AXxF4YpzAxkjAFLAHgEjV7w23PkNrCI9FZVeiTnL84Wi5yWM2ll1rmOUX4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 25, 2025 at 10:00:45PM -0500, Rik van Riel wrote: > +/* > + * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest > + * x86 systems have over 8k CPUs. Because of this potential ASID > + * shortage, global ASIDs are handed out to processes that have > + * frequent TLB flushes and are active on 4 or more CPUs simultaneously. > + */ > +static void consider_global_asid(struct mm_struct *mm) > +{ > + if (!static_cpu_has(X86_FEATURE_INVLPGB)) > + return; > + > + /* Check every once in a while. */ > + if ((current->pid & 0x1f) != (jiffies & 0x1f)) > + return; Uff, this looks funky. > + > + if (!READ_ONCE(global_asid_available)) > + return; use_global_asid() will do that check for us already and it'll even warn. > + > + /* > + * Assign a global ASID if the process is active on > + * 4 or more CPUs simultaneously. > + */ > + if (mm_active_cpus_exceeds(mm, 3)) > + use_global_asid(mm); > +} > + ... > +static void broadcast_tlb_flush(struct flush_tlb_info *info) > +{ > + bool pmd = info->stride_shift == PMD_SHIFT; > + unsigned long asid = mm_global_asid(info->mm); > + unsigned long addr = info->start; > + > + /* > + * TLB flushes with INVLPGB are kicked off asynchronously. > + * The inc_mm_tlb_gen() guarantees page table updates are done > + * before these TLB flushes happen. > + */ > + if (info->end == TLB_FLUSH_ALL) { > + invlpgb_flush_single_pcid_nosync(kern_pcid(asid)); > + /* Do any CPUs supporting INVLPGB need PTI? */ I hope not. :) However, I think one can force-enable PTI on AMD so yeah, let's keep that. ... Final result: From: Rik van Riel Date: Tue, 25 Feb 2025 22:00:45 -0500 Subject: [PATCH] x86/mm: Enable broadcast TLB invalidation for multi-threaded processes There is not enough room in the 12-bit ASID address space to hand out broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes when they are observed to be simultaneously running on 4 or more CPUs. This also allows single threaded process to continue using the cheaper, local TLB invalidation instructions like INVLPGB. Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in a generically named broadcast_tlb_flush() function which can later also be used for Intel RAR. Combined with the removal of unnecessary lru_add_drain calls() (see https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in a nice performance boost for the will-it-scale tlb_flush2_threads test on an AMD Milan system with 36 cores: - vanilla kernel: 527k loops/second - lru_add_drain removal: 731k loops/second - only INVLPGB: 527k loops/second - lru_add_drain + INVLPGB: 1157k loops/second Profiling with only the INVLPGB changes showed while TLB invalidation went down from 40% of the total CPU time to only around 4% of CPU time, the contention simply moved to the LRU lock. Fixing both at the same time about doubles the number of iterations per second from this case. Comparing will-it-scale tlb_flush2_threads with several different numbers of threads on a 72 CPU AMD Milan shows similar results. The number represents the total number of loops per second across all the threads: threads tip INVLPGB 1 315k 304k 2 423k 424k 4 644k 1032k 8 652k 1267k 16 737k 1368k 32 759k 1199k 64 636k 1094k 72 609k 993k 1 and 2 thread performance is similar with and without INVLPGB, because INVLPGB is only used on processes using 4 or more CPUs simultaneously. The number is the median across 5 runs. Some numbers closer to real world performance can be found at Phoronix, thanks to Michael: https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits [ bp: - Massage - :%s/\/cpu_feature_enabled/cgi - :%s/\/mm_clear_asid_transition/cgi ] Signed-off-by: Rik van Riel Signed-off-by: Borislav Petkov (AMD) Reviewed-by: Nadav Amit Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com --- arch/x86/include/asm/tlbflush.h | 5 ++ arch/x86/mm/tlb.c | 104 +++++++++++++++++++++++++++++++- 2 files changed, 108 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index e6c3be06dd21..8c21030269ff 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -280,6 +280,11 @@ static inline void mm_assign_global_asid(struct mm_struct *mm, u16 asid) smp_store_release(&mm->context.global_asid, asid); } +static inline void mm_clear_asid_transition(struct mm_struct *mm) +{ + WRITE_ONCE(mm->context.asid_transition, false); +} + static inline bool mm_in_asid_transition(struct mm_struct *mm) { if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index b5681e6f2333..0efd99053c09 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -430,6 +430,105 @@ static bool mm_needs_global_asid(struct mm_struct *mm, u16 asid) return false; } +/* + * x86 has 4k ASIDs (2k when compiled with KPTI), but the largest x86 + * systems have over 8k CPUs. Because of this potential ASID shortage, + * global ASIDs are handed out to processes that have frequent TLB + * flushes and are active on 4 or more CPUs simultaneously. + */ +static void consider_global_asid(struct mm_struct *mm) +{ + if (!cpu_feature_enabled(X86_FEATURE_INVLPGB)) + return; + + /* Check every once in a while. */ + if ((current->pid & 0x1f) != (jiffies & 0x1f)) + return; + + /* + * Assign a global ASID if the process is active on + * 4 or more CPUs simultaneously. + */ + if (mm_active_cpus_exceeds(mm, 3)) + use_global_asid(mm); +} + +static void finish_asid_transition(struct flush_tlb_info *info) +{ + struct mm_struct *mm = info->mm; + int bc_asid = mm_global_asid(mm); + int cpu; + + if (!mm_in_asid_transition(mm)) + return; + + for_each_cpu(cpu, mm_cpumask(mm)) { + /* + * The remote CPU is context switching. Wait for that to + * finish, to catch the unlikely case of it switching to + * the target mm with an out of date ASID. + */ + while (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) == LOADED_MM_SWITCHING) + cpu_relax(); + + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) != mm) + continue; + + /* + * If at least one CPU is not using the global ASID yet, + * send a TLB flush IPI. The IPI should cause stragglers + * to transition soon. + * + * This can race with the CPU switching to another task; + * that results in a (harmless) extra IPI. + */ + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) != bc_asid) { + flush_tlb_multi(mm_cpumask(info->mm), info); + return; + } + } + + /* All the CPUs running this process are using the global ASID. */ + mm_clear_asid_transition(mm); +} + +static void broadcast_tlb_flush(struct flush_tlb_info *info) +{ + bool pmd = info->stride_shift == PMD_SHIFT; + unsigned long asid = mm_global_asid(info->mm); + unsigned long addr = info->start; + + /* + * TLB flushes with INVLPGB are kicked off asynchronously. + * The inc_mm_tlb_gen() guarantees page table updates are done + * before these TLB flushes happen. + */ + if (info->end == TLB_FLUSH_ALL) { + invlpgb_flush_single_pcid_nosync(kern_pcid(asid)); + /* Do any CPUs supporting INVLPGB need PTI? */ + if (cpu_feature_enabled(X86_FEATURE_PTI)) + invlpgb_flush_single_pcid_nosync(user_pcid(asid)); + } else do { + unsigned long nr = 1; + + if (info->stride_shift <= PMD_SHIFT) { + nr = (info->end - addr) >> info->stride_shift; + nr = clamp_val(nr, 1, invlpgb_count_max); + } + + invlpgb_flush_user_nr_nosync(kern_pcid(asid), addr, nr, pmd); + if (cpu_feature_enabled(X86_FEATURE_PTI)) + invlpgb_flush_user_nr_nosync(user_pcid(asid), addr, nr, pmd); + + addr += nr << info->stride_shift; + } while (addr < info->end); + + finish_asid_transition(info); + + /* Wait for the INVLPGBs kicked off above to finish. */ + __tlbsync(); +} + /* * Given an ASID, flush the corresponding user ASID. We can delay this * until the next time we switch to it. @@ -1260,9 +1359,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, * a local TLB flush is needed. Optimize this use-case by calling * flush_tlb_func_local() directly in this case. */ - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { + if (mm_global_asid(mm)) { + broadcast_tlb_flush(info); + } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { info->trim_cpumask = should_trim_cpumask(mm); flush_tlb_multi(mm_cpumask(mm), info); + consider_global_asid(mm); } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); local_irq_disable(); -- 2.43.0 -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette