Hello. On pátek 21. února 2025 1:52:59, středoevropský standardní čas Rik van Riel wrote: > Add support for broadcast TLB invalidation using AMD's INVLPGB instruction. > > This allows the kernel to invalidate TLB entries on remote CPUs without > needing to send IPIs, without having to wait for remote CPUs to handle > those interrupts, and with less interruption to what was running on > those CPUs. > > Because x86 PCID space is limited, and there are some very large > systems out there, broadcast TLB invalidation is only used for > processes that are active on 3 or more CPUs, with the threshold > being gradually increased the more the PCID space gets exhausted. > > Combined with the removal of unnecessary lru_add_drain calls > (see https://lkml.org/lkml/2024/12/19/1388) this results in a > nice performance boost for the will-it-scale tlb_flush2_threads > test on an AMD Milan system with 36 cores: > > - vanilla kernel: 527k loops/second > - lru_add_drain removal: 731k loops/second > - only INVLPGB: 527k loops/second > - lru_add_drain + INVLPGB: 1157k loops/second > > Profiling with only the INVLPGB changes showed while > TLB invalidation went down from 40% of the total CPU > time to only around 4% of CPU time, the contention > simply moved to the LRU lock. > > Fixing both at the same time about doubles the > number of iterations per second from this case. > > Some numbers closer to real world performance > can be found at Phoronix, thanks to Michael: > > https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits > > My current plan is to implement support for Intel's RAR > (Remote Action Request) TLB flushing in a follow-up series, > after this thing has been merged into -tip. Making things > any larger would just be unwieldy for reviewers. > > v12: > - make sure "nopcid" command line option turns off invlpgb (Brendan) > - add "noinvlpgb" kernel command line option > - split out kernel TLB flushing differently (Dave & Yosry) > - split up the patch that does invlpgb flushing for user processes (Dave) > - clean up get_flush_tlb_info (Boris) > - move invlpgb_count_max initialization to get_cpu_cap (Boris) > - bunch more comments as requested Somehow, this iteration breaks resume from S3. I can see it even in a QEMU VM: ``` [ 24.373391] ACPI: PM: Low-level resume complete [ 24.373929] ACPI: PM: Restoring platform NVS memory [ 24.375024] Enabling non-boot CPUs ... [ 24.375777] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 24.376463] BUG: unable to handle page fault for address: ffffffffa3ba4d60 [ 24.377383] #PF: supervisor write access in kernel mode [ 24.377912] #PF: error_code(0x0003) - permissions violation [ 24.378413] PGD 25427067 P4D 25427067 PUD 25428063 PMD 8000000024c001a1 [ 24.379020] Oops: Oops: 0003 [#1] PREEMPT SMP NOPTI [ 24.379503] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Kdump: loaded Not tainted 6.14.0-pf0 #1 161e4891fb5044b2d7438cd1852eeaac0cdffab5 [ 24.380650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 [ 24.381400] RIP: 0010:get_cpu_cap+0x39b/0x4f0 [ 24.381810] Code: 08 c7 44 24 08 00 00 00 00 48 8d 4c 24 0c e8 3c 00 04 00 90 8b 44 24 04 89 43 64 0f b7 44 24 0c 83 c0 01 81 7b 24 09 00 00 80 <66> 89 05 0e ab 8b 01 0f 86 18 fd ff ff c7 44 24 14 00 00 00 00 4c [ 24.383629] RSP: 0000:ffffafbec00efe70 EFLAGS: 00010012 [ 24.384155] RAX: 0000000000000001 RBX: ffff8b3fbcb19020 RCX: 0000000000001001 [ 24.384862] RDX: 0000000000000000 RSI: ffffafbec00efe74 RDI: ffffafbec00efe78 [ 24.385603] RBP: ffffafbec00efe88 R08: ffffafbec00efe70 R09: ffffafbec00efe7c [ 24.386318] R10: 0000000000002430 R11: ffff8b3fa5428000 R12: ffffafbec00efe8c [ 24.387014] R13: ffffafbec00efe84 R14: ffffafbec00efe80 R15: ffffafbec00efe70 [ 24.387713] FS: 0000000000000000(0000) GS:ffff8b3fbcb00000(0000) knlGS:0000000000000000 [ 24.388502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 24.389074] CR2: ffffffffa3ba4d60 CR3: 0000000025422000 CR4: 0000000000350ef0 [ 24.389769] Call Trace: [ 24.390020] [ 24.392234] identify_cpu+0xd4/0x890 [ 24.392593] identify_secondary_cpu+0x12/0x40 [ 24.393032] smp_store_cpu_info+0x49/0x60 [ 24.393430] start_secondary+0x7f/0x140 [ 24.393810] common_startup_64+0x13e/0x141 [ 24.394218] $ scripts/faddr2line arch/x86/kernel/cpu/common.o get_cpu_cap+0x39b get_cpu_cap+0x39b/0x500: get_cpu_cap at …/arch/x86/kernel/cpu/common.c:1063 1060 if (c->extended_cpuid_level >= 0x80000008) { 1061 cpuid(0x80000008, &eax, &ebx, &ecx, &edx); 1062 c->x86_capability[CPUID_8000_0008_EBX] = ebx; 1063 invlpgb_count_max = (edx & 0xffff) + 1; 1064 } ``` Any idea what I'm looking at? Thank you. > v11: > - resolve conflict with CONFIG_PT_RECLAIM code > - a few more cleanups (Peter, Brendan, Nadav) > v10: > - simplify partial pages with min(nr, 1) in the invlpgb loop (Peter) > - document x86 paravirt, AMD invlpgb, and ARM64 flush without IPI (Brendan) > - remove IS_ENABLED(CONFIG_X86_BROADCAST_TLB_FLUSH) (Brendan) > - various cleanups (Brendan) > v9: > - print warning when start or end address was rounded (Peter) > - in the reclaim code, tlbsync at context switch time (Peter) > - fix !CONFIG_CPU_SUP_AMD compile error in arch_tlbbatch_add_pending (Jan) > v8: > - round start & end to handle non-page-aligned callers (Steven & Jan) > - fix up changelog & add tested-by tags (Manali) > v7: > - a few small code cleanups (Nadav) > - fix spurious VM_WARN_ON_ONCE in mm_global_asid > - code simplifications & better barriers (Peter & Dave) > v6: > - fix info->end check in flush_tlb_kernel_range (Michael) > - disable broadcast TLB flushing on 32 bit x86 > v5: > - use byte assembly for compatibility with older toolchains (Borislav, Michael) > - ensure a panic on an invalid number of extra pages (Dave, Tom) > - add cant_migrate() assertion to tlbsync (Jann) > - a bunch more cleanups (Nadav) > - key TCE enabling off X86_FEATURE_TCE (Andrew) > - fix a race between reclaim and ASID transition (Jann) > v4: > - Use only bitmaps to track free global ASIDs (Nadav) > - Improved AMD initialization (Borislav & Tom) > - Various naming and documentation improvements (Peter, Nadav, Tom, Dave) > - Fixes for subtle race conditions (Jann) > v3: > - Remove paravirt tlb_remove_table call (thank you Qi Zheng) > - More suggested cleanups and changelog fixes by Peter and Nadav > v2: > - Apply suggestions by Peter and Borislav (thank you!) > - Fix bug in arch_tlbbatch_flush, where we need to do both > the TLBSYNC, and flush the CPUs that are in the cpumask. > - Some updates to comments and changelogs based on questions. > > > -- Oleksandr Natalenko, MSE