From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65A9BE77188 for ; Fri, 3 Jan 2025 17:36:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EFF016B0083; Fri, 3 Jan 2025 12:36:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EAE1E6B0088; Fri, 3 Jan 2025 12:36:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D9C4B6B0089; Fri, 3 Jan 2025 12:36:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id BC5AF6B0083 for ; Fri, 3 Jan 2025 12:36:46 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 66334160979 for ; Fri, 3 Jan 2025 17:36:46 +0000 (UTC) X-FDA: 82966842000.17.9054569 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf25.hostedemail.com (Postfix) with ESMTP id 385FBA0005 for ; Fri, 3 Jan 2025 17:36:06 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=PHl+yzTq; spf=pass (imf25.hostedemail.com: domain of jannh@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735925756; a=rsa-sha256; cv=none; b=05a2CrOdfS9tDEr5jz5dDcVwjzHyFriB4UrU0nB6fcMlmKohg4aRhMdW6NNUImxkFKx41g a1AplaSOb7s6NSe5NTKGC569KCy1BeWj0NJE7Z7yWFIKWprOX89tWUMp5BafY/lhTvIWGT 32VsBa22LW3I1BNu3ZVXk8IPdQKybtA= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=PHl+yzTq; spf=pass (imf25.hostedemail.com: domain of jannh@google.com designates 209.85.208.41 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735925756; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hgsRcKEHugbngVifvsbBSXQ3nqR8ddn0jOJ04jCOFt0=; b=BVUIy9qHaZT4j1ohcpJ852Jr2bU5rhmJusZIgQyrvP32vkMfkKKxziiBhejktIl/YtFjoU wNQL76bLD8TCpyF04DcxdekipLE11MhurH02xmONzKASBuiATXHugkX6ksgWtRcJAt5H81 G7Ti5Q+G1eMXmffs9/w1gG006cb5jjA= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-5d3e638e1b4so7658a12.1 for ; Fri, 03 Jan 2025 09:36:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1735925802; x=1736530602; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hgsRcKEHugbngVifvsbBSXQ3nqR8ddn0jOJ04jCOFt0=; b=PHl+yzTqUzTCS3tsRfkN3ouJ+l48SoNo7JeigcPf+JKsNF7kBdP0BimcjP7Bn0LJh0 GWmHTIVCWObrLUJ8912WAN/AnKybu94aroj19zAHl4FySDSmF005sl4kjzZCY2leV4In DSKfMDa5B3YGqpN2WzJmA1RYORso7OxrUvwNgycghMhpW5GtpKNlMCrAb7i1OZsXuDUO nfYFV1xtuj1gDS00nyssOoMUvHdvBzf6KgDEjVjX3gPV+FwRXFrD9zFobAj5abZ2vh5R FtDm6OUOY4r35zRHXZ0n6ApTuxDDCFSBMcQ/DbBq6Ea7AapK6MyY4dFkRrQ03tRsv0pj AcYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735925802; x=1736530602; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hgsRcKEHugbngVifvsbBSXQ3nqR8ddn0jOJ04jCOFt0=; b=V5Eld3P5VJT8wmb+p/oH1GwtCkpo3NbVCL9alUECBP1TKTkmdljVJ2y6jd1XUIZb0u DwCd9if749EkfghlgJbuHiXceQrPWpSfVFy040jYWZKh3VTDiuIeHzRwOnOml7eqXbmF w99JURWEM1YI94s4gNgmyo/BDXrQ+qaqq56EyfqiPFzl36SUOnVLw+fkCyGnP0iMYULD mAl7LNnJ3qn/bJtaWiLwiQWUOSg+3surr3qF6YLujv4zdT+oopWImkFBk2nozLAeyPKK f3MbdS4F1pJkKpYLJWyEBK7LmRGD2vH4xLgCemk2cHKXeU1Dzj9xdTUxeeyCRp5rAm0u Cx9A== X-Forwarded-Encrypted: i=1; AJvYcCUcxqkJXqozTpck/2Yav6T79FQn3OipWvMP5QDnFn5ZTHcyIqk5yaZRrIqOVNB/FBc8hC+ce9bk8g==@kvack.org X-Gm-Message-State: AOJu0Yy6ySg9BcU4qWXZsO4pMJCDJDAcmxpmjEfmmjXvtjvQNI5RKoGY N2++Me28iLXHLIileR0QhVg3gZq8pyarf4Q8cH5tvXUEdEN1YHs8LcqaUNz9IdmFKyjPMjzm7ZJ OzKJJG+0rSHr10Z2DPzUKWVUbGFrWUhmVcqBPqTZQCD23qrs911RGb5I= X-Gm-Gg: ASbGncsEAe8g4oyyBc7FWItX9YYG2QhjEI0CD/MWmdW2CiNo61HGodtWLGdnQ31zx08 BVFYHdWv1fmGogFBS542/v+F377hCEqsytNwu51TjUC9fGyL3KyRWAM1noR5lA6GW1g== X-Google-Smtp-Source: AGHT+IEcpzWXDDxt4Y9tyP2JCDis59/mS56ZoTP2KQ4D+k4ZE3oNtEMmIj6FHxYrd2CLt4qrpo9cetU37pfoFatAwBw= X-Received: by 2002:a05:6402:390a:b0:5d1:22e1:7458 with SMTP id 4fb4d7f45d1cf-5d915fd0203mr90304a12.4.1735925802125; Fri, 03 Jan 2025 09:36:42 -0800 (PST) MIME-Version: 1.0 References: <20241230175550.4046587-1-riel@surriel.com> <20241230175550.4046587-10-riel@surriel.com> In-Reply-To: <20241230175550.4046587-10-riel@surriel.com> From: Jann Horn Date: Fri, 3 Jan 2025 18:36:06 +0100 X-Gm-Features: AbW1kvbJlIVrm5J8jUjhKy2TQU5CMk7bACRF8CXapju7MVjhZ0s4O5ebx1Kskd8 Message-ID: Subject: Re: [PATCH 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes To: Rik van Riel Cc: x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, akpm@linux-foundation.org, nadav.amit@gmail.com, zhengqi.arch@bytedance.com, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 385FBA0005 X-Stat-Signature: 1fqdhqiw8qnka6ujc3k5p9snd9wf9pm3 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1735925766-241282 X-HE-Meta: U2FsdGVkX19J7XM1e583jZoe9hUuWWwXkLcGIstz7tTPng9jiJD3G22WXoDR/xBzJZX6H2gczi6lyxirxYCOi1W2B/AmV40g4sq5c0vQmFJ0csBUTEOeebeJ6AXN1vEDViG83QC+wsvR2ZM2MBqZbzjdBMNaoOoykkAc78FsOTZHuD5hT5WIEnnTXwuZPrQhFVlkx2rDoZzhUPBYqQvrV7VVHn82RSq0Ps7jI8AMR7HOrsMd2jH+vVQEno3Cn/KGcSnXk8UVStoN/CjgVarCJxG7DnJTxwUo+3KgZaJ+iQ9YDs82+LVs0GuXX10hbooEKJ3JwonxeKdeje3DXeVH1QxCCpQkdc1didmwLNX4Ld6ScZu+UOy4PAu4OyLAEWtGQjHGRMmnR0/J5O0wwritzyyba/o1eBL8yrj2Qo6yzS4Wh9P1Tj6vF5QPHuAY9qbQcl5/urQDnGDFc2SlpMcOMfjzbi9CweoxVafHh+PpXpdJpSQcvsm2Bt3DMeRlFkxH6aPrhPUcP9bwQVgMnq6ekgSdKyjliUffpyqVk+sltXQWA9DS0TkrUpmcZS2I+Nlfdyw7cQZhK+tVuRvK4f3qA9TNAnug+hApzStdhdcX/930TY4vWwAOMLnmsECBubZpUIl6l7UhPiO8+NGq6buO44Yqz7LAr6zagqQqxVbknCBrgkbGDnBK+rWn1Q9k27fi2J2ecaTofu5fbQNVci8z4a/zV4/kkhlcxPLHBbx7sc+R3cjMHymwCOnaHcyV5KZX2ULJiI0SMNz5hGTf1r0fHU5ggvucm88wc9vasYKQRwMjODxaq7t33+9CNpGtbCQVqrG/O+hF9jl1UUcicJFdkfxP6jAuE4lySDi4K2bMCu72qqT4mlTViNJrIS4/NNqe9Ww+00qpUOXH4DQJiinWsdOr3yhHj+UCP9vGmF4XcScGROE5fyUSho/4jjc3mZDtf/Ih/2PypfNu+YOtf0n hKlUEmhq QCBK67N6NWBhJhSjxmW97AzGh7aAxZjIqd1XF+I+vNj69LpTns5yXtgyrwjBHCunshEtuc2kJ2ORo540lzkfrlLNYqWgPcL2brE7wUeu9YNi6XerfSA1z0RhKQjExYh2ElyqrspgAnJEwoVJDGfhbS1Dnn3R9yIsdcqtwklyS8xta0WfKPQd5UmpLMiYskLbUXhc4QBVH3hhzuc9S0OxuPXPEHU+omKssleXY9cUhoHSYJbxVTNva5m3R0qyIwGXv9nn/JEJcHAqpyJFAEiKMwrDsHdCq+Lk/gWOuCafOQ61InmqDq2ZeRtOCPPJzCzV5RDVTFcDkpbS+jSk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 30, 2024 at 6:53=E2=80=AFPM Rik van Riel wro= te: > Use broadcast TLB invalidation, using the INVPLGB instruction, on AMD EPY= C 3 > and newer CPUs. > > In order to not exhaust PCID space, and keep TLB flushes local for single > threaded processes, we only hand out broadcast ASIDs to processes active = on > 3 or more CPUs, and gradually increase the threshold as broadcast ASID sp= ace > is depleted. [...] > --- > arch/x86/include/asm/mmu.h | 6 + > arch/x86/include/asm/mmu_context.h | 12 ++ > arch/x86/include/asm/tlbflush.h | 17 ++ > arch/x86/mm/tlb.c | 310 ++++++++++++++++++++++++++++- > 4 files changed, 336 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h > index 3b496cdcb74b..a8e8dfa5a520 100644 > --- a/arch/x86/include/asm/mmu.h > +++ b/arch/x86/include/asm/mmu.h > @@ -48,6 +48,12 @@ typedef struct { > unsigned long flags; > #endif > > +#ifdef CONFIG_CPU_SUP_AMD > + struct list_head broadcast_asid_list; > + u16 broadcast_asid; > + bool asid_transition; Please add a comment on the semantics of the "asid_transition" field here after addressing the comments below. > +#endif > + > #ifdef CONFIG_ADDRESS_MASKING > /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (dis= abled) */ > unsigned long lam_cr3_mask; [...] > +#ifdef CONFIG_CPU_SUP_AMD > +/* > + * Logic for AMD INVLPGB support. > + */ > +static DEFINE_RAW_SPINLOCK(broadcast_asid_lock); > +static u16 last_broadcast_asid =3D TLB_NR_DYN_ASIDS; I wonder if this should be set to MAX_ASID_AVAILABLE or such to ensure that we do a flush before we start using the broadcast ASID space the first time... Or is there something else that already guarantees that all ASIDs of the TLB are flushed during kernel boot? > +static DECLARE_BITMAP(broadcast_asid_used, MAX_ASID_AVAILABLE) =3D { 0 }= ; > +static LIST_HEAD(broadcast_asid_list); > +static int broadcast_asid_available =3D MAX_ASID_AVAILABLE - TLB_NR_DYN_= ASIDS - 1; > + > +static void reset_broadcast_asid_space(void) > +{ > + mm_context_t *context; > + > + lockdep_assert_held(&broadcast_asid_lock); > + > + /* > + * Flush once when we wrap around the ASID space, so we won't nee= d > + * to flush every time we allocate an ASID for boradcast flushing= . nit: typoed "broadcast" > + */ > + invlpgb_flush_all_nonglobals(); > + tlbsync(); > + > + /* > + * Leave the currently used broadcast ASIDs set in the bitmap, si= nce > + * those cannot be reused before the next wraparound and flush.. > + */ > + bitmap_clear(broadcast_asid_used, 0, MAX_ASID_AVAILABLE); > + list_for_each_entry(context, &broadcast_asid_list, broadcast_asid= _list) > + __set_bit(context->broadcast_asid, broadcast_asid_used); > + > + last_broadcast_asid =3D TLB_NR_DYN_ASIDS; > +} > + > +static u16 get_broadcast_asid(void) > +{ > + lockdep_assert_held(&broadcast_asid_lock); > + > + do { > + u16 start =3D last_broadcast_asid; > + u16 asid =3D find_next_zero_bit(broadcast_asid_used, MAX_= ASID_AVAILABLE, start); > + > + if (asid >=3D MAX_ASID_AVAILABLE) { > + reset_broadcast_asid_space(); > + continue; Can this loop endlessly without making forward progress if we have a few thousand processes on the system that are multi-threaded (or used to be multi-threaded) and race the wrong way? meets_broadcast_asid_threshold() checks if we have free IDs remaining, but that check happens before broadcast_asid_lock is held, so we could theoretically race such that no free IDs are available, right? > + } > + > + /* Try claiming this broadcast ASID. */ > + if (!test_and_set_bit(asid, broadcast_asid_used)) { > + last_broadcast_asid =3D asid; > + return asid; > + } > + } while (1); > +} [...] > +/* > + * Assign a broadcast ASID to the current process, protecting against > + * races between multiple threads in the process. > + */ > +static void use_broadcast_asid(struct mm_struct *mm) > +{ > + guard(raw_spinlock_irqsave)(&broadcast_asid_lock); > + > + /* This process is already using broadcast TLB invalidation. */ > + if (mm->context.broadcast_asid) > + return; > + > + mm->context.broadcast_asid =3D get_broadcast_asid(); > + mm->context.asid_transition =3D true; This looks buggy to me: If we first set mm->context.broadcast_asid and then later set mm->context.asid_transition, then a flush_tlb_mm_range() that happens in between will observe mm_broadcast_asid() being true (meaning broadcast invalidation should be used) while mm->context.asid_transition is false (meaning broadcast invalidation alone is sufficient); but actually we haven't even started transitioning CPUs over to the new ASID yet, so I think the flush does nothing? Maybe change how mm->context.asid_transition works such that it is immediately set on mm creation and cleared when the transition is done, so that you don't have to touch it here? Also, please use at least WRITE_ONCE() for writes here, and add comments documenting ordering requirements. > + list_add(&mm->context.broadcast_asid_list, &broadcast_asid_list); > + broadcast_asid_available--; > +} [...] > +static void finish_asid_transition(struct flush_tlb_info *info) > +{ > + struct mm_struct *mm =3D info->mm; > + int bc_asid =3D mm_broadcast_asid(mm); > + int cpu; > + > + if (!mm->context.asid_transition) AFAIU this can be accessed concurrently - please use at least READ_ONCE(). (I think in the current version of the patch, this needs to be ordered against the preceding mm_broadcast_asid() read, but that's implicit on x86, so I guess writing a barrier here would be superfluous.) > + return; > + > + for_each_cpu(cpu, mm_cpumask(mm)) { > + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm, cpu)) !=3D = mm) > + continue; switch_mm_irqs_off() picks an ASID and writes CR3 before writing loaded_mm: "/* Make sure we write CR3 before loaded_mm. */" Can we race with a concurrent switch_mm_irqs_off() on the other CPU such that the other CPU has already switched CR3 to our MM using the old ASID, but has not yet written loaded_mm, such that we skip it here? And then we'll think we finished the ASID transition, and the next time we do a flush, we'll wrongly omit the flush for that other CPU even though it's still using the old ASID? > + > + /* > + * If at least one CPU is not using the broadcast ASID ye= t, > + * send a TLB flush IPI. The IPI should cause stragglers > + * to transition soon. > + */ > + if (per_cpu(cpu_tlbstate.loaded_mm_asid, cpu) !=3D bc_asi= d) { READ_ONCE()? Also, I think this needs a comment explaining that this can race with concurrent MM switches such that we wrongly think that there's a straggler (because we're not reading the loaded_mm and the loaded_mm_asid as one atomic combination). > + flush_tlb_multi(mm_cpumask(info->mm), info); > + return; > + } > + } > + > + /* All the CPUs running this process are using the broadcast ASID= . */ > + mm->context.asid_transition =3D 0; WRITE_ONCE()? Also: This is a bool, please use "false". > +} > + > +static void broadcast_tlb_flush(struct flush_tlb_info *info) > +{ > + bool pmd =3D info->stride_shift =3D=3D PMD_SHIFT; > + unsigned long maxnr =3D invlpgb_count_max; > + unsigned long asid =3D info->mm->context.broadcast_asid; > + unsigned long addr =3D info->start; > + unsigned long nr; > + > + /* Flushing multiple pages at once is not supported with 1GB page= s. */ > + if (info->stride_shift > PMD_SHIFT) > + maxnr =3D 1; > + > + if (info->end =3D=3D TLB_FLUSH_ALL) { > + invlpgb_flush_single_pcid(kern_pcid(asid)); What orders this flush with the preceding page table update? Does the instruction implicitly get ordered after preceding memory writes, or do we get that ordering from inc_mm_tlb_gen() or something like that? > + /* Do any CPUs supporting INVLPGB need PTI? */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + invlpgb_flush_single_pcid(user_pcid(asid)); > + } else do { > + /* > + * Calculate how many pages can be flushed at once; if th= e > + * remainder of the range is less than one page, flush on= e. > + */ > + nr =3D min(maxnr, (info->end - addr) >> info->stride_shif= t); > + nr =3D max(nr, 1); > + > + invlpgb_flush_user_nr(kern_pcid(asid), addr, nr, pmd); > + /* Do any CPUs supporting INVLPGB need PTI? */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + invlpgb_flush_user_nr(user_pcid(asid), addr, nr, = pmd); > + addr +=3D nr << info->stride_shift; > + } while (addr < info->end); > + > + finish_asid_transition(info); > + > + /* Wait for the INVLPGBs kicked off above to finish. */ > + tlbsync(); > +} > +#endif /* CONFIG_CPU_SUP_AMD */ [...] > @@ -769,6 +1042,16 @@ static void flush_tlb_func(void *info) > if (unlikely(loaded_mm =3D=3D &init_mm)) > return; > > + /* Reload the ASID if transitioning into or out of a broadcast AS= ID */ > + if (needs_broadcast_asid_reload(loaded_mm, loaded_mm_asid)) { > + switch_mm_irqs_off(NULL, loaded_mm, NULL); > + loaded_mm_asid =3D this_cpu_read(cpu_tlbstate.loaded_mm_a= sid); > + } > + > + /* Broadcast ASIDs are always kept up to date with INVLPGB. */ > + if (is_broadcast_asid(loaded_mm_asid)) > + return; This relies on the mm_broadcast_asid() read in flush_tlb_mm_range() being ordered after the page table update, correct? And we get that required ordering from the inc_mm_tlb_gen(), which implies a full barrier? It might be nice if there were some more comments on this. > VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id= ) !=3D > loaded_mm->context.ctx_id); >