From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52B3FC02181 for ; Mon, 20 Jan 2025 20:05:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D24CA280003; Mon, 20 Jan 2025 15:05:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CD4E3280001; Mon, 20 Jan 2025 15:05:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7445280003; Mon, 20 Jan 2025 15:05:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 965BE280001 for ; Mon, 20 Jan 2025 15:05:02 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3DDDBA035C for ; Mon, 20 Jan 2025 20:05:02 +0000 (UTC) X-FDA: 83028908844.02.D954EA7 Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 3C0871C0015 for ; Mon, 20 Jan 2025 20:04:59 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Asi80NXm; spf=pass (imf21.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737403500; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/3zA6/rhJygiXhlmQObZDKbmJPf5h9J80NCk24OJ8UQ=; b=7UfwvE8l7unQQkFEy6ukXu84ODAlToTRPT53DQcjjmhe2wFM7ACVudzat4kU29+7Wc4/Ce n9DfNbn0DfmrTv0GKyaRmv7r8gpTe+0++DMhR+qgiLnL0swS+ZKfEXQiCtlnugcaMo6wPQ MAIPm14mJIVc0bsLNya+IzfoS1jRlXE= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Asi80NXm; spf=pass (imf21.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737403500; a=rsa-sha256; cv=none; b=SJzuN/97gZV7ua35kz+ZNCUqpWPIU6ZEwVU2aXC7yDbxBoX8nLgQnkyIVJ1gmDn8gTuxkc elzu9Q3MQT1jaudqzjUwiKu4dT1x2+UoJotOa6i2MFysWMX7S0QM8NjspcjyUYCFUlY+NW DRiA6s2JLxq+LA4oz+B2V3sWfnNm2us= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5d96944401dso8704440a12.0 for ; Mon, 20 Jan 2025 12:04:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737403499; x=1738008299; darn=kvack.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/3zA6/rhJygiXhlmQObZDKbmJPf5h9J80NCk24OJ8UQ=; b=Asi80NXmkPvWpoIMUzrfN+cVTbXoBzMB/yAtiWntwaA55AMxfti9MUkiQLjaeTHlLx 8/iN6u50NkU4xudhOdQDUo30+JyA8eBKOWeXyCmMxWvQZYQejyfOnL2Anj6jglfMdlJq eL9MccekBCsvBXhGkpP9Sl4iSWnYbk/z042uuW4P4G587MI/mXuTlNOfyDvIcgu8LY0C zq6c/YnlY9TkkPGN6nfN2kIkJMyf7tVfVBOUNoVY1LdVqe+AP645K1xqfTi/pmqxIFNG dLujh//nmRldY6n30v4G9ctYARwYTTxhNzuWcgx4Ny93K+dyFbXTkESxyfEpgzc0c7Kl /qeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737403499; x=1738008299; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/3zA6/rhJygiXhlmQObZDKbmJPf5h9J80NCk24OJ8UQ=; b=GFAW4V1S2NyjFhs798NgKO/ySIOmeVEhU2ez9AUPz7mH25Uley1luczoIoDzsjgaAj FZGb30sKKtBTaAdmSo+Wnrcpqx8u8PVYVfaWw7kzHROQfcUi5jN484tBFBROwXZlbUvY T6SBMyh34O1JZREkooebrB4V4StRFJoIna8yCgKz+LmI0TGw5IlGwdJhSRVcKFnRi70a m98TE8bYCdy148cx8ENZu5z0RhIXlBnUSj0ZHdRliYUtfDBVy4zXIV7Bc+Zh9Xeyucdq uEo6uBaAjQmPErjFr0D2TXfiPeIqpGA9PvnKWGKB6XXffLv/0IPuYwSWLgsbtth9dC4Y ySlg== X-Forwarded-Encrypted: i=1; AJvYcCXttRd0fhyMaqMuZqifFi6L5qXk8E2hlyEGi+QINqg2loKp6oKPg0TwrCdSE2xDcSYFXzYeadt/dA==@kvack.org X-Gm-Message-State: AOJu0Yzo9ZyYWWinhYIyjeiqgvPVA+AlDN48tajqGvNgqT9Wglzs8kVY OzUt+rPB4F6PAxxzDfTA30g3f96Qq4bjxdvYvfGqcTImnB5J7Ecx X-Gm-Gg: ASbGncvWpRaEDfrKrX0bXt2cuetCSq5UsVh8Ntu+XOIujd1gch//A1DuZ2PSFjVztWS Ejjs85Z5HjMYD1bGVuk3TkeYDzlCP6KjMwm3QuWnEfECm7MyOEWRPgZDBaGdcTBchj1tUcXRcYB icxpgiGaLakG9Lu72olm0bmR7GdbvwK6P+RYWYAsUXeYX7VAEhalXlzqPA/TM82I8XHJiwl5Ac3 grQl/nBw5SHpd9I4bgOOzab2NbYJPtVp8rRF3iZ7HQcWL+ljGVojK4pV/b+E+5hpkuWR+i7mrQl vJ75tKZr11aJID61o8/z X-Google-Smtp-Source: AGHT+IFBsaEesnJ759o+iAnExQGWf582lTLCGHFFJ6J8OJa0wx62JV85W/0tWwiEUMGiJrjCFrQlbA== X-Received: by 2002:a17:907:7fa8:b0:aaf:123a:e4f0 with SMTP id a640c23a62f3a-ab38b0b6886mr1505818266b.6.1737403498258; Mon, 20 Jan 2025 12:04:58 -0800 (PST) Received: from smtpclient.apple ([132.68.46.60]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ab384ce11eesm660135766b.38.2025.01.20.12.04.56 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Jan 2025 12:04:57 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.300.87.4.3\)) Subject: Re: [PATCH v6 09/12] x86/mm: enable broadcast TLB invalidation for multi-threaded processes From: Nadav Amit In-Reply-To: Date: Mon, 20 Jan 2025 22:04:45 +0200 Cc: the arch/x86 maintainers , Linux Kernel Mailing List , Borislav Petkov , peterz@infradead.org, Dave Hansen , zhengqi.arch@bytedance.com, thomas.lendacky@amd.com, kernel-team@meta.com, "open list:MEMORY MANAGEMENT" , Andrew Morton , jannh@google.com, mhklinux@outlook.com, andrew.cooper3@citrix.com Content-Transfer-Encoding: quoted-printable Message-Id: References: <20250120024104.1924753-1-riel@surriel.com> <20250120024104.1924753-10-riel@surriel.com> <84ba1c3e-d975-458f-89f5-a6f5d04a3d22@gmail.com> To: Rik van Riel X-Mailer: Apple Mail (2.3826.300.87.4.3) X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 3C0871C0015 X-Stat-Signature: 1hddqg5ttthct4x7zqehnfeh411gitmj X-HE-Tag: 1737403499-507768 X-HE-Meta: U2FsdGVkX1+DaUgFJW2Ca0nIAs5+FJMSzgQDDMLdwry91isNs0oco6FlLdlEcrpzzVVo69f7DHajvcp7QiMkz6ZDxrIN03aw2jvgM2+Ux5Vdl+xpVta/bX/ZhjeDzHz/UvWwTK8rfg7R6QmAX7JzGhiA5IhsGEV/FNBDWBgyRZ/ztL1n8PA8HgMbibhmCTR784iC27gwrntlg2HPaI/t74HpQeaJo3byfThfk9I9wg33Kj2cejpXbJs1B1aipqkwNJPWs9nYHsMEeAf8FR3rCz4+n3PcZSaJXnfqsFE4ybN6sJ0fzIkoD4GrO95cp68pfc3J1NvkAYRDOpkfc6in9XJSy8BgciBY2S1d4s1anMpn9jaSGd7OwIbJVp2jAl3BDKIx2HqNRKJVHLzLyyXrC8L7RbJpreLqrFQmaYSk4YDWw/TSyqw4QA5jSxRRwMfdt0t94zzc+YY90ygPBLnh0pgZLf9YkPwlt5CiE+Bo5TPQkO+TNoZsyA4HBl5FZXGQAGvA7olLmw5kL6xGB84g0A50zF7wpD+H8ghFdaVk8FjnZ8s+YOFDahXDQISfnaX/GSCELB2FXzs3RpCX3N7YJD5Naetr03vERUdNCJ6juv6a39NSG3/j78uA+vkHc//KA8MBu6tvWDyOGVkrcUNtmNHTTVS8irtPnkoh/1PxbHvwmCznfyVIq+53HJCxjVSYsusIbTEtNEjaKzbkvHtrJtwZp267aHxA/y6em5OHPVy5uFwbwuK8WDacOWcd4TQIDqZLUDT12aLbRk0z6teYRqR7Mak73HY7VHF0UA2y2K68lLDnpqQ+9RVmALeyqSjLPNXYauyjLZabaIh8xw4KxVJF9h7L6dxSpmPRXjU6GzL5HQr5EY5NulEZdCl5SKzXFjf/FwvwFy7QtxmhPSJgGmYoXlKTGjPfrKmAr7MVKsT45C44+YylsH8rs8ilMAzBFmad3leDGsahOPmnG4K mm3e5FlA KoRc8hDImx8E5+tS+chCIhKOtGW9FstfUWX7Sioka9C00c5SC3ROfFMO37V0GJaxvwoCFtcpuEETXKfNNAdq9uzWT8lKGJNJuZ1RVfu25d3svPMa9Gef+4EdNZgr4wonpTN5EigU+tWr2yQBmih+xEPm7tqqyRIvpCW8fm/YQUIGe4Pnuak2pJQFrcj3qVwzQWObf1erFt5EfaUMnrNgqukdG/8nFOOFg4pe7DlvvPwTL5ot+BoOkV2hFQ/4qq002OZ+NdW8SzMNos54tMARBeFcXD1QwstjYboRKAySkeOLXrgo4ssXJnyKZC1O8pEW66/KJ1mMXvtuOalCPGAlFKvbyWlAqbs9Y5aFAoTZlRFLpE/9JcHGhqWz2iSz+fOtiBxshoS9+a6zYvEtIyM522KCGny/axnPzkq30pPMBUkkisS5OuzlbXJZKsCSoToiTD+JBhR+0vuq/OTU/fSaw7vBFkzLdEUdbt5Qp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > On 20 Jan 2025, at 18:09, Rik van Riel wrote: >=20 > On Mon, 2025-01-20 at 16:02 +0200, Nadav Amit wrote: >>=20 >>=20 >> On 20/01/2025 4:40, Rik van Riel wrote: >>>=20 >>> +static inline void broadcast_tlb_flush(struct flush_tlb_info >>> *info) >>> +{ >>> + VM_WARN_ON_ONCE(1); >>=20 >> Not sure why not the use VM_WARN_ONCE() instead with some more=20 >> informative message (anyhow, a string is allocated for it). >>=20 > VM_WARN_ON_ONCE only has a condition, not a message. Right, my bad. >=20 >>> + /* Slower check to make sure. */ >>> + for_each_cpu(cpu, mm_cpumask(mm)) { >>> + /* Skip the CPUs that aren't really running this >>> process. */ >>> + if (per_cpu(cpu_tlbstate.loaded_mm, cpu) !=3D mm) >>> + continue; >>=20 >> Then perhaps at least add a comment next to loaded_mm, that it's not=20= >> private per-se, but rarely accessed by other cores? >>=20 > I don't see any comment in struct tlb_state that > suggests it was ever private to begin with. >=20 > Which comment are you referring to that should > be edited? You can see there is a tlb_state_shared, so one assumes tlb_state is private... (at least that was my intention separating them). >=20 >>>=20 >>> + WRITE_ONCE(mm->context.asid_transition, true); >>> + WRITE_ONCE(mm->context.global_asid, get_global_asid()); >>=20 >> I know it is likely correct in practice (due to TSO memory model), >> but=20 >> it is not clear, at least for me, how those write order affects the >> rest=20 >> of the code. I managed to figure out how it relates to the reads in=20= >> flush_tlb_mm_range() and native_flush_tlb_multi(), but I wouldn't say >> it=20 >> is trivial and doesn't worth a comment (or smp_wmb/smp_rmb). >>=20 >=20 > What kind of wording should we add here to make it > easier to understand? >=20 > "The TLB invalidation code reads these variables in > the opposite order in which they are written" ? Usually in such cases, you make a reference to wherever there are = readers that rely on the ordering. This is how documenting smp_wmb()/smp_rmb() ordering is usually done. >=20 >=20 >>>=20 >>> + /* >>> + * If at least one CPU is not using the global ASID yet, >>> + * send a TLB flush IPI. The IPI should cause stragglers >>> + * to transition soon. >>> + * >>> + * This can race with the CPU switching to another task; >>> + * that results in a (harmless) extra IPI. >>> + */ >>> + if (READ_ONCE(per_cpu(cpu_tlbstate.loaded_mm_asid, cpu)) = !=3D bc_asid) { >>> + flush_tlb_multi(mm_cpumask(info->mm), info); >>> + return; >>> + } >>=20 >> I am trying to figure out why we return here. The transition might >> not=20 >> be over? Why is it "soon"? Wouldn't flush_tlb_func() reload it=20 >> unconditionally? >=20 > The transition _should_ be over, but what if another > CPU got an NMI while in the middle of switch_mm_irqs_off, > and set its own bit in the mm_cpumask after we send this > IPI? >=20 > On the other hand, if it sets its mm_cpumask bit after > this point, it will also load the mm->context.global_asid > after this point, and should definitely get the new ASID. >=20 > I think we are probably fine to set asid_transition to > false here, but I've had to tweak this code so much over > the past months that I don't feel super confident any more :) I fully relate, but I am not sure it is that great. The problem is that nobody would have the guts to change that code later... >>=20 >> And I guess this is now a bug, if info->stride_shift > PMD_SHIFT... >>=20 > We set maxnr to 1 for larger stride shifts at the top of the function: >=20 You=E2=80=99re right, all safe. >=20 >>> + goto reload_tlb; >>=20 >> Not a fan of the goto's when they are not really needed, and I don't=20= >> think it is really needed here. Especially that the name of the tag=20= >> "reload_tlb" does not really convey that the page-tables are reloaded >> at=20 >> that point. >=20 > In this particular case, the CPU continues running with > the same page tables, but with a different PCID. I understand it is =E2=80=9Creload_tlb=E2=80=9D from your point of view, = or from the point of view of the code that does the =E2=80=9Cgoto=E2=80=9D, but if I = showed you the code that follows the =E2=80=9Creload_tlb=E2=80=9D, I=E2=80=99m not = sure you=E2=80=99d know it is so. [ snip, taking your valid points ] >>=20 >>> + loaded_mm_asid =3D >>> this_cpu_read(cpu_tlbstate.loaded_mm_asid); >>> + } >>> + >>> + /* Broadcast ASIDs are always kept up to date with >>> INVLPGB. */ >>> + if (is_global_asid(loaded_mm_asid)) >>> + return; >>=20 >> The comment does not clarify to me, and I don't manage to clearly=20 >> explain to myself, why it is guaranteed that all the IPI TLB flushes, >> which were potentially issued before the transition, are not needed. >>=20 > IPI TLB flushes that were issued before the transition went > to the CPUs when they were using dynamic ASIDs (numbers 1-5). >=20 > Reloading the TLB with a different PCID, even pointed at the > same page tables, means that the TLB should load the > translations fresh from the page tables, and not re-use any > that it had previously loaded under a different PCID. >=20 What about this scenario for instance? CPU0 CPU1 CPU2 ---- ---- ---- (1) use_global_asid(mm): =20 mm->context.asid_trans =3D T; mm->context.global_asid =3D G; (2) switch_mm(..., next=3Dmm): *Observes global_asid =3D G =3D> loads CR3 with PCID=3DG =3D> fills TLB under G. TLB caches PTE[G, V] =3D P (for some reason) (3) flush_tlb_mm_range(mm): *Sees global_asid =3D=3D = 0 (stale/old value) =3D> flush_tlb_multi() =3D> IPI flush for dyn. (4) IPI arrives on CPU1: flush_tlb_func(...):=20 is_global_asid(G)? yes, skip invalidate; broadcast flush assumed to cover it. (5) IPI completes on CPU2: Dyn. ASIDs are flushed,=20= but CPU1=E2=80=99s = global ASID was never invalidated! (6) CPU1 uses stale TLB entries under ASID G. TLB continues to use PTE[G, V] =3D P, as it was not invalidated.