From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 558A1E77197 for ; Thu, 9 Jan 2025 20:22:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D08EF6B00AA; Thu, 9 Jan 2025 15:22:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CB4526B00AE; Thu, 9 Jan 2025 15:22:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7C8D6B00B0; Thu, 9 Jan 2025 15:22:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 95F526B00AA for ; Thu, 9 Jan 2025 15:22:31 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4CE50C04B8 for ; Thu, 9 Jan 2025 20:22:31 +0000 (UTC) X-FDA: 82989036102.14.C73BAB5 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by imf23.hostedemail.com (Postfix) with ESMTP id AA78E14001B for ; Thu, 9 Jan 2025 20:22:28 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736454148; a=rsa-sha256; cv=none; b=wXSm5dUye6AK0qVn2riJP5521/RDPRJHIl4m5Z3nYF8dEf3Jk/YwD88Dt4mLoWcgiSht1T kfNP0Kxn35BdkfOsYxDYGfito4UNkhHJFJ8qIvN0eJFKihYhcIgkfPC+Ehp64m/9Jbze/1 6/0Wx5yOyiQardKvPnSB9jkQgj3gPX8= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf23.hostedemail.com: domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) smtp.mailfrom=riel@shelob.surriel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736454148; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dpq4vTpqbKlRDiifE27zVPmZFm6sTZTg00HNe1r9d5s=; b=mP4DxPJVQ5iWH8Yr5uKOtcTpjIXOs6d4f1PxQiLxq+pUSSuIkLy0bPF4kY/pKZai0KxQ28 ROy1x1Lwv1q+GeD4TrfPWnQpNS4/dy7Cs2OM6yWEeR4PDd0qSL45VaEDxhyx3E4hJcftUY lxjRUWmKVKCueL14AL6nfaXTM5hi/dw= Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1tVyxd-000000007eP-49xB; Thu, 09 Jan 2025 15:16:58 -0500 Message-ID: <855298e6e981378c3afeab93b8c3cb821a7a5b88.camel@surriel.com> Subject: Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes From: Rik van Riel To: Dave Hansen , x86@kernel.org Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, akpm@linux-foundation.org, nadav.amit@gmail.com, zhengqi.arch@bytedance.com, linux-mm@kvack.org Date: Thu, 09 Jan 2025 15:16:57 -0500 In-Reply-To: References: <20241230175550.4046587-1-riel@surriel.com> <20241230175550.4046587-7-riel@surriel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.54.1 (3.54.1-1.fc41) MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: AA78E14001B X-Stat-Signature: hd8gwkacikm6neyysi5h4fhjahhpsn7s X-HE-Tag: 1736454148-812867 X-HE-Meta: U2FsdGVkX18OuVRnHtOZxJaGtzQYDjD9clwzsvJ1fIxKmoEKMEKUHWYktCD7Y1+OrYSLGvPg9V7PFQ0ZvkdH71U9THGiDhPBchEzOoLCmCzwufuksKH0xghsWKwWAwyPyfw0I2NvrZZHHIWOWxIdK+noIXXZQ6NDyRobgw0jCh9POPVW184wsLF5e0HdCIxNmgqPpMQDKglvhkfBKu51YytLMyW8dfLcy62i1MOdRiV1Epvjj/1KzI9/sYRInxjm7CIHeW61fpocOVeKBRf0asc2tu+SdyX2kKuWrT22pvaGcK+EqY9QXZctNsWV/IcXTOGDwT4KPCgVuBDYOK9V2nipEYgWtkwp9U8LCrBpu5j/Y7SDbk84eB+bnlu3mf+/wdNl4KsHmjMUF2QR7H27dtN02i8PyOqt6PeHDF2JnITx72Lb8SzLtLk6/DUHYgYhF+ldPxKChplfDh4i6HzkR/vIo0DzisYHVrNLuyEC+4gPRlk2mIRXES6YGbyYepyDitysK6wj/DF9ubwvt0vDPhbkT2AQmo9cwlX3nn2I9fuI1cj8alPdN1162++J9zqqjcLWfyXa7qectJqwidMPly3zOjHXjlcwgWi3m8ZwvuysO6vizb5FvuXqe/qGPSr9qGqJ1qcqF+8biQLw2FpDsxS3y6S7Y7E7PsiXE+2U8qFAG9aVeD2diGA2hzownlWtDCSPoe0N9TWXP71TEIKMNg1DUeas+BfecwR/0saJbGjl+GDuE+rTajPbojO0E/2lEcl8YmTZB/dnK24iefimfPQP9txKK3oCvxitBVUZreuH19DqWHpJ8fMsuNXhVuoOd06h/e7IpOV3rYYBeQ+9sCQZftfIGAH/iI/5lLwQH+pf2jK8ZEhIU+OJOl1SmEl5JRLYXBpTxlpoVv9inSOxNydX28GFdZItu1z8kEDSUYV0xHwBXXkSRcmMQ0fCblBJg4S3u4a3uSH3b3sNOvG C194EGo4 0nhlyWH6YPeqXYBdz8cDKMyn3y7xMYtGiBwdR3iRqxsQoldKG+Yn06yrFF8akm6onQZwCKP7EGn7bVqZ0XI7yukzl9pT5zXIoF1QEASqYMmt7/9sGbbAcGE1Y2AtfAKPEpZy08g/dEPCvNcULahHyEERVJB8OUTbuy7UOyJjDMaDcTbSfUBBaw9MWiYwoACZCnd+PnwQERrgv8LCR3JKRrH8bDBwrnVbkjkwu2RJGMyfStM0SMxa3cp9+uFkkdoChUN+i5dWTWWHmsihawDNAOwoxvQRTGfOW71g3630DNnRupJHABNpOnH11Npdvge2aGhEIBGeOg60IMNFSUPNZ3JawEQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 2025-01-06 at 09:21 -0800, Dave Hansen wrote: > On 12/30/24 09:53, Rik van Riel wrote: >=20 >=20 > > +static void broadcast_kernel_range_flush(unsigned long start, > > unsigned long end) > > +{ > > + unsigned long addr; > > + unsigned long maxnr =3D invlpgb_count_max; > > + unsigned long threshold =3D tlb_single_page_flush_ceiling * > > maxnr; >=20 > The 'tlb_single_page_flush_ceiling' value was determined by looking > at > _local_ invalidation cost. Could you talk a bit about why it's also a > good value to use for remote invalidations? Does it hold up for > INVLPGB > the same way it did for good ol' INVLPG? Has there been any explicit > testing here to find a good value? >=20 > I'm also confused by the multiplication here. Let's say > invlpgb_count_max=3D=3D20 and tlb_single_page_flush_ceiling=3D=3D30. >=20 > You would need to switch away from single-address invalidation when > the > number of addresses is >20 for INVLPGB functional reasons. But you'd > also need to switch away when >30 for performance reasons > (tlb_single_page_flush_ceiling). >=20 > But I don't understand how that would make the threshold 20*30=3D600 > invalidations. I have not done any measurement to see how flushing with INVLPGB stacks up versus local TLB flushes. What makes INVLPGB potentially slower: - These flushes are done globally What makes INVLPGB potentially faster: - Multiple flushes can be pending simultaneously, and executed in any convenient order by the CPUs. - Wait once on completion of all the queued flushes. Another thing that makes things interesting is the=C2=A0 TLB entry coalescing done by AMD CPUs. When multiple pages are both virtually and physically contiguous in memory (which is fairly common), the CPU can use a single TLB entry to map up to 8 of them. That means if we issue eg. 20 INVLPGB flushes for 8 4kB pages each, instead of the CPUs needing to remove 160 TLB entries, there might only be 50. I just guessed at the numbers used in my code, while trying to sort out the details elsewhere in the code. How should we go about measuring the tradeoffs between invalidation time, and the time spent in TLB misses from flushing unnecessary stuff? >=20 > > + /* > > + * TLBSYNC only waits for flushes originating on the same > > CPU. > > + * Disabling migration allows us to wait on all flushes. > > + */ >=20 > Imperative voice here too, please: >=20 > Disable migration to wait on all flushes. >=20 > > + guard(preempt)(); > > + > > + if (end =3D=3D TLB_FLUSH_ALL || > > + =C2=A0=C2=A0=C2=A0 (end - start) > threshold << PAGE_SHIFT) { >=20 > This is basically a copy-and-paste of the "range vs. global" flush > logic, but taking 'invlpgb_count_max' into account. >=20 > It would be ideal if those limit checks could be consolidated. I > suspect > that when the 'threshold' calculation above gets clarified that they > may > be easier to consolidate. Maybe? I implemented another suggestion in the code, and the start of flush_tlb_kernel_range() now looks like this: void flush_tlb_kernel_range(unsigned long start, unsigned long end) { if (broadcast_kernel_range_flush(start, end)) return; If we are to consolidate the limit check, it should probably be in a helper function somewhere, and not by spreading out the broadcast flush calls. >=20 > BTW, what is a typical value for 'invlpgb_count_max'? Is it more or > less > than the typical value for 'tlb_single_page_flush_ceiling'? >=20 > Maybe we should just lower 'tlb_single_page_flush_ceiling' if > 'invlpgb_count_max' falls below it so we only have _one_ runtime > value > to consider. The value for invlpgb_count_max on both Milan and Bergamo CPUs appears to be 8. That is, the CPU reports we can flush 7 additional pages beyond a single page. This matches the number of PTEs that can be=20 cached in one TLB entry if they are contiguous and aligned, and matches one cache line worth of PTE entries. >=20 >=20 > > + invlpgb_flush_all(); > > + } else { > > + unsigned long nr; > > + for (addr =3D start; addr < end; addr +=3D nr << > > PAGE_SHIFT) { > > + nr =3D min((end - addr) >> PAGE_SHIFT, > > maxnr); > > + invlpgb_flush_addr(addr, nr); > > + } > > + } > > + > > + tlbsync(); > > +} > > + > > =C2=A0static void do_kernel_range_flush(void *info) > > =C2=A0{ > > =C2=A0 struct flush_tlb_info *f =3D info; > > @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void > > *info) > > =C2=A0 > > =C2=A0void flush_tlb_kernel_range(unsigned long start, unsigned long > > end) > > =C2=A0{ > > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) { > > + broadcast_kernel_range_flush(start, end); > > + return; > > + } > > + > > =C2=A0 /* Balance as user space task's flush, a bit conservative > > */ > > =C2=A0 if (end =3D=3D TLB_FLUSH_ALL || > > =C2=A0 =C2=A0=C2=A0=C2=A0 (end - start) > tlb_single_page_flush_ceiling= << > > PAGE_SHIFT) { >=20 > I also wonder if this would all get simpler if we give in and > *always* > call get_flush_tlb_info(). That would provide a nice single place to > consolidate the "all vs. ranged" flush logic. Possibly. That might be a good way to unify that threshold check? That should probably be a separate patch, though. --=20 All Rights Reversed.