From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE945C352A4 for ; Wed, 12 Feb 2020 15:26:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 78A9620661 for ; Wed, 12 Feb 2020 15:26:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 78A9620661 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F20E36B0460; Wed, 12 Feb 2020 10:26:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id ED2156B0461; Wed, 12 Feb 2020 10:26:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE7C26B0462; Wed, 12 Feb 2020 10:26:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0109.hostedemail.com [216.40.44.109]) by kanga.kvack.org (Postfix) with ESMTP id C56476B0460 for ; Wed, 12 Feb 2020 10:26:30 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4F2AD2DFC for ; Wed, 12 Feb 2020 15:26:30 +0000 (UTC) X-FDA: 76481851740.30.bike16_1b8dd77692302 X-HE-Tag: bike16_1b8dd77692302 X-Filterd-Recvd-Size: 4380 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Wed, 12 Feb 2020 15:26:29 +0000 (UTC) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 58FD9328; Wed, 12 Feb 2020 07:26:28 -0800 (PST) Received: from arrakis.emea.arm.com (arrakis.cambridge.arm.com [10.1.196.71]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0E2063F68F; Wed, 12 Feb 2020 07:26:26 -0800 (PST) Date: Wed, 12 Feb 2020 15:26:24 +0000 From: Catalin Marinas To: "qi.fuli@fujitsu.com" Cc: Andrea Arcangeli , Will Deacon , Jon Masters , Rafael Aquini , Mark Salter , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "linux-arm-kernel@lists.infradead.org" Subject: Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes Message-ID: <20200212152624.GA587247@arrakis.emea.arm.com> References: <20200203201745.29986-1-aarcange@redhat.com> <20200203201745.29986-3-aarcange@redhat.com> <6e59905d-3e5b-bbd5-d192-9f18a0a152f5@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6e59905d-3e5b-bbd5-d192-9f18a0a152f5@jp.fujitsu.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 12, 2020 at 02:13:56PM +0000, qi.fuli@fujitsu.com wrote: > On 2/4/20 5:17 AM, Andrea Arcangeli wrote: > > With multiple NUMA nodes and multiple sockets, the tlbi broadcast > > shall be delivered through the interconnects in turn increasing the > > interconnect traffic and the latency of the tlbi broadcast instruction. > > > > Even within a single NUMA node the latency of the tlbi broadcast > > instruction increases almost linearly with the number of CPUs trying to > > send tlbi broadcasts at the same time. > > > > When the process is single threaded however we can achieve full SMP > > scalability by skipping the tlbi broadcasting. Other arches already > > deploy this optimization. > > > > After the local TLB flush this however means the ASID context goes out > > of sync in all CPUs except the local one. This can be tracked in the > > mm_cpumask(mm): if the bit is set it means the asid context is stale > > for that CPU. This results in an extra local ASID TLB flush only if a > > single threaded process is migrated to a different CPU and only after a > > TLB flush. No extra local TLB flush is needed for the common case of > > single threaded processes context scheduling within the same CPU and for > > multithreaded processes. > > > > Skipping the tlbi instruction broadcasting is already implemented in > > local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(), > > flush_tlb_range() and flush_tlb_page() too. > > > > Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same > > time from 32 single threaded processes before the patch: > > > > Performance counter stats for './loop' (3 runs): > > > > 0 dummy > > > > 2.121353 +- 0.000387 seconds time elapsed ( +- 0.02% ) > > > > and with the patch applied: > > > > Performance counter stats for './loop' (3 runs): > > > > 0 dummy > > > > 0.1197750 +- 0.0000827 seconds time elapsed ( +- 0.07% ) > > I have tested this patch on thunderX2 with Himeno benchmark[1] with > LARGE calculation size. Here are the results. > > w/o patch: MFLOPS : 1149.480174 > w/ patch: MFLOPS : 1110.653003 > > In order to validate the effectivness of the patch, I ran a > single-threded program, which calls mprotect() in a loop to issue the > tlbi broadcast instruction on a CPU core. At the same time, I ran Himeno > benchmark on another CPU core. The results are: > > w/o patch: MFLOPS : 860.238792 > w/ patch: MFLOPS : 1110.449666 > > Though Himeno benchmark is a microbenchmark, I hope it helps. It doesn't really help. What if you have a two-thread program calling mprotect() in a loop? IOW, how is this relevant to real-world scenarios? Thanks. -- Catalin