From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37A5EC43334 for ; Fri, 8 Jul 2022 06:21:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A225C6B0071; Fri, 8 Jul 2022 02:21:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9CFFC6B0073; Fri, 8 Jul 2022 02:21:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BF59900002; Fri, 8 Jul 2022 02:21:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7C6956B0071 for ; Fri, 8 Jul 2022 02:21:57 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 456C034CAC for ; Fri, 8 Jul 2022 06:21:57 +0000 (UTC) X-FDA: 79662937074.01.DE55E1A Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf11.hostedemail.com (Postfix) with ESMTP id A13564004B for ; Fri, 8 Jul 2022 06:21:55 +0000 (UTC) Received: from canpemm500009.china.huawei.com (unknown [172.30.72.53]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4LfNPd5KfxzTgRl; Fri, 8 Jul 2022 14:18:13 +0800 (CST) Received: from [10.67.102.169] (10.67.102.169) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 8 Jul 2022 14:21:51 +0800 CC: , , , , , , , , , , Barry Song , Nadav Amit , Mel Gorman Subject: Re: [PATCH 4/4] arm64: support batched/deferred tlb shootdown during page reclamation To: Barry Song <21cnbao@gmail.com>, , , , , , , References: <20220707125242.425242-1-21cnbao@gmail.com> <20220707125242.425242-5-21cnbao@gmail.com> From: Yicong Yang Message-ID: Date: Fri, 8 Jul 2022 14:21:50 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: <20220707125242.425242-5-21cnbao@gmail.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.67.102.169] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf11.hostedemail.com: domain of yangyicong@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=yangyicong@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657261316; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LGAtwnmnokUMa8nqTB2UoCqx9ie35C83pjsCKe+PEx4=; b=SeWrnl4n6UPjYxRBjLE7V28tyTN3LqWNz5AoxGa/8jlTWczmRap9RMvHc/E6W8n5xk+6LS PjgVKVEaWVCHXluAY887Vn5pctWtJzvUP2fe+w9//NQXcO8KMFnq6M+ZUSENxvzHHExWRp JdhfgCQHQaDldAfhquZJca9/2Slhec8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657261316; a=rsa-sha256; cv=none; b=dHy+38vXT1qFKetSmdbfSqwzECapdWx3YlPm9VF26TSjoaivuSUKku4TmKc1fdEzJpUia1 SVY5eKLQW4GjLs9NfmUJzMmZ4pVY92EsJtPa0AkEW7NHFvBB8vWo0i03g7WTXZNaKwsuvo mGloiz5oKS0HkEliH7aqGJjlRoVI2VA= X-Stat-Signature: 8kgycz1xct1zzbew3geqfi7xf13ow56h X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A13564004B Authentication-Results: imf11.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf11.hostedemail.com: domain of yangyicong@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=yangyicong@huawei.com X-Rspam-User: X-HE-Tag: 1657261315-75464 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Barry, On 2022/7/7 20:52, Barry Song wrote: > From: Barry Song > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. > > Even running a simplest program which requires swapout can > prove this is true, > #include > #include > #include > #include > > int main() > { > #define SIZE (1 * 1024 * 1024) > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > memset(p, 0x88, SIZE); > > for (int k = 0; k < 10000; k++) { > /* swap in */ > for (int i = 0; i < SIZE; i += 4096) { > (void)p[i]; > } > > /* swap out */ > madvise(p, SIZE, MADV_PAGEOUT); > } > } > > Perf result on snapdragon 888 with 8 cores by using zRAM > as the swap block device. > > ~ # perf record taskset -c 4 ./a.out > [ perf record: Woken up 10 times to write data ] > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > ~ # perf report > # To display the perf.data header info, please use --header/--header-only options. > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 60K of event 'cycles' > # Event count (approx.): 35706225414 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................. ............................................................................. > # > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > 3.49% a.out [kernel.kallsyms] [k] memset64 > 1.63% a.out [kernel.kallsyms] [k] clear_page > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > 1.23% a.out [kernel.kallsyms] [k] xas_load > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > swapping in/out a page mapped by only one process. If the > page is mapped by multiple processes, typically, like more > than 100 on a phone, the overhead would be much higher as > we have to run tlb flush 100 times for one single page. > Plus, tlb flush overhead will increase with the number > of CPU cores due to the bad scalability of tlb shootdown > in HW, so those ARM64 servers should expect much higher > overhead. > > Further perf annonate shows 95% cpu time of ptep_clear_flush > is actually used by the final dsb() to wait for the completion > of tlb flush. This provides us a very good chance to leverage > the existing batched tlb in kernel. The minimum modification > is that we only send async tlbi in the first stage and we send > dsb while we have to sync in the second stage. > > With the above simplest micro benchmark, collapsed time to > finish the program decreases around 5%. > > Typical collapsed time w/o patch: > ~ # time taskset -c 4 ./a.out > 0.21user 14.34system 0:14.69elapsed > w/ patch: > ~ # time taskset -c 4 ./a.out > 0.22user 13.45system 0:13.80elapsed > Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real 0m13.460s 0m11.771s user 0m0.248s 0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush Feel free to add: Tested-by: Yicong Yang > Cc: Jonathan Corbet > Cc: Nadav Amit > Cc: Mel Gorman > Signed-off-by: Barry Song > --- > Documentation/features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/tlbbatch.h | 12 ++++++++++++ > arch/arm64/include/asm/tlbflush.h | 13 +++++++++++++ > 4 files changed, 27 insertions(+), 1 deletion(-) > create mode 100644 arch/arm64/include/asm/tlbbatch.h > > diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt > index 1c009312b9c1..2caf815d7c6c 100644 > --- a/Documentation/features/vm/TLB/arch-support.txt > +++ b/Documentation/features/vm/TLB/arch-support.txt > @@ -9,7 +9,7 @@ > | alpha: | TODO | > | arc: | TODO | > | arm: | TODO | > - | arm64: | TODO | > + | arm64: | ok | > | csky: | TODO | > | hexagon: | TODO | > | ia64: | TODO | > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 1652a9800ebe..e94913a0b040 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -93,6 +93,7 @@ config ARM64 > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > select ARCH_SUPPORTS_NUMA_BALANCING > select ARCH_SUPPORTS_PAGE_TABLE_CHECK > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT > select ARCH_WANT_DEFAULT_BPF_JIT > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT > diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h > new file mode 100644 > index 000000000000..fedb0b87b8db > --- /dev/null > +++ b/arch/arm64/include/asm/tlbbatch.h > @@ -0,0 +1,12 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ARCH_ARM64_TLBBATCH_H > +#define _ARCH_ARM64_TLBBATCH_H > + > +struct arch_tlbflush_unmap_batch { > + /* > + * For arm64, HW can do tlb shootdown, so we don't > + * need to record cpumask for sending IPI > + */ > +}; > + > +#endif /* _ARCH_ARM64_TLBBATCH_H */ > diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h > index 412a3b9a3c25..b3ed163267ca 100644 > --- a/arch/arm64/include/asm/tlbflush.h > +++ b/arch/arm64/include/asm/tlbflush.h > @@ -272,6 +272,19 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, > dsb(ish); > } > > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > + struct mm_struct *mm, > + struct vm_area_struct *vma, > + unsigned long uaddr) > +{ > + flush_tlb_page_nosync(vma, uaddr); > +} > + > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > +{ > + dsb(ish); > +} > + > /* > * This is meant to avoid soft lock-ups on large TLB flushing ranges and not > * necessarily a performance improvement. >