From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78775C6FD1D for ; Sat, 1 Apr 2023 12:13:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 978C0900002; Sat, 1 Apr 2023 08:13:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 929216B0074; Sat, 1 Apr 2023 08:13:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 81783900002; Sat, 1 Apr 2023 08:13:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 708E06B0072 for ; Sat, 1 Apr 2023 08:13:33 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2269C160632 for ; Sat, 1 Apr 2023 12:13:33 +0000 (UTC) X-FDA: 80632712706.02.57F5EC3 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf10.hostedemail.com (Postfix) with ESMTP id 55D6EC001F for ; Sat, 1 Apr 2023 12:13:28 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of yangyicong@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=yangyicong@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680351211; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JE50FNfNBYUEO4yq9EUPjBnoM7fC15WK7u9QBLsiyW4=; b=5fWdYImSyfmtmsesu9vUgvM2winmeJ8I506SK0xKNa2EjdbRgq9bFCRsevmHTZwXsMmdal /ORByWRvxL6VPmj16cdBgkHHLRPetyOuLUX4kDphX8CzCwZPRTvYWFTOhQ1N2r4jHJJqPU 80JaR4/8fzOLOMM2GaU74Ih37Rj8zBc= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none; spf=pass (imf10.hostedemail.com: domain of yangyicong@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=yangyicong@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680351211; a=rsa-sha256; cv=none; b=SbWyceCwIVVk0eY72h8d6VYwm0ejE7TYN0+4jktbULXg383bNjuAjBVnwAiHqsSsHIvPV6 A+A0u2RayTlq/ndxNx1KO/Sryc4Bx7nOpjnCVfTHCCeawmvOz7I+ytFdxk5ATTj9oY9fHn HIc763oomZpneBCVxxNzHedDeTlcVbU= Received: from canpemm500009.china.huawei.com (unknown [172.30.72.53]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4PpbZH6cNNz17LnP; Sat, 1 Apr 2023 20:09:59 +0800 (CST) Received: from [10.67.102.169] (10.67.102.169) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Sat, 1 Apr 2023 20:12:30 +0800 CC: , , , , , , , , , , , , , , , , , , , , , , , Barry Song <21cnbao@gmail.com>, , , , , Barry Song , Nadav Amit , Mel Gorman Subject: Re: [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation From: Yicong Yang To: Punit Agrawal , References: <20230329035512.57392-1-yangyicong@huawei.com> <20230329035512.57392-3-yangyicong@huawei.com> <87cz4qwfbt.fsf_-_@stealth> <2687a998-6dbe-de8f-2f62-1456d2de7940@huawei.com> Message-ID: <241c3a4c-642e-e871-72ae-e3098a967b69@huawei.com> Date: Sat, 1 Apr 2023 20:12:30 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: <2687a998-6dbe-de8f-2f62-1456d2de7940@huawei.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.67.102.169] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected X-Stat-Signature: x8wemi6b8mq3rx7dbnc6zu4sgbj3oxst X-Rspam-User: X-Rspamd-Queue-Id: 55D6EC001F X-Rspamd-Server: rspam06 X-HE-Tag: 1680351208-147828 X-HE-Meta: U2FsdGVkX1/fEv05+QZw/ZhOFVTEqAQrEeq//e6fxWCaCmG0vqLG97qwpDyZjM5YBzs1SqdtfyC3s7BNSpo3MdEhyFCltODC48D3rlZCLSsL9cH8eVp2i0BAPKnQLog/JlSJmEROstcaCqORCY3ir3GdlnwggksC9v8eSC1eTaHKVGPnaSxh9GsyxHVxu0Hsgs6rde72aZTlZKQQ/jXa580aC8zCtdZQVpXSpgcezV1dCzWpq/sxs4OJfHtQCVfMGXM2vVtk9R0U4Zctq+tOutGCEuuDTQWz9zr+KMj52eqsWQ5wk51BVQCxTNR8eU/uo7NRsn2iqLKnFfrfGzY2QWeJOSLoG8zn1ivZAZTUMOredsiOvm1P/Gdt/xuWJhHWhm1uO5WvAEVFYkILEChBv0GcrgbdvXX0348+u7/m9N5FbHi8cObptRtm8LXpOxf0Ozou2nJuSzwcZX9JqJEd5AoRhbvQXVyk0jNA4DOghdLqpBUrcFQod4uxgwdeal5q3o5HZO7n74ArYwAfOsPYGCJVWZVeUBeoUL3PGVpVPQgzkt7BEUzhMSGBhzjw4NYqG4tjMlQplhSmsxOUBrKtsmB+ir9eoRiLiVMCr5Q4rcsmKoIX13QJOOI/isieNmJucL1Y+pucqomCZmdo2Od/caKA3Qn9XRH1B9jJKwxyRKeTvJg0BEosN20+zGUaikSPZR8LvH77QeCTr7zC4s4RCMRtweC4H2I5Vb2eFU2qwiMHfwG+uJvk9v+5H2p2jBIXEuZ30DuH5lT7nCJkef8p2f+ndtSPLIy0PUFlfoqPNeeLVhUSMHnWPe/Z/kxiawoXLeHXSXi6p5Q6kVI8wK4E4i0Rsm6Lu6wvQWAWN9kXxF/ZvdcjpdMdAaGwafeKL70xdv3ubBKcdH0BzM5xv8qo6dB+tx8m+v6wrrmgDMQFdEnqZ+flpaxghLwuqmcxrLhkyU6ABKPzI/E1bYnnjOg BtEX8SDu MHLtKSnaeh73t84LF/vscAUhOJIw6st2llfB5HPT0ONkMPjmN24SikWVHLZk0W8WKJex/pN2wCaJnSiOdTOtpu299A1pIXBJQ+aR+BaoTO7ZZ0faEvvx/j/Dh3pi6eXRnGNpH5FuSBhJeLzJdtqQbaeCBNZ1V+MvDB4El/XhRZUN5NESG36kiZ0mitqwucwL8WYVZqj1iwCP8k3/8c0ksE0PnEVLDajlScuxKGKGpiIDSaZ6Zn1i3JVxoHdTPwgui8uxP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/3/30 21:45, Yicong Yang wrote: > Hi Punit, > > On 2023/3/30 21:15, Punit Agrawal wrote: >> Hi Yicong, >> >> Yicong Yang writes: >> >>> From: Barry Song >>> >>> on x86, batched and deferred tlb shootdown has lead to 90% >>> performance increase on tlb shootdown. on arm64, HW can do >>> tlb shootdown without software IPI. But sync tlbi is still >>> quite expensive. >>> >>> Even running a simplest program which requires swapout can >>> prove this is true, >>> #include >>> #include >>> #include >>> #include >>> >>> int main() >>> { >>> #define SIZE (1 * 1024 * 1024) >>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >>> >>> memset(p, 0x88, SIZE); >>> >>> for (int k = 0; k < 10000; k++) { >>> /* swap in */ >>> for (int i = 0; i < SIZE; i += 4096) { >>> (void)p[i]; >>> } >>> >>> /* swap out */ >>> madvise(p, SIZE, MADV_PAGEOUT); >>> } >>> } >>> >>> Perf result on snapdragon 888 with 8 cores by using zRAM >>> as the swap block device. >>> >>> ~ # perf record taskset -c 4 ./a.out >>> [ perf record: Woken up 10 times to write data ] >>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] >>> ~ # perf report >>> # To display the perf.data header info, please use --header/--header-only options. >>> # To display the perf.data header info, please use --header/--header-only options. >>> # >>> # >>> # Total Lost Samples: 0 >>> # >>> # Samples: 60K of event 'cycles' >>> # Event count (approx.): 35706225414 >>> # >>> # Overhead Command Shared Object Symbol >>> # ........ ....... ................. ............................................................................. >>> # >>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq >>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages >>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write >>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush >>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock >>> 3.49% a.out [kernel.kallsyms] [k] memset64 >>> 1.63% a.out [kernel.kallsyms] [k] clear_page >>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock >>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 >>> 1.23% a.out [kernel.kallsyms] [k] xas_load >>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock >>> >>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark >>> swapping in/out a page mapped by only one process. If the >>> page is mapped by multiple processes, typically, like more >>> than 100 on a phone, the overhead would be much higher as >>> we have to run tlb flush 100 times for one single page. >>> Plus, tlb flush overhead will increase with the number >>> of CPU cores due to the bad scalability of tlb shootdown >>> in HW, so those ARM64 servers should expect much higher >>> overhead. >>> >>> Further perf annonate shows 95% cpu time of ptep_clear_flush >>> is actually used by the final dsb() to wait for the completion >>> of tlb flush. This provides us a very good chance to leverage >>> the existing batched tlb in kernel. The minimum modification >>> is that we only send async tlbi in the first stage and we send >>> dsb while we have to sync in the second stage. >>> >>> With the above simplest micro benchmark, collapsed time to >>> finish the program decreases around 5%. >>> >>> Typical collapsed time w/o patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.21user 14.34system 0:14.69elapsed >>> w/ patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.22user 13.45system 0:13.80elapsed >>> >>> Also, Yicong Yang added the following observation. >>> Tested with benchmark in the commit on Kunpeng920 arm64 server, >>> observed an improvement around 12.5% with command >>> `time ./swap_bench`. >>> w/o w/ >>> real 0m13.460s 0m11.771s >>> user 0m0.248s 0m0.279s >>> sys 0m12.039s 0m11.458s >>> >>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() >>> which has been eliminated by this patch: >>> >>> [root@localhost yang]# perf record -- ./swap_bench && perf report >>> [...] >>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush >>> >>> It is tested on 4,8,128 CPU platforms and shows to be beneficial on >>> large systems but may not have improvement on small systems like on >>> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends >>> on CONFIG_EXPERT for this stage and make this disabled on systems >>> with less than 8 CPUs. User can modify this threshold according to >>> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB. >> >> The commit log and the patch disagree on the name of the config option >> (CONFIG_NR_CPUS_FOR_BATCHED_TLB vs CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB). >> > > ah yes, it's a typo and I'll fix it. > >> But more importantly, I was wondering why this posting doesn't address >> Catalin's feedback [a] about using a runtime tunable. Maybe I missed the >> follow-up discussion. >> > So I used below patch based on this to provide a knob /proc/sys/vm/batched_tlb_enabled for turning on/off the batched TLB. But wondering flush.c is the best place for putting this, any comments? Thanks. diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 41a763cf8c1b..2b2c69c23b47 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -280,6 +280,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH +extern struct static_key_false batched_tlb_enabled; + static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) { /* @@ -289,7 +291,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) * a threshold for enabling this to avoid potential side effects on * these platforms. */ - if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB) + if (!static_branch_unlikely(&batched_tlb_enabled)) return false; /* diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index 5f9379b3c8c8..ce3bc32523f7 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -7,8 +7,10 @@ */ #include +#include #include #include +#include #include #include @@ -107,3 +109,53 @@ void arch_invalidate_pmem(void *addr, size_t size) } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif + +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + +DEFINE_STATIC_KEY_FALSE(batched_tlb_enabled); + +int batched_tlb_enabled_handler(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + unsigned int enabled = static_branch_unlikely(&batched_tlb_enabled); + struct ctl_table t; + int err; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + t = *table; + t.data = &enabled; + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + if (!err && write) { + if (enabled) + static_branch_enable(&batched_tlb_enabled); + else + static_branch_disable(&batched_tlb_enabled); + } + + return err; +} + +static struct ctl_table batched_tlb_sysctls[] = { + { + .procname = "batched_tlb_enabled", + .data = NULL, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = batched_tlb_enabled_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + {} +}; + +static int __init batched_tlb_sysctls_init(void) +{ + register_sysctl_init("vm", batched_tlb_sysctls); + + return 0; +} +late_initcall(batched_tlb_sysctls_init); + +#endif