* [PATCH v8 0/2] arm64: support batched/deferred tlb shootdown during page reclamation @ 2023-03-29 3:55 Yicong Yang 2023-03-29 3:55 ` [PATCH v8 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang 2023-03-29 3:55 ` [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 0 siblings, 2 replies; 6+ messages in thread From: Yicong Yang @ 2023-03-29 3:55 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, anshuman.khandual, linux-doc Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Jonathan.Cameron From: Yicong Yang <yangyicong@hisilicon.com> Though ARM64 has the hardware to do tlb shootdown, the hardware broadcasting is not free. A simplest micro benchmark shows even on snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is huge even for paging out one page mapped by only one process: 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush While pages are mapped by multiple processes or HW has more CPUs, the cost should become even higher due to the bad scalability of tlb shootdown. The same benchmark can result in 16.99% CPU consumption on ARM64 server with around 100 cores according to Yicong's test on patch 4/4. This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by 1. only send tlbi instructions in the first stage - arch_tlbbatch_add_mm() 2. wait for the completion of tlbi by dsb while doing tlbbatch sync in arch_tlbbatch_flush() Testing on snapdragon shows the overhead of ptep_clear_flush is removed by the patchset. The micro benchmark becomes 5% faster even for one page mapped by single process on snapdragon 888. This support also optimize the page migration more than 50% with support of batched TLB flushing [*]. [*] https://lore.kernel.org/linux-mm/20230213123444.155149-1-ying.huang@intel.com/ -v8: 1. Rebase on 6.3-rc4 2. Tested the optimization on page migration and mentioned it in the commit 3. Thanks the review from Anshuman. Link: https://lore.kernel.org/linux-mm/20221117082648.47526-1-yangyicong@huawei.com/ -v7: 1. rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() as suggested, since it takes an extra address for arm64, per Nadav and Anshuman. Also mentioned in the commit. 2. add tags from Xin Hao, thanks. Link: https://lore.kernel.org/lkml/20221115031425.44640-1-yangyicong@huawei.com/ -v6: 1. comment we don't defer TLB flush on platforms affected by ARM64_WORKAROUND_REPEAT_TLBI 2. use cpus_have_const_cap() instead of this_cpu_has_cap() 3. add tags from Punit, Thanks. 4. default enable the feature when cpus >= 8 rather than > 8, since the original improvement is observed on snapdragon 888 with 8 cores. Link: https://lore.kernel.org/lkml/20221028081255.19157-1-yangyicong@huawei.com/ -v5: 1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64. 2. Make a threshold of CPU numbers for enabling batched TLP flush on arm64 Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyicong@huawei.com/T/ -v4: 1. Add tags from Kefeng and Anshuman, Thanks. 2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman 3. Merge previous Patch 1,2-3 into one, per Anshuman Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyicong@huawei.com/ -v3: 1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng 2. Add Tested-by from Xin Hao Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cnbao@gmail.com/ -v2: 1. Collected Yicong's test result on kunpeng920 ARM64 server; 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm() according to the comments of Peter Zijlstra and Dave Hansen 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask is empty according to the comments of Nadav Amit Thanks, Peter, Dave and Nadav for your testing or reviewing , and comments. -v1: https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/ Anshuman Khandual (1): mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Barry Song (1): arm64: support batched/deferred tlb shootdown during page reclamation .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig | 6 +++ arch/arm64/include/asm/tlbbatch.h | 12 +++++ arch/arm64/include/asm/tlbflush.h | 52 ++++++++++++++++++- arch/x86/include/asm/tlbflush.h | 17 +++++- include/linux/mm_types_task.h | 4 +- mm/rmap.c | 21 +++----- 7 files changed, 94 insertions(+), 20 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h -- 2.24.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v8 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() 2023-03-29 3:55 [PATCH v8 0/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang @ 2023-03-29 3:55 ` Yicong Yang 2023-03-29 3:55 ` [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 1 sibling, 0 replies; 6+ messages in thread From: Yicong Yang @ 2023-03-29 3:55 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, anshuman.khandual, linux-doc Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Jonathan.Cameron, Anshuman Khandual, Barry Song From: Anshuman Khandual <khandual@linux.vnet.ibm.com> The entire scheme of deferred TLB flush in reclaim path rests on the fact that the cost to refill TLB entries is less than flushing out individual entries by sending IPI to remote CPUs. But architecture can have different ways to evaluate that. Hence apart from checking TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be architecture specific. Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/] Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> [Rebase and fix incorrect return value type] Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> --- arch/x86/include/asm/tlbflush.h | 12 ++++++++++++ mm/rmap.c | 9 +-------- 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cda3118f3b27..8a497d902c16 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + bool should_defer = false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} + static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) { /* diff --git a/mm/rmap.c b/mm/rmap.c index 8632e02661ac..38ccb700748c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -686,17 +686,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) */ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) { - bool should_defer = false; - if (!(flags & TTU_BATCH_FLUSH)) return false; - /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) - should_defer = true; - put_cpu(); - - return should_defer; + return arch_tlbbatch_should_defer(mm); } /* -- 2.24.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation 2023-03-29 3:55 [PATCH v8 0/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2023-03-29 3:55 ` [PATCH v8 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang @ 2023-03-29 3:55 ` Yicong Yang 2023-03-30 13:15 ` Punit Agrawal 1 sibling, 1 reply; 6+ messages in thread From: Yicong Yang @ 2023-03-29 3:55 UTC (permalink / raw) To: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, anshuman.khandual, linux-doc Cc: corbet, peterz, arnd, punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Jonathan.Cameron, Barry Song, Nadav Amit, Mel Gorman From: Barry Song <v-songbaohua@oppo.com> on x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 10000; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ........ ....... ................. ............................................................................. # 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out [kernel.kallsyms] [k] memset64 1.63% a.out [kernel.kallsyms] [k] clear_page 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out [kernel.kallsyms] [k] xas_load 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also, Yicong Yang added the following observation. Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real 0m13.460s 0m11.771s user 0m0.248s 0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush It is tested on 4,8,128 CPU platforms and shows to be beneficial on large systems but may not have improvement on small systems like on a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on CONFIG_EXPERT for this stage and make this disabled on systems with less than 8 CPUs. User can modify this threshold according to their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB. Also this patch improve the performance of page migration. Using pmbench and tries to migrate the pages of pmbench between node 0 and node 1 for 20 times, this patch decrease the time used more than 50% and saved the time used by ptep_clear_flush(). This patch extends arch_tlbbatch_add_mm() to take an address of the target page to support the feature on arm64. Also rename it to arch_tlbbatch_add_pending() to better match its function since we don't need to handle the mm on arm64 and add_mm is not proper. add_pending will make sense to both as on x86 we're pending the TLB flush operations while on arm64 we're pending the synchronize operations. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> --- .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig | 6 +++ arch/arm64/include/asm/tlbbatch.h | 12 +++++ arch/arm64/include/asm/tlbflush.h | 52 ++++++++++++++++++- arch/x86/include/asm/tlbflush.h | 5 +- include/linux/mm_types_task.h | 4 +- mm/rmap.c | 12 +++-- 7 files changed, 81 insertions(+), 12 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 7f049c251a79..76208db88f3b 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -9,7 +9,7 @@ | alpha: | TODO | | arc: | TODO | | arm: | TODO | - | arm64: | N/A | + | arm64: | ok | | csky: | TODO | | hexagon: | TODO | | ia64: | TODO | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1023e896d46b..93b5f5f989a1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -95,6 +95,7 @@ config ARM64 select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if EXPERT select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT @@ -275,6 +276,11 @@ config ARM64_CONT_PMD_SHIFT default 5 if ARM64_16K_PAGES default 4 +config ARM64_NR_CPUS_FOR_BATCHED_TLB + int "Threshold to enable batched TLB flush" + default 8 + depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + config ARCH_MMAP_RND_BITS_MIN default 14 if ARM64_64K_PAGES default 16 if ARM64_16K_PAGES diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h new file mode 100644 index 000000000000..fedb0b87b8db --- /dev/null +++ b/arch/arm64/include/asm/tlbbatch.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ARCH_ARM64_TLBBATCH_H +#define _ARCH_ARM64_TLBBATCH_H + +struct arch_tlbflush_unmap_batch { + /* + * For arm64, HW can do tlb shootdown, so we don't + * need to record cpumask for sending IPI + */ +}; + +#endif /* _ARCH_ARM64_TLBBATCH_H */ diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 412a3b9a3c25..41a763cf8c1b 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -254,17 +254,23 @@ static inline void flush_tlb_mm(struct mm_struct *mm) dsb(ish); } -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, +static inline void __flush_tlb_page_nosync(struct mm_struct *mm, unsigned long uaddr) { unsigned long addr; dsb(ishst); - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm)); + addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); } +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, + unsigned long uaddr) +{ + return __flush_tlb_page_nosync(vma->vm_mm, uaddr); +} + static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long uaddr) { @@ -272,6 +278,48 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, dsb(ish); } +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + /* + * TLB batched flush is proved to be beneficial for systems with large + * number of CPUs, especially system with more than 8 CPUs. TLB shutdown + * is cheap on small systems which may not need this feature. So use + * a threshold for enabling this to avoid potential side effects on + * these platforms. + */ + if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB) + return false; + + /* + * TLB flush deferral is not required on systems, which are affected with + * ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation + * will have two consecutive TLBI instructions with a dsb(ish) in between + * defeating the purpose (i.e save overall 'dsb ish' cost). + */ +#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI + if (unlikely(cpus_have_const_cap(ARM64_WORKAROUND_REPEAT_TLBI))) + return false; +#endif + + return true; +} + +static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, + struct mm_struct *mm, + unsigned long uaddr) +{ + __flush_tlb_page_nosync(mm, uaddr); +} + +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) +{ + dsb(ish); +} + +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ + /* * This is meant to avoid soft lock-ups on large TLB flushing ranges and not * necessarily a performance improvement. diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 8a497d902c16..15cada9635c1 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -263,8 +263,9 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) return atomic64_inc_return(&mm->context.tlb_gen); } -static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, - struct mm_struct *mm) +static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch, + struct mm_struct *mm, + unsigned long uaddr) { inc_mm_tlb_gen(mm); cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm)); diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index 5414b5c6a103..aa44fff8bb9d 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -52,8 +52,8 @@ struct tlbflush_unmap_batch { #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH /* * The arch code makes the following promise: generic code can modify a - * PTE, then call arch_tlbbatch_add_mm() (which internally provides all - * needed barriers), then call arch_tlbbatch_flush(), and the entries + * PTE, then call arch_tlbbatch_add_pending() (which internally provides + * all needed barriers), then call arch_tlbbatch_flush(), and the entries * will be flushed on all CPUs by the time that arch_tlbbatch_flush() * returns. */ diff --git a/mm/rmap.c b/mm/rmap.c index 38ccb700748c..a4e2c16a1a72 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -641,12 +641,13 @@ void try_to_unmap_flush_dirty(void) #define TLB_FLUSH_BATCH_PENDING_LARGE \ (TLB_FLUSH_BATCH_PENDING_MASK / 2) -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, + unsigned long uaddr) { struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; int batch, nbatch; - arch_tlbbatch_add_mm(&tlb_ubc->arch, mm); + arch_tlbbatch_add_pending(&tlb_ubc->arch, mm, uaddr); tlb_ubc->flush_required = true; /* @@ -724,7 +725,8 @@ void flush_tlb_batched_pending(struct mm_struct *mm) } } #else -static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) +static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable, + unsigned long uaddr) { } @@ -1575,7 +1577,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); } @@ -1956,7 +1958,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, */ pteval = ptep_get_and_clear(mm, address, pvmw.pte); - set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); + set_tlb_ubc_flush_pending(mm, pte_dirty(pteval), address); } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); } -- 2.24.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation 2023-03-29 3:55 ` [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang @ 2023-03-30 13:15 ` Punit Agrawal 2023-03-30 13:45 ` Yicong Yang 0 siblings, 1 reply; 6+ messages in thread From: Punit Agrawal @ 2023-03-30 13:15 UTC (permalink / raw) To: Yicong Yang Cc: akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, anshuman.khandual, linux-doc, corbet, peterz, arnd, punit.agrawal, linux-kernel, darren, yangyicong, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Jonathan.Cameron, Barry Song, Nadav Amit, Mel Gorman Hi Yicong, Yicong Yang <yangyicong@huawei.com> writes: > From: Barry Song <v-songbaohua@oppo.com> > > on x86, batched and deferred tlb shootdown has lead to 90% > performance increase on tlb shootdown. on arm64, HW can do > tlb shootdown without software IPI. But sync tlbi is still > quite expensive. > > Even running a simplest program which requires swapout can > prove this is true, > #include <sys/types.h> > #include <unistd.h> > #include <sys/mman.h> > #include <string.h> > > int main() > { > #define SIZE (1 * 1024 * 1024) > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > memset(p, 0x88, SIZE); > > for (int k = 0; k < 10000; k++) { > /* swap in */ > for (int i = 0; i < SIZE; i += 4096) { > (void)p[i]; > } > > /* swap out */ > madvise(p, SIZE, MADV_PAGEOUT); > } > } > > Perf result on snapdragon 888 with 8 cores by using zRAM > as the swap block device. > > ~ # perf record taskset -c 4 ./a.out > [ perf record: Woken up 10 times to write data ] > [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] > ~ # perf report > # To display the perf.data header info, please use --header/--header-only options. > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 60K of event 'cycles' > # Event count (approx.): 35706225414 > # > # Overhead Command Shared Object Symbol > # ........ ....... ................. ............................................................................. > # > 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq > 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages > 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write > 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush > 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock > 3.49% a.out [kernel.kallsyms] [k] memset64 > 1.63% a.out [kernel.kallsyms] [k] clear_page > 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock > 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 > 1.23% a.out [kernel.kallsyms] [k] xas_load > 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock > > ptep_clear_flush() takes 5.36% CPU in the micro-benchmark > swapping in/out a page mapped by only one process. If the > page is mapped by multiple processes, typically, like more > than 100 on a phone, the overhead would be much higher as > we have to run tlb flush 100 times for one single page. > Plus, tlb flush overhead will increase with the number > of CPU cores due to the bad scalability of tlb shootdown > in HW, so those ARM64 servers should expect much higher > overhead. > > Further perf annonate shows 95% cpu time of ptep_clear_flush > is actually used by the final dsb() to wait for the completion > of tlb flush. This provides us a very good chance to leverage > the existing batched tlb in kernel. The minimum modification > is that we only send async tlbi in the first stage and we send > dsb while we have to sync in the second stage. > > With the above simplest micro benchmark, collapsed time to > finish the program decreases around 5%. > > Typical collapsed time w/o patch: > ~ # time taskset -c 4 ./a.out > 0.21user 14.34system 0:14.69elapsed > w/ patch: > ~ # time taskset -c 4 ./a.out > 0.22user 13.45system 0:13.80elapsed > > Also, Yicong Yang added the following observation. > Tested with benchmark in the commit on Kunpeng920 arm64 server, > observed an improvement around 12.5% with command > `time ./swap_bench`. > w/o w/ > real 0m13.460s 0m11.771s > user 0m0.248s 0m0.279s > sys 0m12.039s 0m11.458s > > Originally it's noticed a 16.99% overhead of ptep_clear_flush() > which has been eliminated by this patch: > > [root@localhost yang]# perf record -- ./swap_bench && perf report > [...] > 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush > > It is tested on 4,8,128 CPU platforms and shows to be beneficial on > large systems but may not have improvement on small systems like on > a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends > on CONFIG_EXPERT for this stage and make this disabled on systems > with less than 8 CPUs. User can modify this threshold according to > their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB. The commit log and the patch disagree on the name of the config option (CONFIG_NR_CPUS_FOR_BATCHED_TLB vs CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB). But more importantly, I was wondering why this posting doesn't address Catalin's feedback [a] about using a runtime tunable. Maybe I missed the follow-up discussion. Thanks, Punit [a] https://lore.kernel.org/linux-mm/Y7xMhPTAwcUT4O6b@arm.com/ > Also this patch improve the performance of page migration. Using pmbench > and tries to migrate the pages of pmbench between node 0 and node 1 for > 20 times, this patch decrease the time used more than 50% and saved the > time used by ptep_clear_flush(). > > This patch extends arch_tlbbatch_add_mm() to take an address of the > target page to support the feature on arm64. Also rename it to > arch_tlbbatch_add_pending() to better match its function since we > don't need to handle the mm on arm64 and add_mm is not proper. > add_pending will make sense to both as on x86 we're pending the > TLB flush operations while on arm64 we're pending the synchronize > operations. > > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Nadav Amit <namit@vmware.com> > Cc: Mel Gorman <mgorman@suse.de> > Tested-by: Yicong Yang <yangyicong@hisilicon.com> > Tested-by: Xin Hao <xhao@linux.alibaba.com> > Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> > Reviewed-by: Xin Hao <xhao@linux.alibaba.com> > Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> > --- > .../features/vm/TLB/arch-support.txt | 2 +- > arch/arm64/Kconfig | 6 +++ > arch/arm64/include/asm/tlbbatch.h | 12 +++++ > arch/arm64/include/asm/tlbflush.h | 52 ++++++++++++++++++- > arch/x86/include/asm/tlbflush.h | 5 +- > include/linux/mm_types_task.h | 4 +- > mm/rmap.c | 12 +++-- > 7 files changed, 81 insertions(+), 12 deletions(-) > create mode 100644 arch/arm64/include/asm/tlbbatch.h [...] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation 2023-03-30 13:15 ` Punit Agrawal @ 2023-03-30 13:45 ` Yicong Yang 2023-04-01 12:12 ` Yicong Yang 0 siblings, 1 reply; 6+ messages in thread From: Yicong Yang @ 2023-03-30 13:45 UTC (permalink / raw) To: Punit Agrawal Cc: yangyicong, akpm, linux-mm, linux-arm-kernel, x86, catalin.marinas, will, anshuman.khandual, linux-doc, corbet, peterz, arnd, linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Jonathan.Cameron, Barry Song, Nadav Amit, Mel Gorman Hi Punit, On 2023/3/30 21:15, Punit Agrawal wrote: > Hi Yicong, > > Yicong Yang <yangyicong@huawei.com> writes: > >> From: Barry Song <v-songbaohua@oppo.com> >> >> on x86, batched and deferred tlb shootdown has lead to 90% >> performance increase on tlb shootdown. on arm64, HW can do >> tlb shootdown without software IPI. But sync tlbi is still >> quite expensive. >> >> Even running a simplest program which requires swapout can >> prove this is true, >> #include <sys/types.h> >> #include <unistd.h> >> #include <sys/mman.h> >> #include <string.h> >> >> int main() >> { >> #define SIZE (1 * 1024 * 1024) >> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >> >> memset(p, 0x88, SIZE); >> >> for (int k = 0; k < 10000; k++) { >> /* swap in */ >> for (int i = 0; i < SIZE; i += 4096) { >> (void)p[i]; >> } >> >> /* swap out */ >> madvise(p, SIZE, MADV_PAGEOUT); >> } >> } >> >> Perf result on snapdragon 888 with 8 cores by using zRAM >> as the swap block device. >> >> ~ # perf record taskset -c 4 ./a.out >> [ perf record: Woken up 10 times to write data ] >> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] >> ~ # perf report >> # To display the perf.data header info, please use --header/--header-only options. >> # To display the perf.data header info, please use --header/--header-only options. >> # >> # >> # Total Lost Samples: 0 >> # >> # Samples: 60K of event 'cycles' >> # Event count (approx.): 35706225414 >> # >> # Overhead Command Shared Object Symbol >> # ........ ....... ................. ............................................................................. >> # >> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq >> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages >> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write >> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush >> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock >> 3.49% a.out [kernel.kallsyms] [k] memset64 >> 1.63% a.out [kernel.kallsyms] [k] clear_page >> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock >> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 >> 1.23% a.out [kernel.kallsyms] [k] xas_load >> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock >> >> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark >> swapping in/out a page mapped by only one process. If the >> page is mapped by multiple processes, typically, like more >> than 100 on a phone, the overhead would be much higher as >> we have to run tlb flush 100 times for one single page. >> Plus, tlb flush overhead will increase with the number >> of CPU cores due to the bad scalability of tlb shootdown >> in HW, so those ARM64 servers should expect much higher >> overhead. >> >> Further perf annonate shows 95% cpu time of ptep_clear_flush >> is actually used by the final dsb() to wait for the completion >> of tlb flush. This provides us a very good chance to leverage >> the existing batched tlb in kernel. The minimum modification >> is that we only send async tlbi in the first stage and we send >> dsb while we have to sync in the second stage. >> >> With the above simplest micro benchmark, collapsed time to >> finish the program decreases around 5%. >> >> Typical collapsed time w/o patch: >> ~ # time taskset -c 4 ./a.out >> 0.21user 14.34system 0:14.69elapsed >> w/ patch: >> ~ # time taskset -c 4 ./a.out >> 0.22user 13.45system 0:13.80elapsed >> >> Also, Yicong Yang added the following observation. >> Tested with benchmark in the commit on Kunpeng920 arm64 server, >> observed an improvement around 12.5% with command >> `time ./swap_bench`. >> w/o w/ >> real 0m13.460s 0m11.771s >> user 0m0.248s 0m0.279s >> sys 0m12.039s 0m11.458s >> >> Originally it's noticed a 16.99% overhead of ptep_clear_flush() >> which has been eliminated by this patch: >> >> [root@localhost yang]# perf record -- ./swap_bench && perf report >> [...] >> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush >> >> It is tested on 4,8,128 CPU platforms and shows to be beneficial on >> large systems but may not have improvement on small systems like on >> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends >> on CONFIG_EXPERT for this stage and make this disabled on systems >> with less than 8 CPUs. User can modify this threshold according to >> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB. > > The commit log and the patch disagree on the name of the config option > (CONFIG_NR_CPUS_FOR_BATCHED_TLB vs CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB). > ah yes, it's a typo and I'll fix it. > But more importantly, I was wondering why this posting doesn't address > Catalin's feedback [a] about using a runtime tunable. Maybe I missed the > follow-up discussion. > I must have missed that, terribly sorry for it... Thanks for pointing it out! Let me try to implement a version using a runtime tunable and get back with some test results. Thanks, Yicong > Thanks, > Punit > > [a] https://lore.kernel.org/linux-mm/Y7xMhPTAwcUT4O6b@arm.com/ > >> Also this patch improve the performance of page migration. Using pmbench >> and tries to migrate the pages of pmbench between node 0 and node 1 for >> 20 times, this patch decrease the time used more than 50% and saved the >> time used by ptep_clear_flush(). >> >> This patch extends arch_tlbbatch_add_mm() to take an address of the >> target page to support the feature on arm64. Also rename it to >> arch_tlbbatch_add_pending() to better match its function since we >> don't need to handle the mm on arm64 and add_mm is not proper. >> add_pending will make sense to both as on x86 we're pending the >> TLB flush operations while on arm64 we're pending the synchronize >> operations. >> >> Cc: Anshuman Khandual <anshuman.khandual@arm.com> >> Cc: Jonathan Corbet <corbet@lwn.net> >> Cc: Nadav Amit <namit@vmware.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Tested-by: Yicong Yang <yangyicong@hisilicon.com> >> Tested-by: Xin Hao <xhao@linux.alibaba.com> >> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> >> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> >> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> >> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> >> --- >> .../features/vm/TLB/arch-support.txt | 2 +- >> arch/arm64/Kconfig | 6 +++ >> arch/arm64/include/asm/tlbbatch.h | 12 +++++ >> arch/arm64/include/asm/tlbflush.h | 52 ++++++++++++++++++- >> arch/x86/include/asm/tlbflush.h | 5 +- >> include/linux/mm_types_task.h | 4 +- >> mm/rmap.c | 12 +++-- >> 7 files changed, 81 insertions(+), 12 deletions(-) >> create mode 100644 arch/arm64/include/asm/tlbbatch.h > > > [...] > > . > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation 2023-03-30 13:45 ` Yicong Yang @ 2023-04-01 12:12 ` Yicong Yang 0 siblings, 0 replies; 6+ messages in thread From: Yicong Yang @ 2023-04-01 12:12 UTC (permalink / raw) To: Punit Agrawal, catalin.marinas Cc: yangyicong, akpm, linux-mm, linux-arm-kernel, x86, will, anshuman.khandual, linux-doc, corbet, peterz, arnd, linux-kernel, darren, huzhanyuan, lipeifeng, zhangshiming, guojian, realmz6, linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390, Barry Song, wangkefeng.wang, xhao, prime.zeng, Jonathan.Cameron, Barry Song, Nadav Amit, Mel Gorman On 2023/3/30 21:45, Yicong Yang wrote: > Hi Punit, > > On 2023/3/30 21:15, Punit Agrawal wrote: >> Hi Yicong, >> >> Yicong Yang <yangyicong@huawei.com> writes: >> >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> on x86, batched and deferred tlb shootdown has lead to 90% >>> performance increase on tlb shootdown. on arm64, HW can do >>> tlb shootdown without software IPI. But sync tlbi is still >>> quite expensive. >>> >>> Even running a simplest program which requires swapout can >>> prove this is true, >>> #include <sys/types.h> >>> #include <unistd.h> >>> #include <sys/mman.h> >>> #include <string.h> >>> >>> int main() >>> { >>> #define SIZE (1 * 1024 * 1024) >>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, >>> MAP_SHARED | MAP_ANONYMOUS, -1, 0); >>> >>> memset(p, 0x88, SIZE); >>> >>> for (int k = 0; k < 10000; k++) { >>> /* swap in */ >>> for (int i = 0; i < SIZE; i += 4096) { >>> (void)p[i]; >>> } >>> >>> /* swap out */ >>> madvise(p, SIZE, MADV_PAGEOUT); >>> } >>> } >>> >>> Perf result on snapdragon 888 with 8 cores by using zRAM >>> as the swap block device. >>> >>> ~ # perf record taskset -c 4 ./a.out >>> [ perf record: Woken up 10 times to write data ] >>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] >>> ~ # perf report >>> # To display the perf.data header info, please use --header/--header-only options. >>> # To display the perf.data header info, please use --header/--header-only options. >>> # >>> # >>> # Total Lost Samples: 0 >>> # >>> # Samples: 60K of event 'cycles' >>> # Event count (approx.): 35706225414 >>> # >>> # Overhead Command Shared Object Symbol >>> # ........ ....... ................. ............................................................................. >>> # >>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq >>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages >>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write >>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush >>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock >>> 3.49% a.out [kernel.kallsyms] [k] memset64 >>> 1.63% a.out [kernel.kallsyms] [k] clear_page >>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock >>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 >>> 1.23% a.out [kernel.kallsyms] [k] xas_load >>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock >>> >>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark >>> swapping in/out a page mapped by only one process. If the >>> page is mapped by multiple processes, typically, like more >>> than 100 on a phone, the overhead would be much higher as >>> we have to run tlb flush 100 times for one single page. >>> Plus, tlb flush overhead will increase with the number >>> of CPU cores due to the bad scalability of tlb shootdown >>> in HW, so those ARM64 servers should expect much higher >>> overhead. >>> >>> Further perf annonate shows 95% cpu time of ptep_clear_flush >>> is actually used by the final dsb() to wait for the completion >>> of tlb flush. This provides us a very good chance to leverage >>> the existing batched tlb in kernel. The minimum modification >>> is that we only send async tlbi in the first stage and we send >>> dsb while we have to sync in the second stage. >>> >>> With the above simplest micro benchmark, collapsed time to >>> finish the program decreases around 5%. >>> >>> Typical collapsed time w/o patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.21user 14.34system 0:14.69elapsed >>> w/ patch: >>> ~ # time taskset -c 4 ./a.out >>> 0.22user 13.45system 0:13.80elapsed >>> >>> Also, Yicong Yang added the following observation. >>> Tested with benchmark in the commit on Kunpeng920 arm64 server, >>> observed an improvement around 12.5% with command >>> `time ./swap_bench`. >>> w/o w/ >>> real 0m13.460s 0m11.771s >>> user 0m0.248s 0m0.279s >>> sys 0m12.039s 0m11.458s >>> >>> Originally it's noticed a 16.99% overhead of ptep_clear_flush() >>> which has been eliminated by this patch: >>> >>> [root@localhost yang]# perf record -- ./swap_bench && perf report >>> [...] >>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush >>> >>> It is tested on 4,8,128 CPU platforms and shows to be beneficial on >>> large systems but may not have improvement on small systems like on >>> a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends >>> on CONFIG_EXPERT for this stage and make this disabled on systems >>> with less than 8 CPUs. User can modify this threshold according to >>> their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB. >> >> The commit log and the patch disagree on the name of the config option >> (CONFIG_NR_CPUS_FOR_BATCHED_TLB vs CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB). >> > > ah yes, it's a typo and I'll fix it. > >> But more importantly, I was wondering why this posting doesn't address >> Catalin's feedback [a] about using a runtime tunable. Maybe I missed the >> follow-up discussion. >> > So I used below patch based on this to provide a knob /proc/sys/vm/batched_tlb_enabled for turning on/off the batched TLB. But wondering flush.c is the best place for putting this, any comments? Thanks. diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 41a763cf8c1b..2b2c69c23b47 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -280,6 +280,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH +extern struct static_key_false batched_tlb_enabled; + static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) { /* @@ -289,7 +291,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) * a threshold for enabling this to avoid potential side effects on * these platforms. */ - if (num_online_cpus() < CONFIG_ARM64_NR_CPUS_FOR_BATCHED_TLB) + if (!static_branch_unlikely(&batched_tlb_enabled)) return false; /* diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index 5f9379b3c8c8..ce3bc32523f7 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -7,8 +7,10 @@ */ #include <linux/export.h> +#include <linux/jump_label.h> #include <linux/mm.h> #include <linux/pagemap.h> +#include <linux/sysctl.h> #include <asm/cacheflush.h> #include <asm/cache.h> @@ -107,3 +109,53 @@ void arch_invalidate_pmem(void *addr, size_t size) } EXPORT_SYMBOL_GPL(arch_invalidate_pmem); #endif + +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH + +DEFINE_STATIC_KEY_FALSE(batched_tlb_enabled); + +int batched_tlb_enabled_handler(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + unsigned int enabled = static_branch_unlikely(&batched_tlb_enabled); + struct ctl_table t; + int err; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + t = *table; + t.data = &enabled; + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); + if (!err && write) { + if (enabled) + static_branch_enable(&batched_tlb_enabled); + else + static_branch_disable(&batched_tlb_enabled); + } + + return err; +} + +static struct ctl_table batched_tlb_sysctls[] = { + { + .procname = "batched_tlb_enabled", + .data = NULL, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = batched_tlb_enabled_handler, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, + {} +}; + +static int __init batched_tlb_sysctls_init(void) +{ + register_sysctl_init("vm", batched_tlb_sysctls); + + return 0; +} +late_initcall(batched_tlb_sysctls_init); + +#endif ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-04-01 12:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-29 3:55 [PATCH v8 0/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2023-03-29 3:55 ` [PATCH v8 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Yicong Yang 2023-03-29 3:55 ` [PATCH v8 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Yicong Yang 2023-03-30 13:15 ` Punit Agrawal 2023-03-30 13:45 ` Yicong Yang 2023-04-01 12:12 ` Yicong Yang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox