From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f199.google.com (mail-wr0-f199.google.com [209.85.128.199]) by kanga.kvack.org (Postfix) with ESMTP id 0E17A6B025F for ; Thu, 17 Aug 2017 05:21:55 -0400 (EDT) Received: by mail-wr0-f199.google.com with SMTP id x43so11815917wrb.9 for ; Thu, 17 Aug 2017 02:21:55 -0700 (PDT) Received: from mail-wr0-x241.google.com (mail-wr0-x241.google.com. [2a00:1450:400c:c0c::241]) by mx.google.com with ESMTPS id f8si2804669edf.111.2017.08.17.02.21.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 17 Aug 2017 02:21:53 -0700 (PDT) Received: by mail-wr0-x241.google.com with SMTP id 49so1198032wrw.5 for ; Thu, 17 Aug 2017 02:21:53 -0700 (PDT) Date: Thu, 17 Aug 2017 11:21:50 +0200 From: Ingo Molnar Subject: Re: [PATCHv4 14/14] x86/mm: Offset boot-time paging mode switching cost Message-ID: <20170817092150.imggg3pjguagvudd@gmail.com> References: <20170808125415.78842-1-kirill.shutemov@linux.intel.com> <20170808125415.78842-15-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170808125415.78842-15-kirill.shutemov@linux.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Kirill A. Shutemov" Cc: Linus Torvalds , Andrew Morton , x86@kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Andi Kleen , Dave Hansen , Andy Lutomirski , Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org * Kirill A. Shutemov wrote: > By this point we have functioning boot-time switching between 4- and > 5-level paging mode. But naive approach comes with cost. > > Numbers below are for kernel build, allmodconfig, 5 times. > > CONFIG_X86_5LEVEL=n: > > Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs): > > 17308719.892691 task-clock:u (msec) # 26.772 CPUs utilized ( +- 0.11% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 331,993,164 page-faults:u # 0.019 M/sec ( +- 0.01% ) > 43,614,978,867,455 cycles:u # 2.520 GHz ( +- 0.01% ) > 39,371,534,575,126 stalled-cycles-frontend:u # 90.27% frontend cycles idle ( +- 0.09% ) > 28,363,350,152,428 instructions:u # 0.65 insn per cycle > # 1.39 stalled cycles per insn ( +- 0.00% ) > 6,316,784,066,413 branches:u # 364.948 M/sec ( +- 0.00% ) > 250,808,144,781 branch-misses:u # 3.97% of all branches ( +- 0.01% ) > > 646.531974142 seconds time elapsed ( +- 1.15% ) > > CONFIG_X86_5LEVEL=y: > > Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs): > > 17411536.780625 task-clock:u (msec) # 26.426 CPUs utilized ( +- 0.10% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 331,868,663 page-faults:u # 0.019 M/sec ( +- 0.01% ) > 43,865,909,056,301 cycles:u # 2.519 GHz ( +- 0.01% ) > 39,740,130,365,581 stalled-cycles-frontend:u # 90.59% frontend cycles idle ( +- 0.05% ) > 28,363,358,997,959 instructions:u # 0.65 insn per cycle > # 1.40 stalled cycles per insn ( +- 0.00% ) > 6,316,784,937,460 branches:u # 362.793 M/sec ( +- 0.00% ) > 251,531,919,485 branch-misses:u # 3.98% of all branches ( +- 0.00% ) > > 658.886307752 seconds time elapsed ( +- 0.92% ) > The patch tries to fix the performance regression by using > > !cpu_feature_enabled(X86_FEATURE_LA57) instead of p4d_folded in all hot > code paths. These will statically patch the target code for additional > performance. > > Also, I had to re-write number of static inline helpers as macros. > It was needed to break header dependency loop between cpufeature.h and > pgtable_types.h. > > CONFIG_X86_5LEVEL=y + the patch: > > Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs): > > 17381990.268506 task-clock:u (msec) # 26.907 CPUs utilized ( +- 0.19% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 331,862,625 page-faults:u # 0.019 M/sec ( +- 0.01% ) > 43,697,726,320,051 cycles:u # 2.514 GHz ( +- 0.03% ) > 39,480,408,690,401 stalled-cycles-frontend:u # 90.35% frontend cycles idle ( +- 0.05% ) > 28,363,394,221,388 instructions:u # 0.65 insn per cycle > # 1.39 stalled cycles per insn ( +- 0.00% ) > 6,316,794,985,573 branches:u # 363.410 M/sec ( +- 0.00% ) > 251,013,232,547 branch-misses:u # 3.97% of all branches ( +- 0.01% ) > > 645.991174661 seconds time elapsed ( +- 1.19% ) Ok - these measurements are very nice and address many of my worries about earlier parts of the series. Anyway, please split this patch up some more as well (as any of the optimizations could regress by themselves), and my renaming suggestions still stand as well. > @@ -11,6 +11,11 @@ > #undef CONFIG_PARAVIRT_SPINLOCKS > #undef CONFIG_KASAN > > +#ifdef CONFIG_X86_5LEVEL > +/* cpu_feature_enabled() cannot be used that early */ > +#define p4d_folded __p4d_folded > +#endif > + > #include > #include > #include > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 077e8b45784c..702a1feb4991 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -274,15 +274,8 @@ return_from_SYSCALL_64: > * depending on paging mode) in the address. > */ > #ifdef CONFIG_X86_5LEVEL > - testl $1, p4d_folded(%rip) > - jnz 1f > - shl $(64 - 57), %rcx > - sar $(64 - 57), %rcx > - jmp 2f > -1: > - shl $(64 - 48), %rcx > - sar $(64 - 48), %rcx > -2: > + ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ > + "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 Ignore my earlier suggestion to use alternatives, you already implemented it! This is what I get for replying to a patch series in chronological order. ;-) I suspect the syscall overhead was the main reason for the performance regression. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org