From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68D3FC433F5 for ; Wed, 25 May 2022 12:11:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC5908D0003; Wed, 25 May 2022 08:11:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C75128D0002; Wed, 25 May 2022 08:11:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B3B058D0003; Wed, 25 May 2022 08:11:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A1F3D8D0002 for ; Wed, 25 May 2022 08:11:30 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 6E1F06112C for ; Wed, 25 May 2022 12:11:30 +0000 (UTC) X-FDA: 79504150740.26.70D1C8B Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) by imf17.hostedemail.com (Postfix) with ESMTP id F323D40038 for ; Wed, 25 May 2022 12:10:59 +0000 (UTC) Received: by mail-pf1-f177.google.com with SMTP id 202so12054162pfu.0 for ; Wed, 25 May 2022 05:11:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=gu+QAebvwEXRvn+ThIJPlBzhHTznFzsKEDCvobNl9FM=; b=cdgFsJOw0kjw13qW620Z9bUI9V86aH3A5SFjWZ+kqNnCXFPQrhHgTFXeE6zXVeAIS/ 5E+LoJY/38QtJyBSJSrGFjRPl5eVYzjlEl4bTQpzMN7/zPS5co72RcG48FlXbc9iOTh1 oSbMqjhsC7D+fafORe+unyBawBfeazMqgS2MqLaAQAb6nlgN1ilR7KGiUO4dV/E8KKVD kdRJDWCNvMYq5R27SilmolM46+oQ7exI1qIAAnYuHyGcAjiXfWS7VhZxTPG145ij09lF fuPDMbmQJlm9MTwe4UwzJQ7DBiwU40aGmTcgnxbl9Bu9sdz4l/3atUaHS1WfhVuKii0I wp1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=gu+QAebvwEXRvn+ThIJPlBzhHTznFzsKEDCvobNl9FM=; b=jgaBjaSruta4THytksrKGHt98RbUfhS0HRsIypSSpLbD74vwKUuX1O7MYNFgkLq73E ssRRyCY11UeOENYdXqy9XrBYlGKzHddcbGsjq7cN1md2skSFRpenxnaJfQQUds++uT1E faroHmExmmUaGFkga6VFycBnb1HaiclApk1G43w8AA4PQU81h5rXFErCzsgYghrBXOo7 XIV9+XVlU7RmuNeneLMfPD6EUxjAMo7GihpEqS4BygPkQ+Ksjmh11uxU3ifi1DGVNiA1 H/6HfNKp7ZB+V980S6Pk49+iZMeEvJDM9W9Gv8vvdkPYb54Dl0FMj9ExAFo8SP9bxhqv J7eg== X-Gm-Message-State: AOAM533LzxZyP9QaHB9rG+vvt35ZZVwbusMROOyqzB5qqicZjdQXP4y3 ODYdl07sqs8t5b/Dq3kDGaROBaueLPahiAtkG+s= X-Google-Smtp-Source: ABdhPJwmCQLIMpvOwaGIairSKM8tCQzCcLJuyyVJr1hZKQfyJ1B/tkn0ctT7unJdc9XirKC3jkoja0AgThx5ckRwAVU= X-Received: by 2002:a63:5357:0:b0:3f2:6210:f8e9 with SMTP id t23-20020a635357000000b003f26210f8e9mr27823081pgl.158.1653480688856; Wed, 25 May 2022 05:11:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Mark Hemment Date: Wed, 25 May 2022 13:11:17 +0100 Message-ID: Subject: Re: [PATCH] x86/clear_user: Make it faster To: Borislav Petkov Cc: Linus Torvalds , Andrew Morton , "the arch/x86 maintainers" , Peter Zijlstra , Patrice CHOTARD , Mikulas Patocka , Lukas Czerner , Christoph Hellwig , "Darrick J. Wong" , Chuck Lever , Hugh Dickins , patches@lists.linux.dev, Linux-MM , mm-commits@vger.kernel.org, Mel Gorman Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: F323D40038 X-Stat-Signature: zx5o36hjgdyt5ymoyj7cazumcx6c17ce Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=googlemail.com header.s=20210112 header.b=cdgFsJOw; spf=pass (imf17.hostedemail.com: domain of markhemm@googlemail.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=markhemm@googlemail.com; dmarc=pass (policy=quarantine) header.from=googlemail.com X-Rspam-User: X-HE-Tag: 1653480659-621267 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 24 May 2022 at 13:32, Borislav Petkov wrote: > > Ok, > > finally a somewhat final version, lightly tested. > > I still need to run it on production Icelake and that is kinda being > delayed due to server room cooling issues (don't ask ;-\). > > --- > From: Borislav Petkov > > Based on a patch by Mark Hemment and > incorporating very sane suggestions from Linus. > > The point here is to have the default case with FSRM - which is supposed > to be the majority of x86 hw out there - if not now then soon - be > directly inlined into the instruction stream so that no function call > overhead is taking place. > > The benchmarks I ran would show very small improvements and a PF > benchmark would even show weird things like slowdowns with higher core > counts. > > So for a ~6m running the git test suite, the function gets called under > 700K times, all from padzero(): > > <...>-2536 [006] ..... 261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900 > <...>-2536 [006] ..... 261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160 > <...>-2537 [008] ..... 261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850 > <...>-2537 [008] ..... 261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900 > ... > > which is around 1%-ish of the total time and which is consistent with > the benchmark numbers. > > So Mel gave me the idea to simply measure how fast the function becomes. > I.e.: > > start = rdtsc_ordered(); > ret = __clear_user(to, n); > end = rdtsc_ordered(); > > Computing the mean average of all the samples collected during the test > suite run then shows some improvement: > > clear_user_original: > Amean: 9219.71 (Sum: 6340154910, samples: 687674) > > fsrm: > Amean: 8030.63 (Sum: 5522277720, samples: 687652) > > That's on Zen3. > > Signed-off-by: Borislav Petkov > Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com > --- > arch/x86/include/asm/uaccess.h | 5 +- > arch/x86/include/asm/uaccess_64.h | 45 ++++++++++ > arch/x86/lib/clear_page_64.S | 142 ++++++++++++++++++++++++++++++ > arch/x86/lib/usercopy_64.c | 40 --------- > tools/objtool/check.c | 3 + > 5 files changed, 192 insertions(+), 43 deletions(-) [...snip...] > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S > index fe59b8ac4fcc..6dd6c7fd1eb8 100644 ... > +/* > + * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is > + * present. > + * Input: > + * rdi destination > + * rcx count > + * > + * Output: > + * rcx: uncleared bytes or 0 if successful. > + */ > +SYM_FUNC_START(clear_user_rep_good) > + # call the original thing for less than a cacheline > + cmp $64, %rcx > + ja .Lprep > + call clear_user_original > + RET > + > +.Lprep: > + # copy lower 32-bits for rest bytes > + mov %ecx, %edx > + shr $3, %rcx > + jz .Lrep_good_rest_bytes A slight doubt here; comment says "less than a cachline", but the code is using 'ja' (jump if above) - so calls 'clear_user_original' for a 'len' less than or equal to 64. Not sure of the intended behaviour for 64 bytes here, but 'copy_user_enhanced_fast_string' uses the slow-method for lengths less than 64. So, should this be coded as; cmp $64,%rcx jb clear_user_original ? 'clear_user_erms' has similar logic which might also need reworking. > + > +.Lrep_good_qwords: > + rep stosq > + > +.Lrep_good_rest_bytes: > + and $7, %edx > + jz .Lrep_good_exit > + > +.Lrep_good_bytes: > + mov %edx, %ecx > + rep stosb > + > +.Lrep_good_exit: > + # see .Lexit comment above > + xor %eax, %eax > + RET > + > +.Lrep_good_qwords_exception: > + # convert remaining qwords back into bytes to return to caller > + shl $3, %rcx > + and $7, %edx > + add %rdx, %rcx > + jmp .Lrep_good_exit > + > + _ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception) > + _ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit) > +SYM_FUNC_END(clear_user_rep_good) > +EXPORT_SYMBOL(clear_user_rep_good) > + > +/* > + * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present. > + * Input: > + * rdi destination > + * rcx count > + * > + * Output: > + * rcx: uncleared bytes or 0 if successful. > + * > + */ > +SYM_FUNC_START(clear_user_erms) > + # call the original thing for less than a cacheline > + cmp $64, %rcx > + ja .Lerms_bytes > + call clear_user_original > + RET > + > +.Lerms_bytes: > + rep stosb > + > +.Lerms_exit: > + xorl %eax,%eax > + RET > + > + _ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit) > +SYM_FUNC_END(clear_user_erms) > +EXPORT_SYMBOL(clear_user_erms) > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c > index 0ae6cf804197..6c1f8ac5e721 100644 > --- a/arch/x86/lib/usercopy_64.c > +++ b/arch/x86/lib/usercopy_64.c > @@ -14,46 +14,6 @@ > * Zero Userspace > */ > > -unsigned long __clear_user(void __user *addr, unsigned long size) > -{ > - long __d0; > - might_fault(); > - /* no memory constraint because it doesn't change any memory gcc knows > - about */ > - stac(); > - asm volatile( > - " testq %[size8],%[size8]\n" > - " jz 4f\n" > - " .align 16\n" > - "0: movq $0,(%[dst])\n" > - " addq $8,%[dst]\n" > - " decl %%ecx ; jnz 0b\n" > - "4: movq %[size1],%%rcx\n" > - " testl %%ecx,%%ecx\n" > - " jz 2f\n" > - "1: movb $0,(%[dst])\n" > - " incq %[dst]\n" > - " decl %%ecx ; jnz 1b\n" > - "2:\n" > - > - _ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1]) > - _ASM_EXTABLE_UA(1b, 2b) > - > - : [size8] "=&c"(size), [dst] "=&D" (__d0) > - : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr)); > - clac(); > - return size; > -} > -EXPORT_SYMBOL(__clear_user); > - > -unsigned long clear_user(void __user *to, unsigned long n) > -{ > - if (access_ok(to, n)) > - return __clear_user(to, n); > - return n; > -} > -EXPORT_SYMBOL(clear_user); > - > #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE > /** > * clean_cache_range - write back a cache range with CLWB > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > index ca5b74603008..e460aa004b88 100644 > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = { > "copy_mc_fragile_handle_tail", > "copy_mc_enhanced_fast_string", > "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */ > + "clear_user_erms", > + "clear_user_rep_good", > + "clear_user_original", > NULL > }; > > -- > 2.35.1 > > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette Cheers, Mark