From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1765C433EF for ; Fri, 10 Jun 2022 22:16:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 346358D00EA; Fri, 10 Jun 2022 18:16:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2F42C8D00E2; Fri, 10 Jun 2022 18:16:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 195118D00EA; Fri, 10 Jun 2022 18:16:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 03B528D00E2 for ; Fri, 10 Jun 2022 18:16:07 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C0B47205FD for ; Fri, 10 Jun 2022 22:16:06 +0000 (UTC) X-FDA: 79563735132.07.70F6CA5 Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com [209.85.219.179]) by imf06.hostedemail.com (Postfix) with ESMTP id 3D03018007A for ; Fri, 10 Jun 2022 22:16:06 +0000 (UTC) Received: by mail-yb1-f179.google.com with SMTP id r82so805410ybc.13 for ; Fri, 10 Jun 2022 15:16:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=hI+zZ/Mnew9MAd1l+7se58mgS8lUxXN6O5BZdWQ2ELI=; b=Jc8yECzdfgeaNPtdCEGrL3Mx+NDPr8Ub4MnPrRcJWIRMuc1Z8T3ezNSK/aRIqqsZ8H 0K71tlYDjYJ9ilB9LBHICC2Afirb2s0vtzAxMXrlKZjhUlJzl8Tj90pTge7fLbmVER4e jmNjybHOERuVFz2HRs9QQaRJIqGfEYS4pg2IHV4/wN3lGm9ga/nQHf5/Zdz1NfS/Bh8T hdkOXWcygfq1UrwnXZ6NkyubnmT3cFUH2oyxgI9xR/q2j4YeBcsJ5ju76qQspiMaKB6F /tp2zOrgb6TgKB7igrkwamudMjrKrlPFY+DeLrEHbpxf94a0lgS1jnfApcmmpBF7PHYl 5jFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=hI+zZ/Mnew9MAd1l+7se58mgS8lUxXN6O5BZdWQ2ELI=; b=hATY5Egx4FgYTvqldXUS5kwH1gVz0TkdI3OXmpGlG9JV1w9sqBONlLiDMRpUoan7P3 YvdXMjgB2ezJBSB35FAhV7JzZYnkzRTWbPNyC8kGEHKvI0B41uG7OZgLcO/cSdoI4X+M chs7uIEM4LtaY/vrQVp9iA4PlyPXAtbERVa+ABaH4/+w4Nk5jCEgOXTR5W2zvM9wFb/g V3iacZ6LHj4gRNeHiAMQjJg4FLsDI3HQBloVf5hva8f1BAGLWY88yp5A8V6ix7u86d+v FrWpx4zGamvY9kK0OY5bV2CydXct/wp2RKrZs2QeBAJ38N5fWz4Fp0X9JHzEgQFSLsKu wJPw== X-Gm-Message-State: AOAM530nUtPceEmcC1XwjAHs17SzFeGwcKmJFqdtaXQdXKN2EmMEMwAU V7ZaDEeqOiIzRB9fdOTgU2C5JijHNz6XoxMoub8= X-Google-Smtp-Source: ABdhPJyouFvhik2b52Gqiq7UqCTS3s7FSNRxD/vQU9v7TO59+GEb89rgFTEcBmpSVZvgx4nonCZyXPGhDVxg9Z077LI= X-Received: by 2002:a25:c0c5:0:b0:664:5e8e:6baa with SMTP id c188-20020a25c0c5000000b006645e8e6baamr6004757ybf.143.1654899365258; Fri, 10 Jun 2022 15:16:05 -0700 (PDT) MIME-Version: 1.0 References: <20220606202109.1306034-1-ankur.a.arora@oracle.com> <20220606203725.1313715-5-ankur.a.arora@oracle.com> In-Reply-To: From: Noah Goldstein Date: Fri, 10 Jun 2022 15:15:54 -0700 Message-ID: Subject: Re: [PATCH v3 09/21] x86/asm: add clear_pages_movnt() To: Ankur Arora Cc: open list , linux-mm@kvack.org, X86 ML , torvalds@linux-foundation.org, akpm@linux-foundation.org, mike.kravetz@oracle.com, mingo@kernel.org, Andy Lutomirski , tglx@linutronix.de, Borislav Petkov , peterz@infradead.org, ak@linux.intel.com, arnd@arndb.de, jgg@nvidia.com, jon.grimm@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com, joao.m.martins@oracle.com Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1654899366; a=rsa-sha256; cv=none; b=22t4DLglRmDO+6gefeFzBRqiNPs/mCophlyqZnlSsEC0akud5YAz7k3rwZgIxSJiwuHjOX /dEywVlHGWBaW32w715K4tBfekKihfKehYzwCOndT9sOPn5e8hxuLHvzLeT7qgTzCdQ4zc K7R3d+LZimiRyRMM+XQftrQ4ib0KmiE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Jc8yECzd; spf=pass (imf06.hostedemail.com: domain of goldstein.w.n@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=goldstein.w.n@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1654899366; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hI+zZ/Mnew9MAd1l+7se58mgS8lUxXN6O5BZdWQ2ELI=; b=iknWzyHx887gKS9VkchL9DuD4vTF2HQPrUE8tyIQ/5NRnQyj2pBYuv57kuOeF9I3BWFWlj AYMEI3FHXYVOdmAC9thP9ZNJf1Z6EhA8oQ+rUR9kJgl4rpLrFSYXGSxd1q2sObF33s+PrX IMppBL98i05B7YBpEgsjIVkEEnwLJAU= X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 3D03018007A X-Stat-Signature: e8ybz8bj8donxcpmp1p435huexits4ib X-Rspam-User: Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=Jc8yECzd; spf=pass (imf06.hostedemail.com: domain of goldstein.w.n@gmail.com designates 209.85.219.179 as permitted sender) smtp.mailfrom=goldstein.w.n@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1654899366-733544 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jun 10, 2022 at 3:11 PM Noah Goldstein wrote: > > On Mon, Jun 6, 2022 at 11:39 PM Ankur Arora wrote: > > > > Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive. > > With this, page-clearing can skip the memory hierarchy, thus providing > > a non cache-polluting implementation of clear_pages(). > > > > MOVNTI, from the Intel SDM, Volume 2B, 4-101: > > "The non-temporal hint is implemented by using a write combining (WC) > > memory type protocol when writing the data to memory. Using this > > protocol, the processor does not write the data into the cache > > hierarchy, nor does it fetch the corresponding cache line from memory > > into the cache hierarchy." > > > > The AMD Arch Manual has something similar to say as well. > > > > One use-case is to zero large extents without bringing in never-to-be- > > accessed cachelines. Also, often clear_pages_movnt() based clearing is > > faster once extent sizes are O(LLC-size). > > > > As the excerpt notes, MOVNTI is weakly ordered with respect to other > > instructions operating on the memory hierarchy. This needs to be > > handled by the caller by executing an SFENCE when done. > > > > The implementation is straight-forward: unroll the inner loop to keep > > the code similar to memset_movnti(), so that we can gauge > > clear_pages_movnt() performance via perf bench mem memset. > > > > # Intel Icelakex > > # Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb > > # (X86_FEATURE_ERMS) and x86-64-movnt: > > > > System: Oracle X9-2 (2 nodes * 32 cores * 2 threads) > > Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6) > > Memory: 512 GB evenly split between nodes > > LLC-size: 48MB for each node (32-cores * 2-threads) > > no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance > > > > x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%) > > ---------------------- --------------------- -------- > > size BW ( stdev) BW ( stdev) > > > > 2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38% > > 16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02% > > 128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24% > > 1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35% > > 4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50% > > For these sizes it may be worth it to save/rstor an xmm register to do > the memset: > > Just on my Tigerlake laptop: > model name : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz > > movntdq xmm (5 runs) movnti GPR (5 runs) > Delta(%) > ----------------------- ----------------------- > size BW GB/s ( +- stdev) BW GB/s ( +- > stdev) % > 2 MB 35.71 GB/s ( +- 1.02) 34.62 GB/s ( +- > 0.77) -3.15% > 16 MB 36.43 GB/s ( +- 0.35) 31.3 GB/s ( +- > 0.1) -16.39% > 128 MB 35.6 GB/s ( +- 0.83) 30.82 GB/s ( +- > 0.08) -15.5% > 1024 MB 36.85 GB/s ( +- 0.26) 30.71 GB/s ( +- > 0.2) -20.0% Also (again just from Tigerlake laptop) I found the trend favor `rep stosb` more (as opposed to non-cacheable writes) when there are multiple threads competing for BW: https://docs.google.com/spreadsheets/d/1f6N9EVqHg71cDIR-RALLR76F_ovW5gzwIWr26yLCmS0/edit?usp=sharing > > > > Signed-off-by: Ankur Arora > > --- > > arch/x86/include/asm/page_64.h | 1 + > > arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++ > > 2 files changed, 22 insertions(+) > > > > diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h > > index a88a3508888a..3affc4ecb8da 100644 > > --- a/arch/x86/include/asm/page_64.h > > +++ b/arch/x86/include/asm/page_64.h > > @@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long); > > void clear_pages_orig(void *page, unsigned long npages); > > void clear_pages_rep(void *page, unsigned long npages); > > void clear_pages_erms(void *page, unsigned long npages); > > +void clear_pages_movnt(void *page, unsigned long npages); > > > > #define __HAVE_ARCH_CLEAR_USER_PAGES > > static inline void clear_pages(void *page, unsigned int npages) > > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S > > index 2cc3b681734a..83d14f1c9f57 100644 > > --- a/arch/x86/lib/clear_page_64.S > > +++ b/arch/x86/lib/clear_page_64.S > > @@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms) > > RET > > SYM_FUNC_END(clear_pages_erms) > > EXPORT_SYMBOL_GPL(clear_pages_erms) > > + > > +SYM_FUNC_START(clear_pages_movnt) > > + xorl %eax,%eax > > + movq %rsi,%rcx > > + shlq $PAGE_SHIFT, %rcx > > + > > + .p2align 4 > > +.Lstart: > > + movnti %rax, 0x00(%rdi) > > + movnti %rax, 0x08(%rdi) > > + movnti %rax, 0x10(%rdi) > > + movnti %rax, 0x18(%rdi) > > + movnti %rax, 0x20(%rdi) > > + movnti %rax, 0x28(%rdi) > > + movnti %rax, 0x30(%rdi) > > + movnti %rax, 0x38(%rdi) > > + addq $0x40, %rdi > > + subl $0x40, %ecx > > + ja .Lstart > > + RET > > +SYM_FUNC_END(clear_pages_movnt) > > -- > > 2.31.1 > >