From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60955C433E0 for ; Wed, 10 Feb 2021 17:58:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CF43E64EEC for ; Wed, 10 Feb 2021 17:58:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CF43E64EEC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 09F476B0075; Wed, 10 Feb 2021 12:58:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F29206B007E; Wed, 10 Feb 2021 12:58:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C41E26B0075; Wed, 10 Feb 2021 12:58:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101]) by kanga.kvack.org (Postfix) with ESMTP id 8F36D6B0075 for ; Wed, 10 Feb 2021 12:58:07 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 55FDB180000CF for ; Wed, 10 Feb 2021 17:58:07 +0000 (UTC) X-FDA: 77803117014.19.north05_1d080f427611 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id 112E71ACC44 for ; Wed, 10 Feb 2021 17:58:07 +0000 (UTC) X-HE-Tag: north05_1d080f427611 X-Filterd-Recvd-Size: 14916 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Wed, 10 Feb 2021 17:58:06 +0000 (UTC) IronPort-SDR: rxqNfGbCG2tdfpM1KqGp2lKb8X4oh62vJ+vty1+LmUPAER2cJNscvDUi0ap6Lrb2kLgA+1My0B QMFFDYOgOSLQ== X-IronPort-AV: E=McAfee;i="6000,8403,9891"; a="243612372" X-IronPort-AV: E=Sophos;i="5.81,168,1610438400"; d="scan'208";a="243612372" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2021 09:58:01 -0800 IronPort-SDR: X0tAO9SjbBWvmYePpCIh0BI40Hi4h05Z4RwSfxH6JQ+kr6FfjEmHAYWyb3vDP88hv92+5XawI1 ZpCACfbw8n1w== X-IronPort-AV: E=Sophos;i="5.81,168,1610438400"; d="scan'208";a="421140731" Received: from yyu32-desk.sc.intel.com ([143.183.136.146]) by fmsmga002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2021 09:58:00 -0800 From: Yu-cheng Yu To: x86@kernel.org, "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, Arnd Bergmann , Andy Lutomirski , Balbir Singh , Borislav Petkov , Cyrill Gorcunov , Dave Hansen , Eugene Syromiatnikov , Florian Weimer , "H.J. Lu" , Jann Horn , Jonathan Corbet , Kees Cook , Mike Kravetz , Nadav Amit , Oleg Nesterov , Pavel Machek , Peter Zijlstra , Randy Dunlap , "Ravi V. Shankar" , Vedvyas Shanbhogue , Dave Martin , Weijiang Yang , Pengfei Xu , Cc: Yu-cheng Yu Subject: [PATCH v20 08/25] x86/mm: Introduce _PAGE_COW Date: Wed, 10 Feb 2021 09:56:46 -0800 Message-Id: <20210210175703.12492-9-yu-cheng.yu@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20210210175703.12492-1-yu-cheng.yu@intel.com> References: <20210210175703.12492-1-yu-cheng.yu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There is essentially no room left in the x86 hardware PTEs on some OSes (not Linux). That left the hardware architects looking for a way to represent a new memory type (shadow stack) within the existing bits. They chose to repurpose a lightly-used state: Write=3D0, Dirty=3D1. The reason it's lightly used is that Dirty=3D1 is normally set by hardwar= e and cannot normally be set by hardware on a Write=3D0 PTE. Software must normally be involved to create one of these PTEs, so software can simply opt to not create them. In places where Linux normally creates Write=3D0, Dirty=3D1, it can use t= he software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In othe= r words, whenever Linux needs to create Write=3D0, Dirty=3D1, it instead cr= eates Write=3D0, Cow=3D1, except for shadow stack, which is Write=3D0, Dirty=3D= 1. This clearly separates shadow stack from other data, and results in the following: (a) A modified, copy-on-write (COW) page: (Write=3D0, Cow=3D1) (b) A R/O page that has been COW'ed: (Write=3D0, Cow=3D1) The user page is in a R/O VMA, and get_user_pages() needs a writable copy. The page fault handler creates a copy of the page and sets the new copy's PTE as Write=3D0 and Cow=3D1. (c) A shadow stack PTE: (Write=3D0, Dirty=3D1) (d) A shared shadow stack PTE: (Write=3D0, Cow=3D1) When a shadow stack page is being shared among processes (this happen= s at fork()), its PTE is made Dirty=3D0, so the next shadow stack acces= s causes a fault, and the page is duplicated and Dirty=3D1 is set again= . This is the COW equivalent for shadow stack pages, even though it's copy-on-access rather than copy-on-write. (e) A page where the processor observed a Write=3D1 PTE, started a write,= set Dirty=3D1, but then observed a Write=3D0 PTE. That's possible today,= but will not happen on processors that support shadow stack. Define _PAGE_COW and update pte_*() helpers and apply the same changes to pmd and pud. After this, there are six free bits left in the 64-bit PTE, and no more free bits in the 32-bit PTE (except for PAE) and Shadow Stack is not implemented for the 32-bit kernel. Signed-off-by: Yu-cheng Yu --- arch/x86/include/asm/pgtable.h | 136 ++++++++++++++++++++++++--- arch/x86/include/asm/pgtable_types.h | 42 ++++++++- 2 files changed, 165 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtabl= e.h index a02c67291cfc..2309fa9d412e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -121,9 +121,12 @@ extern pmdval_t early_pmd_flags; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ -static inline int pte_dirty(pte_t pte) +static inline bool pte_dirty(pte_t pte) { - return pte_flags(pte) & _PAGE_DIRTY; + /* + * A dirty PTE has Dirty=3D1 or Cow=3D1. + */ + return pte_flags(pte) & _PAGE_DIRTY_BITS; } =20 =20 @@ -160,9 +163,12 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } =20 -static inline int pmd_dirty(pmd_t pmd) +static inline bool pmd_dirty(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_DIRTY; + /* + * A dirty PMD has Dirty=3D1 or Cow=3D1. + */ + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; } =20 static inline int pmd_young(pmd_t pmd) @@ -170,9 +176,12 @@ static inline int pmd_young(pmd_t pmd) return pmd_flags(pmd) & _PAGE_ACCESSED; } =20 -static inline int pud_dirty(pud_t pud) +static inline bool pud_dirty(pud_t pud) { - return pud_flags(pud) & _PAGE_DIRTY; + /* + * A dirty PUD has Dirty=3D1 or Cow=3D1. + */ + return pud_flags(pud) & _PAGE_DIRTY_BITS; } =20 static inline int pud_young(pud_t pud) @@ -182,7 +191,15 @@ static inline int pud_young(pud_t pud) =20 static inline int pte_write(pte_t pte) { - return pte_flags(pte) & _PAGE_RW; + /* + * When shadow stack is enabled, a directly writable PTE has RW=3D1. + * A shadow stack PTE is logically writable and has RW=3D0, Dirty=3D1. + * See comments for _PAGE_COW. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY); + else + return pte_flags(pte) & _PAGE_RW; } =20 static inline int pte_huge(pte_t pte) @@ -333,7 +350,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) =20 static inline pte_t pte_mkclean(pte_t pte) { - return pte_clear_flags(pte, _PAGE_DIRTY); + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); } =20 static inline pte_t pte_mkold(pte_t pte) @@ -343,6 +360,18 @@ static inline pte_t pte_mkold(pte_t pte) =20 static inline pte_t pte_wrprotect(pte_t pte) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PTE (RW=3D0, Dirty=3D1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pte_flags(pte) & _PAGE_DIRTY) { + pte =3D pte_clear_flags(pte, _PAGE_DIRTY); + pte =3D pte_set_flags(pte, _PAGE_COW); + } + } + return pte_clear_flags(pte, _PAGE_RW); } =20 @@ -353,6 +382,18 @@ static inline pte_t pte_mkexec(pte_t pte) =20 static inline pte_t pte_mkdirty(pte_t pte) { + pteval_t dirty =3D _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=3D1, Write=3D0 PTEs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) + dirty =3D _PAGE_COW; + + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pte_t pte_mkwrite_shstk(pte_t pte) +{ + pte =3D pte_clear_flags(pte, _PAGE_COW); return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } =20 @@ -363,6 +404,13 @@ static inline pte_t pte_mkyoung(pte_t pte) =20 static inline pte_t pte_mkwrite(pte_t pte) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pte_flags(pte) & _PAGE_COW) { + pte =3D pte_clear_flags(pte, _PAGE_COW); + pte =3D pte_set_flags(pte, _PAGE_DIRTY); + } + } + return pte_set_flags(pte, _PAGE_RW); } =20 @@ -434,16 +482,40 @@ static inline pmd_t pmd_mkold(pmd_t pmd) =20 static inline pmd_t pmd_mkclean(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_DIRTY); + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); } =20 static inline pmd_t pmd_wrprotect(pmd_t pmd) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PMD (RW=3D0, Dirty=3D1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pmd_flags(pmd) & _PAGE_DIRTY) { + pmd =3D pmd_clear_flags(pmd, _PAGE_DIRTY); + pmd =3D pmd_set_flags(pmd, _PAGE_COW); + } + } + return pmd_clear_flags(pmd, _PAGE_RW); } =20 static inline pmd_t pmd_mkdirty(pmd_t pmd) { + pmdval_t dirty =3D _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=3D1, Write=3D0 PMDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pmd_flags(pmd) & _PAGE_= RW)) + dirty =3D _PAGE_COW; + + return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) +{ + pmd =3D pmd_clear_flags(pmd, _PAGE_COW); return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } =20 @@ -464,6 +536,13 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) =20 static inline pmd_t pmd_mkwrite(pmd_t pmd) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pmd_flags(pmd) & _PAGE_COW) { + pmd =3D pmd_clear_flags(pmd, _PAGE_COW); + pmd =3D pmd_set_flags(pmd, _PAGE_DIRTY); + } + } + return pmd_set_flags(pmd, _PAGE_RW); } =20 @@ -488,17 +567,35 @@ static inline pud_t pud_mkold(pud_t pud) =20 static inline pud_t pud_mkclean(pud_t pud) { - return pud_clear_flags(pud, _PAGE_DIRTY); + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); } =20 static inline pud_t pud_wrprotect(pud_t pud) { + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PUD (RW=3D0, Dirty=3D1). Move the hardware + * dirty value to the software bit. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pud_flags(pud) & _PAGE_DIRTY) { + pud =3D pud_clear_flags(pud, _PAGE_DIRTY); + pud =3D pud_set_flags(pud, _PAGE_COW); + } + } + return pud_clear_flags(pud, _PAGE_RW); } =20 static inline pud_t pud_mkdirty(pud_t pud) { - return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pudval_t dirty =3D _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=3D1, Write=3D0 PUDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !(pud_flags(pud) & _PAGE_= RW)) + dirty =3D _PAGE_COW; + + return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY); } =20 static inline pud_t pud_mkdevmap(pud_t pud) @@ -518,6 +615,13 @@ static inline pud_t pud_mkyoung(pud_t pud) =20 static inline pud_t pud_mkwrite(pud_t pud) { + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) { + if (pud_flags(pud) & _PAGE_COW) { + pud =3D pud_clear_flags(pud, _PAGE_COW); + pud =3D pud_set_flags(pud, _PAGE_DIRTY); + } + } + return pud_set_flags(pud, _PAGE_RW); } =20 @@ -1131,7 +1235,15 @@ extern int pmdp_clear_flush_young(struct vm_area_s= truct *vma, #define pmd_write pmd_write static inline int pmd_write(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_RW; + /* + * When shadow stack is enabled, a directly writable PMD has RW=3D1. + * A shadow stack PMD is logically writable and has RW=3D0, Dirty=3D1. + * See comments for _PAGE_COW. + */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY); + else + return pmd_flags(pmd) & _PAGE_RW; } =20 #define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/= pgtable_types.h index b8b79d618bbc..437d7ff0ae80 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,7 +23,8 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ @@ -36,6 +37,15 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking= */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 =20 +/* + * Indicates a copy-on-write page. + */ +#ifdef CONFIG_X86_CET +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ +#else +#define _PAGE_BIT_COW 0 +#endif + /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL @@ -117,6 +127,36 @@ #define _PAGE_DEVMAP (_AT(pteval_t, 0)) #endif =20 +/* + * The hardware requires shadow stack to be read-only and Dirty. + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs + * from shadow stack PTEs: + * (a) A modified, copy-on-write (COW) page: (Write=3D0, Cow=3D1) + * (b) A R/O page that has been COW'ed: (Write=3D0, Cow=3D1) + * The user page is in a R/O VMA, and get_user_pages() needs a + * writable copy. The page fault handler creates a copy of the page + * and sets the new copy's PTE as Write=3D0, Cow=3D1. + * (c) A shadow stack PTE: (Write=3D0, Dirty=3D1) + * (d) A shared (copy-on-access) shadow stack PTE: (Write=3D0, Cow=3D1) + * When a shadow stack page is being shared among processes (this + * happens at fork()), its PTE is cleared of _PAGE_DIRTY, so the nex= t + * shadow stack access causes a fault, and the page is duplicated an= d + * _PAGE_DIRTY is set again. This is the COW equivalent for shadow + * stack pages, even though it's copy-on-access rather than + * copy-on-write. + * (e) A page where the processor observed a Write=3D1 PTE, started a wr= ite, + * set Dirty=3D1, but then observed a Write=3D0 PTE (changed by anot= her + * thread). That's possible today, but will not happen on processor= s + * that support shadow stack. + */ +#ifdef CONFIG_X86_CET +#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW) +#else +#define _PAGE_COW (_AT(pteval_t, 0)) +#endif + +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW) + #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) =20 /* --=20 2.21.0