From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC139C4338F for ; Wed, 4 Aug 2021 04:32:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6D11A60C3F for ; Wed, 4 Aug 2021 04:32:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 6D11A60C3F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id D5B508D001E; Wed, 4 Aug 2021 00:32:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CBFCF8D0027; Wed, 4 Aug 2021 00:32:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9853B8D001E; Wed, 4 Aug 2021 00:32:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181]) by kanga.kvack.org (Postfix) with ESMTP id 741958D0023 for ; Wed, 4 Aug 2021 00:32:41 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 28B78181BD78B for ; Wed, 4 Aug 2021 04:32:41 +0000 (UTC) X-FDA: 78436127322.28.E12DAD4 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf27.hostedemail.com (Postfix) with ESMTP id 4C50A7004FF2 for ; Wed, 4 Aug 2021 04:32:40 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="194125770" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="194125770" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="511702690" Received: from iweiny-desk2.sc.intel.com (HELO localhost) ([10.3.52.147]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 21:32:37 -0700 From: ira.weiny@intel.com To: Dave Hansen , Dan Williams Cc: Fenghua Yu , Sean Christopherson , Ira Weiny , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Peter Zijlstra , Andy Lutomirski , "H. Peter Anvin" , Rick Edgecombe , x86@kernel.org, linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org Subject: [PATCH V7 09/18] x86/pks: Add PKS kernel API Date: Tue, 3 Aug 2021 21:32:22 -0700 Message-Id: <20210804043231.2655537-10-ira.weiny@intel.com> X-Mailer: git-send-email 2.28.0.rc0.12.gb6a658bd00c9 In-Reply-To: <20210804043231.2655537-1-ira.weiny@intel.com> References: <20210804043231.2655537-1-ira.weiny@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4C50A7004FF2 Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf27.hostedemail.com: domain of ira.weiny@intel.com has no SPF policy when checking 192.55.52.151) smtp.mailfrom=ira.weiny@intel.com X-Stat-Signature: j8krcrsd9wkq3hphisr4zkqr58hm3kj5 X-HE-Tag: 1628051560-399317 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Fenghua Yu PKS allows kernel users to define domains of page mappings which have additional protections beyond the paging protections. Violating those protections creates a fault which by default will oops. Each kernel user defines a PKS_KEY_* key value which identifies a PKS domain to be used exclusively by that kernel user. This API is then used to control which pages are part of that domain and the current threads protection of those pages. 4 new functions are added pks_enabled(), pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite(). 2 new macros are added PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey). Update the protection key documentation to cover pkeys on supervisor pages. This includes how to reserve a key and set the default permissions on that key. Cc: Sean Christopherson Cc: Dan Williams Co-developed-by: Ira Weiny Signed-off-by: Ira Weiny Signed-off-by: Fenghua Yu --- Change for V7 Add pks_enabled() to allow users more dynamic choice on PKS use. Update documentation for key allocation Remove dynamic key allocation, keys will be allocated statically now. Add expected CPU generation support to documentation --- Documentation/core-api/protection-keys.rst | 121 ++++++++++++++++++--- arch/x86/include/asm/pgtable_types.h | 12 ++ arch/x86/mm/pkeys.c | 66 +++++++++++ include/linux/pgtable.h | 4 + include/linux/pkeys.h | 14 +++ 5 files changed, 199 insertions(+), 18 deletions(-) diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/c= ore-api/protection-keys.rst index ec575e72d0b2..6420a60666fc 100644 --- a/Documentation/core-api/protection-keys.rst +++ b/Documentation/core-api/protection-keys.rst @@ -4,25 +4,30 @@ Memory Protection Keys =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature -which is found on Intel's Skylake (and later) "Scalable Processor" -Server CPUs. It will be available in future non-server Intel parts -and future AMD processors. +Memory Protection Keys provide a mechanism for enforcing page-based +protections, but without requiring modification of the page tables +when an application changes protection domains. =20 -For anyone wishing to test or use this feature, it is available in -Amazon's EC2 C5 instances and is known to work there using an Ubuntu -17.04 image. +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Sc= alable +Processor" Server CPUs and later. And it will be available in future +non-server Intel parts and future AMD processors. =20 -Memory Protection Keys provides a mechanism for enforcing page-based -protections, but without requiring modification of the page tables -when an application changes protection domains. It works by -dedicating 4 previously ignored bits in each page table entry to a -"protection key", giving 16 possible keys. +Protection Keys for Supervisor pages (PKS) is available in the SDM since= May +2020. + +pkeys work by dedicating 4 previously Reserved bits in each page table e= ntry to +a "protection key", giving 16 possible keys. User and Supervisor pages = are +treated separately. + +Protections for each page are controlled with per-CPU registers for each= type +of page User and Supervisor. Each of these 32-bit register stores two s= eparate +bits (Access Disable and Write Disable) for each key. =20 -There is also a new user-accessible register (PKRU) with two separate -bits (Access Disable and Write Disable) for each key. Being a CPU -register, PKRU is inherently thread-local, potentially giving each -thread a different set of protections from every other thread. +For Userspace the register is user-accessible (rdpkru/wrpkru). For +Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kerne= l. + +Being a CPU register, pkeys are inherently thread-local, potentially giv= ing +each thread an independent set of protections from every other thread. =20 There are two new instructions (RDPKRU/WRPKRU) for reading and writing to the new register. The feature is only available in 64-bit mode, @@ -30,8 +35,11 @@ even though there is theoretically space in the PAE PT= Es. These permissions are enforced on data access only and have no effect on instruction fetches. =20 -Syscalls -=3D=3D=3D=3D=3D=3D=3D=3D +For kernel space rdmsr/wrmsr are used to access the kernel MSRs. + + +Syscalls for user space keys +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D =20 There are 3 system calls which directly interact with pkeys:: =20 @@ -98,3 +106,80 @@ with a read():: The kernel will send a SIGSEGV in both cases, but si_code will be set to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when the plain mprotect() permissions are violated. + + +Kernel API for PKS support +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D + +Similar to user space pkeys, supervisor pkeys allow additional protectio= ns to +be defined for a supervisor mappings. Unlike user space pkeys, violatio= ns of +these protections result in a a kernel oops. + +Supervisor Memory Protection Keys (PKS) is a feature which is found on I= ntel's +Sapphire Rapids (and later) "Scalable Processor" Server CPUs. It will a= lso be +available in future non-server Intel parts. + +Also qemu has some support as well: https://www.qemu.org/2021/04/30/qemu= -6-0-0/ + +Kernel users intending to use PKS support should depend on +ARCH_HAS_SUPERVISOR_PKEYS, and add their config to ARCH_ENABLE_SUPERVISO= R_PKEYS +to turn on this support within the core. + +Users reserve a key value by adding an entry to the enum pks_pkey_consum= ers and +defining the initial protections in the consumer_defaults[] array. + +For example to configure a key for 'MY_FEATURE' with a default of Write +Disabled. + +:: + + enum pks_pkey_consumers + { + PKS_KEY_DEFAULT, + PKS_KEY_MY_FEATURE, + PKS_KEY_NR_CONSUMERS + } + + ... + consumer_defaults[PKS_KEY_DEFAULT] =3D 0; + consumer_defaults[PKS_KEY_MY_FEATURE] =3D PKR_DISABLE_WRITE; + ... + +The following interface is used to manipulate the 'protection domain' de= fined +by a pkey within the kernel. Setting a pkey value in a supervisor PTE a= dds +this additional protection to the page. + +:: + + #define PAGE_KERNEL_PKEY(pkey) + #define _PAGE_KEY(pkey) + bool pks_enabled(void); + void pks_mk_noaccess(int pkey); + void pks_mk_readonly(int pkey); + void pks_mk_readwrite(int pkey); + +pks_enabled() allows users to know if PKS is configured and available on= the +current running system. + +Kernel users must set the pkey in the page table entries for the mapping= s they +want to protect. This can be done with PAGE_KERNEL_PKEY() or _PAGE_KEY(= ). + +The pks_mk*() family of calls allow indinvidual threads to change the +protections for the domain identified by the pkey parameter. 3 states a= re +available: pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() = which +set the access to none, read, and read/write respectively. + +The interface sets (Access Disabled (AD=3D1)) for all keys not in use. + +It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not seria= lizing +but still maintains ordering properties similar to WRPKRU. Thus it is s= afe to +immediately use a mapping when the pks_mk*() functions return. + +Older versions of the SDM on PKRS may be wrong with regard to this +serialization. The text should be the same as that of WRPKRU. From the= WRPKRU +text: + + WRPKRU will never execute transiently. Memory accesses + affected by PKRU register will not execute (even transiently) + until all prior executions of WRPKRU have completed execution + and updated the PKRU register. diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/= pgtable_types.h index 40497a9020c6..3f866e730456 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -71,6 +71,12 @@ _PAGE_PKEY_BIT2 | \ _PAGE_PKEY_BIT3) =20 +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +#define _PAGE_PKEY(pkey) (_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0) +#else +#define _PAGE_PKEY(pkey) (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED) #else @@ -226,6 +232,12 @@ enum page_cache_mode { #define PAGE_KERNEL_IO __pgprot_mask(__PAGE_KERNEL_IO) #define PAGE_KERNEL_IO_NOCACHE __pgprot_mask(__PAGE_KERNEL_IO_NOCACHE) =20 +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS +#define PAGE_KERNEL_PKEY(pkey) __pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(= pkey)) +#else +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL +#endif + #endif /* __ASSEMBLY__ */ =20 /* xwr */ diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c index eca01dc8d7ac..146a665d1bf3 100644 --- a/arch/x86/mm/pkeys.c +++ b/arch/x86/mm/pkeys.c @@ -3,6 +3,9 @@ * Intel Memory Protection Keys management * Copyright (c) 2015, Intel Corporation. */ +#undef pr_fmt +#define pr_fmt(fmt) "x86/pkeys: " fmt + #include /* debugfs_create_u32() */ #include /* mm_struct, vma, etc... = */ #include /* PKEY_* = */ @@ -10,6 +13,7 @@ =20 #include /* boot_cpu_has, ... = */ #include /* vma_pkey() = */ +#include =20 int __execute_only_pkey(struct mm_struct *mm) { @@ -301,4 +305,66 @@ void pks_init_task(struct task_struct *task) task->thread.saved_pkrs =3D pkrs_init_value; } =20 +bool pks_enabled(void) +{ + return cpu_feature_enabled(X86_FEATURE_PKS); +} + +/* + * Do not call this directly, see pks_mk*() below. + * + * @pkey: Key for the domain to change + * @protection: protection bits to be used + * + * Protection utilizes the same protection bits specified for User pkeys + * PKEY_DISABLE_ACCESS + * PKEY_DISABLE_WRITE + * + */ +static inline void pks_update_protection(int pkey, unsigned long protect= ion) +{ + current->thread.saved_pkrs =3D update_pkey_val(current->thread.saved_pk= rs, + pkey, protection); + pkrs_write_current(); +} + +/** + * pks_mk_noaccess() - Disable all access to the domain + * @pkey the pkey for which the access should change. + * + * Disable all access to the domain specified by pkey. This is not a gl= obal + * update and only affects the current running thread. + */ +void pks_mk_noaccess(int pkey) +{ + pks_update_protection(pkey, PKEY_DISABLE_ACCESS); +} +EXPORT_SYMBOL_GPL(pks_mk_noaccess); + +/** + * pks_mk_readonly() - Make the domain Read only + * @pkey the pkey for which the access should change. + * + * Allow read access to the domain specified by pkey. This is not a glo= bal + * update and only affects the current running thread. + */ +void pks_mk_readonly(int pkey) +{ + pks_update_protection(pkey, PKEY_DISABLE_WRITE); +} +EXPORT_SYMBOL_GPL(pks_mk_readonly); + +/** + * pks_mk_readwrite() - Make the domain Read/Write + * @pkey the pkey for which the access should change. + * + * Allow all access, read and write, to the domain specified by pkey. T= his is + * not a global update and only affects the current running thread. + */ +void pks_mk_readwrite(int pkey) +{ + pks_update_protection(pkey, 0); +} +EXPORT_SYMBOL_GPL(pks_mk_readwrite); + #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index d147480cdefc..eba1a9f9d124 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1526,6 +1526,10 @@ static inline bool arch_has_pfn_modify_check(void) # define PAGE_KERNEL_EXEC PAGE_KERNEL #endif =20 +#ifndef PAGE_KERNEL_PKEY +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL +#endif + /* * Page Table Modification bits for pgtbl_mod_mask. * diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 76eb19a37942..b9919ed4d300 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -56,11 +56,25 @@ extern u32 pkrs_init_value; void pkrs_save_irq(struct pt_regs *regs); void pkrs_restore_irq(struct pt_regs *regs); =20 +bool pks_enabled(void); +void pks_mk_noaccess(int pkey); +void pks_mk_readonly(int pkey); +void pks_mk_readwrite(int pkey); + #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ =20 static inline void pkrs_save_irq(struct pt_regs *regs) { } static inline void pkrs_restore_irq(struct pt_regs *regs) { } =20 +static inline bool pks_enabled(void) +{ + return false; +} + +static inline void pks_mk_noaccess(int pkey) {} +static inline void pks_mk_readonly(int pkey) {} +static inline void pks_mk_readwrite(int pkey) {} + #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */ =20 #endif /* _LINUX_PKEYS_H */ --=20 2.28.0.rc0.12.gb6a658bd00c9