linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/15] pkeys-based page table hardening
@ 2025-02-03 10:18 Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 01/15] mm: Introduce kpkeys Kevin Brodsky
                   ` (16 more replies)
  0 siblings, 17 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

This is a proposal to leverage protection keys (pkeys) to harden
critical kernel data, by making it mostly read-only. The series includes
a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
as well as a page table hardening feature based on that framework
(kpkeys_hardened_pgtables). Both are implemented on arm64 as a proof of
concept, but they are designed to be compatible with any architecture
implementing pkeys.

The proposed approach is a typical use of pkeys: the data to protect is
mapped with a given pkey P, and the pkey register is initially configured
to grant read-only access to P. Where the protected data needs to be
written to, the pkey register is temporarily switched to grant write
access to P on the current CPU.

The key fact this approach relies on is that the target data is
only written to via a limited and well-defined API. This makes it
possible to explicitly switch the pkey register where needed, without
introducing excessively invasive changes, and only for a small amount of
trusted code.

Page tables were chosen as they are a popular (and critical) target for
attacks, but there are of course many others - this is only a starting
point (see section "Further use-cases"). It has become more and more
common for accesses to such target data to be mediated by a hypervisor
in vendor kernels; the hope is that kpkeys can provide much of that
protection in a simpler manner. No benchmarking has been performed at
this stage, but the runtime overhead should also be lower (though likely
not negligible).

# kpkeys

The use of pkeys involves two separate mechanisms: assigning a pkey to
pages, and defining the pkeys -> permissions mapping via the pkey
register. This is implemented through the following interface:

- Pages in the linear mapping are assigned a pkey using set_memory_pkey().
  This is sufficient for this series, but of course higher-level
  interfaces can be introduced later to ask allocators to return pages
  marked with a given pkey. It should also be possible to extend this to
  vmalloc() if needed.

- The pkey register is configured based on a *kpkeys level*. kpkeys
  levels are simple integers that correspond to a given configuration,
  for instance:

  KPKEYS_LVL_DEFAULT:
        RW access to KPKEYS_PKEY_DEFAULT
        RO access to any other KPKEYS_PKEY_*

  KPKEYS_LVL_<FEAT>:
        RW access to KPKEYS_PKEY_DEFAULT
        RW access to KPKEYS_PKEY_<FEAT>
        RO access to any other KPKEYS_PKEY_*

  Only pkeys that are managed by the kpkeys framework are impacted;
  permissions for other pkeys are left unchanged (this allows for other
  schemes using pkeys to be used in parallel, and arch-specific use of
  certain pkeys).

  The kpkeys level is changed by calling kpkeys_set_level(), setting the
  pkey register accordingly and returning the original value. A
  subsequent call to kpkeys_restore_pkey_reg() restores the kpkeys
  level. The numeric value of KPKEYS_LVL_* (kpkeys level) is purely
  symbolic and thus generic, however each architecture is free to define
  KPKEYS_PKEY_* (pkey value).

# kpkeys_hardened_pgtables

The kpkeys_hardened_pgtables feature uses the interface above to make
the (kernel and user) page tables read-only by default, enabling write
access only in helpers such as set_pte(). One complication is that those
helpers as well as page table allocators are used very early, before
kpkeys become available. Enabling kpkeys_hardened_pgtables, if and when
kpkeys become available, is therefore done as follows:

1. A static key is turned on. This enables a transition to
   KPKEYS_LVL_PGTABLES in all helpers writing to page tables, and also
   impacts page table allocators (see step 3).

2. All pages holding kernel page tables are set to KPKEYS_PKEY_PGTABLES.
   This ensures they can only be written when runnning at
   KPKEYS_LVL_PGTABLES.

3. Page table allocators set the returned pages to KPKEYS_PKEY_PGTABLES
   (and the pkey is reset upon freeing). This ensures that all page
   tables are mapped with that privileged pkey.

# Threat model

The proposed scheme aims at mitigating data-only attacks (e.g.
use-after-free/cross-cache attacks). In other words, it is assumed that
control flow is not corrupted, and that the attacker does not achieve
arbitrary code execution. Nothing prevents the pkey register from being
set to its most permissive state - the assumption is that the register
is only modified on legitimate code paths.

A few related notes:

- Functions that set the pkey register are all implemented inline.
  Besides performance considerations, this is meant to avoid creating
  a function that can be used as a straightforward gadget to set the
  pkey register to an arbitrary value.

- kpkeys_set_level() only accepts a compile-time constant as argument,
  as a variable could be manipulated by an attacker. This could be
  relaxed but it seems unlikely that a variable kpkeys level would be
  needed in practice.

# Further use-cases

It should be possible to harden various targets using kpkeys, including:

- struct cred (enforcing a "mostly read-only" state once committed)

- fixmap (occasionally used even after early boot, e.g.
  set_swapper_pgd() in arch/arm64/mm/mmu.c)

- SELinux state (e.g. struct selinux_state::initialized)

... and many others.

kpkeys could also be used to strengthen the confidentiality of secret
data by making it completely inaccessible by default, and granting
read-only or read-write access as needed. This requires such data to be
rarely accessed (or via a limited interface only). One example on arm64
is the pointer authentication keys in thread_struct, whose leakage to
userspace would lead to pointer authentication being easily defeated.

# This series

The series is composed of two parts:

- The kpkeys framework (patch 1-7). The main API is introduced in
  <linux/kpkeys.h>, and it is implemented on arm64 using the POE
  (Permission Overlay Extension) feature.

- The kpkeys_hardened_pgtables feature (patch 8-15). <linux/kpkeys.h> is
  extended with an API to set page table pages to a given pkey and a
  guard object to switch kpkeys level accordingly, both gated on a
  static key. This is then used in generic and arm64 pgtable handling
  code as needed. Finally a simple KUnit-based test suite is added to
  demonstrate the page table protection.

The arm64 implementation should be considered a proof of concept only.
The enablement of POE for in-kernel use is incomplete; in particular
POR_EL1 (pkey register) should be reset on exception entry and restored
on exception return.

# Performance

No particular efforts were made to optimise the use of kpkeys at this
stage (and no benchmarking was performed either). There are two obvious
low-hanging fruits in the kpkeys_hardened_pgtables feature:

- Always switching kpkeys level in leaf helpers such as set_pte() can be
  very inefficient if many page table entries are updated in a row. Some
  sort of batching may be desirable.

- On arm64 specifically, the page table helpers typically perform an
  expensive ISB (Instruction Synchronisation Barrier) after writing to
  page tables. Since most of the cost of switching the arm64 pkey
  register (POR_EL1) comes from the following ISB, the overhead incurred
  by kpkeys_restore_pkey_reg() would be significantly reduced by merging
  its ISB with the pgtable helper's. That would however require more
  invasive changes, beyond simply adding a guard object.

# Open questions

A few aspects in this RFC that are debatable and/or worth discussing:

- There is currently no restriction on how kpkeys levels map to pkeys
  permissions. A typical approach is to allocate one pkey per level and
  make it writable at that level only. As the number of levels
  increases, we may however run out of pkeys, especially on arm64 (just
  8 pkeys with POE). Depending on the use-cases, it may be acceptable to
  use the same pkey for the data associated to multiple levels.

  Another potential concern is that a given piece of code may require
  write access to multiple privileged pkeys. This could be addressed by
  introducing a notion of hierarchy in trust levels, where Tn is able to
  write to memory owned by Tm if n >= m, for instance.

- kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
  the former takes a kpkeys level and returns a pkey register value, to
  be consumed by the latter. It would be more intuitive to manipulate
  kpkeys levels only. However this assumes that there is a 1:1 mapping
  between kpkeys levels and pkey register values, while in principle
  the mapping is 1:n (certain pkeys may be used outside the kpkeys
  framework).

- An architecture that supports kpkeys is expected to select
  CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is
  no CONFIG_KPKEYS to control this behaviour. Since this creates no
  significant overhead (at least on arm64), it seemed better to keep it
  simple. Each hardening feature does have its own option and arch
  opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES,
  CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES).


Any comment or feedback will be highly appreciated, be it on the
high-level approach or implementation choices!

- Kevin

---
Changelog

RFC v2..v3:

- Patch 1: kpkeys_set_level() may now return KPKEYS_PKEY_REG_INVAL to indicate
  that the pkey register wasn't written to, and as a result that
  kpkeys_restore_pkey_reg() should do nothing. This simplifies the conditional
  guard macro and also allows architectures to skip writes to the pkey
  register if the target value is the same as the current one.

- Patch 1: introduced additional KPKEYS_GUARD* macros to cover more use-cases
  and reduce duplication.

- Patch 6: reject pkey value above arch_max_pkey().

- Patch 13: added missing guard(kpkeys_hardened_pgtables) in
  __clear_young_dirty_pte().

- Rebased on v6.14-rc1.

RFC v2: https://lore.kernel.org/linux-hardening/20250108103250.3188419-1-kevin.brodsky@arm.com/

RFC v1..v2:

- A new approach is used to set the pkey of page table pages. Thanks to
  Qi Zheng's and my own series [1][2], pagetable_*_ctor is
  systematically called when a PTP is allocated at any level (PTE to
  PGD), and pagetable_*_dtor when it is freed, on all architectures.
  Patch 11 makes use of this to call kpkeys_{,un}protect_pgtable_memory
  from the common ctor/dtor helper. The arm64 patches from v1 (patch 12
  and 13) are dropped as they are no longer needed. Patch 10 is
  introduced to allow pagetable_*_ctor to fail at all levels, since
  kpkeys_protect_pgtable_memory may itself fail.
  [Original suggestion by Peter Zijlstra]

- Changed the prototype of kpkeys_{,un}protect_pgtable_memory in patch 9
  to take a struct folio * for more convenience, and implemented them
  out-of-line to avoid a circular dependency with <linux/mm.h>.

- Rebased on next-20250107, which includes [1] and [2].

- Added locking in patch 8. [Peter Zijlstra's suggestion]

RFC v1: https://lore.kernel.org/linux-hardening/20241206101110.1646108-1-kevin.brodsky@arm.com/

[1] https://lore.kernel.org/linux-mm/cover.1736317725.git.zhengqi.arch@bytedance.com/
[2] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/
---
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Langlois <pierre.langlois@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mm@kvack.org
Cc: x86@kernel.org

Kevin Brodsky (15):
  mm: Introduce kpkeys
  set_memory: Introduce set_memory_pkey() stub
  arm64: mm: Enable overlays for all EL1 indirect permissions
  arm64: Introduce por_set_pkey_perms() helper
  arm64: Implement asm/kpkeys.h using POE
  arm64: set_memory: Implement set_memory_pkey()
  arm64: Enable kpkeys
  mm: Introduce kernel_pgtables_set_pkey()
  mm: Introduce kpkeys_hardened_pgtables
  mm: Allow __pagetable_ctor() to fail
  mm: Map page tables with privileged pkey
  arm64: kpkeys: Support KPKEYS_LVL_PGTABLES
  arm64: mm: Guard page table writes with kpkeys
  arm64: Enable kpkeys_hardened_pgtables support
  mm: Add basic tests for kpkeys_hardened_pgtables

 arch/arm64/Kconfig                    |   2 +
 arch/arm64/include/asm/kpkeys.h       |  45 ++++++++
 arch/arm64/include/asm/pgtable-prot.h |  16 +--
 arch/arm64/include/asm/pgtable.h      |  20 +++-
 arch/arm64/include/asm/por.h          |   9 ++
 arch/arm64/include/asm/set_memory.h   |   4 +
 arch/arm64/kernel/cpufeature.c        |   5 +-
 arch/arm64/kernel/smp.c               |   2 +
 arch/arm64/mm/fault.c                 |   2 +
 arch/arm64/mm/mmu.c                   |  28 ++---
 arch/arm64/mm/pageattr.c              |  25 +++++
 include/asm-generic/kpkeys.h          |  21 ++++
 include/asm-generic/pgalloc.h         |  15 ++-
 include/linux/kpkeys.h                | 156 ++++++++++++++++++++++++++
 include/linux/mm.h                    |  27 +++--
 include/linux/set_memory.h            |   7 ++
 mm/Kconfig                            |   5 +
 mm/Makefile                           |   2 +
 mm/kpkeys_hardened_pgtables.c         |  44 ++++++++
 mm/kpkeys_hardened_pgtables_test.c    |  72 ++++++++++++
 mm/memory.c                           | 137 ++++++++++++++++++++++
 security/Kconfig.hardening            |  24 ++++
 22 files changed, 625 insertions(+), 43 deletions(-)
 create mode 100644 arch/arm64/include/asm/kpkeys.h
 create mode 100644 include/asm-generic/kpkeys.h
 create mode 100644 include/linux/kpkeys.h
 create mode 100644 mm/kpkeys_hardened_pgtables.c
 create mode 100644 mm/kpkeys_hardened_pgtables_test.c


base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 01/15] mm: Introduce kpkeys
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 02/15] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

kpkeys is a simple framework to enable the use of protection keys
(pkeys) to harden the kernel itself. This patch introduces the basic
API in <linux/kpkeys.h>: a couple of functions to set and restore
the pkey register and macros to define guard objects.

kpkeys introduces a new concept on top of pkeys: the kpkeys level.
Each level is associated to a set of permissions for the pkeys
managed by the kpkeys framework. kpkeys_set_level(lvl) sets those
permissions according to lvl, and returns the original pkey
register, to be later restored by kpkeys_restore_pkey_reg(). To
start with, only KPKEYS_LVL_DEFAULT is available, which is meant
to grant RW access to KPKEYS_PKEY_DEFAULT (i.e. all memory since
this is the only available pkey for now).

Because each architecture implementing pkeys uses a different
representation for the pkey register, and may reserve certain pkeys
for specific uses, support for kpkeys must be explicitly indicated
by selecting ARCH_HAS_KPKEYS and defining the following functions in
<asm/kpkeys.h>, in addition to the macros provided in
<asm-generic/kpkeys.h>:

- arch_kpkeys_set_level()
- arch_kpkeys_restore_pkey_reg()
- arch_kpkeys_enabled()

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/asm-generic/kpkeys.h |  17 ++++++
 include/linux/kpkeys.h       | 113 +++++++++++++++++++++++++++++++++++
 mm/Kconfig                   |   2 +
 3 files changed, 132 insertions(+)
 create mode 100644 include/asm-generic/kpkeys.h
 create mode 100644 include/linux/kpkeys.h

diff --git a/include/asm-generic/kpkeys.h b/include/asm-generic/kpkeys.h
new file mode 100644
index 000000000000..ab819f157d6a
--- /dev/null
+++ b/include/asm-generic/kpkeys.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_KPKEYS_H
+#define __ASM_GENERIC_KPKEYS_H
+
+#ifndef KPKEYS_PKEY_DEFAULT
+#define KPKEYS_PKEY_DEFAULT	0
+#endif
+
+/*
+ * Represents a pkey register value that cannot be used, typically disabling
+ * access to all keys.
+ */
+#ifndef KPKEYS_PKEY_REG_INVAL
+#define KPKEYS_PKEY_REG_INVAL	0
+#endif
+
+#endif	/* __ASM_GENERIC_KPKEYS_H */
diff --git a/include/linux/kpkeys.h b/include/linux/kpkeys.h
new file mode 100644
index 000000000000..62f897c65658
--- /dev/null
+++ b/include/linux/kpkeys.h
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_KPKEYS_H
+#define _LINUX_KPKEYS_H
+
+#include <linux/bug.h>
+#include <linux/cleanup.h>
+
+#define KPKEYS_LVL_DEFAULT	0
+
+#define KPKEYS_LVL_MIN		KPKEYS_LVL_DEFAULT
+#define KPKEYS_LVL_MAX		KPKEYS_LVL_DEFAULT
+
+#define __KPKEYS_GUARD(name, set_level, restore_pkey_reg, set_arg, ...)	\
+	__DEFINE_CLASS_IS_CONDITIONAL(name, false);			\
+	DEFINE_CLASS(name, u64,						\
+		     restore_pkey_reg, set_level, set_arg);		\
+	static inline void *class_##name##_lock_ptr(u64 *_T)		\
+	{ return _T; }
+
+/**
+ * KPKEYS_GUARD_NOOP() - define a guard type that does nothing
+ * @name: the name of the guard type
+ * @cond_arg: an argument specification (optional)
+ *
+ * Define a guard type that does nothing, useful to match a real guard type
+ * that is defined under an #ifdef. @cond_arg may optionally be passed to match
+ * a guard defined using KPKEYS_GUARD_COND().
+ */
+#define KPKEYS_GUARD_NOOP(name, ...)					\
+	__KPKEYS_GUARD(name, 0, (void)_T, ##__VA_ARGS__, void)
+
+#ifdef CONFIG_ARCH_HAS_KPKEYS
+
+#include <asm/kpkeys.h>
+
+/**
+ * KPKEYS_GUARD_COND() - define a guard type that conditionally switches to
+ *                       a given kpkeys level
+ * @name: the name of the guard type
+ * @level: the kpkeys level to switch to
+ * @cond: an expression that is evaluated as condition
+ * @cond_arg: an argument specification for the condition (optional)
+ *
+ * Define a guard type that switches to @level if @cond evaluates to true, and
+ * does nothing otherwise. @cond_arg may be specified to give access to a
+ * caller-defined argument to @cond.
+ */
+#define KPKEYS_GUARD_COND(name, level, cond, ...)			\
+	__KPKEYS_GUARD(name,						\
+		       cond ? kpkeys_set_level(level)			\
+			    : KPKEYS_PKEY_REG_INVAL,			\
+		       kpkeys_restore_pkey_reg(_T),			\
+		       ##__VA_ARGS__, void)
+
+/**
+ * KPKEYS_GUARD() - define a guard type that switches to a given kpkeys level
+ *                  if kpkeys are enabled
+ * @name: the name of the guard type
+ * @level: the kpkeys level to switch to
+ *
+ * Define a guard type that switches to @level if the system supports kpkeys.
+ */
+#define KPKEYS_GUARD(name, level)					\
+	KPKEYS_GUARD_COND(name, level, arch_kpkeys_enabled())
+
+/**
+ * kpkeys_set_level() - switch kpkeys level
+ * @level: the level to switch to
+ *
+ * Switches the kpkeys level to the specified value. @level must be a
+ * compile-time constant. The arch-specific pkey register will be updated
+ * accordingly, and the original value returned.
+ *
+ * Return: the original pkey register value if the register was written to, or
+ *         KPKEYS_PKEY_REG_INVAL otherwise (no write to the register was
+ *         required).
+ */
+static inline u64 kpkeys_set_level(int level)
+{
+	BUILD_BUG_ON_MSG(!__builtin_constant_p(level),
+			 "kpkeys_set_level() only takes constant levels");
+	BUILD_BUG_ON_MSG(level < KPKEYS_LVL_MIN || level > KPKEYS_LVL_MAX,
+			 "Invalid level passed to kpkeys_set_level()");
+
+	return arch_kpkeys_set_level(level);
+}
+
+/**
+ * kpkeys_restore_pkey_reg() - restores a pkey register value
+ * @pkey_reg: the pkey register value to restore
+ *
+ * This function is meant to be passed the value returned by kpkeys_set_level(),
+ * in order to restore the pkey register to its original value (thus restoring
+ * the original kpkeys level).
+ */
+static inline void kpkeys_restore_pkey_reg(u64 pkey_reg)
+{
+	if (pkey_reg != KPKEYS_PKEY_REG_INVAL)
+		arch_kpkeys_restore_pkey_reg(pkey_reg);
+}
+
+#else /* CONFIG_ARCH_HAS_KPKEYS */
+
+#include <asm-generic/kpkeys.h>
+
+static inline bool arch_kpkeys_enabled(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_ARCH_HAS_KPKEYS */
+
+#endif /* _LINUX_KPKEYS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..71edc478f111 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1147,6 +1147,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_KPKEYS
+	bool
 
 config ARCH_USES_PG_ARCH_2
 	bool
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 02/15] set_memory: Introduce set_memory_pkey() stub
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 01/15] mm: Introduce kpkeys Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 03/15] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Introduce a new function, set_memory_pkey(), which sets the
protection key (pkey) of pages in the specified linear mapping
range. Architectures implementing kernel pkeys (kpkeys) must
provide a suitable implementation; an empty stub is added as
fallback.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/set_memory.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 3030d9245f5a..7b3a8bfde3c6 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -84,4 +84,11 @@ static inline int set_memory_decrypted(unsigned long addr, int numpages)
 }
 #endif /* CONFIG_ARCH_HAS_MEM_ENCRYPT */
 
+#ifndef CONFIG_ARCH_HAS_KPKEYS
+static inline int set_memory_pkey(unsigned long addr, int numpages, int pkey)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_SET_MEMORY_H_ */
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 03/15] arm64: mm: Enable overlays for all EL1 indirect permissions
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 01/15] mm: Introduce kpkeys Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 02/15] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 04/15] arm64: Introduce por_set_pkey_perms() helper Kevin Brodsky
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

In preparation of using POE inside the kernel, enable "Overlay
applied" for all stage 1 base permissions in PIR_EL1. This ensures
that the permissions set in POR_EL1 affect all kernel mappings.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable-prot.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index a95f1f77bb39..7c0c30460900 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -181,13 +181,13 @@ static inline bool __pure lpa2_is_enabled(void)
 	PIRx_ELx_PERM(pte_pi_index(_PAGE_GCS),           PIE_NONE_O) | \
 	PIRx_ELx_PERM(pte_pi_index(_PAGE_GCS_RO),        PIE_NONE_O) | \
 	PIRx_ELx_PERM(pte_pi_index(_PAGE_EXECONLY),      PIE_NONE_O) | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY_EXEC), PIE_R)      | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED_EXEC),   PIE_RW)     | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY),      PIE_R)      | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED),        PIE_RW)     | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_ROX),    PIE_RX)     | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_EXEC),   PIE_RWX)    | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_RO),     PIE_R)      | \
-	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL),        PIE_RW))
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY_EXEC), PIE_R_O)    | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED_EXEC),   PIE_RW_O)   | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_READONLY),      PIE_R_O)    | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_SHARED),        PIE_RW_O)   | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_ROX),    PIE_RX_O)   | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_EXEC),   PIE_RWX_O)  | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL_RO),     PIE_R_O)    | \
+	PIRx_ELx_PERM(pte_pi_index(_PAGE_KERNEL),        PIE_RW_O))
 
 #endif /* __ASM_PGTABLE_PROT_H */
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 04/15] arm64: Introduce por_set_pkey_perms() helper
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (2 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 03/15] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 05/15] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Introduce a helper that sets the permissions of a given pkey
(POIndex) in the POR_ELx format, and make use of it in
arch_set_user_pkey_access().

Also ensure that <asm/sysreg.h> is included in asm/por.h to provide
the POE_* definitions.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/por.h |  9 +++++++++
 arch/arm64/mm/mmu.c          | 28 ++++++++++------------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/por.h b/arch/arm64/include/asm/por.h
index e06e9f473675..7f0d73980cce 100644
--- a/arch/arm64/include/asm/por.h
+++ b/arch/arm64/include/asm/por.h
@@ -6,6 +6,8 @@
 #ifndef _ASM_ARM64_POR_H
 #define _ASM_ARM64_POR_H
 
+#include <asm/sysreg.h>
+
 #define POR_BITS_PER_PKEY		4
 #define POR_ELx_IDX(por_elx, idx)	(((por_elx) >> ((idx) * POR_BITS_PER_PKEY)) & 0xf)
 
@@ -30,4 +32,11 @@ static inline bool por_elx_allows_exec(u64 por, u8 pkey)
 	return perm & POE_X;
 }
 
+static inline u64 por_set_pkey_perms(u64 por, u8 pkey, u64 perms)
+{
+	u64 shift = pkey * POR_BITS_PER_PKEY;
+
+	return (por & ~(POE_MASK << shift)) | (perms << shift);
+}
+
 #endif /* _ASM_ARM64_POR_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4df5bc5b1b8..9547183d86cf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1555,9 +1555,8 @@ void __cpu_replace_ttbr1(pgd_t *pgdp, bool cnp)
 #ifdef CONFIG_ARCH_HAS_PKEYS
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val)
 {
-	u64 new_por = POE_RXW;
-	u64 old_por;
-	u64 pkey_shift;
+	u64 new_perms;
+	u64 por;
 
 	if (!system_supports_poe())
 		return -ENOSPC;
@@ -1571,26 +1570,19 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long i
 		return -EINVAL;
 
 	/* Set the bits we need in POR:  */
-	new_por = POE_RXW;
+	new_perms = POE_RXW;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_por &= ~POE_W;
+		new_perms &= ~POE_W;
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_por &= ~POE_RW;
+		new_perms &= ~POE_RW;
 	if (init_val & PKEY_DISABLE_READ)
-		new_por &= ~POE_R;
+		new_perms &= ~POE_R;
 	if (init_val & PKEY_DISABLE_EXECUTE)
-		new_por &= ~POE_X;
+		new_perms &= ~POE_X;
 
-	/* Shift the bits in to the correct place in POR for pkey: */
-	pkey_shift = pkey * POR_BITS_PER_PKEY;
-	new_por <<= pkey_shift;
-
-	/* Get old POR and mask off any old bits in place: */
-	old_por = read_sysreg_s(SYS_POR_EL0);
-	old_por &= ~(POE_MASK << pkey_shift);
-
-	/* Write old part along with new part: */
-	write_sysreg_s(old_por | new_por, SYS_POR_EL0);
+	por = read_sysreg_s(SYS_POR_EL0);
+	por = por_set_pkey_perms(por, pkey, new_perms);
+	write_sysreg_s(por, SYS_POR_EL0);
 
 	return 0;
 }
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 05/15] arm64: Implement asm/kpkeys.h using POE
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (3 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 04/15] arm64: Introduce por_set_pkey_perms() helper Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 06/15] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Implement the kpkeys interface if CONFIG_ARM64_POE is enabled.
The permissions for KPKEYS_PKEY_DEFAULT (pkey 0) are set to RWX as
this pkey is also used for code mappings.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/kpkeys.h | 43 +++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100644 arch/arm64/include/asm/kpkeys.h

diff --git a/arch/arm64/include/asm/kpkeys.h b/arch/arm64/include/asm/kpkeys.h
new file mode 100644
index 000000000000..e17f6df41873
--- /dev/null
+++ b/arch/arm64/include/asm/kpkeys.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_KPKEYS_H
+#define __ASM_KPKEYS_H
+
+#include <asm/barrier.h>
+#include <asm/cpufeature.h>
+#include <asm/por.h>
+
+#include <asm-generic/kpkeys.h>
+
+static inline bool arch_kpkeys_enabled(void)
+{
+	return system_supports_poe();
+}
+
+#ifdef CONFIG_ARM64_POE
+
+static inline u64 por_set_kpkeys_level(u64 por, int level)
+{
+	por = por_set_pkey_perms(por, KPKEYS_PKEY_DEFAULT, POE_RXW);
+
+	return por;
+}
+
+static inline int arch_kpkeys_set_level(int level)
+{
+	u64 prev_por = read_sysreg_s(SYS_POR_EL1);
+
+	write_sysreg_s(por_set_kpkeys_level(prev_por, level), SYS_POR_EL1);
+	isb();
+
+	return prev_por;
+}
+
+static inline void arch_kpkeys_restore_pkey_reg(u64 pkey_reg)
+{
+	write_sysreg_s(pkey_reg, SYS_POR_EL1);
+	isb();
+}
+
+#endif /* CONFIG_ARM64_POE */
+
+#endif	/* __ASM_KPKEYS_H */
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 06/15] arm64: set_memory: Implement set_memory_pkey()
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (4 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 05/15] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 07/15] arm64: Enable kpkeys Kevin Brodsky
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Implement set_memory_pkey() using POE if supported.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/set_memory.h |  4 ++++
 arch/arm64/mm/pageattr.c            | 25 +++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
index 90f61b17275e..b6cd6de34abf 100644
--- a/arch/arm64/include/asm/set_memory.h
+++ b/arch/arm64/include/asm/set_memory.h
@@ -19,4 +19,8 @@ bool kernel_page_present(struct page *page);
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
 
+#ifdef CONFIG_ARCH_HAS_KPKEYS
+int set_memory_pkey(unsigned long addr, int numpages, int pkey);
+#endif
+
 #endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 39fd1f7ff02a..9721a74adbe2 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -8,6 +8,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/sched.h>
 #include <linux/vmalloc.h>
+#include <linux/pkeys.h>
 
 #include <asm/cacheflush.h>
 #include <asm/pgtable-prot.h>
@@ -292,6 +293,30 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 	return set_memory_valid(addr, nr, valid);
 }
 
+#ifdef CONFIG_ARCH_HAS_KPKEYS
+int set_memory_pkey(unsigned long addr, int numpages, int pkey)
+{
+	unsigned long set_prot = 0;
+
+	if (!system_supports_poe())
+		return 0;
+
+	if (!__is_lm_address(addr))
+		return -EINVAL;
+
+	if (pkey >= arch_max_pkey())
+		return -EINVAL;
+
+	set_prot |= pkey & BIT(0) ? PTE_PO_IDX_0 : 0;
+	set_prot |= pkey & BIT(1) ? PTE_PO_IDX_1 : 0;
+	set_prot |= pkey & BIT(2) ? PTE_PO_IDX_2 : 0;
+
+	return __change_memory_common(addr, PAGE_SIZE * numpages,
+				      __pgprot(set_prot),
+				      __pgprot(PTE_PO_IDX_MASK));
+}
+#endif
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 /*
  * This is - apart from the return value - doing the same
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 07/15] arm64: Enable kpkeys
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (5 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 06/15] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

This is the final step to enable kpkeys on arm64. We enable
POE at EL1 by setting TCR2_EL1.POE, and initialise POR_EL1 so that
it enables access to the default pkey/POIndex (default kpkeys
level). An ISB is added so that POE restrictions are enforced
immediately.

Having done this, we can now select ARCH_HAS_KPKEYS if ARM64_POE is
enabled.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/Kconfig             | 1 +
 arch/arm64/kernel/cpufeature.c | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fcdd0ed3eca8..34f15348ada8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2186,6 +2186,7 @@ config ARM64_POE
 	def_bool y
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_KPKEYS
 	help
 	  The Permission Overlay Extension is used to implement Memory
 	  Protection Keys. Memory Protection Keys provides a mechanism for
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 4eb7c6698ae4..28b8c93c60c7 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -76,6 +76,7 @@
 #include <linux/kasan.h>
 #include <linux/percpu.h>
 #include <linux/sched/isolation.h>
+#include <linux/kpkeys.h>
 
 #include <asm/cpu.h>
 #include <asm/cpufeature.h>
@@ -2385,8 +2386,10 @@ static void cpu_enable_mops(const struct arm64_cpu_capabilities *__unused)
 #ifdef CONFIG_ARM64_POE
 static void cpu_enable_poe(const struct arm64_cpu_capabilities *__unused)
 {
-	sysreg_clear_set(REG_TCR2_EL1, 0, TCR2_EL1_E0POE);
+	write_sysreg_s(por_set_kpkeys_level(0, KPKEYS_LVL_DEFAULT), SYS_POR_EL1);
+	sysreg_clear_set(REG_TCR2_EL1, 0, TCR2_EL1_E0POE | TCR2_EL1_POE);
 	sysreg_clear_set(CPACR_EL1, 0, CPACR_EL1_E0POE);
+	isb();
 }
 #endif
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey()
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (6 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 07/15] arm64: Enable kpkeys Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-06 19:01   ` Linus Walleij
  2025-02-03 10:18 ` [RFC PATCH v3 09/15] mm: Introduce kpkeys_hardened_pgtables Kevin Brodsky
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

kernel_pgtables_set_pkey() allows setting the pkey of all page table
pages in swapper_pg_dir, recursively. This will be needed by
kpkeys_hardened_pgtables, as it relies on all PTPs being mapped with
a non-default pkey. Those initial kernel page tables cannot
practically be assigned a non-default pkey right when they are
allocated, so mutating them during (early) boot is required.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/mm.h |   2 +
 mm/memory.c        | 137 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1068ddcbb7..c3998b78f6a5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4155,4 +4155,6 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
 int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
 int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
 
+int kernel_pgtables_set_pkey(int pkey);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d54..1386b9cfb459 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -77,6 +77,8 @@
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
 #include <linux/fsnotify.h>
+#include <linux/kpkeys.h>
+#include <linux/set_memory.h>
 
 #include <trace/events/kmem.h>
 
@@ -7066,3 +7068,138 @@ void vma_pgtable_walk_end(struct vm_area_struct *vma)
 	if (is_vm_hugetlb_page(vma))
 		hugetlb_vma_unlock_read(vma);
 }
+
+static int set_page_pkey(void *p, int pkey)
+{
+	unsigned long addr = (unsigned long)p;
+
+	/*
+	 * swapper_pg_dir itself will be made read-only by mark_rodata_ro()
+	 * so there is no point in changing its pkey.
+	 */
+	if (p == swapper_pg_dir)
+		return 0;
+
+	return set_memory_pkey(addr, 1, pkey);
+}
+
+static int set_pkey_pte(pmd_t *pmd, int pkey)
+{
+	pte_t *pte;
+	int err;
+
+	pte = pte_offset_kernel(pmd, 0);
+	err = set_page_pkey(pte, pkey);
+
+	return err;
+}
+
+static int set_pkey_pmd(pud_t *pud, int pkey)
+{
+	pmd_t *pmd;
+	int i, err = 0;
+
+	pmd = pmd_offset(pud, 0);
+
+	err = set_page_pkey(pmd, pkey);
+	if (err)
+		return err;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (pmd_none(pmd[i]) || pmd_bad(pmd[i]) || pmd_leaf(pmd[i]))
+			continue;
+		err = set_pkey_pte(&pmd[i], pkey);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+static int set_pkey_pud(p4d_t *p4d, int pkey)
+{
+	pud_t *pud;
+	int i, err = 0;
+
+	if (mm_pmd_folded(&init_mm))
+		return set_pkey_pmd((pud_t *)p4d, pkey);
+
+	pud = pud_offset(p4d, 0);
+
+	err = set_page_pkey(pud, pkey);
+	if (err)
+		return err;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (pud_none(pud[i]) || pud_bad(pud[i]) || pud_leaf(pud[i]))
+			continue;
+		err = set_pkey_pmd(&pud[i], pkey);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+static int set_pkey_p4d(pgd_t *pgd, int pkey)
+{
+	p4d_t *p4d;
+	int i, err = 0;
+
+	if (mm_pud_folded(&init_mm))
+		return set_pkey_pud((p4d_t *)pgd, pkey);
+
+	p4d = p4d_offset(pgd, 0);
+
+	err = set_page_pkey(p4d, pkey);
+	if (err)
+		return err;
+
+	for (i = 0; i < PTRS_PER_P4D; i++) {
+		if (p4d_none(p4d[i]) || p4d_bad(p4d[i]) || p4d_leaf(p4d[i]))
+			continue;
+		err = set_pkey_pud(&p4d[i], pkey);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+/**
+ * kernel_pgtables_set_pkey - set pkey for all kernel page table pages
+ * @pkey: pkey to set the page table pages to
+ *
+ * Walks swapper_pg_dir setting the protection key of every page table page (at
+ * all levels) to @pkey. swapper_pg_dir itself is left untouched as it is
+ * expected to be mapped read-only by mark_rodata_ro().
+ *
+ * No-op if the architecture does not support kpkeys.
+ */
+int kernel_pgtables_set_pkey(int pkey)
+{
+	pgd_t *pgd = swapper_pg_dir;
+	int i, err = 0;
+
+	if (!arch_kpkeys_enabled())
+		return 0;
+
+	spin_lock(&init_mm.page_table_lock);
+
+	if (mm_p4d_folded(&init_mm)) {
+		err = set_pkey_p4d(pgd, pkey);
+		goto out;
+	}
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		if (pgd_none(pgd[i]) || pgd_bad(pgd[i]) || pgd_leaf(pgd[i]))
+			continue;
+		err = set_pkey_p4d(&pgd[i], pkey);
+		if (err)
+			break;
+	}
+
+out:
+	spin_unlock(&init_mm.page_table_lock);
+	return err;
+}
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 09/15] mm: Introduce kpkeys_hardened_pgtables
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (7 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 10/15] mm: Allow __pagetable_ctor() to fail Kevin Brodsky
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

kpkeys_hardened_pgtables is a hardening feature based on kpkeys. It
aims to prevent the corruption of page tables by: 1. mapping all
page table pages, both kernel and user, with a privileged pkey
(KPKEYS_PKEY_PGTABLES), and 2. granting write access to that pkey
only when running at a higher kpkeys level (KPKEYS_LVL_PGTABLES).

The feature is exposed as CONFIG_KPKEYS_HARDENED_PGTABLES; it
requires explicit architecture opt-in by selecting
ARCH_HAS_KPKEYS_HARDENED_PGTABLES, since much of the page table
handling is arch-specific.

This patch introduces an API to modify the PTPs' pkey and switch
kpkeys level using a guard object. Because this API is going to be
called from low-level pgtable helpers (setters, allocators), it must
be inactive on boot and explicitly switched on if and when kpkeys
become available. A static key is used for that purpose; it is the
responsibility of each architecture supporting
kpkeys_hardened_pgtables to call kpkeys_hardened_pgtables_enable()
as early as possible to switch on that static key. The initial
kernel page tables are also walked to set their pkey, since they
have already been allocated at that point.

The definition of the kpkeys_hardened_pgtables guard class does not
use the static key on the restore path to avoid mismatched
set/restore pairs. Indeed, enabling the static key itself involves
modifying page tables, and it is thus possible that the guard object
is created when the static key appears as false, and destroyed when it
appears as true. To avoid this situation, we reserve an invalid value
for the pkey register and use it to disable the restore path.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/asm-generic/kpkeys.h  |  4 ++++
 include/linux/kpkeys.h        | 45 ++++++++++++++++++++++++++++++++++-
 mm/Kconfig                    |  3 +++
 mm/Makefile                   |  1 +
 mm/kpkeys_hardened_pgtables.c | 44 ++++++++++++++++++++++++++++++++++
 security/Kconfig.hardening    | 12 ++++++++++
 6 files changed, 108 insertions(+), 1 deletion(-)
 create mode 100644 mm/kpkeys_hardened_pgtables.c

diff --git a/include/asm-generic/kpkeys.h b/include/asm-generic/kpkeys.h
index ab819f157d6a..cec92334a9f3 100644
--- a/include/asm-generic/kpkeys.h
+++ b/include/asm-generic/kpkeys.h
@@ -2,6 +2,10 @@
 #ifndef __ASM_GENERIC_KPKEYS_H
 #define __ASM_GENERIC_KPKEYS_H
 
+#ifndef KPKEYS_PKEY_PGTABLES
+#define KPKEYS_PKEY_PGTABLES	1
+#endif
+
 #ifndef KPKEYS_PKEY_DEFAULT
 #define KPKEYS_PKEY_DEFAULT	0
 #endif
diff --git a/include/linux/kpkeys.h b/include/linux/kpkeys.h
index 62f897c65658..645eaf00096c 100644
--- a/include/linux/kpkeys.h
+++ b/include/linux/kpkeys.h
@@ -4,11 +4,15 @@
 
 #include <linux/bug.h>
 #include <linux/cleanup.h>
+#include <linux/jump_label.h>
+
+struct folio;
 
 #define KPKEYS_LVL_DEFAULT	0
+#define KPKEYS_LVL_PGTABLES	1
 
 #define KPKEYS_LVL_MIN		KPKEYS_LVL_DEFAULT
-#define KPKEYS_LVL_MAX		KPKEYS_LVL_DEFAULT
+#define KPKEYS_LVL_MAX		KPKEYS_LVL_PGTABLES
 
 #define __KPKEYS_GUARD(name, set_level, restore_pkey_reg, set_arg, ...)	\
 	__DEFINE_CLASS_IS_CONDITIONAL(name, false);			\
@@ -110,4 +114,43 @@ static inline bool arch_kpkeys_enabled(void)
 
 #endif /* CONFIG_ARCH_HAS_KPKEYS */
 
+#ifdef CONFIG_KPKEYS_HARDENED_PGTABLES
+
+DECLARE_STATIC_KEY_FALSE(kpkeys_hardened_pgtables_enabled);
+
+/*
+ * Use guard(kpkeys_hardened_pgtables)() to temporarily grant write access
+ * to page tables.
+ */
+KPKEYS_GUARD_COND(kpkeys_hardened_pgtables, KPKEYS_LVL_PGTABLES,
+		  static_branch_unlikely(&kpkeys_hardened_pgtables_enabled))
+
+int kpkeys_protect_pgtable_memory(struct folio *folio);
+int kpkeys_unprotect_pgtable_memory(struct folio *folio);
+
+/*
+ * Enables kpkeys_hardened_pgtables and switches existing kernel page tables to
+ * a privileged pkey (KPKEYS_PKEY_PGTABLES).
+ *
+ * Should be called as early as possible by architecture code, after (k)pkeys
+ * are initialised and before any user task is spawned.
+ */
+void kpkeys_hardened_pgtables_enable(void);
+
+#else /* CONFIG_KPKEYS_HARDENED_PGTABLES */
+
+KPKEYS_GUARD_NOOP(kpkeys_hardened_pgtables)
+
+static inline int kpkeys_protect_pgtable_memory(struct folio *folio)
+{
+	return 0;
+}
+static inline int kpkeys_unprotect_pgtable_memory(struct folio *folio)
+{
+	return 0;
+}
+static inline void kpkeys_hardened_pgtables_enable(void) {}
+
+#endif /* CONFIG_KPKEYS_HARDENED_PGTABLES */
+
 #endif /* _LINUX_KPKEYS_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 71edc478f111..2a8ebe780e64 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1149,6 +1149,9 @@ config ARCH_HAS_PKEYS
 	bool
 config ARCH_HAS_KPKEYS
 	bool
+# ARCH_HAS_KPKEYS must be selected when selecting this option
+config ARCH_HAS_KPKEYS_HARDENED_PGTABLES
+	bool
 
 config ARCH_USES_PG_ARCH_2
 	bool
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..130691364172 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
+obj-$(CONFIG_KPKEYS_HARDENED_PGTABLES) += kpkeys_hardened_pgtables.o
diff --git a/mm/kpkeys_hardened_pgtables.c b/mm/kpkeys_hardened_pgtables.c
new file mode 100644
index 000000000000..c6eb7fb6ae56
--- /dev/null
+++ b/mm/kpkeys_hardened_pgtables.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/mm.h>
+#include <linux/kpkeys.h>
+#include <linux/set_memory.h>
+
+DEFINE_STATIC_KEY_FALSE(kpkeys_hardened_pgtables_enabled);
+
+int kpkeys_protect_pgtable_memory(struct folio *folio)
+{
+	unsigned long addr = (unsigned long)folio_address(folio);
+	unsigned int order = folio_order(folio);
+	int ret = 0;
+
+	if (static_branch_unlikely(&kpkeys_hardened_pgtables_enabled))
+		ret = set_memory_pkey(addr, 1 << order, KPKEYS_PKEY_PGTABLES);
+
+	WARN_ON(ret);
+	return ret;
+}
+
+int kpkeys_unprotect_pgtable_memory(struct folio *folio)
+{
+	unsigned long addr = (unsigned long)folio_address(folio);
+	unsigned int order = folio_order(folio);
+	int ret = 0;
+
+	if (static_branch_unlikely(&kpkeys_hardened_pgtables_enabled))
+		ret = set_memory_pkey(addr, 1 << order, KPKEYS_PKEY_DEFAULT);
+
+	WARN_ON(ret);
+	return ret;
+}
+
+void __init kpkeys_hardened_pgtables_enable(void)
+{
+	int ret;
+
+	if (!arch_kpkeys_enabled())
+		return;
+
+	static_branch_enable(&kpkeys_hardened_pgtables_enabled);
+	ret = kernel_pgtables_set_pkey(KPKEYS_PKEY_PGTABLES);
+	WARN_ON(ret);
+}
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index b56e001e0c6a..f729488eac56 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -301,6 +301,18 @@ config BUG_ON_DATA_CORRUPTION
 
 	  If unsure, say N.
 
+config KPKEYS_HARDENED_PGTABLES
+	bool "Harden page tables using kernel pkeys"
+	depends on ARCH_HAS_KPKEYS_HARDENED_PGTABLES
+	help
+	  This option makes all page tables mostly read-only by
+	  allocating them with a non-default protection key (pkey) and
+	  only enabling write access to that pkey in routines that are
+	  expected to write to page table entries.
+
+	  This option has no effect if the system does not support
+	  kernel pkeys.
+
 endmenu
 
 config CC_HAS_RANDSTRUCT
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 10/15] mm: Allow __pagetable_ctor() to fail
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (8 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 09/15] mm: Introduce kpkeys_hardened_pgtables Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 11/15] mm: Map page tables with privileged pkey Kevin Brodsky
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

In preparation for adding construction hooks (that may fail) to
__pagetable_ctor(), make __pagetable_ctor() return a bool,
propagate it to pagetable_*_ctor() and handle failure in
the generic {pud,p4d,pgd}_alloc.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/asm-generic/pgalloc.h | 15 ++++++++++++---
 include/linux/mm.h            | 21 ++++++++++-----------
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 892ece4558a2..9962f7454d0c 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -173,7 +173,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
 	if (!ptdesc)
 		return NULL;
 
-	pagetable_pud_ctor(ptdesc);
+	if (!pagetable_pud_ctor(ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 	return ptdesc_address(ptdesc);
 }
 #define __pud_alloc_one(...)	alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
@@ -227,7 +230,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
 	if (!ptdesc)
 		return NULL;
 
-	pagetable_p4d_ctor(ptdesc);
+	if (!pagetable_p4d_ctor(ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 	return ptdesc_address(ptdesc);
 }
 #define __p4d_alloc_one(...)	alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
@@ -271,7 +277,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
 	if (!ptdesc)
 		return NULL;
 
-	pagetable_pgd_ctor(ptdesc);
+	if (!pagetable_pgd_ctor(ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 	return ptdesc_address(ptdesc);
 }
 #define __pgd_alloc(...)	alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c3998b78f6a5..721e779647f3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2992,12 +2992,13 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */
 
-static inline void __pagetable_ctor(struct ptdesc *ptdesc)
+static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
 {
 	struct folio *folio = ptdesc_folio(ptdesc);
 
 	__folio_set_pgtable(folio);
 	lruvec_stat_add_folio(folio, NR_PAGETABLE);
+	return true;
 }
 
 static inline void pagetable_dtor(struct ptdesc *ptdesc)
@@ -3019,8 +3020,7 @@ static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc)
 {
 	if (!ptlock_init(ptdesc))
 		return false;
-	__pagetable_ctor(ptdesc);
-	return true;
+	return __pagetable_ctor(ptdesc);
 }
 
 pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
@@ -3126,8 +3126,7 @@ static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
 	if (!pmd_ptlock_init(ptdesc))
 		return false;
 	ptdesc_pmd_pts_init(ptdesc);
-	__pagetable_ctor(ptdesc);
-	return true;
+	return __pagetable_ctor(ptdesc);
 }
 
 /*
@@ -3149,19 +3148,19 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
-static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_pud_ctor(struct ptdesc *ptdesc)
 {
-	__pagetable_ctor(ptdesc);
+	return __pagetable_ctor(ptdesc);
 }
 
-static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_p4d_ctor(struct ptdesc *ptdesc)
 {
-	__pagetable_ctor(ptdesc);
+	return __pagetable_ctor(ptdesc);
 }
 
-static inline void pagetable_pgd_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_pgd_ctor(struct ptdesc *ptdesc)
 {
-	__pagetable_ctor(ptdesc);
+	return __pagetable_ctor(ptdesc);
 }
 
 extern void __init pagecache_init(void);
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 11/15] mm: Map page tables with privileged pkey
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (9 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 10/15] mm: Allow __pagetable_ctor() to fail Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 12/15] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

If CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, map allocated page
table pages using a privileged pkey (KPKEYS_PKEY_PGTABLES), so that
page tables can only be written under guard(kpkeys_hardened_pgtables).

This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
(default).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/mm.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 721e779647f3..aa01f51fdc6f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@
 #include <linux/memremap.h>
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
+#include <linux/kpkeys.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -2998,6 +2999,8 @@ static inline bool __pagetable_ctor(struct ptdesc *ptdesc)
 
 	__folio_set_pgtable(folio);
 	lruvec_stat_add_folio(folio, NR_PAGETABLE);
+	if (kpkeys_protect_pgtable_memory(folio))
+		return false;
 	return true;
 }
 
@@ -3008,6 +3011,7 @@ static inline void pagetable_dtor(struct ptdesc *ptdesc)
 	ptlock_free(ptdesc);
 	__folio_clear_pgtable(folio);
 	lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+	kpkeys_unprotect_pgtable_memory(folio);
 }
 
 static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 12/15] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (10 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 11/15] mm: Map page tables with privileged pkey Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 13/15] arm64: mm: Guard page table writes with kpkeys Kevin Brodsky
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Enable RW access to KPKEYS_PKEY_PGTABLES (used to map page table
pages) if switching to KPKEYS_LVL_PGTABLES, otherwise only grant RO
access.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/kpkeys.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kpkeys.h b/arch/arm64/include/asm/kpkeys.h
index e17f6df41873..4854e1f3babd 100644
--- a/arch/arm64/include/asm/kpkeys.h
+++ b/arch/arm64/include/asm/kpkeys.h
@@ -18,6 +18,8 @@ static inline bool arch_kpkeys_enabled(void)
 static inline u64 por_set_kpkeys_level(u64 por, int level)
 {
 	por = por_set_pkey_perms(por, KPKEYS_PKEY_DEFAULT, POE_RXW);
+	por = por_set_pkey_perms(por, KPKEYS_PKEY_PGTABLES,
+				 level == KPKEYS_LVL_PGTABLES ? POE_RW : POE_R);
 
 	return por;
 }
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 13/15] arm64: mm: Guard page table writes with kpkeys
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (11 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 12/15] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 14/15] arm64: Enable kpkeys_hardened_pgtables support Kevin Brodsky
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

When CONFIG_KPKEYS_HARDENED_PGTABLES is enabled, page tables (both
user and kernel) are mapped with a privileged pkey in the linear
mapping. As a result, they can only be written under the
kpkeys_hardened_pgtables guard, which sets POR_EL1 appropriately to
allow such writes.

Use this guard wherever page tables genuinely need to be written,
keeping its scope as small as possible (so that POR_EL1 is reset as
fast as possible). Where atomics are involved, the guard's scope
encompasses the whole loop to avoid switching POR_EL1 unnecessarily.

This patch is a no-op if CONFIG_KPKEYS_HARDENED_PGTABLES is disabled
(default).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 20 ++++++++++++++++++--
 arch/arm64/mm/fault.c            |  2 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b2a2ad1b9e8..37ce03a6ab70 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -39,6 +39,7 @@
 #include <linux/mm_types.h>
 #include <linux/sched.h>
 #include <linux/page_table_check.h>
+#include <linux/kpkeys.h>
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
@@ -314,6 +315,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
 {
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*ptep, pte);
 }
 
@@ -758,6 +760,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 	}
 #endif /* __PAGETABLE_PMD_FOLDED */
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*pmdp, pmd);
 
 	if (pmd_valid(pmd)) {
@@ -825,6 +828,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
 		return;
 	}
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*pudp, pud);
 
 	if (pud_valid(pud)) {
@@ -906,6 +910,7 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 		return;
 	}
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*p4dp, p4d);
 	dsb(ishst);
 	isb();
@@ -1033,6 +1038,7 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 		return;
 	}
 
+	guard(kpkeys_hardened_pgtables)();
 	WRITE_ONCE(*pgdp, pgd);
 	dsb(ishst);
 	isb();
@@ -1233,6 +1239,7 @@ static inline int __ptep_test_and_clear_young(struct vm_area_struct *vma,
 {
 	pte_t old_pte, pte;
 
+	guard(kpkeys_hardened_pgtables)();
 	pte = __ptep_get(ptep);
 	do {
 		old_pte = pte;
@@ -1279,7 +1286,10 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address, pte_t *ptep)
 {
-	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
+	pte_t pte;
+
+	scoped_guard(kpkeys_hardened_pgtables)
+		pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
 
 	page_table_check_pte_clear(mm, pte);
 
@@ -1322,7 +1332,10 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address, pmd_t *pmdp)
 {
-	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
+	pmd_t pmd;
+
+	scoped_guard(kpkeys_hardened_pgtables)
+		pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
 
 	page_table_check_pmd_clear(mm, pmd);
 
@@ -1336,6 +1349,7 @@ static inline void ___ptep_set_wrprotect(struct mm_struct *mm,
 {
 	pte_t old_pte;
 
+	guard(kpkeys_hardened_pgtables)();
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
@@ -1369,6 +1383,7 @@ static inline void __clear_young_dirty_pte(struct vm_area_struct *vma,
 {
 	pte_t old_pte;
 
+	guard(kpkeys_hardened_pgtables)();
 	do {
 		old_pte = pte;
 
@@ -1416,6 +1431,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmdp, pmd_t pmd)
 {
 	page_table_check_pmd_set(vma->vm_mm, pmdp, pmd);
+	guard(kpkeys_hardened_pgtables)();
 	return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
 }
 #endif
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index ef63651099a9..ab45047155b9 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -220,6 +220,8 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 	if (pte_same(pte, entry))
 		return 0;
 
+	guard(kpkeys_hardened_pgtables)();
+
 	/* only preserve the access flags and write permission */
 	pte_val(entry) &= PTE_RDONLY | PTE_AF | PTE_WRITE | PTE_DIRTY;
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 14/15] arm64: Enable kpkeys_hardened_pgtables support
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (12 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 13/15] arm64: mm: Guard page table writes with kpkeys Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-03 10:18 ` [RFC PATCH v3 15/15] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

kpkeys_hardened_pgtables should be enabled as early as possible (if
selected). It does however require kpkeys being available, which
means on arm64 POE being detected and enabled. POE is a boot
feature, so calling kpkeys_hardened_pgtables_enable() just after
setup_boot_cpu_features() in smp_prepare_boot_cpu() is the best we
can do.

With that done, all the bits are in place and we can advertise
support for kpkeys_hardened_pgtables by selecting
ARCH_HAS_KPKEYS_HARDENED_PGTABLES if ARM64_POE is enabled.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/Kconfig      | 1 +
 arch/arm64/kernel/smp.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 34f15348ada8..df26902d385c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2187,6 +2187,7 @@ config ARM64_POE
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
 	select ARCH_HAS_KPKEYS
+	select ARCH_HAS_KPKEYS_HARDENED_PGTABLES
 	help
 	  The Permission Overlay Extension is used to implement Memory
 	  Protection Keys. Memory Protection Keys provides a mechanism for
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b3f6b56e733..074cab55f9db 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -35,6 +35,7 @@
 #include <linux/kgdb.h>
 #include <linux/kvm_host.h>
 #include <linux/nmi.h>
+#include <linux/kpkeys.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -468,6 +469,7 @@ void __init smp_prepare_boot_cpu(void)
 	if (system_uses_irq_prio_masking())
 		init_gic_priority_masking();
 
+	kpkeys_hardened_pgtables_enable();
 	kasan_init_hw_tags();
 	/* Init percpu seeds for random tags after cpus are set up. */
 	kasan_init_sw_tags();
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 15/15] mm: Add basic tests for kpkeys_hardened_pgtables
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (13 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 14/15] arm64: Enable kpkeys_hardened_pgtables support Kevin Brodsky
@ 2025-02-03 10:18 ` Kevin Brodsky
  2025-02-06 22:41 ` [RFC PATCH v3 00/15] pkeys-based page table hardening Kees Cook
  2025-03-06 16:23 ` Maxwell Bland
  16 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-03 10:18 UTC (permalink / raw)
  To: linux-hardening
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Add basic tests for the kpkeys_hardened_pgtables feature: try to
perform a direct write to some kernel and user page table entry and
ensure it fails.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 mm/Makefile                        |  1 +
 mm/kpkeys_hardened_pgtables_test.c | 72 ++++++++++++++++++++++++++++++
 security/Kconfig.hardening         | 12 +++++
 3 files changed, 85 insertions(+)
 create mode 100644 mm/kpkeys_hardened_pgtables_test.c

diff --git a/mm/Makefile b/mm/Makefile
index 130691364172..f7263b7f45b8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -148,3 +148,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
 obj-$(CONFIG_KPKEYS_HARDENED_PGTABLES) += kpkeys_hardened_pgtables.o
+obj-$(CONFIG_KPKEYS_HARDENED_PGTABLES_TEST) += kpkeys_hardened_pgtables_test.o
diff --git a/mm/kpkeys_hardened_pgtables_test.c b/mm/kpkeys_hardened_pgtables_test.c
new file mode 100644
index 000000000000..86d862d43bea
--- /dev/null
+++ b/mm/kpkeys_hardened_pgtables_test.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <kunit/test.h>
+#include <linux/pgtable.h>
+#include <linux/mman.h>
+
+static void write_kernel_pte(struct kunit *test)
+{
+	pte_t *ptep;
+	pte_t pte;
+	int ret;
+
+	/*
+	 * The choice of address is mostly arbitrary - we just need a page
+	 * that is definitely mapped, such as the current function.
+	 */
+	ptep = virt_to_kpte((unsigned long)&write_kernel_pte);
+	KUNIT_ASSERT_NOT_NULL_MSG(test, ptep, "Failed to get PTE");
+
+	pte = ptep_get(ptep);
+	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
+	ret = copy_to_kernel_nofault(ptep, &pte, sizeof(pte));
+	KUNIT_EXPECT_EQ_MSG(test, ret, -EFAULT,
+			    "Direct PTE write wasn't prevented");
+}
+
+static void write_user_pmd(struct kunit *test)
+{
+	pmd_t *pmdp;
+	pmd_t pmd;
+	unsigned long uaddr;
+	int ret;
+
+	uaddr = kunit_vm_mmap(test, NULL, 0, PAGE_SIZE, PROT_READ,
+			      MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, 0);
+	KUNIT_ASSERT_NE_MSG(test, uaddr, 0, "Could not create userspace mm");
+
+	/* We passed MAP_POPULATE so a PMD should already be allocated */
+	pmdp = pmd_off(current->mm, uaddr);
+	KUNIT_ASSERT_NOT_NULL_MSG(test, pmdp, "Failed to get PMD");
+
+	pmd = pmdp_get(pmdp);
+	pmd = set_pmd_bit(pmd, __pgprot(PROT_SECT_NORMAL));
+	ret = copy_to_kernel_nofault(pmdp, &pmd, sizeof(pmd));
+	KUNIT_EXPECT_EQ_MSG(test, ret, -EFAULT,
+			    "Direct PMD write wasn't prevented");
+}
+
+static int kpkeys_hardened_pgtables_suite_init(struct kunit_suite *suite)
+{
+	if (!arch_kpkeys_enabled()) {
+		pr_err("Cannot run kpkeys_hardened_pgtables tests: kpkeys are not supported\n");
+		return 1;
+	}
+
+	return 0;
+}
+
+static struct kunit_case kpkeys_hardened_pgtables_test_cases[] = {
+	KUNIT_CASE(write_kernel_pte),
+	KUNIT_CASE(write_user_pmd),
+	{}
+};
+
+static struct kunit_suite kpkeys_hardened_pgtables_test_suite = {
+	.name = "Hardened pgtables using kpkeys",
+	.test_cases = kpkeys_hardened_pgtables_test_cases,
+	.suite_init = kpkeys_hardened_pgtables_suite_init,
+};
+kunit_test_suite(kpkeys_hardened_pgtables_test_suite);
+
+MODULE_DESCRIPTION("Tests for the kpkeys_hardened_pgtables feature");
+MODULE_LICENSE("GPL");
diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index f729488eac56..649847535fc3 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -313,6 +313,18 @@ config KPKEYS_HARDENED_PGTABLES
 	  This option has no effect if the system does not support
 	  kernel pkeys.
 
+config KPKEYS_HARDENED_PGTABLES_TEST
+	tristate "KUnit tests for kpkeys_hardened_pgtables" if !KUNIT_ALL_TESTS
+	depends on KPKEYS_HARDENED_PGTABLES
+	depends on KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable this option to check that the kpkeys_hardened_pgtables feature
+	  functions as intended, i.e. prevents arbitrary writes to user and
+	  kernel page tables.
+
+	  If unsure, say N.
+
 endmenu
 
 config CC_HAS_RANDSTRUCT
-- 
2.47.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey()
  2025-02-03 10:18 ` [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
@ 2025-02-06 19:01   ` Linus Walleij
  2025-02-07 14:33     ` Kevin Brodsky
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Walleij @ 2025-02-06 19:01 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-hardening, linux-kernel, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

Hi Kevin,

thanks for your patch!

On Mon, Feb 3, 2025 at 11:20 AM Kevin Brodsky <kevin.brodsky@arm.com> wrote:

> kernel_pgtables_set_pkey() allows setting the pkey of all page table
> pages in swapper_pg_dir, recursively. This will be needed by
> kpkeys_hardened_pgtables, as it relies on all PTPs being mapped with
> a non-default pkey. Those initial kernel page tables cannot
> practically be assigned a non-default pkey right when they are
> allocated, so mutating them during (early) boot is required.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>

(...)

> +static int set_page_pkey(void *p, int pkey)
> +static int set_pkey_pte(pmd_t *pmd, int pkey)
> +static int set_pkey_pmd(pud_t *pud, int pkey)
> +static int set_pkey_pud(p4d_t *p4d, int pkey)
> +static int set_pkey_p4d(pgd_t *pgd, int pkey)
> +int kernel_pgtables_set_pkey(int pkey)

Aren't these all discardable after boot, so the whole set should
be tagged with __init?

Other than that it LGTM.

Yours,
Linus Walleij


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (14 preceding siblings ...)
  2025-02-03 10:18 ` [RFC PATCH v3 15/15] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky
@ 2025-02-06 22:41 ` Kees Cook
  2025-02-10 14:23   ` Kevin Brodsky
  2025-03-06 16:23 ` Maxwell Bland
  16 siblings, 1 reply; 28+ messages in thread
From: Kees Cook @ 2025-02-06 22:41 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-hardening, linux-kernel, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On Mon, Feb 03, 2025 at 10:18:24AM +0000, Kevin Brodsky wrote:
> This is a proposal to leverage protection keys (pkeys) to harden
> critical kernel data, by making it mostly read-only. The series includes
> a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
> as well as a page table hardening feature based on that framework
> (kpkeys_hardened_pgtables). Both are implemented on arm64 as a proof of
> concept, but they are designed to be compatible with any architecture
> implementing pkeys.

Does QEMU support POE? The only mention I could find is here:
https://mail.gnu.org/archive/html/qemu-arm/2024-03/msg00486.html
where the answer is, "no and it looks difficult". :P

> # Threat model
> 
> The proposed scheme aims at mitigating data-only attacks (e.g.
> use-after-free/cross-cache attacks). In other words, it is assumed that
> control flow is not corrupted, and that the attacker does not achieve
> arbitrary code execution. Nothing prevents the pkey register from being
> set to its most permissive state - the assumption is that the register
> is only modified on legitimate code paths.

Do you have any tests that could be added to drivers/misc/lkdtm that
explicitly exercise the protection? That is where many hardware security
features get tested. (i.e. a successful test will generally trigger a
BUG_ON or similar.)

> The arm64 implementation should be considered a proof of concept only.
> The enablement of POE for in-kernel use is incomplete; in particular
> POR_EL1 (pkey register) should be reset on exception entry and restored
> on exception return.

As in, make sure the loaded pkey isn't leaked into an exception handler?

> # Open questions
> 
> A few aspects in this RFC that are debatable and/or worth discussing:
> 
> - There is currently no restriction on how kpkeys levels map to pkeys
>   permissions. A typical approach is to allocate one pkey per level and
>   make it writable at that level only. As the number of levels
>   increases, we may however run out of pkeys, especially on arm64 (just
>   8 pkeys with POE). Depending on the use-cases, it may be acceptable to
>   use the same pkey for the data associated to multiple levels.
> 
>   Another potential concern is that a given piece of code may require
>   write access to multiple privileged pkeys. This could be addressed by
>   introducing a notion of hierarchy in trust levels, where Tn is able to
>   write to memory owned by Tm if n >= m, for instance.
> 
> - kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
>   the former takes a kpkeys level and returns a pkey register value, to
>   be consumed by the latter. It would be more intuitive to manipulate
>   kpkeys levels only. However this assumes that there is a 1:1 mapping
>   between kpkeys levels and pkey register values, while in principle
>   the mapping is 1:n (certain pkeys may be used outside the kpkeys
>   framework).

Is the "levels" nature of this related to how POE behaves? It sounds
like there can only be 1 pkey active at a time (a role), rather than
each pkey representing access to a specific set of pages (a key in a
keyring), where many pkeys could be active at the same time. Am I
understanding that correctly?

> Any comment or feedback will be highly appreciated, be it on the
> high-level approach or implementation choices!

As hinted earlier with my QEMU question... what's the best way I can I
test this myself? :)

Thanks for working on this! Data-only attacks have been on the rise for
a while now, and I'm excited to see some viable mitigations appearing.
Yay!

-Kees

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey()
  2025-02-06 19:01   ` Linus Walleij
@ 2025-02-07 14:33     ` Kevin Brodsky
  0 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-07 14:33 UTC (permalink / raw)
  To: Linus Walleij
  Cc: linux-hardening, linux-kernel, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On 06/02/2025 20:01, Linus Walleij wrote:
>> +static int set_page_pkey(void *p, int pkey)
>> +static int set_pkey_pte(pmd_t *pmd, int pkey)
>> +static int set_pkey_pmd(pud_t *pud, int pkey)
>> +static int set_pkey_pud(p4d_t *p4d, int pkey)
>> +static int set_pkey_p4d(pgd_t *pgd, int pkey)
>> +int kernel_pgtables_set_pkey(int pkey)
> Aren't these all discardable after boot, so the whole set should
> be tagged with __init?

Good point, I was thinking of this function as a relatively generic one
but in reality there's probably no use for it after boot. In any case
it's only called from a function marked __init at the moment, so it's
safe to mark all those as __init too.

> Other than that it LGTM.

Thank you for the review!

- Kevin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-02-06 22:41 ` [RFC PATCH v3 00/15] pkeys-based page table hardening Kees Cook
@ 2025-02-10 14:23   ` Kevin Brodsky
  2025-02-13 14:54     ` Kevin Brodsky
  0 siblings, 1 reply; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-10 14:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-hardening, linux-kernel, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On 06/02/2025 23:41, Kees Cook wrote:
> On Mon, Feb 03, 2025 at 10:18:24AM +0000, Kevin Brodsky wrote:
>> This is a proposal to leverage protection keys (pkeys) to harden
>> critical kernel data, by making it mostly read-only. The series includes
>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
>> as well as a page table hardening feature based on that framework
>> (kpkeys_hardened_pgtables). Both are implemented on arm64 as a proof of
>> concept, but they are designed to be compatible with any architecture
>> implementing pkeys.
> Does QEMU support POE? The only mention I could find is here:
> https://mail.gnu.org/archive/html/qemu-arm/2024-03/msg00486.html
> where the answer is, "no and it looks difficult". :P

Unfortunately it looks like the answer hasn't changed since last year. I
am testing this series on an Arm Fast Models platform (FVP) [1], which
does support POE. I've included instructions to get you started at the end.

>> # Threat model
>>
>> The proposed scheme aims at mitigating data-only attacks (e.g.
>> use-after-free/cross-cache attacks). In other words, it is assumed that
>> control flow is not corrupted, and that the attacker does not achieve
>> arbitrary code execution. Nothing prevents the pkey register from being
>> set to its most permissive state - the assumption is that the register
>> is only modified on legitimate code paths.
> Do you have any tests that could be added to drivers/misc/lkdtm that
> explicitly exercise the protection? That is where many hardware security
> features get tested. (i.e. a successful test will generally trigger a
> BUG_ON or similar.)

I could certainly add some tests there, but I wonder if such crash tests
provide much benefit compared to the KUnit tests (that rely on
copy_to_kernel_nofault()) in patch 15? Not crashing the kernel does mean
that many of those tests can be run in a row :)

>> The arm64 implementation should be considered a proof of concept only.
>> The enablement of POE for in-kernel use is incomplete; in particular
>> POR_EL1 (pkey register) should be reset on exception entry and restored
>> on exception return.
> As in, make sure the loaded pkey isn't leaked into an exception handler?

I wouldn't say "leaking" is the issue here, but yes conceptually
exception handlers should run with a fixed pkey configuration, not that
of the interrupted context. As Dave Hansen pointed out [2], what is even
more important is to context-switch the pkey register. A thread may be
interrupted and scheduled out while executing at a higher kpkeys level;
we want to ensure that this thread resumes execution at the same kpkeys
level, and that in the meantime we return to the standard level.

>> # Open questions
>>
>> A few aspects in this RFC that are debatable and/or worth discussing:
>>
>> - There is currently no restriction on how kpkeys levels map to pkeys
>>   permissions. A typical approach is to allocate one pkey per level and
>>   make it writable at that level only. As the number of levels
>>   increases, we may however run out of pkeys, especially on arm64 (just
>>   8 pkeys with POE). Depending on the use-cases, it may be acceptable to
>>   use the same pkey for the data associated to multiple levels.
>>
>>   Another potential concern is that a given piece of code may require
>>   write access to multiple privileged pkeys. This could be addressed by
>>   introducing a notion of hierarchy in trust levels, where Tn is able to
>>   write to memory owned by Tm if n >= m, for instance.
>>
>> - kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
>>   the former takes a kpkeys level and returns a pkey register value, to
>>   be consumed by the latter. It would be more intuitive to manipulate
>>   kpkeys levels only. However this assumes that there is a 1:1 mapping
>>   between kpkeys levels and pkey register values, while in principle
>>   the mapping is 1:n (certain pkeys may be used outside the kpkeys
>>   framework).
> Is the "levels" nature of this related to how POE behaves? It sounds
> like there can only be 1 pkey active at a time (a role), rather than
> each pkey representing access to a specific set of pages (a key in a
> keyring), where many pkeys could be active at the same time. Am I
> understanding that correctly?

Only one key is used (besides the default key) in this initial RFC.
However, the idea behind the level abstraction is indeed that (RW)
access to multiple keys may be required at the same time. In the
follow-up RFC protecting credentials, this is illustrated by the
"unrestricted" level that grants RW access to all keys. I believe this
approach is the most flexible, in that any permission mapping can be
defined for each level.

>> Any comment or feedback will be highly appreciated, be it on the
>> high-level approach or implementation choices!
> As hinted earlier with my QEMU question... what's the best way I can I
> test this myself? :)

As mentioned above I tested this series on Arm FVP. By far the easiest
way to run some custom kernel/rootfs on FVP is to use the Shrinkwrap
tool [3]. First install it following the quick start guide [4] (I would
recommend using the Docker backend if possible). Then build the firmware
stack using:

$ shrinkwrap build -o arch/v9.0.yaml ns-edk2.yaml

To make things easy, the runtime configuration can be stored in a file.
Create ~/.shrinkwrap/config/poe.yaml with the following contents:

----8<----

%YAML 1.2
---
layers:
  - arch/v9.0.yaml

run:
  rtvars:
    CMDLINE:
      type: string
      # nr_cpus=1 can be added to speed up the boot
      value: console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda rw
  params:
    -C cluster0.has_permission_overlay_s1: 1
    -C cluster1.has_permission_overlay_s1: 1

----8<----

Finally start FVP using:

$ shrinkwrap run -o poe.yaml ns-edk2.yaml -r
KERNEL=<out>/arch/arm64/boot/Image -r ROOTFS=<rootfs.img>

(Use Ctrl-] to terminate the model if needed.)

<rootfs.img> is a file containing the root filesystem (in raw format,
e.g. ext4). The kernel itself is built as usual (defconfig works just
fine), just make sure to select CONFIG_KPKEYS_HARDENED_PGTABLES to
enable the feature. You can also select
CONFIG_KPKEYS_HARDENED_PGTABLES_TEST to run the tests in patch 15.

> Thanks for working on this! Data-only attacks have been on the rise for
> a while now, and I'm excited to see some viable mitigations appearing.
> Yay!

Thank you for your interest and support, very appreciated!

- Kevin
[1]
https://developer.arm.com/Tools%20and%20Software/Fixed%20Virtual%20Platforms/Arm%20Architecture%20FVPs
[2]
https://lore.kernel.org/linux-hardening/dcc1800c-cf0a-4d88-bc88-982f0709b382@intel.com/
[3] https://shrinkwrap.docs.arm.com/
[4] https://shrinkwrap.docs.arm.com/en/latest/userguide/quickstart.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-02-10 14:23   ` Kevin Brodsky
@ 2025-02-13 14:54     ` Kevin Brodsky
  0 siblings, 0 replies; 28+ messages in thread
From: Kevin Brodsky @ 2025-02-13 14:54 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-hardening, linux-kernel, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On 10/02/2025 15:23, Kevin Brodsky wrote:
> [...]
>
>>> Any comment or feedback will be highly appreciated, be it on the
>>> high-level approach or implementation choices!
>> As hinted earlier with my QEMU question... what's the best way I can I
>> test this myself? :)
> As mentioned above I tested this series on Arm FVP. By far the easiest
> way to run some custom kernel/rootfs on FVP is to use the Shrinkwrap
> tool [3]. First install it following the quick start guide [4] (I would
> recommend using the Docker backend if possible). Then build the firmware
> stack using:
>
> $ shrinkwrap build -o arch/v9.0.yaml ns-edk2.yaml
>
> To make things easy, the runtime configuration can be stored in a file.
> Create ~/.shrinkwrap/config/poe.yaml with the following contents:
>
> ----8<----
>
> %YAML 1.2
> ---
> layers:
>   - arch/v9.0.yaml

Apologies, this is incorrect - it will not work with the most recent FVP
builds. POE is a v9.4 feature so this line should be replaced with:

> - arch/v9.4.yaml

(No need to change the shrinkwrap build line, it only matters for the
FVP runtime parameters.)

- Kevin

> run:
>   rtvars:
>     CMDLINE:
>       type: string
>       # nr_cpus=1 can be added to speed up the boot
>       value: console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda rw
>   params:
>     -C cluster0.has_permission_overlay_s1: 1
>     -C cluster1.has_permission_overlay_s1: 1
>
> ----8<----
>
> Finally start FVP using:
>
> $ shrinkwrap run -o poe.yaml ns-edk2.yaml -r
> KERNEL=<out>/arch/arm64/boot/Image -r ROOTFS=<rootfs.img>
>
> (Use Ctrl-] to terminate the model if needed.)
>
> <rootfs.img> is a file containing the root filesystem (in raw format,
> e.g. ext4). The kernel itself is built as usual (defconfig works just
> fine), just make sure to select CONFIG_KPKEYS_HARDENED_PGTABLES to
> enable the feature. You can also select
> CONFIG_KPKEYS_HARDENED_PGTABLES_TEST to run the tests in patch 15.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
                   ` (15 preceding siblings ...)
  2025-02-06 22:41 ` [RFC PATCH v3 00/15] pkeys-based page table hardening Kees Cook
@ 2025-03-06 16:23 ` Maxwell Bland
  2025-03-13 12:32   ` Kevin Brodsky
  16 siblings, 1 reply; 28+ messages in thread
From: Maxwell Bland @ 2025-03-06 16:23 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-kernel, Kevin Brodsky, Andrew Morton, Mark Brown,
	Catalin Marinas, Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly,
	Kees Cook, Linus Walleij, Andy Lutomirski, Marc Zyngier,
	Peter Zijlstra, Pierre Langlois, Quentin Perret,
	Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On Mon, Feb 03, 2025 at 10:18:24AM +0000, Kevin Brodsky wrote:
> This is a proposal to leverage protection keys (pkeys) to harden
> critical kernel data, by making it mostly read-only. The series includes
> a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
> as well as a page table hardening feature based on that framework
> (kpkeys_hardened_pgtables). Both are implemented on arm64 as a proof of
> concept, but they are designed to be compatible with any architecture
> implementing pkeys.

Hi Kevin,

This is awesome. When working on some of these problems, I also thought
of leveraging the POE feature, but was not sure of a good way to make
it work.

> Page tables were chosen as they are a popular (and critical) target for
> attacks, but there are of course many others - this is only a starting
> point (see section "Further use-cases"). It has become more and more
> common for accesses to such target data to be mediated by a hypervisor
> in vendor kernels; the hope is that kpkeys can provide much of that
> protection in a simpler manner. No benchmarking has been performed at
> this stage, but the runtime overhead should also be lower (though likely
> not negligible).

Some notes here, having implemented similar page table protections,
albeit using stage-2 page table permissions and a fault handler.

https://lore.kernel.org/all/2wf4kmoqqmod6njviaq33lbxbx6gvdqbksljxykgltwnxo6ruy@7ueumwmxxh72/

I wanted to know your thoughts on associating specific policies to
page table updates in cases where an adversary is able to corrupt
other state associated with parameters to the page table infrastructure
code, e.g.

arch/arm64/net/bpf_jit_comp.c
2417:		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target),

Is this something we would assume is handled via the security_* hook
infrastructure, a shadow stack CFI method, or changing the kernel code
to reverify the data non-modularly, some combination of the above?

> - Pages in the linear mapping are assigned a pkey using set_memory_pkey().
>   This is sufficient for this series, but of course higher-level
>   interfaces can be introduced later to ask allocators to return pages
>   marked with a given pkey. It should also be possible to extend this to
>   vmalloc() if needed.

One of the interesting points here, acknowledged below, was that this
relies on having guarantees around the PC value/CFI of the function.
Since this is the baseline assumption, it feels natural that the
locking/unlocking would be associated into the existing CFI
instrumentation, since (from experience) point-patching each mutable
data structure allocation/deallocation is difficult.

> # Further use-cases
> 
> It should be possible to harden various targets using kpkeys, including:
> 
> - struct cred (enforcing a "mostly read-only" state once committed)
> 
> - fixmap (occasionally used even after early boot, e.g.
>   set_swapper_pgd() in arch/arm64/mm/mmu.c)
> 
> - SELinux state (e.g. struct selinux_state::initialized)
> 
> ... and many others.

Be wary that it is not just struct cred but pointers to struct cred.  We
quickly run into a problem of, for example, in updates to f_op,
bookkeeping which f_op is OK to have in the file, e.g., in Android's
6.1.x:

drivers/gpu/drm/i810/i810_dma.c
138:    file_priv->filp->f_op = &i810_buffer_fops;
144:    file_priv->filp->f_op = old_fops;

drivers/staging/android/ashmem.c
436:            vmfile->f_op = &vmfile_fops;

Where overwriting f_op is a "classic" bypass of protection systems like
this one.

I think this problem may be totally solvable if POE was integrated into
something like CFI, since we can guarantee only the code that sets f_op
to "vmfile_fops" can unlock/relock the file's page.

Maybe another approach would work better, though?

> # Open questions
> 
> A few aspects in this RFC that are debatable and/or worth discussing:
> 
> - There is currently no restriction on how kpkeys levels map to pkeys
>   permissions. A typical approach is to allocate one pkey per level and
>   make it writable at that level only. As the number of levels
>   increases, we may however run out of pkeys, especially on arm64 (just
>   8 pkeys with POE). Depending on the use-cases, it may be acceptable to
>   use the same pkey for the data associated to multiple levels.

Honestly, I associate each protected virtual page in stage-2 with a
unique tag (manually, right now, but Kees Cook has some magic that
does the same via alloc_tag.h), and this works really well to track
specific resources and resource modification semantics "over" a generic
protection ring.

I think, though, that the code you provided could be used to bootstrap
such a system by using the overlay to protect a similar page tag lookup
table, which then can provide the fine-grained protection semantics.

I.e. use this baseline to isolate a secure monitor system.

Hopefully that makes sense! (-:

> 
> - kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
>   the former takes a kpkeys level and returns a pkey register value, to
>   be consumed by the latter. It would be more intuitive to manipulate
>   kpkeys levels only. However this assumes that there is a 1:1 mapping
>   between kpkeys levels and pkey register values, while in principle
>   the mapping is 1:n (certain pkeys may be used outside the kpkeys
>   framework).

Another issue I'm not confident in is the assumption of adversary's
inability to manipulate system control registers. This is true in the
context of a Heki-like system (or any well-made HVCI), but not totally
true of a pure EL1 implementation?

> 
> - An architecture that supports kpkeys is expected to select
>   CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is
>   no CONFIG_KPKEYS to control this behaviour. Since this creates no
>   significant overhead (at least on arm64), it seemed better to keep it
>   simple. Each hardening feature does have its own option and arch
>   opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES,
>   CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES).

There's so many pieces of data though/data structures! In this model,
you'd have a separate switch for thousands of types of data! But I do
think protecting PTs is the first step to a more complicated security
monitor, since it allows you to have integrity for specific physical
pages (or IPAs).
> 
> 
> Any comment or feedback will be highly appreciated, be it on the
> high-level approach or implementation choices!

Last note, I'd not totallllyyy trust the compiler to inline the
functions.... I've met cases where functions on memory protections
I expected to be inlined were not. I think __forceinline *may* work
here, vs standard "static inline", but am not confident/sure.

Hopefully the above is valuable at all. Thanks!

Maxwell Bland


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-03-06 16:23 ` Maxwell Bland
@ 2025-03-13 12:32   ` Kevin Brodsky
  2025-03-19 21:54     ` Maxwell Bland
  0 siblings, 1 reply; 28+ messages in thread
From: Kevin Brodsky @ 2025-03-13 12:32 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-kernel, Andrew Morton, Mark Brown, Catalin Marinas,
	Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On 06/03/2025 17:23, Maxwell Bland wrote:
> On Mon, Feb 03, 2025 at 10:18:24AM +0000, Kevin Brodsky wrote:
>> This is a proposal to leverage protection keys (pkeys) to harden
>> critical kernel data, by making it mostly read-only. The series includes
>> a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
>> as well as a page table hardening feature based on that framework
>> (kpkeys_hardened_pgtables). Both are implemented on arm64 as a proof of
>> concept, but they are designed to be compatible with any architecture
>> implementing pkeys.
> Hi Kevin,
>
> This is awesome. When working on some of these problems, I also thought
> of leveraging the POE feature, but was not sure of a good way to make
> it work.

Hi Maxwell,

I'm glad to hear about your interest!

>> Page tables were chosen as they are a popular (and critical) target for
>> attacks, but there are of course many others - this is only a starting
>> point (see section "Further use-cases"). It has become more and more
>> common for accesses to such target data to be mediated by a hypervisor
>> in vendor kernels; the hope is that kpkeys can provide much of that
>> protection in a simpler manner. No benchmarking has been performed at
>> this stage, but the runtime overhead should also be lower (though likely
>> not negligible).
> Some notes here, having implemented similar page table protections,
> albeit using stage-2 page table permissions and a fault handler.
>
> https://lore.kernel.org/all/2wf4kmoqqmod6njviaq33lbxbx6gvdqbksljxykgltwnxo6ruy@7ueumwmxxh72/
>
> I wanted to know your thoughts on associating specific policies to
> page table updates in cases where an adversary is able to corrupt
> other state associated with parameters to the page table infrastructure
> code, e.g.
>
> arch/arm64/net/bpf_jit_comp.c
> 2417:		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target),
>
> Is this something we would assume is handled via the security_* hook
> infrastructure, a shadow stack CFI method, or changing the kernel code
> to reverify the data non-modularly, some combination of the above?

This is a good question and the short answer is that I don't know what
the best approach would be. As far as kpkeys are concerned, I'm not sure
they can really help here - it seems to be about the integrity of a
generated BPF program, and surely that program is already made read-only
once generated?

>> - Pages in the linear mapping are assigned a pkey using set_memory_pkey().
>>   This is sufficient for this series, but of course higher-level
>>   interfaces can be introduced later to ask allocators to return pages
>>   marked with a given pkey. It should also be possible to extend this to
>>   vmalloc() if needed.
> One of the interesting points here, acknowledged below, was that this
> relies on having guarantees around the PC value/CFI of the function.
> Since this is the baseline assumption, it feels natural that the
> locking/unlocking would be associated into the existing CFI
> instrumentation, since (from experience) point-patching each mutable
> data structure allocation/deallocation is difficult.

Could you elaborate on this point? Are you referring to protecting
arbitrary data structures with kpkeys, rather than what this series
covers (i.e. pgtables)?

>> # Further use-cases
>>
>> It should be possible to harden various targets using kpkeys, including:
>>
>> - struct cred (enforcing a "mostly read-only" state once committed)
>>
>> - fixmap (occasionally used even after early boot, e.g.
>>   set_swapper_pgd() in arch/arm64/mm/mmu.c)
>>
>> - SELinux state (e.g. struct selinux_state::initialized)
>>
>> ... and many others.
> Be wary that it is not just struct cred but pointers to struct cred.  We
> quickly run into a problem of, for example, in updates to f_op,
> bookkeeping which f_op is OK to have in the file, e.g., in Android's
> 6.1.x:
>
> drivers/gpu/drm/i810/i810_dma.c
> 138:    file_priv->filp->f_op = &i810_buffer_fops;
> 144:    file_priv->filp->f_op = old_fops;
>
> drivers/staging/android/ashmem.c
> 436:            vmfile->f_op = &vmfile_fops;
>
> Where overwriting f_op is a "classic" bypass of protection systems like
> this one.

Indeed, this has been pointed out to me on a few occasions. On that note
I should reference the follow-up series that aims at protecting struct
cred [1].

Using kpkeys to ensure that the task->cred pointer itself is mostly
read-only is not straightforward, because there is just too much
existing code that directly writes to other fields in task_struct -
meaning we can't simply make the whole task_struct mostly read-only.
What could be done is to split out the cred fields in a separate
kpkeys-protected page, and link that page from task_struct. There is a
clear overhead associated with this approach though, and maybe more
importantly we have just displaced the issue as the pointer to that
protected page remains writeable... I'm not sure how much a page-based
technique like pkeys can help here.

[1]
https://lore.kernel.org/linux-hardening/20250203102809.1223255-1-kevin.brodsky@arm.com/

> I think this problem may be totally solvable if POE was integrated into
> something like CFI, since we can guarantee only the code that sets f_op
> to "vmfile_fops" can unlock/relock the file's page.

I'm interested to hear more about this - I don't think I'm aware of the
CFI instrumentation you're referring to.

> Maybe another approach would work better, though?
>
>> # Open questions
>>
>> A few aspects in this RFC that are debatable and/or worth discussing:
>>
>> - There is currently no restriction on how kpkeys levels map to pkeys
>>   permissions. A typical approach is to allocate one pkey per level and
>>   make it writable at that level only. As the number of levels
>>   increases, we may however run out of pkeys, especially on arm64 (just
>>   8 pkeys with POE). Depending on the use-cases, it may be acceptable to
>>   use the same pkey for the data associated to multiple levels.
> Honestly, I associate each protected virtual page in stage-2 with a
> unique tag (manually, right now, but Kees Cook has some magic that
> does the same via alloc_tag.h), and this works really well to track
> specific resources and resource modification semantics "over" a generic
> protection ring.
>
> I think, though, that the code you provided could be used to bootstrap
> such a system by using the overlay to protect a similar page tag lookup
> table, which then can provide the fine-grained protection semantics.

This sounds sensible. Considering that the size of pkeys is between 3
and 5 bits depending on the architecture, to keep the approach generic
we really don't have many keys to play with (7 excluding the default key
on arm64/POE). kpkeys approaches keys in a static way (no dynamic
allocation), so it's best to think of each key as corresponding to one
subsystem defined at compile time.

> I.e. use this baseline to isolate a secure monitor system.
>
> Hopefully that makes sense! (-:
>
>> - kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
>>   the former takes a kpkeys level and returns a pkey register value, to
>>   be consumed by the latter. It would be more intuitive to manipulate
>>   kpkeys levels only. However this assumes that there is a 1:1 mapping
>>   between kpkeys levels and pkey register values, while in principle
>>   the mapping is 1:n (certain pkeys may be used outside the kpkeys
>>   framework).
> Another issue I'm not confident in is the assumption of adversary's
> inability to manipulate system control registers. This is true in the
> context of a Heki-like system (or any well-made HVCI), but not totally
> true of a pure EL1 implementation?

This is entirely reliant on the threat model, i.e. only data-only
attacks are in scope. If the adversary successfully hijacks the control
flow to jump in the middle of a function, or achieves arbitrary code
execution, pkeys won't be of much help. Robust CFI is therefore a clear
prerequisite.

>> - An architecture that supports kpkeys is expected to select
>>   CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is
>>   no CONFIG_KPKEYS to control this behaviour. Since this creates no
>>   significant overhead (at least on arm64), it seemed better to keep it
>>   simple. Each hardening feature does have its own option and arch
>>   opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES,
>>   CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES).
> There's so many pieces of data though/data structures! In this model,
> you'd have a separate switch for thousands of types of data! But I do
> think protecting PTs is the first step to a more complicated security
> monitor, since it allows you to have integrity for specific physical
> pages (or IPAs).

Agreed. I think this is good enough for kpkeys where we use roughly one
key per feature (meaning that the total number of features is pretty
limited), but this wouldn't scale if we had a framework to protect many
data types. In that case we'd need a different approach, probably
runtime-based. The advantage of having a CONFIG option per feature
though is that it incurs no overhead at all unless selected.

>>
>> Any comment or feedback will be highly appreciated, be it on the
>> high-level approach or implementation choices!
> Last note, I'd not totallllyyy trust the compiler to inline the
> functions.... I've met cases where functions on memory protections
> I expected to be inlined were not. I think __forceinline *may* work
> here, vs standard "static inline", but am not confident/sure.

That's a fair point, __always_inline could be used to make sure
set_pkey/restore_pkey_reg (and the helpers they call) are actually
inlined. I'll change that in v4. The guard object constructor/destructor
are defined as regular inline functions in <linux/cleanup.h> though, so
the inlining might not be complete.

> Hopefully the above is valuable at all. Thanks!

It is! Thank you for your thoughts and suggestions.

- Kevin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-03-13 12:32   ` Kevin Brodsky
@ 2025-03-19 21:54     ` Maxwell Bland
  2025-03-25 17:11       ` Kevin Brodsky
  0 siblings, 1 reply; 28+ messages in thread
From: Maxwell Bland @ 2025-03-19 21:54 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-kernel, Andrew Morton, Mark Brown, Catalin Marinas,
	Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On Thu, Mar 13, 2025 at 01:32:31PM +0100, Kevin Brodsky wrote:
> On 06/03/2025 17:23, Maxwell Bland wrote:
> > On Mon, Feb 03, 2025 at 10:18:24AM +0000, Kevin Brodsky wrote:
> > I wanted to know your thoughts on associating specific policies to
> > page table updates in cases where an adversary is able to corrupt
> > other state associated with parameters to the page table infrastructure
> > code, e.g.
> >
> > arch/arm64/net/bpf_jit_comp.c
> > 2417:		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target),
> >
> > Is this something we would assume is handled via the security_* hook
> > infrastructure, a shadow stack CFI method, or changing the kernel code
> > to reverify the data non-modularly, some combination of the above?
> 
> This is a good question and the short answer is that I don't know what
> the best approach would be. As far as kpkeys are concerned, I'm not sure
> they can really help here - it seems to be about the integrity of a
> generated BPF program, and surely that program is already made read-only
> once generated?

Apologies for the slight (5ish) day delay in response here!

Regarding the above discussion of attacks on the context wrapping calls
to the PT infrastructure, I will share the approach I took below, though
it is not super fun, and I don't have all the code publicly available
(yet!).

Before that, another quick example to show it is not just BPF:

There are PT update calls inside EROFS, commonly used in mobile devices
to save on memory. One of the issues I've run into during pagetable
lockdown is

https://github.com/torvalds/linux/blob/81e4f8d68c66da301bb881862735bd74c6241a19/fs/erofs/internal.h#L414

MotoChopper-style attacks could be accomplished through corruption of
the z_erofs_pcluster allocated via SLAB, since erofs has that series of
pointers for compressed_bvecs, which is parsed out into the
compressed_pages used by the EROFS backend to map ram. 

Onto the solution. In a pure-immutability approach, I ended up creating
some definitions regarding what an "expected" virtual-to-physical
mapping of kernel memory or change in page permissions is (my solution
was to treat init_mm as a ground truth structure which is only RW under
a more refined set of constraints on mapping updates and permission set
updates), for example,

if (get_tag(cur_pte & EL1_PADDR_MASK) == SECCOMP_PAGE) {
// run BPF re-verifier to avoid the Qualys 2021 CVE PoC magic

or forcing PXN for everything always unless some other check says we
absolutely have to make this PX:

if (!cur_pte && new_pte) {
  if (!(new_pte & pxn_bit)) {
    new_pte |= pxn_bit; 

And then a bunch of additional infrastructure to define what a "safe
mapping of physical memory looks like", which introduces its own set of
problems for defining how/where vmalloc allocates from, physical address
layout randomization, ...

Such an explication of the expected possible modifications to PTEs may
be needed? I don't understand whether just adding guard() around set_pte
helps this case.

> Could you elaborate on this point? Are you referring to protecting
> arbitrary data structures with kpkeys, rather than what this series
> covers (i.e. pgtables)?

Correct.

> > Where overwriting f_op is a "classic" bypass of protection systems like
> > this one.
> 
> Indeed, this has been pointed out to me on a few occasions. On that note
> I should reference the follow-up series that aims at protecting struct
> cred [1].
> 
> Using kpkeys to ensure that the task->cred pointer itself is mostly
> read-only is not straightforward, because there is just too much
> existing code that directly writes to other fields in task_struct -
> meaning we can't simply make the whole task_struct mostly read-only.
> What could be done is to split out the cred fields in a separate
> kpkeys-protected page, and link that page from task_struct. There is a
> clear overhead associated with this approach though, and maybe more
> importantly we have just displaced the issue as the pointer to that
> protected page remains writeable... I'm not sure how much a page-based
> technique like pkeys can help here.

__randomize_layout accomplished some of what is necessary here, while
accommodating contemporary hardware limitations, albeit to a finer
granularity and lesser degree. I am also not sure how well it works on
production systems (yet! It is on my TODO list in the short term), or
how much protection it really implies for an adversary with a "read
gadget" that can determine where to write (either from associated asm or
from the struct itself).

But...

(1) After thinking about this for a few hours, __randomize_layout implies
something valuable: under circumstances where this is supported, because
we assume the struct's layout itself is modifable, it becomes possible
to introduce a canary value _not_ for the modifiable portions, but just
for the non-modifiable portions, store this canary value in a strictly
read-only region of memory that also tracks pointers to the structure,
and ensure uses of the non-modifiable fields must always leverage the
canary value stored in this secondary region.

Also...

(2) Maybe the supposition that we'd not be able to go through each of
the ~300 memcpy and all the other copy operations throughout the kernel
to add an additional step which prevents a stale reference to a pointer
to a modifable part is not totally crazy, just a little crazy.

> 
> [1]
> https://lore.kernel.org/linux-hardening/20250203102809.1223255-1-kevin.brodsky@arm.com/
> 
> > I think this problem may be totally solvable if POE was integrated into
> > something like CFI, since we can guarantee only the code that sets f_op
> > to "vmfile_fops" can unlock/relock the file's page.
> 
> I'm interested to hear more about this - I don't think I'm aware of the
> CFI instrumentation you're referring to.

It may be possible to write a plugin modification to

https://github.com/gcc-mirror/gcc/blob/12b2c414b6d0e0d1b3d328b58d654c19c30bee8c/gcc/config/arm/aarch-bti-insert.cc

To also inject the POE/register set instructions conditioned upon the an
expected privilege (kpkeys) level of the executing function.

Instead of adding code to set the privilege state via a specialized
callgate function, having this function's call explicitly tied into the
GCC CFI instrumentation pass. Determination of the specific "key level",
i.e. KPKEYS_LVL_PGTABLES, could likely be determined by the file
name/path via the compiler, or many other clever mechanisms, potentially
with support for exceptions via function attribute tags.

Note this implies still correctly accounting for the key level during
allocation, which could be retrieved from the POE level register.

> >>
> >> Any comment or feedback will be highly appreciated, be it on the
> >> high-level approach or implementation choices!
> > Last note, I'd not totallllyyy trust the compiler to inline the
> > functions.... I've met cases where functions on memory protections
> > I expected to be inlined were not. I think __forceinline *may* work
> > here, vs standard "static inline", but am not confident/sure.
> 
> That's a fair point, __always_inline could be used to make sure
> set_pkey/restore_pkey_reg (and the helpers they call) are actually
> inlined. I'll change that in v4. The guard object constructor/destructor
> are defined as regular inline functions in <linux/cleanup.h> though, so
> the inlining might not be complete.

Noticed as well, just now, the reliance on the static_key for branch
unlikely ... the following data structure (at the end of this email) may
be of interest as an FYI ... it can be used to track whether the kernel
self patching API is only being used to patch expected locations.

I use this right now with additional checks that the instruction being
written for the indirect branch matches the expected offset.

Cheers,
Maxwell


/* SPDX-License-Identifier: GPL-2.0 */
/**
 * Copyright (C) 2023 Motorola Mobility, Inc.
 *
 * Author: Maxwell Bland
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Bitwise trie data structure for storing jump label code addresses.
 * To perform a lookup, we iterate over the bits of the jump_label from MSB to
 * LSB, and use the bit as a tree traversal.
 *
 * Ideally we would just store this in the hypervisor, but we don't have enough
 * memory.  At the time of writing there are 56404 jump labels in the kernel,
 * and the hypervisor is restricted to a memory footprint of 32 KB.
 *
 * So instead what we do is allocate this data structure in the kernel module,
 * and then lock it down. Finally, we pass the pointer to the hypervisor, which
 * can then perform lookups whenever a DABT exception occurs.
 *
 * We choose not to use the kernel's radix tree for this because it is less
 * space efficient and much slower, and jump_label operations relate to
 * performance sensitive code (e.g. the networking stack).
 */

#ifndef JUMP_TABLE_LOOKUP_H
#define JUMP_TABLE_LOOKUP_H

#include <linux/of_address.h>

/* Used to determine the branch direction */
#define JUMP_TREE_BRANCH(x) ((x >> ((sizeof(x) * 8) - 1)) & 1)

union jump_tree_node {
	uint32_t je_index[2];
	uint64_t je_val;
};

static union jump_tree_node *jet; /* Start of jump table lookup */
static uint64_t jet_ind; /* Current index of jump table lookup */
static union jump_tree_node *jet_end; /* End of jump table lookup */

void init_jtn(union jump_tree_node *jtn)
{
	jtn->je_val = 0;
}

/**
 * jet_extend - extends the jump entry lookup table
 *
 * Scales up the size of the jump entry lookup table by 2 just in case our
 * initial estimate was too small. Ensures that the new size is a multiple of
 * PAGE_SIZE.
 */
void jet_extend(void)
{
	uint64_t jet_sz = jet_end - jet;
	uint64_t new_sz = jet_sz * 2;
	union jump_tree_node *new_jet;

	if (new_sz % PAGE_SIZE)
		new_sz += PAGE_SIZE - (new_sz % PAGE_SIZE);

	if (jet_ind + 1 >= jet_sz) {
		new_jet = kvmalloc(new_sz * sizeof(union jump_tree_node),
				   GFP_KERNEL);
		memcpy(new_jet, jet, jet_sz * sizeof(union jump_tree_node));
		jet_end = new_jet + new_sz;
		kvfree(jet);
		jet = new_jet;
	}

	init_jtn(jet + jet_ind);
}

/**
 * add_branch - adds a branch to the jump entry lookup tree
 *
 * Adds a branch to the current node corresponding to whether the LSB of the
 * cur_code_ptr is 1 or 0. The branch target is the index of the next node in
 * the tree or a leaf (jump entry)
 */
int add_branch(unsigned long *cur_code_ptr, union jump_tree_node *cur_node,
	       uint64_t branch_target)
{
	bool bit;

	if (!cur_code_ptr || !cur_node)
		return -EACCES;

	bit = JUMP_TREE_BRANCH(*cur_code_ptr);
	*cur_code_ptr <<= 1;

	if (bit)
		cur_node->je_index[0] = branch_target;
	else
		cur_node->je_index[1] = branch_target;

	return 0;
}

/**
 * add_trunk - adds a path of branches to the jump entry lookup tree
 *
 * Adds the series of branches to the jump entry lookup tree corresponding to
 * the bits in the cur_code_ptr value starting at LSB index i, and adds the leaf
 * node at the end.
 */
int add_trunk(union jump_tree_node *cur_node, struct jump_entry *je,
	      unsigned long cur_code_ptr, int i)
{
	if (!cur_node || !je)
		return -EACCES;

	for (; i < sizeof(cur_code_ptr) * 8; i++) {
		if (add_branch(&cur_code_ptr, cur_node, jet_ind))
			return -EACCES;
		jet_extend();
		cur_node = jet + jet_ind;
		jet_ind++;
	}

	/*
	 * GKI rips aarch64_insn_gen_branch_imm out for the illusion of security
	 * while leaving jump_labels as a whole intact, this is a partial
	 * duplicate of that function
	 */
	cur_node->je_val = (jump_entry_target(je) - jump_entry_code(je)) >> 2;

	return 0;
}

/**
 * jet_alloc - allocates a new jump entry tree for lookups
 * @jet_end_out: output parameter for the size of the jump entry tree (bytes)
 *
 * Allocates the jump entry tree with 2x the number of jump entries in the
 * kernel. This is because we need to store both the jump entry and, in the
 * worst case, a pointer to the next index in the tree. Ensures the total memory
 * allocated is a multiple of PAGE_SIZE, uses kvmalloc (the memory is
 * non-contiguous)
 *
 * Return: 0 on failure, or the pointer to the allocated jump entry tree
 */
uint64_t jet_alloc(struct jump_entry *je_start, struct jump_entry *je_end,
		   uint64_t *jet_size_out)
{
	union jump_tree_node *cur_node;
	unsigned long cur_code_ptr = -1;
	bool bit;
	int i = 0;

	if (!je_start || !je_end || !jet_size_out) {
		pr_err("%s fail: invalid parameters\n", __func__);
		return 0;
	}

	if (je_start >= je_end) {
		pr_err("%s fail: je_start >= je_end\n", __func__);
		return 0;
	}

	jet = kvmalloc(2 * (je_end - je_start) * sizeof(union jump_tree_node),
		       GFP_KERNEL);
	jet_ind = 1;
	jet_end = jet + 2 * (je_end - je_start);
	init_jtn(jet);
	init_jtn(jet + 1);

	while (je_start != je_end) {
		cur_code_ptr = __virt_to_phys(jump_entry_code(je_start));
		cur_node = jet;

		// iterate over the bits of the jump label
		for (i = 0; i < sizeof(cur_code_ptr) * 8; i++) {
			bit = JUMP_TREE_BRANCH(cur_code_ptr);

			if (bit) {
				if (cur_node->je_index[0]) {
					cur_node = jet + cur_node->je_index[0];
				} else {
					if (add_trunk(cur_node, je_start,
						      cur_code_ptr, i))
						return 0;
					break;
				}
			} else {
				if (cur_node->je_index[1]) {
					cur_node = jet + cur_node->je_index[1];
				} else {
					if (add_trunk(cur_node, je_start,
						      cur_code_ptr, i))
						return 0;
					break;
				}
			}

			cur_code_ptr <<= 1;
		}

		je_start++;
	}

	*jet_size_out = jet_end - jet;
	return (uint64_t)jet;
}

#endif


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-03-19 21:54     ` Maxwell Bland
@ 2025-03-25 17:11       ` Kevin Brodsky
  2025-03-28 16:15         ` Maxwell Bland
  0 siblings, 1 reply; 28+ messages in thread
From: Kevin Brodsky @ 2025-03-25 17:11 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-kernel, Andrew Morton, Mark Brown, Catalin Marinas,
	Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On 19/03/2025 22:54, Maxwell Bland wrote:
> [...]
>
> Onto the solution. In a pure-immutability approach, I ended up creating
> some definitions regarding what an "expected" virtual-to-physical
> mapping of kernel memory or change in page permissions is (my solution
> was to treat init_mm as a ground truth structure which is only RW under
> a more refined set of constraints on mapping updates and permission set
> updates), for example,
>
> if (get_tag(cur_pte & EL1_PADDR_MASK) == SECCOMP_PAGE) {
> // run BPF re-verifier to avoid the Qualys 2021 CVE PoC magic
>
> or forcing PXN for everything always unless some other check says we
> absolutely have to make this PX:
>
> if (!cur_pte && new_pte) {
>   if (!(new_pte & pxn_bit)) {
>     new_pte |= pxn_bit; 
>
> And then a bunch of additional infrastructure to define what a "safe
> mapping of physical memory looks like", which introduces its own set of
> problems for defining how/where vmalloc allocates from, physical address
> layout randomization, ...
>
> Such an explication of the expected possible modifications to PTEs may
> be needed? I don't understand whether just adding guard() around set_pte
> helps this case.

If I understood correctly the case you're referring to, the simple
answer is that it doesn't. If some driver explicitly updates page
tables, and then some exploit hijacks those pgtable updates, then kpkeys
won't help at all. I see two separate problems here: ensuring that the
appropriate interface is used to write to pgtables (or other objects),
and then constraining which pgtables exactly a certain context can
manipulate via this interface. kpkeys addresses the first problem, which
you could say is a low-hanging fruit. The second problem clearly needs
addressing for a strong protection of page tables, but I don't think
kpkeys can directly help with that.

> [...]
>
>>> Where overwriting f_op is a "classic" bypass of protection systems like
>>> this one.
>> Indeed, this has been pointed out to me on a few occasions. On that note
>> I should reference the follow-up series that aims at protecting struct
>> cred [1].
>>
>> Using kpkeys to ensure that the task->cred pointer itself is mostly
>> read-only is not straightforward, because there is just too much
>> existing code that directly writes to other fields in task_struct -
>> meaning we can't simply make the whole task_struct mostly read-only.
>> What could be done is to split out the cred fields in a separate
>> kpkeys-protected page, and link that page from task_struct. There is a
>> clear overhead associated with this approach though, and maybe more
>> importantly we have just displaced the issue as the pointer to that
>> protected page remains writeable... I'm not sure how much a page-based
>> technique like pkeys can help here.
> __randomize_layout accomplished some of what is necessary here, while
> accommodating contemporary hardware limitations, albeit to a finer
> granularity and lesser degree. I am also not sure how well it works on
> production systems (yet! It is on my TODO list in the short term), or
> how much protection it really implies for an adversary with a "read
> gadget" that can determine where to write (either from associated asm or
> from the struct itself).
>
> But...
>
> (1) After thinking about this for a few hours, __randomize_layout implies
> something valuable: under circumstances where this is supported, because
> we assume the struct's layout itself is modifable, it becomes possible
> to introduce a canary value _not_ for the modifiable portions, but just
> for the non-modifiable portions, store this canary value in a strictly
> read-only region of memory that also tracks pointers to the structure,
> and ensure uses of the non-modifiable fields must always leverage the
> canary value stored in this secondary region.

I don't think this can be enforced with pkeys, since the overall struct
would have to remain RW. But I'm not sure I completely understand how
that scheme would work overall, or how __randomize_layout is a
prerequisite for it.

> [...]
>
> It may be possible to write a plugin modification to
>
> https://github.com/gcc-mirror/gcc/blob/12b2c414b6d0e0d1b3d328b58d654c19c30bee8c/gcc/config/arm/aarch-bti-insert.cc
>
> To also inject the POE/register set instructions conditioned upon the an
> expected privilege (kpkeys) level of the executing function.
>
> Instead of adding code to set the privilege state via a specialized
> callgate function, having this function's call explicitly tied into the
> GCC CFI instrumentation pass. Determination of the specific "key level",
> i.e. KPKEYS_LVL_PGTABLES, could likely be determined by the file
> name/path via the compiler, or many other clever mechanisms, potentially
> with support for exceptions via function attribute tags.

Right I'm with you now. This is actually something we considered, but
I'm not sure it really helps all that much. Adding a guard object to a
function is basically the same amount of churn as marking that function
with an attribute, except it doesn't require compiler support. It would
get more useful if you wanted all functions in a file to switch kpkeys
level (using a compiler flag), but for the purpose of making certain
data mostly read-only, I doubt you actually want that - the granularity
is just too coarse.

A different angle would be use an attribute to mark struct members,
rather than functions. That would instruct the compiler to switch kpkeys
level whenever that member is written (and only then). You would
probably want to forbid taking the address of the member too, as the
compiler can't easily track that. That said this doesn't solve the
bigger issue of existing code expecting to be able to write to other
members in the struct. It most likely works only if the kpkeys switch is
done when writing to any member of the struct (which may get very
expensive).

> [...]
> Noticed as well, just now, the reliance on the static_key for branch
> unlikely ... the following data structure (at the end of this email) may
> be of interest as an FYI ... it can be used to track whether the kernel
> self patching API is only being used to patch expected locations.
>
> I use this right now with additional checks that the instruction being
> written for the indirect branch matches the expected offset.

Are there exploits out there corrupting the address of a static branch?
This seems difficult to me in practice because the __jump_table section
where the addresses of instructions to be patched are stored is marked
__ro_after_init.

- Kevin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-03-25 17:11       ` Kevin Brodsky
@ 2025-03-28 16:15         ` Maxwell Bland
  2025-04-04  7:57           ` Kevin Brodsky
  0 siblings, 1 reply; 28+ messages in thread
From: Maxwell Bland @ 2025-03-28 16:15 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-kernel, Andrew Morton, Mark Brown, Catalin Marinas,
	Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On Tue, Mar 25, 2025 at 06:11:24PM +0100, Kevin Brodsky wrote:
> >>> Where overwriting f_op is a "classic" bypass of protection systems like
> >>> this one.
> >> Indeed, this has been pointed out to me on a few occasions. On that note
> >> I should reference the follow-up series that aims at protecting struct
> >> cred [1].
> >>
> >> Using kpkeys to ensure that the task->cred pointer itself is mostly
> >> read-only is not straightforward, because there is just too much
> >> existing code that directly writes to other fields in task_struct -
> >> meaning we can't simply make the whole task_struct mostly read-only.
> >> What could be done is to split out the cred fields in a separate
> >> kpkeys-protected page, and link that page from task_struct. There is a
> >> clear overhead associated with this approach though, and maybe more
> >> importantly we have just displaced the issue as the pointer to that
> >> protected page remains writeable... I'm not sure how much a page-based
> >> technique like pkeys can help here.
> > __randomize_layout accomplished some of what is necessary here, while
> > accommodating contemporary hardware limitations, albeit to a finer
> > granularity and lesser degree. I am also not sure how well it works on
> > production systems (yet! It is on my TODO list in the short term), or
> > how much protection it really implies for an adversary with a "read
> > gadget" that can determine where to write (either from associated asm or
> > from the struct itself).
> >
> > But...
> >
> > (1) After thinking about this for a few hours, __randomize_layout implies
> > something valuable: under circumstances where this is supported, because
> > we assume the struct's layout itself is modifable, it becomes possible
> > to introduce a canary value _not_ for the modifiable portions, but just
> > for the non-modifiable portions, store this canary value in a strictly
> > read-only region of memory that also tracks pointers to the structure,
> > and ensure uses of the non-modifiable fields must always leverage the
> > canary value stored in this secondary region.
> 
> I don't think this can be enforced with pkeys, since the overall struct
> would have to remain RW. But I'm not sure I completely understand how
> that scheme would work overall, or how __randomize_layout is a
> prerequisite for it.

Hi, so __randomize_layout is interesting only in that it demonstrates we
can modify the layout (and therefore associated code accessing fields)
for a given datastructure.

When I wrote the previous email, I thought maybe we'd be able to use it
to do more, but not so much any more.

However, the "real" idea here is:

(1) "splitting up" each RW data structure ($X$) into mutable and non-mutable
parts would be difficult and difficult to maintain, so I was looking for
a solution outside of that.
(2) The primitive we do have, thanks to POE and this patch, is
the ability to make a page of memory writable only from a specific
context, and read only otherwise.
(3) The well-trodden solution is to allocate a field on a second,
independent page $p$ when allocating $X$, with some field/hash/magic to
guarantee the integrity of writes to the immutable fields of $X$
(sometimes called shadow memory).

Valid, CFI-guaranteed functions would have access to $p$, and would
be instrumented to update $p$ with a new hash of the fields of $X$
when updating $X$, but likely something more performance-friendly.
When readig from $X$, the data is pulled from $p$ to ensure the
field being read from $X$ is valid and has not been modified. It is
not possible to modify $p$ outside of the valid contexts in which
we can modify the read-mostly or constant fields of $X$.

Importantly, this does not rely on the confidentiality of $p$, which I
think was an issue with trying an approach like this with ARM MTE
previously, or at least the version of ARM MTE that Juhee Kim et al. at
Georgia Tech broke with speculative exectuion, iirc.

I think performance is difficult here, but that's more of a development
concern, I hope, than a concern in theory.

> 
> > [...]
> >
> > It may be possible to write a plugin modification to
> >
> > https://github.com/gcc-mirror/gcc/blob/12b2c414b6d0e0d1b3d328b58d654c19c30bee8c/gcc/config/arm/aarch-bti-insert.cc
> >
> > To also inject the POE/register set instructions conditioned upon the an
> > expected privilege (kpkeys) level of the executing function.
> >
> > Instead of adding code to set the privilege state via a specialized
> > callgate function, having this function's call explicitly tied into the
> > GCC CFI instrumentation pass. Determination of the specific "key level",
> > i.e. KPKEYS_LVL_PGTABLES, could likely be determined by the file
> > name/path via the compiler, or many other clever mechanisms, potentially
> > with support for exceptions via function attribute tags.
> 
> Right I'm with you now. This is actually something we considered, but
> I'm not sure it really helps all that much. Adding a guard object to a
> function is basically the same amount of churn as marking that function
> with an attribute, except it doesn't require compiler support. It would
> get more useful if you wanted all functions in a file to switch kpkeys
> level (using a compiler flag), but for the purpose of making certain
> data mostly read-only, I doubt you actually want that - the granularity
> is just too coarse.
> 
> A different angle would be use an attribute to mark struct members,
> rather than functions. That would instruct the compiler to switch kpkeys
> level whenever that member is written (and only then). You would
> probably want to forbid taking the address of the member too, as the
> compiler can't easily track that. That said this doesn't solve the
> bigger issue of existing code expecting to be able to write to other
> members in the struct. It most likely works only if the kpkeys switch is
> done when writing to any member of the struct (which may get very
> expensive).

We agree. Doing something like this doesn't crash stuff, but it makes
the phone sluggish and terrible to use. (-: Hence, I may try the above:
keep the struct read-write, but when reading from "critical fields"
(pointers to function pointers), require the compiler to inject a check
for an integrity value stored on a mostly-read-only page. That integrity
value can only be updated by code that is resonsible for writing said
critical fields.

Since the supposition is things like f_ops don't really need to change
much ($p$ does not need to be accessed much), and otherwise the data
structure is fully writable, the performance impact seems like it would
not be significant.

That said, and if I am not mistaken, the downside is it'd require
Clang/GCC support, same as CFI.

> 
> > [...]
> > Noticed as well, just now, the reliance on the static_key for branch
> > unlikely ... the following data structure (at the end of this email) may
> > be of interest as an FYI ... it can be used to track whether the kernel
> > self patching API is only being used to patch expected locations.
> >
> > I use this right now with additional checks that the instruction being
> > written for the indirect branch matches the expected offset.
> 
> Are there exploits out there corrupting the address of a static branch?
> This seems difficult to me in practice because the __jump_table section
> where the addresses of instructions to be patched are stored is marked
> __ro_after_init.

There are a couple of different ways. You can do the attack this patch
is intended to prevent, change the page tables unwillingly to give
yourself permissions to write to the static key struct as part of a
larger chain, there's the ability to inject code into a kernel
module to call the patch text API and use it as a write gadget for the
rest of the kernel, etc..

There were a lot of claims back in the day that kernel code would be
marked strictly read-only by the EL2 secure monitor or kernel proection
systems, but there's self-modifying code built right into it under many
KConfigs. To have a guarantee of CFI you kind of have to ensure the
kernel can't patch itself.

Maxwell


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-03-28 16:15         ` Maxwell Bland
@ 2025-04-04  7:57           ` Kevin Brodsky
  2025-04-14 22:43             ` Maxwell Bland
  0 siblings, 1 reply; 28+ messages in thread
From: Kevin Brodsky @ 2025-04-04  7:57 UTC (permalink / raw)
  To: Maxwell Bland
  Cc: linux-kernel, Andrew Morton, Mark Brown, Catalin Marinas,
	Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On 28/03/2025 17:15, Maxwell Bland wrote:
> [...]
>
> However, the "real" idea here is:
>
> (1) "splitting up" each RW data structure ($X$) into mutable and non-mutable
> parts would be difficult and difficult to maintain, so I was looking for
> a solution outside of that.
> (2) The primitive we do have, thanks to POE and this patch, is
> the ability to make a page of memory writable only from a specific
> context, and read only otherwise.
> (3) The well-trodden solution is to allocate a field on a second,
> independent page $p$ when allocating $X$, with some field/hash/magic to
> guarantee the integrity of writes to the immutable fields of $X$
> (sometimes called shadow memory).
>
> Valid, CFI-guaranteed functions would have access to $p$, and would
> be instrumented to update $p$ with a new hash of the fields of $X$
> when updating $X$, but likely something more performance-friendly.
> When readig from $X$, the data is pulled from $p$ to ensure the
> field being read from $X$ is valid and has not been modified. It is
> not possible to modify $p$ outside of the valid contexts in which
> we can modify the read-mostly or constant fields of $X$.
>
> Importantly, this does not rely on the confidentiality of $p$, which I
> think was an issue with trying an approach like this with ARM MTE
> previously, or at least the version of ARM MTE that Juhee Kim et al. at
> Georgia Tech broke with speculative exectuion, iirc.
>
> I think performance is difficult here, but that's more of a development
> concern, I hope, than a concern in theory.

Thank you for elaborating, this is much clearer now. I suppose this
requires that all read-mostly fields are accessed via helpers that will
check/update $p$. Painful when those fields are directly accessed today,
as in the case of task_struct, but the required changes are hopefully
easy to automate (using Coccinelle for instance). And as you point out
further down, this could be done via a compiler attribute instead. The
performance impact on reads is clearly a concern, but it is not an
all-or-nothing scheme - we can choose which members are protected,
meaning we can make trade-offs.

Overall this seems worth investigating. I wonder, have you considered
how accessors would find the shadow memory? It could of course be linked
directly from task_struct, but then nothing prevents that pointer from
being corrupted. I can't think of another cheap way to link $p$ though.
This is not a full-blown shadow memory approach, so I'm not sure we can
reserve a whole chunk of the address space for that purpose.

>> [...]
>>
>> A different angle would be use an attribute to mark struct members,
>> rather than functions. That would instruct the compiler to switch kpkeys
>> level whenever that member is written (and only then). You would
>> probably want to forbid taking the address of the member too, as the
>> compiler can't easily track that. That said this doesn't solve the
>> bigger issue of existing code expecting to be able to write to other
>> members in the struct. It most likely works only if the kpkeys switch is
>> done when writing to any member of the struct (which may get very
>> expensive).
> We agree. Doing something like this doesn't crash stuff, but it makes
> the phone sluggish and terrible to use. (-: Hence, I may try the above:
> keep the struct read-write, but when reading from "critical fields"
> (pointers to function pointers), require the compiler to inject a check
> for an integrity value stored on a mostly-read-only page. That integrity
> value can only be updated by code that is resonsible for writing said
> critical fields.
>
> Since the supposition is things like f_ops don't really need to change
> much ($p$ does not need to be accessed much), and otherwise the data
> structure is fully writable, the performance impact seems like it would
> not be significant.

Agreed, it doesn't have to be very expensive if used sparingly.

> That said, and if I am not mistaken, the downside is it'd require
> Clang/GCC support, same as CFI.

Indeed. For experimenting a Coccinelle script to convert direct access
to certain members to a function call is probably easier :)

>
>>> [...]
>>> Noticed as well, just now, the reliance on the static_key for branch
>>> unlikely ... the following data structure (at the end of this email) may
>>> be of interest as an FYI ... it can be used to track whether the kernel
>>> self patching API is only being used to patch expected locations.
>>>
>>> I use this right now with additional checks that the instruction being
>>> written for the indirect branch matches the expected offset.
>> Are there exploits out there corrupting the address of a static branch?
>> This seems difficult to me in practice because the __jump_table section
>> where the addresses of instructions to be patched are stored is marked
>> __ro_after_init.
> There are a couple of different ways. You can do the attack this patch
> is intended to prevent, change the page tables unwillingly to give
> yourself permissions to write to the static key struct as part of a
> larger chain, there's the ability to inject code into a kernel
> module to call the patch text API and use it as a write gadget for the
> rest of the kernel, etc..

Right. I'd argue that static keys are not your biggest worry if the
attacker can control page tables or call arbitrary functions - and in
fact kpkeys are easily defeated if the attacker is able to modify the
control flow (one could simply call kpkeys_set_level()).

> There were a lot of claims back in the day that kernel code would be
> marked strictly read-only by the EL2 secure monitor or kernel proection
> systems, but there's self-modifying code built right into it under many
> KConfigs. To have a guarantee of CFI you kind of have to ensure the
> kernel can't patch itself.

Self-patching is used extensively on boot (on arm64 at least) to support
optional extensions such as POE. I suppose this kind of patching isn't
really a concern as it occurs very early. Static keys can be used at any
point and are therefore more dangerous, but the performance uplift is
likely significant in many cases.

- Kevin


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 00/15] pkeys-based page table hardening
  2025-04-04  7:57           ` Kevin Brodsky
@ 2025-04-14 22:43             ` Maxwell Bland
  0 siblings, 0 replies; 28+ messages in thread
From: Maxwell Bland @ 2025-04-14 22:43 UTC (permalink / raw)
  To: Kevin Brodsky
  Cc: linux-kernel, Andrew Morton, Mark Brown, Catalin Marinas,
	Dave Hansen, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Andy Lutomirski, Marc Zyngier, Peter Zijlstra,
	Pierre Langlois, Quentin Perret, Mike Rapoport (IBM),
	Ryan Roberts, Thomas Gleixner, Will Deacon, Matthew Wilcox,
	Qi Zheng, linux-arm-kernel, linux-mm, x86

On Fri, Apr 04, 2025 at 09:57:02AM +0200, Kevin Brodsky wrote:
> On 28/03/2025 17:15, Maxwell Bland wrote:
> Overall this seems worth investigating. I wonder, have you considered
> how accessors would find the shadow memory? It could of course be linked
> directly from task_struct, but then nothing prevents that pointer from
> being corrupted. I can't think of another cheap way to link $p$ though.
> This is not a full-blown shadow memory approach, so I'm not sure we can
> reserve a whole chunk of the address space for that purpose.

Hi, apologies for the delay again, I had much fire to put out last week.

I saw you posted a V4 for this, so I'll close out this chain.

W.r.t. the above, it may be possible to segment the RB tree in vmalloc.c
and designate an allocation region for this purpose. I did something
similar to enforce PXNTable-across-vmalloc a year or so ago which ended
up successful on a production device.

I plan to experiment a bit with different approaches and will probably
send the code to the mailing list once/if I get something together (also
if it isn't pre-empted by someone smarter and faster doing something
better).  (-:

> Indeed. For experimenting a Coccinelle script to convert direct access
> to certain members to a function call is probably easier :)

This does keep it in-kernel, which is nice, and I will keep this in mind
as I write.

Thank you for the discussion and patch, as well as the newest one!

- Maxwell Bland


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-04-14 22:44 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-03 10:18 [RFC PATCH v3 00/15] pkeys-based page table hardening Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 01/15] mm: Introduce kpkeys Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 02/15] set_memory: Introduce set_memory_pkey() stub Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 03/15] arm64: mm: Enable overlays for all EL1 indirect permissions Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 04/15] arm64: Introduce por_set_pkey_perms() helper Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 05/15] arm64: Implement asm/kpkeys.h using POE Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 06/15] arm64: set_memory: Implement set_memory_pkey() Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 07/15] arm64: Enable kpkeys Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 08/15] mm: Introduce kernel_pgtables_set_pkey() Kevin Brodsky
2025-02-06 19:01   ` Linus Walleij
2025-02-07 14:33     ` Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 09/15] mm: Introduce kpkeys_hardened_pgtables Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 10/15] mm: Allow __pagetable_ctor() to fail Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 11/15] mm: Map page tables with privileged pkey Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 12/15] arm64: kpkeys: Support KPKEYS_LVL_PGTABLES Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 13/15] arm64: mm: Guard page table writes with kpkeys Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 14/15] arm64: Enable kpkeys_hardened_pgtables support Kevin Brodsky
2025-02-03 10:18 ` [RFC PATCH v3 15/15] mm: Add basic tests for kpkeys_hardened_pgtables Kevin Brodsky
2025-02-06 22:41 ` [RFC PATCH v3 00/15] pkeys-based page table hardening Kees Cook
2025-02-10 14:23   ` Kevin Brodsky
2025-02-13 14:54     ` Kevin Brodsky
2025-03-06 16:23 ` Maxwell Bland
2025-03-13 12:32   ` Kevin Brodsky
2025-03-19 21:54     ` Maxwell Bland
2025-03-25 17:11       ` Kevin Brodsky
2025-03-28 16:15         ` Maxwell Bland
2025-04-04  7:57           ` Kevin Brodsky
2025-04-14 22:43             ` Maxwell Bland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox