* [PATCH v4 0/1] Seal system mappings
@ 2024-11-25 20:20 jeffxu
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
2024-11-26 16:39 ` [PATCH v4 0/1] Seal " Lorenzo Stoakes
0 siblings, 2 replies; 62+ messages in thread
From: jeffxu @ 2024-11-25 20:20 UTC (permalink / raw)
To: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg
Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Jeff Xu
From: Jeff Xu <jeffxu@chromium.org>
Seal vdso, vvar, sigpage, uprobes and vsyscall.
Those mappings are readonly or executable only, sealing can protect
them from ever changing or unmapped during the life time of the process.
For complete descriptions of memory sealing, please see mseal.rst [1].
System mappings such as vdso, vvar, and sigpage (for arm) are
generated by the kernel during program initialization, and are
sealed after creation.
Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [2]. It is sealed from creation.
The vdso, vvar, sigpage, and uprobe mappings all invoke the
_install_special_mapping() function. As no other mappings utilize this
function, it is logical to incorporate sealing logic within
_install_special_mapping(). This approach avoids the necessity of
modifying code across various architecture-specific implementations.
The vsyscall mapping, which has its own initialization function, is
sealed in the XONLY case, it seems to be the most common and secure
case of using vsyscall.
It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the mapping of vdso, vvar, and sigpage during restore
operations. Consequently, this feature cannot be universally enabled
across all systems.
Currently, memory sealing is only functional in a 64-bit kernel
configuration.
To enable this feature, the architecture needs to be tested to
confirm that it doesn't unmap/remap system mappings during the
the life time of the process. After the architecture enables
ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
Alternatively, kernel command line (exec.seal_system_mappings)
enables this feature also.
This feature is tested using ChromeOS and Android on X86_64 and ARM64,
therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
Other architectures can enable this after testing. No specific hardware
features from the CPU are needed.
This feature's security enhancements will benefit ChromeOS, Android,
and other secure-by-default systems.
[1] Documentation/userspace-api/mseal.rst
[2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
History:
V4:
ARCH_HAS_SEAL_SYSTEM_MAPPINGS (Lorenzo Stoakes)
test info (Lorenzo Stoakes)
Update mseal.rst (Liam R. Howlett)
Update test_mremap_vdso.c (Liam R. Howlett)
Misc. style, comments, doc update (Liam R. Howlett)
V3:
https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
Revert uprobe to v1 logic (Oleg Nesterov)
use CONFIG_SEAL_SYSTEM_MAPPINGS instead of _ALWAYS/_NEVER (Kees Cook)
Move kernel cmd line from fs/exec.c to mm/mseal.c and misc. refactor (Liam R. Howlett)
V2:
https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
Seal uprobe always (Oleg Nesterov)
Update comments and description (Randy Dunlap, Liam R.Howlett, Oleg Nesterov)
Rebase to linux_main
V1:
https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
Jeff Xu (1):
exec: seal system mappings
.../admin-guide/kernel-parameters.txt | 11 ++++++
Documentation/userspace-api/mseal.rst | 4 ++
arch/arm64/Kconfig | 1 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
include/linux/mm.h | 12 ++++++
init/Kconfig | 25 ++++++++++++
mm/mmap.c | 10 +++++
mm/mseal.c | 39 +++++++++++++++++++
security/Kconfig | 24 ++++++++++++
10 files changed, 133 insertions(+), 2 deletions(-)
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:20 [PATCH v4 0/1] Seal system mappings jeffxu
@ 2024-11-25 20:20 ` jeffxu
2024-11-25 20:40 ` Matthew Wilcox
` (4 more replies)
2024-11-26 16:39 ` [PATCH v4 0/1] Seal " Lorenzo Stoakes
1 sibling, 5 replies; 62+ messages in thread
From: jeffxu @ 2024-11-25 20:20 UTC (permalink / raw)
To: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg
Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Jeff Xu
From: Jeff Xu <jeffxu@chromium.org>
Seal vdso, vvar, sigpage, uprobes and vsyscall.
Those mappings are readonly or executable only, sealing can protect
them from ever changing or unmapped during the life time of the process.
For complete descriptions of memory sealing, please see mseal.rst [1].
System mappings such as vdso, vvar, and sigpage (for arm) are
generated by the kernel during program initialization, and are
sealed after creation.
Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [2]. It is sealed from creation.
The vdso, vvar, sigpage, and uprobe mappings all invoke the
_install_special_mapping() function. As no other mappings utilize this
function, it is logical to incorporate sealing logic within
_install_special_mapping(). This approach avoids the necessity of
modifying code across various architecture-specific implementations.
The vsyscall mapping, which has its own initialization function, is
sealed in the XONLY case, it seems to be the most common and secure
case of using vsyscall.
It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the mapping of vdso, vvar, and sigpage during restore
operations. Consequently, this feature cannot be universally enabled
across all systems.
Currently, memory sealing is only functional in a 64-bit kernel
configuration.
To enable this feature, the architecture needs to be tested to
confirm that it doesn't unmap/remap system mappings during the
the life time of the process. After the architecture enables
ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
Alternatively, kernel command line (exec.seal_system_mappings)
enables this feature also.
This feature is tested using ChromeOS and Android on X86_64 and ARM64,
therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
Other architectures can enable this after testing. No specific hardware
features from the CPU are needed.
This feature's security enhancements will benefit ChromeOS, Android,
and other secure-by-default systems.
[1] Documentation/userspace-api/mseal.rst
[2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
.../admin-guide/kernel-parameters.txt | 11 ++++++
Documentation/userspace-api/mseal.rst | 4 ++
arch/arm64/Kconfig | 1 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
include/linux/mm.h | 12 ++++++
init/Kconfig | 25 ++++++++++++
mm/mmap.c | 10 +++++
mm/mseal.c | 39 +++++++++++++++++++
security/Kconfig | 24 ++++++++++++
10 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e7bfe1bde49e..f63268341739 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1538,6 +1538,17 @@
Permit 'security.evm' to be updated regardless of
current integrity status.
+ exec.seal_system_mappings = [KNL]
+ Format: { no | yes }
+ Seal system mappings: vdso, vvar, sigpage, vsyscall,
+ uprobe.
+ - 'no': do not seal system mappings.
+ - 'yes': seal system mappings.
+ This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
+ If not specified or invalid, default is the value set by
+ CONFIG_SEAL_SYSTEM_MAPPINGS.
+ This option has no effect if CONFIG_64BIT=n
+
early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
index 41102f74c5e2..bec122318a59 100644
--- a/Documentation/userspace-api/mseal.rst
+++ b/Documentation/userspace-api/mseal.rst
@@ -130,6 +130,10 @@ Use cases
- Chrome browser: protect some security sensitive data structures.
+- seal system mappings:
+ kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
+ as vdso, vvar, sigpage, uprobes and vsyscall.
+
When not to use mseal
=====================
Applications can apply sealing to any virtual memory region from userspace,
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 63de71544d95..fc5da8f74342 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -44,6 +44,7 @@ config ARM64
select ARCH_HAS_SETUP_DMA_OPS
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY
+ select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
select ARCH_STACKWALK
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ea18662942c..5f6bac99974c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -26,6 +26,7 @@ config X86_64
depends on 64BIT
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
+ select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index 2fb7d53cf333..30e0958915ca 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -366,8 +366,12 @@ void __init map_vsyscall(void)
set_vsyscall_pgtable_user_bits(swapper_pg_dir);
}
- if (vsyscall_mode == XONLY)
- vm_flags_init(&gate_vma, VM_EXEC);
+ if (vsyscall_mode == XONLY) {
+ unsigned long vm_flags = VM_EXEC;
+
+ vm_flags |= seal_system_mappings();
+ vm_flags_init(&gate_vma, vm_flags);
+ }
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index df0a5eac66b7..f787d6c85cbb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
+#ifdef CONFIG_64BIT
+/*
+ * return VM_SEALED if seal system mapping is enabled.
+ */
+unsigned long seal_system_mappings(void);
+#else
+static inline unsigned long seal_system_mappings(void)
+{
+ return 0;
+}
+#endif
+
#endif /* _LINUX_MM_H */
diff --git a/init/Kconfig b/init/Kconfig
index 1aa95a5dfff8..614719259aa0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
config ARCH_HAS_MEMBARRIER_SYNC_CORE
bool
+config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
+ bool
+ help
+ Control SEAL_SYSTEM_MAPPINGS access based on architecture.
+
+ A 64-bit kernel is required for the memory sealing feature.
+ No specific hardware features from the CPU are needed.
+
+ To enable this feature, the architecture needs to be tested to
+ confirm that it doesn't unmap/remap system mappings during the
+ the life time of the process. After the architecture enables this,
+ a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
+ to the feature.
+
+ The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
+ feature, which is known to remap/unmap vdso. Thus, the presence of
+ CHECKPOINT_RESTORE is not considered a factor in enabling
+ ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
+
+ For complete list of system mappings, please see
+ CONFIG_SEAL_SYSTEM_MAPPINGS.
+
+ For complete descriptions of memory sealing, please see
+ Documentation/userspace-api/mseal.rst
+
config HAVE_PERF_EVENTS
bool
help
diff --git a/mm/mmap.c b/mm/mmap.c
index 57fd5ab2abe7..bc694c555805 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
unsigned long addr, unsigned long len,
unsigned long vm_flags, const struct vm_special_mapping *spec)
{
+ /*
+ * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
+ * invoke the _install_special_mapping function can be sealed.
+ * Therefore, it is logical to call the seal_system_mappings_enabled()
+ * function here. In the future, if this is not the case, i.e. if certain
+ * mappings cannot be sealed, then it would be necessary to move this
+ * check to the calling function.
+ */
+ vm_flags |= seal_system_mappings();
+
return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
&special_mapping_vmops);
}
diff --git a/mm/mseal.c b/mm/mseal.c
index ece977bd21e1..80126d6231bb 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -7,6 +7,7 @@
* Author: Jeff Xu <jeffxu@chromium.org>
*/
+#include <linux/fs_parser.h>
#include <linux/mempolicy.h>
#include <linux/mman.h>
#include <linux/mm.h>
@@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
{
return do_mseal(start, len, flags);
}
+
+/*
+ * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
+ */
+enum seal_system_mappings_type {
+ SEAL_SYSTEM_MAPPINGS_DISABLED,
+ SEAL_SYSTEM_MAPPINGS_ENABLED
+};
+
+static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
+ IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
+ SEAL_SYSTEM_MAPPINGS_DISABLED;
+
+static const struct constant_table value_table_sys_mapping[] __initconst = {
+ { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
+ { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
+ { }
+};
+
+static int __init early_seal_system_mappings_override(char *buf)
+{
+ if (!buf)
+ return -EINVAL;
+
+ seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
+ buf, seal_system_mappings_v);
+ return 0;
+}
+
+early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
+
+unsigned long seal_system_mappings(void)
+{
+ if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
+ return VM_SEALED;
+
+ return 0;
+}
diff --git a/security/Kconfig b/security/Kconfig
index 28e685f53bd1..5bbb8d989d79 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
endchoice
+config SEAL_SYSTEM_MAPPINGS
+ bool "seal system mappings"
+ default n
+ depends on 64BIT
+ depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
+ depends on !CHECKPOINT_RESTORE
+ help
+ Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
+
+ A 64-bit kernel is required for the memory sealing feature.
+ No specific hardware features from the CPU are needed.
+
+ Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
+
+ CHECKPOINT_RESTORE might relocate vdso mapping during restore,
+ and remap/unmap will fail when the mapping is sealed, therefore
+ !CHECKPOINT_RESTORE is added as dependency.
+
+ Kernel command line exec.seal_system_mappings=(no/yes) overrides
+ this.
+
+ For complete descriptions of memory sealing, please see
+ Documentation/userspace-api/mseal.rst
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
2.47.0.338.g60cca15819-goog
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
@ 2024-11-25 20:40 ` Matthew Wilcox
2024-12-02 17:22 ` Jeff Xu
2024-12-02 18:29 ` Lorenzo Stoakes
` (3 subsequent siblings)
4 siblings, 1 reply; 62+ messages in thread
From: Matthew Wilcox @ 2024-11-25 20:40 UTC (permalink / raw)
To: jeffxu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe
On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> +/*
> + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> + */
> +enum seal_system_mappings_type {
> + SEAL_SYSTEM_MAPPINGS_DISABLED,
> + SEAL_SYSTEM_MAPPINGS_ENABLED
> +};
> +
> +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> + SEAL_SYSTEM_MAPPINGS_DISABLED;
> +
> +static const struct constant_table value_table_sys_mapping[] __initconst = {
> + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> + { }
> +};
> +
> +static int __init early_seal_system_mappings_override(char *buf)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> + buf, seal_system_mappings_v);
> + return 0;
> +}
> +
> +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
Are you paid by the line? This all seems ridiculously overcomplicated.
Look at (first example I found) kgdbwait:
static int __init opt_kgdb_wait(char *str)
{
kgdb_break_asap = 1;
kdb_init(KDB_INIT_EARLY);
if (kgdb_io_module_registered &&
IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))
kgdb_initial_breakpoint();
return 0;
}
early_param("kgdbwait", opt_kgdb_wait);
I don't understand why you've created a new 'exec' namespace, and why
this feature fits in 'exec'. That seems like an implementation detail.
I'd lose the "exec." prefix.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 0/1] Seal system mappings
2024-11-25 20:20 [PATCH v4 0/1] Seal system mappings jeffxu
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
@ 2024-11-26 16:39 ` Lorenzo Stoakes
2024-12-02 17:28 ` Jeff Xu
1 sibling, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2024-11-26 16:39 UTC (permalink / raw)
To: jeffxu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka
+Vlastimil
Jeff... :)
Please review
https://www.kernel.org/doc/html/latest/process/submitting-patches.html
You didn't cc- mantainers of code you are changing. And you reference my
name without cc'ing me here. I'm sure there's some relevant Taylor Swift
lyric...
On Mon, Nov 25, 2024 at 08:20:20PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
>
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
>
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [2]. It is sealed from creation.
>
> The vdso, vvar, sigpage, and uprobe mappings all invoke the
> _install_special_mapping() function. As no other mappings utilize this
> function, it is logical to incorporate sealing logic within
> _install_special_mapping(). This approach avoids the necessity of
> modifying code across various architecture-specific implementations.
>
> The vsyscall mapping, which has its own initialization function, is
> sealed in the XONLY case, it seems to be the most common and secure
> case of using vsyscall.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the mapping of vdso, vvar, and sigpage during restore
> operations. Consequently, this feature cannot be universally enabled
> across all systems.
>
> Currently, memory sealing is only functional in a 64-bit kernel
> configuration.
>
> To enable this feature, the architecture needs to be tested to
> confirm that it doesn't unmap/remap system mappings during the
> the life time of the process. After the architecture enables
> ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> Alternatively, kernel command line (exec.seal_system_mappings)
> enables this feature also.
>
> This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> Other architectures can enable this after testing. No specific hardware
> features from the CPU are needed.
>
> This feature's security enhancements will benefit ChromeOS, Android,
> and other secure-by-default systems.
>
> [1] Documentation/userspace-api/mseal.rst
> [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
>
> History:
> V4:
> ARCH_HAS_SEAL_SYSTEM_MAPPINGS (Lorenzo Stoakes)
> test info (Lorenzo Stoakes)
> Update mseal.rst (Liam R. Howlett)
> Update test_mremap_vdso.c (Liam R. Howlett)
> Misc. style, comments, doc update (Liam R. Howlett)
>
> V3:
> https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
> Revert uprobe to v1 logic (Oleg Nesterov)
> use CONFIG_SEAL_SYSTEM_MAPPINGS instead of _ALWAYS/_NEVER (Kees Cook)
> Move kernel cmd line from fs/exec.c to mm/mseal.c and misc. refactor (Liam R. Howlett)
>
> V2:
> https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
> Seal uprobe always (Oleg Nesterov)
> Update comments and description (Randy Dunlap, Liam R.Howlett, Oleg Nesterov)
> Rebase to linux_main
>
> V1:
> https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
>
> Jeff Xu (1):
> exec: seal system mappings
>
> .../admin-guide/kernel-parameters.txt | 11 ++++++
> Documentation/userspace-api/mseal.rst | 4 ++
> arch/arm64/Kconfig | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> include/linux/mm.h | 12 ++++++
> init/Kconfig | 25 ++++++++++++
> mm/mmap.c | 10 +++++
> mm/mseal.c | 39 +++++++++++++++++++
> security/Kconfig | 24 ++++++++++++
> 10 files changed, 133 insertions(+), 2 deletions(-)
>
> --
> 2.47.0.338.g60cca15819-goog
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:40 ` Matthew Wilcox
@ 2024-12-02 17:22 ` Jeff Xu
2024-12-02 17:57 ` Lorenzo Stoakes
2024-12-02 19:57 ` Jeff Xu
0 siblings, 2 replies; 62+ messages in thread
From: Jeff Xu @ 2024-12-02 17:22 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe
On Mon, Nov 25, 2024 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > +/*
> > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > + */
> > +enum seal_system_mappings_type {
> > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > +};
> > +
> > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > +
> > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > + { }
> > +};
> > +
> > +static int __init early_seal_system_mappings_override(char *buf)
> > +{
> > + if (!buf)
> > + return -EINVAL;
> > +
> > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > + buf, seal_system_mappings_v);
> > + return 0;
> > +}
> > +
> > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
>
> Are you paid by the line?
> This all seems ridiculously overcomplicated.
> Look at (first example I found) kgdbwait:
>
The example you provided doesn't seem to support the kernel cmd-line ?
> static int __init opt_kgdb_wait(char *str)
> {
> kgdb_break_asap = 1;
>
> kdb_init(KDB_INIT_EARLY);
> if (kgdb_io_module_registered &&
> IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))
> kgdb_initial_breakpoint();
>
> return 0;
> }
> early_param("kgdbwait", opt_kgdb_wait);
>
There is an existing pattern of supporting kernel cmd line + KCONFIG
which I followed [1],
IMO, this fits this user-case really well, if you have a better
example, I'm happy to look.
[1] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.com/
> I don't understand why you've created a new 'exec' namespace, and why
> this feature fits in 'exec'. That seems like an implementation detail.
> I'd lose the "exec." prefix.
I would prefer some prefix to group these types of features.
vdso/vvar are sealed during the execve() call, so I choose "exec".
The next work I'm planning is sealing the NX stack, it would start
with the same prefix.
If exec is not an intuitive prefix, I'm also happy with "process." prefix.
Thanks for reviewing
-Jeff
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 0/1] Seal system mappings
2024-11-26 16:39 ` [PATCH v4 0/1] Seal " Lorenzo Stoakes
@ 2024-12-02 17:28 ` Jeff Xu
0 siblings, 0 replies; 62+ messages in thread
From: Jeff Xu @ 2024-12-02 17:28 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka
On Tue, Nov 26, 2024 at 8:40 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +Vlastimil
>
> Jeff... :)
>
> Please review
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html
>
> You didn't cc- mantainers of code you are changing. And you reference my
> name without cc'ing me here. I'm sure there's some relevant Taylor Swift
> lyric...
>
I apologize, this shouldn't happen again.
Thanks for reminding me
-Jeff
>
> On Mon, Nov 25, 2024 at 08:20:20PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> >
> > Unlike the aforementioned mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime [2]. It is sealed from creation.
> >
> > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > _install_special_mapping() function. As no other mappings utilize this
> > function, it is logical to incorporate sealing logic within
> > _install_special_mapping(). This approach avoids the necessity of
> > modifying code across various architecture-specific implementations.
> >
> > The vsyscall mapping, which has its own initialization function, is
> > sealed in the XONLY case, it seems to be the most common and secure
> > case of using vsyscall.
> >
> > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > alter the mapping of vdso, vvar, and sigpage during restore
> > operations. Consequently, this feature cannot be universally enabled
> > across all systems.
> >
> > Currently, memory sealing is only functional in a 64-bit kernel
> > configuration.
> >
> > To enable this feature, the architecture needs to be tested to
> > confirm that it doesn't unmap/remap system mappings during the
> > the life time of the process. After the architecture enables
> > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > Alternatively, kernel command line (exec.seal_system_mappings)
> > enables this feature also.
> >
> > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > Other architectures can enable this after testing. No specific hardware
> > features from the CPU are needed.
> >
> > This feature's security enhancements will benefit ChromeOS, Android,
> > and other secure-by-default systems.
> >
> > [1] Documentation/userspace-api/mseal.rst
> > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> >
> > History:
> > V4:
> > ARCH_HAS_SEAL_SYSTEM_MAPPINGS (Lorenzo Stoakes)
> > test info (Lorenzo Stoakes)
> > Update mseal.rst (Liam R. Howlett)
> > Update test_mremap_vdso.c (Liam R. Howlett)
> > Misc. style, comments, doc update (Liam R. Howlett)
> >
> > V3:
> > https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
> > Revert uprobe to v1 logic (Oleg Nesterov)
> > use CONFIG_SEAL_SYSTEM_MAPPINGS instead of _ALWAYS/_NEVER (Kees Cook)
> > Move kernel cmd line from fs/exec.c to mm/mseal.c and misc. refactor (Liam R. Howlett)
> >
> > V2:
> > https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
> > Seal uprobe always (Oleg Nesterov)
> > Update comments and description (Randy Dunlap, Liam R.Howlett, Oleg Nesterov)
> > Rebase to linux_main
> >
> > V1:
> > https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
> >
> > Jeff Xu (1):
> > exec: seal system mappings
> >
> > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > Documentation/userspace-api/mseal.rst | 4 ++
> > arch/arm64/Kconfig | 1 +
> > arch/x86/Kconfig | 1 +
> > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > include/linux/mm.h | 12 ++++++
> > init/Kconfig | 25 ++++++++++++
> > mm/mmap.c | 10 +++++
> > mm/mseal.c | 39 +++++++++++++++++++
> > security/Kconfig | 24 ++++++++++++
> > 10 files changed, 133 insertions(+), 2 deletions(-)
> >
> > --
> > 2.47.0.338.g60cca15819-goog
> >
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-02 17:22 ` Jeff Xu
@ 2024-12-02 17:57 ` Lorenzo Stoakes
2024-12-02 20:05 ` Jeff Xu
2024-12-02 19:57 ` Jeff Xu
1 sibling, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2024-12-02 17:57 UTC (permalink / raw)
To: Jeff Xu
Cc: Matthew Wilcox, akpm, keescook, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe
On Mon, Dec 02, 2024 at 09:22:33AM -0800, Jeff Xu wrote:
> On Mon, Nov 25, 2024 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > +/*
> > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > + */
> > > +enum seal_system_mappings_type {
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > +};
> > > +
> > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > +
> > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > + { }
> > > +};
> > > +
> > > +static int __init early_seal_system_mappings_override(char *buf)
> > > +{
> > > + if (!buf)
> > > + return -EINVAL;
> > > +
> > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > + buf, seal_system_mappings_v);
> > > + return 0;
> > > +}
> > > +
> > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> >
> > Are you paid by the line?
> > This all seems ridiculously overcomplicated.
> > Look at (first example I found) kgdbwait:
> >
> The example you provided doesn't seem to support the kernel cmd-line ?
>
> > static int __init opt_kgdb_wait(char *str)
> > {
> > kgdb_break_asap = 1;
> >
> > kdb_init(KDB_INIT_EARLY);
> > if (kgdb_io_module_registered &&
> > IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))
> > kgdb_initial_breakpoint();
> >
> > return 0;
> > }
> > early_param("kgdbwait", opt_kgdb_wait);
> >
> There is an existing pattern of supporting kernel cmd line + KCONFIG
> which I followed [1],
> IMO, this fits this user-case really well, if you have a better
> example, I'm happy to look.
>
> [1] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.com/
>
> > I don't understand why you've created a new 'exec' namespace, and why
> > this feature fits in 'exec'. That seems like an implementation detail.
> > I'd lose the "exec." prefix.
>
> I would prefer some prefix to group these types of features.
> vdso/vvar are sealed during the execve() call, so I choose "exec".
> The next work I'm planning is sealing the NX stack, it would start
> with the same prefix.
>
> If exec is not an intuitive prefix, I'm also happy with "process." prefix.
If we HAVE to have a prefix, I'd prefer "mseal.". 'Seal' is horribly
overloaded and I'd prefer to group these operations together.
>
> Thanks for reviewing
>
> -Jeff
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
2024-11-25 20:40 ` Matthew Wilcox
@ 2024-12-02 18:29 ` Lorenzo Stoakes
2024-12-02 20:38 ` Jeff Xu
2024-12-04 14:04 ` Benjamin Berg
` (2 subsequent siblings)
4 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2024-12-02 18:29 UTC (permalink / raw)
To: jeffxu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe
On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
>
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
>
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [2]. It is sealed from creation.
>
> The vdso, vvar, sigpage, and uprobe mappings all invoke the
> _install_special_mapping() function. As no other mappings utilize this
> function, it is logical to incorporate sealing logic within
> _install_special_mapping(). This approach avoids the necessity of
> modifying code across various architecture-specific implementations.
>
> The vsyscall mapping, which has its own initialization function, is
> sealed in the XONLY case, it seems to be the most common and secure
> case of using vsyscall.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the mapping of vdso, vvar, and sigpage during restore
> operations. Consequently, this feature cannot be universally enabled
> across all systems.
>
> Currently, memory sealing is only functional in a 64-bit kernel
> configuration.
>
> To enable this feature, the architecture needs to be tested to
> confirm that it doesn't unmap/remap system mappings during the
> the life time of the process. After the architecture enables
> ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> Alternatively, kernel command line (exec.seal_system_mappings)
> enables this feature also.
>
> This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> Other architectures can enable this after testing. No specific hardware
> features from the CPU are needed.
>
> This feature's security enhancements will benefit ChromeOS, Android,
> and other secure-by-default systems.
>
> [1] Documentation/userspace-api/mseal.rst
> [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
> .../admin-guide/kernel-parameters.txt | 11 ++++++
> Documentation/userspace-api/mseal.rst | 4 ++
> arch/arm64/Kconfig | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> include/linux/mm.h | 12 ++++++
> init/Kconfig | 25 ++++++++++++
> mm/mmap.c | 10 +++++
> mm/mseal.c | 39 +++++++++++++++++++
> security/Kconfig | 24 ++++++++++++
> 10 files changed, 133 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e7bfe1bde49e..f63268341739 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1538,6 +1538,17 @@
> Permit 'security.evm' to be updated regardless of
> current integrity status.
>
> + exec.seal_system_mappings = [KNL]
> + Format: { no | yes }
> + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> + uprobe.
> + - 'no': do not seal system mappings.
> + - 'yes': seal system mappings.
> + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> + If not specified or invalid, default is the value set by
> + CONFIG_SEAL_SYSTEM_MAPPINGS.
> + This option has no effect if CONFIG_64BIT=n
> +
> early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> stages so cover more early boot allocations.
> Please note that as side effect some optimizations
> diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> index 41102f74c5e2..bec122318a59 100644
> --- a/Documentation/userspace-api/mseal.rst
> +++ b/Documentation/userspace-api/mseal.rst
> @@ -130,6 +130,10 @@ Use cases
>
> - Chrome browser: protect some security sensitive data structures.
>
> +- seal system mappings:
> + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> + as vdso, vvar, sigpage, uprobes and vsyscall.
> +
> When not to use mseal
> =====================
> Applications can apply sealing to any virtual memory region from userspace,
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 63de71544d95..fc5da8f74342 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -44,6 +44,7 @@ config ARM64
> select ARCH_HAS_SETUP_DMA_OPS
> select ARCH_HAS_SET_DIRECT_MAP
> select ARCH_HAS_SET_MEMORY
> + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> select ARCH_STACKWALK
> select ARCH_HAS_STRICT_KERNEL_RWX
> select ARCH_HAS_STRICT_MODULE_RWX
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ea18662942c..5f6bac99974c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -26,6 +26,7 @@ config X86_64
> depends on 64BIT
> # Options that are inherently 64-bit kernel only:
> select ARCH_HAS_GIGANTIC_PAGE
> + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> select ARCH_SUPPORTS_PER_VMA_LOCK
> select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> index 2fb7d53cf333..30e0958915ca 100644
> --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> }
>
> - if (vsyscall_mode == XONLY)
> - vm_flags_init(&gate_vma, VM_EXEC);
> + if (vsyscall_mode == XONLY) {
> + unsigned long vm_flags = VM_EXEC;
> +
> + vm_flags |= seal_system_mappings();
> + vm_flags_init(&gate_vma, vm_flags);
> + }
>
> BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> (unsigned long)VSYSCALL_ADDR);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index df0a5eac66b7..f787d6c85cbb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
>
> +#ifdef CONFIG_64BIT
> +/*
> + * return VM_SEALED if seal system mapping is enabled.
> + */
> +unsigned long seal_system_mappings(void);
> +#else
> +static inline unsigned long seal_system_mappings(void)
> +{
> + return 0;
> +}
OK so we can set seal system mappings on a 32-bit system and
silently... just not do it?...
> +#endif
> +
> #endif /* _LINUX_MM_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 1aa95a5dfff8..614719259aa0 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> config ARCH_HAS_MEMBARRIER_SYNC_CORE
> bool
>
> +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> + bool
> + help
> + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> +
> + A 64-bit kernel is required for the memory sealing feature.
> + No specific hardware features from the CPU are needed.
> +
> + To enable this feature, the architecture needs to be tested to
> + confirm that it doesn't unmap/remap system mappings during the
> + the life time of the process. After the architecture enables this,
> + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> + to the feature.
> +
> + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> + feature, which is known to remap/unmap vdso. Thus, the presence of
> + CHECKPOINT_RESTORE is not considered a factor in enabling
> + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> +
> + For complete list of system mappings, please see
> + CONFIG_SEAL_SYSTEM_MAPPINGS.
> +
> + For complete descriptions of memory sealing, please see
> + Documentation/userspace-api/mseal.rst
> +
> config HAVE_PERF_EVENTS
> bool
> help
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 57fd5ab2abe7..bc694c555805 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> unsigned long addr, unsigned long len,
> unsigned long vm_flags, const struct vm_special_mapping *spec)
> {
> + /*
> + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> + * invoke the _install_special_mapping function can be sealed.
> + * Therefore, it is logical to call the seal_system_mappings_enabled()
> + * function here. In the future, if this is not the case, i.e. if certain
> + * mappings cannot be sealed, then it would be necessary to move this
> + * check to the calling function.
> + */
> + vm_flags |= seal_system_mappings();
> +
> return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> &special_mapping_vmops);
> }
> diff --git a/mm/mseal.c b/mm/mseal.c
> index ece977bd21e1..80126d6231bb 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -7,6 +7,7 @@
> * Author: Jeff Xu <jeffxu@chromium.org>
> */
>
> +#include <linux/fs_parser.h>
> #include <linux/mempolicy.h>
> #include <linux/mman.h>
> #include <linux/mm.h>
> @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> {
> return do_mseal(start, len, flags);
> }
> +
> +/*
> + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> + */
> +enum seal_system_mappings_type {
> + SEAL_SYSTEM_MAPPINGS_DISABLED,
> + SEAL_SYSTEM_MAPPINGS_ENABLED
> +};
> +
> +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> + SEAL_SYSTEM_MAPPINGS_DISABLED;
> +
> +static const struct constant_table value_table_sys_mapping[] __initconst = {
> + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> + { }
> +};
> +
> +static int __init early_seal_system_mappings_override(char *buf)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> + buf, seal_system_mappings_v);
> + return 0;
> +}
> +
> +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> +
> +unsigned long seal_system_mappings(void)
> +{
> + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> + return VM_SEALED;
> +
> + return 0;
> +}
> diff --git a/security/Kconfig b/security/Kconfig
> index 28e685f53bd1..5bbb8d989d79 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
>
> endchoice
>
> +config SEAL_SYSTEM_MAPPINGS
> + bool "seal system mappings"
I'd prefer an 'mseal' here please, it's becoming hard to grep for this
stuff. We overload 'seal' too much and I want to be able to identify what
is a memfd seal and what is an mseal or whatever else...
> + default n
> + depends on 64BIT
> + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> + depends on !CHECKPOINT_RESTORE
I don't know why we bother setting restrictions on this but allow them to
be overriden with a boot flag?
This means somebody with CRIU enabled could enable this and have a broken
kernel right? We can't allow that.
I'd much prefer we either:
1. Just have a CONFIG_MSEAL_SYSTEM_MAPPINGS flag. _or_
2. Have CONFIG_MSEAL_SYSTEM_MAPPINGS enable, allow kernel flag to disable.
In both cases you #ifdef on CONFIG_MSEAL_SYSTEM_MAPPINGS, and the
restrictions appply correctly.
If in the future we decide this feature is stable and ready and good to
enable globally we can just change the default on this to y at some later
date?
Otherwise it just seems like in a effect the kernel command line flag is a
debug flag to experiment on arbitrary kernels?
> + help
> + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> +
> + A 64-bit kernel is required for the memory sealing feature.
> + No specific hardware features from the CPU are needed.
> +
> + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> +
> + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> + and remap/unmap will fail when the mapping is sealed, therefore
> + !CHECKPOINT_RESTORE is added as dependency.
> +
> + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> + this.
> +
> + For complete descriptions of memory sealing, please see
> + Documentation/userspace-api/mseal.rst
> +
> config SECURITY
> bool "Enable different security models"
> depends on SYSFS
> --
> 2.47.0.338.g60cca15819-goog
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-02 17:22 ` Jeff Xu
2024-12-02 17:57 ` Lorenzo Stoakes
@ 2024-12-02 19:57 ` Jeff Xu
1 sibling, 0 replies; 62+ messages in thread
From: Jeff Xu @ 2024-12-02 19:57 UTC (permalink / raw)
To: Matthew Wilcox
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Lorenzo Stoakes
On Mon, Dec 2, 2024 at 9:22 AM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Mon, Nov 25, 2024 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > +/*
> > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > + */
> > > +enum seal_system_mappings_type {
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > +};
> > > +
> > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > +
> > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > + { }
> > > +};
> > > +
> > > +static int __init early_seal_system_mappings_override(char *buf)
> > > +{
> > > + if (!buf)
> > > + return -EINVAL;
> > > +
> > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > + buf, seal_system_mappings_v);
> > > + return 0;
> > > +}
> > > +
> > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> >
> > Are you paid by the line?
> > This all seems ridiculously overcomplicated.
> > Look at (first example I found) kgdbwait:
> >
> The example you provided doesn't seem to support the kernel cmd-line ?
>
> > static int __init opt_kgdb_wait(char *str)
> > {
> > kgdb_break_asap = 1;
> >
> > kdb_init(KDB_INIT_EARLY);
> > if (kgdb_io_module_registered &&
> > IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))
> > kgdb_initial_breakpoint();
> >
> > return 0;
> > }
> > early_param("kgdbwait", opt_kgdb_wait);
> >
> There is an existing pattern of supporting kernel cmd line + KCONFIG
> which I followed [1],
> IMO, this fits this user-case really well, if you have a better
> example, I'm happy to look.
>
> [1] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.com/
>
Sorry, I miss-understood the code. This code also uses the kernel cmd
line, it is just not using keyword=yes/no pattern, but checking the
existence of "keyword" in the kernel cmd line.
Current pattern allows values beyond "yes"/"no", so if we ever need
extension (e.g. a new system mapping type, or pre-process control), we
have flexibility to do so.
On second thought, that might be over-thinking, I will switch to this
(simpler) pattern in the next version.
Thanks
> > I don't understand why you've created a new 'exec' namespace, and why
> > this feature fits in 'exec'. That seems like an implementation detail.
> > I'd lose the "exec." prefix.
>
> I would prefer some prefix to group these types of features.
> vdso/vvar are sealed during the execve() call, so I choose "exec".
> The next work I'm planning is sealing the NX stack, it would start
> with the same prefix.
>
> If exec is not an intuitive prefix, I'm also happy with "process." prefix.
>
> Thanks for reviewing
>
> -Jeff
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-02 17:57 ` Lorenzo Stoakes
@ 2024-12-02 20:05 ` Jeff Xu
0 siblings, 0 replies; 62+ messages in thread
From: Jeff Xu @ 2024-12-02 20:05 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Matthew Wilcox, akpm, keescook, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka
On Mon, Dec 2, 2024 at 9:57 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Dec 02, 2024 at 09:22:33AM -0800, Jeff Xu wrote:
> > On Mon, Nov 25, 2024 at 12:40 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > > +/*
> > > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > > + */
> > > > +enum seal_system_mappings_type {
> > > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > > +};
> > > > +
> > > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > > +
> > > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > > + { }
> > > > +};
> > > > +
> > > > +static int __init early_seal_system_mappings_override(char *buf)
> > > > +{
> > > > + if (!buf)
> > > > + return -EINVAL;
> > > > +
> > > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > > + buf, seal_system_mappings_v);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > >
> > > Are you paid by the line?
> > > This all seems ridiculously overcomplicated.
> > > Look at (first example I found) kgdbwait:
> > >
> > The example you provided doesn't seem to support the kernel cmd-line ?
> >
> > > static int __init opt_kgdb_wait(char *str)
> > > {
> > > kgdb_break_asap = 1;
> > >
> > > kdb_init(KDB_INIT_EARLY);
> > > if (kgdb_io_module_registered &&
> > > IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))
> > > kgdb_initial_breakpoint();
> > >
> > > return 0;
> > > }
> > > early_param("kgdbwait", opt_kgdb_wait);
> > >
> > There is an existing pattern of supporting kernel cmd line + KCONFIG
> > which I followed [1],
> > IMO, this fits this user-case really well, if you have a better
> > example, I'm happy to look.
> >
> > [1] https://lore.kernel.org/lkml/20240802080225.89408-1-adrian.ratiu@collabora.com/
> >
> > > I don't understand why you've created a new 'exec' namespace, and why
> > > this feature fits in 'exec'. That seems like an implementation detail.
> > > I'd lose the "exec." prefix.
> >
> > I would prefer some prefix to group these types of features.
> > vdso/vvar are sealed during the execve() call, so I choose "exec".
> > The next work I'm planning is sealing the NX stack, it would start
> > with the same prefix.
> >
> > If exec is not an intuitive prefix, I'm also happy with "process." prefix.
>
> If we HAVE to have a prefix, I'd prefer "mseal.". 'Seal' is horribly
> overloaded and I'd prefer to group these operations together.
>
mseal.seal_system_mappings seems to contain duplicate info.
If the norm is against prefix in kernel cmd line, I will drop the prefix and use
mseal_system_mappings
> >
> > Thanks for reviewing
> >
> > -Jeff
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-02 18:29 ` Lorenzo Stoakes
@ 2024-12-02 20:38 ` Jeff Xu
2024-12-03 7:35 ` Lorenzo Stoakes
0 siblings, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2024-12-02 20:38 UTC (permalink / raw)
To: Lorenzo Stoakes, Vlastimil Babka
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe
On Mon, Dec 2, 2024 at 10:29 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> >
> > Unlike the aforementioned mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime [2]. It is sealed from creation.
> >
> > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > _install_special_mapping() function. As no other mappings utilize this
> > function, it is logical to incorporate sealing logic within
> > _install_special_mapping(). This approach avoids the necessity of
> > modifying code across various architecture-specific implementations.
> >
> > The vsyscall mapping, which has its own initialization function, is
> > sealed in the XONLY case, it seems to be the most common and secure
> > case of using vsyscall.
> >
> > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > alter the mapping of vdso, vvar, and sigpage during restore
> > operations. Consequently, this feature cannot be universally enabled
> > across all systems.
> >
> > Currently, memory sealing is only functional in a 64-bit kernel
> > configuration.
> >
> > To enable this feature, the architecture needs to be tested to
> > confirm that it doesn't unmap/remap system mappings during the
> > the life time of the process. After the architecture enables
> > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > Alternatively, kernel command line (exec.seal_system_mappings)
> > enables this feature also.
> >
> > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > Other architectures can enable this after testing. No specific hardware
> > features from the CPU are needed.
> >
> > This feature's security enhancements will benefit ChromeOS, Android,
> > and other secure-by-default systems.
> >
> > [1] Documentation/userspace-api/mseal.rst
> > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > Documentation/userspace-api/mseal.rst | 4 ++
> > arch/arm64/Kconfig | 1 +
> > arch/x86/Kconfig | 1 +
> > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > include/linux/mm.h | 12 ++++++
> > init/Kconfig | 25 ++++++++++++
> > mm/mmap.c | 10 +++++
> > mm/mseal.c | 39 +++++++++++++++++++
> > security/Kconfig | 24 ++++++++++++
> > 10 files changed, 133 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index e7bfe1bde49e..f63268341739 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1538,6 +1538,17 @@
> > Permit 'security.evm' to be updated regardless of
> > current integrity status.
> >
> > + exec.seal_system_mappings = [KNL]
> > + Format: { no | yes }
> > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > + uprobe.
> > + - 'no': do not seal system mappings.
> > + - 'yes': seal system mappings.
> > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > + If not specified or invalid, default is the value set by
> > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > + This option has no effect if CONFIG_64BIT=n
> > +
> > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> > stages so cover more early boot allocations.
> > Please note that as side effect some optimizations
> > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > index 41102f74c5e2..bec122318a59 100644
> > --- a/Documentation/userspace-api/mseal.rst
> > +++ b/Documentation/userspace-api/mseal.rst
> > @@ -130,6 +130,10 @@ Use cases
> >
> > - Chrome browser: protect some security sensitive data structures.
> >
> > +- seal system mappings:
> > + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> > + as vdso, vvar, sigpage, uprobes and vsyscall.
> > +
> > When not to use mseal
> > =====================
> > Applications can apply sealing to any virtual memory region from userspace,
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 63de71544d95..fc5da8f74342 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -44,6 +44,7 @@ config ARM64
> > select ARCH_HAS_SETUP_DMA_OPS
> > select ARCH_HAS_SET_DIRECT_MAP
> > select ARCH_HAS_SET_MEMORY
> > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > select ARCH_STACKWALK
> > select ARCH_HAS_STRICT_KERNEL_RWX
> > select ARCH_HAS_STRICT_MODULE_RWX
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 1ea18662942c..5f6bac99974c 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -26,6 +26,7 @@ config X86_64
> > depends on 64BIT
> > # Options that are inherently 64-bit kernel only:
> > select ARCH_HAS_GIGANTIC_PAGE
> > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > select ARCH_SUPPORTS_PER_VMA_LOCK
> > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > index 2fb7d53cf333..30e0958915ca 100644
> > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> > set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> > }
> >
> > - if (vsyscall_mode == XONLY)
> > - vm_flags_init(&gate_vma, VM_EXEC);
> > + if (vsyscall_mode == XONLY) {
> > + unsigned long vm_flags = VM_EXEC;
> > +
> > + vm_flags |= seal_system_mappings();
> > + vm_flags_init(&gate_vma, vm_flags);
> > + }
> >
> > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> > (unsigned long)VSYSCALL_ADDR);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index df0a5eac66b7..f787d6c85cbb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> > int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> > int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >
> > +#ifdef CONFIG_64BIT
> > +/*
> > + * return VM_SEALED if seal system mapping is enabled.
> > + */
> > +unsigned long seal_system_mappings(void);
> > +#else
> > +static inline unsigned long seal_system_mappings(void)
> > +{
> > + return 0;
> > +}
>
> OK so we can set seal system mappings on a 32-bit system and
> silently... just not do it?...
>
I don't understand what you meant.
The function returns the vm_flags for seal system mappings.
In 32 bit, it returns 0.
the caller (in mmap.c) does below:
vm_flags |= seal_system_mappings();
(The pattern is recommended by Liam. )
Is that because the function name is misleading ? I can change it to
seal_flags_system_mappings() if there is no objection to the long
name.
> > +#endif
> > +
> > #endif /* _LINUX_MM_H */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 1aa95a5dfff8..614719259aa0 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > bool
> >
> > +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > + bool
> > + help
> > + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> > +
> > + A 64-bit kernel is required for the memory sealing feature.
> > + No specific hardware features from the CPU are needed.
> > +
> > + To enable this feature, the architecture needs to be tested to
> > + confirm that it doesn't unmap/remap system mappings during the
> > + the life time of the process. After the architecture enables this,
> > + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> > + to the feature.
> > +
> > + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> > + feature, which is known to remap/unmap vdso. Thus, the presence of
> > + CHECKPOINT_RESTORE is not considered a factor in enabling
> > + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> > +
> > + For complete list of system mappings, please see
> > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > +
> > + For complete descriptions of memory sealing, please see
> > + Documentation/userspace-api/mseal.rst
> > +
> > config HAVE_PERF_EVENTS
> > bool
> > help
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 57fd5ab2abe7..bc694c555805 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> > unsigned long addr, unsigned long len,
> > unsigned long vm_flags, const struct vm_special_mapping *spec)
> > {
> > + /*
> > + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > + * invoke the _install_special_mapping function can be sealed.
> > + * Therefore, it is logical to call the seal_system_mappings_enabled()
> > + * function here. In the future, if this is not the case, i.e. if certain
> > + * mappings cannot be sealed, then it would be necessary to move this
> > + * check to the calling function.
> > + */
> > + vm_flags |= seal_system_mappings();
> > +
> > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> > &special_mapping_vmops);
> > }
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index ece977bd21e1..80126d6231bb 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -7,6 +7,7 @@
> > * Author: Jeff Xu <jeffxu@chromium.org>
> > */
> >
> > +#include <linux/fs_parser.h>
> > #include <linux/mempolicy.h>
> > #include <linux/mman.h>
> > #include <linux/mm.h>
> > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> > {
> > return do_mseal(start, len, flags);
> > }
> > +
> > +/*
> > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > + */
> > +enum seal_system_mappings_type {
> > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > +};
> > +
> > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > +
> > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > + { }
> > +};
> > +
> > +static int __init early_seal_system_mappings_override(char *buf)
> > +{
> > + if (!buf)
> > + return -EINVAL;
> > +
> > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > + buf, seal_system_mappings_v);
> > + return 0;
> > +}
> > +
> > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > +
> > +unsigned long seal_system_mappings(void)
> > +{
> > + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > + return VM_SEALED;
> > +
> > + return 0;
> > +}
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 28e685f53bd1..5bbb8d989d79 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
> >
> > endchoice
> >
> > +config SEAL_SYSTEM_MAPPINGS
> > + bool "seal system mappings"
>
> I'd prefer an 'mseal' here please, it's becoming hard to grep for this
> stuff. We overload 'seal' too much and I want to be able to identify what
> is a memfd seal and what is an mseal or whatever else...
>
I m OK with MSEAL_
> > + default n
> > + depends on 64BIT
> > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > + depends on !CHECKPOINT_RESTORE
>
> I don't know why we bother setting restrictions on this but allow them to
> be overriden with a boot flag?
>
The idea is a distribution might not enable kernel security features
by default, and kernel cmdline provides flexibility to let users
enable it.
This is the same approach as proc_mem.force_override kernel cmd line
where Kees recommended [1], I would prefer to keep this as is.
[1] https://lore.kernel.org/all/202402261110.B8129C002@keescook/
> This means somebody with CRIU enabled could enable this and have a broken
> kernel right? We can't allow that.
>
> I'd much prefer we either:
>
> 1. Just have a CONFIG_MSEAL_SYSTEM_MAPPINGS flag. _or_
> 2. Have CONFIG_MSEAL_SYSTEM_MAPPINGS enable, allow kernel flag to disable.
>
> In both cases you #ifdef on CONFIG_MSEAL_SYSTEM_MAPPINGS, and the
> restrictions appply correctly.
>
> If in the future we decide this feature is stable and ready and good to
> enable globally we can just change the default on this to y at some later
> date?
>
> Otherwise it just seems like in a effect the kernel command line flag is a
> debug flag to experiment on arbitrary kernels?
>
> > + help
> > + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > +
> > + A 64-bit kernel is required for the memory sealing feature.
> > + No specific hardware features from the CPU are needed.
> > +
> > + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> > +
> > + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > + and remap/unmap will fail when the mapping is sealed, therefore
> > + !CHECKPOINT_RESTORE is added as dependency.
> > +
> > + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> > + this.
> > +
> > + For complete descriptions of memory sealing, please see
> > + Documentation/userspace-api/mseal.rst
> > +
> > config SECURITY
> > bool "Enable different security models"
> > depends on SYSFS
> > --
> > 2.47.0.338.g60cca15819-goog
> >
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-02 20:38 ` Jeff Xu
@ 2024-12-03 7:35 ` Lorenzo Stoakes
2024-12-03 18:19 ` Jeff Xu
0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 7:35 UTC (permalink / raw)
To: Jeff Xu
Cc: Vlastimil Babka, akpm, keescook, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe
On Mon, Dec 02, 2024 at 12:38:27PM -0800, Jeff Xu wrote:
> On Mon, Dec 2, 2024 at 10:29 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > >
> > > Those mappings are readonly or executable only, sealing can protect
> > > them from ever changing or unmapped during the life time of the process.
> > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > >
> > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > generated by the kernel during program initialization, and are
> > > sealed after creation.
> > >
> > > Unlike the aforementioned mappings, the uprobe mapping is not
> > > established during program startup. However, its lifetime is the same
> > > as the process's lifetime [2]. It is sealed from creation.
> > >
> > > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > > _install_special_mapping() function. As no other mappings utilize this
> > > function, it is logical to incorporate sealing logic within
> > > _install_special_mapping(). This approach avoids the necessity of
> > > modifying code across various architecture-specific implementations.
> > >
> > > The vsyscall mapping, which has its own initialization function, is
> > > sealed in the XONLY case, it seems to be the most common and secure
> > > case of using vsyscall.
> > >
> > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > > alter the mapping of vdso, vvar, and sigpage during restore
> > > operations. Consequently, this feature cannot be universally enabled
> > > across all systems.
> > >
> > > Currently, memory sealing is only functional in a 64-bit kernel
> > > configuration.
> > >
> > > To enable this feature, the architecture needs to be tested to
> > > confirm that it doesn't unmap/remap system mappings during the
> > > the life time of the process. After the architecture enables
> > > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > > Alternatively, kernel command line (exec.seal_system_mappings)
> > > enables this feature also.
> > >
> > > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > > Other architectures can enable this after testing. No specific hardware
> > > features from the CPU are needed.
> > >
> > > This feature's security enhancements will benefit ChromeOS, Android,
> > > and other secure-by-default systems.
> > >
> > > [1] Documentation/userspace-api/mseal.rst
> > > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > ---
> > > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > > Documentation/userspace-api/mseal.rst | 4 ++
> > > arch/arm64/Kconfig | 1 +
> > > arch/x86/Kconfig | 1 +
> > > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > > include/linux/mm.h | 12 ++++++
> > > init/Kconfig | 25 ++++++++++++
> > > mm/mmap.c | 10 +++++
> > > mm/mseal.c | 39 +++++++++++++++++++
> > > security/Kconfig | 24 ++++++++++++
> > > 10 files changed, 133 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index e7bfe1bde49e..f63268341739 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -1538,6 +1538,17 @@
> > > Permit 'security.evm' to be updated regardless of
> > > current integrity status.
> > >
> > > + exec.seal_system_mappings = [KNL]
> > > + Format: { no | yes }
> > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > + uprobe.
> > > + - 'no': do not seal system mappings.
> > > + - 'yes': seal system mappings.
> > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > + If not specified or invalid, default is the value set by
> > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > + This option has no effect if CONFIG_64BIT=n
> > > +
> > > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> > > stages so cover more early boot allocations.
> > > Please note that as side effect some optimizations
> > > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > > index 41102f74c5e2..bec122318a59 100644
> > > --- a/Documentation/userspace-api/mseal.rst
> > > +++ b/Documentation/userspace-api/mseal.rst
> > > @@ -130,6 +130,10 @@ Use cases
> > >
> > > - Chrome browser: protect some security sensitive data structures.
> > >
> > > +- seal system mappings:
> > > + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> > > + as vdso, vvar, sigpage, uprobes and vsyscall.
> > > +
> > > When not to use mseal
> > > =====================
> > > Applications can apply sealing to any virtual memory region from userspace,
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index 63de71544d95..fc5da8f74342 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -44,6 +44,7 @@ config ARM64
> > > select ARCH_HAS_SETUP_DMA_OPS
> > > select ARCH_HAS_SET_DIRECT_MAP
> > > select ARCH_HAS_SET_MEMORY
> > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > select ARCH_STACKWALK
> > > select ARCH_HAS_STRICT_KERNEL_RWX
> > > select ARCH_HAS_STRICT_MODULE_RWX
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index 1ea18662942c..5f6bac99974c 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -26,6 +26,7 @@ config X86_64
> > > depends on 64BIT
> > > # Options that are inherently 64-bit kernel only:
> > > select ARCH_HAS_GIGANTIC_PAGE
> > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > > select ARCH_SUPPORTS_PER_VMA_LOCK
> > > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > index 2fb7d53cf333..30e0958915ca 100644
> > > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> > > set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> > > }
> > >
> > > - if (vsyscall_mode == XONLY)
> > > - vm_flags_init(&gate_vma, VM_EXEC);
> > > + if (vsyscall_mode == XONLY) {
> > > + unsigned long vm_flags = VM_EXEC;
> > > +
> > > + vm_flags |= seal_system_mappings();
> > > + vm_flags_init(&gate_vma, vm_flags);
> > > + }
> > >
> > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> > > (unsigned long)VSYSCALL_ADDR);
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index df0a5eac66b7..f787d6c85cbb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> > > int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> > > int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> > >
> > > +#ifdef CONFIG_64BIT
> > > +/*
> > > + * return VM_SEALED if seal system mapping is enabled.
> > > + */
> > > +unsigned long seal_system_mappings(void);
> > > +#else
> > > +static inline unsigned long seal_system_mappings(void)
> > > +{
> > > + return 0;
> > > +}
> >
> > OK so we can set seal system mappings on a 32-bit system and
> > silently... just not do it?...
> >
> I don't understand what you meant.
>
> The function returns the vm_flags for seal system mappings.
> In 32 bit, it returns 0.
>
> the caller (in mmap.c) does below:
> vm_flags |= seal_system_mappings();
>
> (The pattern is recommended by Liam. )
>
> Is that because the function name is misleading ? I can change it to
> seal_flags_system_mappings() if there is no objection to the long
> name.
No, I'm saying that you're making it possible for somebody to enable this
feature on a 32-bit system, and to think it's enabled and that they're
protected when in fact they're not.
Which is, security-wise, I think rather unwise.
Again it's an argument against a cmdline parameter. See below.
>
> > > +#endif
> > > +
> > > #endif /* _LINUX_MM_H */
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 1aa95a5dfff8..614719259aa0 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > > config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > > bool
> > >
> > > +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > + bool
> > > + help
> > > + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> > > +
> > > + A 64-bit kernel is required for the memory sealing feature.
> > > + No specific hardware features from the CPU are needed.
> > > +
> > > + To enable this feature, the architecture needs to be tested to
> > > + confirm that it doesn't unmap/remap system mappings during the
> > > + the life time of the process. After the architecture enables this,
> > > + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> > > + to the feature.
> > > +
> > > + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> > > + feature, which is known to remap/unmap vdso. Thus, the presence of
> > > + CHECKPOINT_RESTORE is not considered a factor in enabling
> > > + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> > > +
> > > + For complete list of system mappings, please see
> > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > +
> > > + For complete descriptions of memory sealing, please see
> > > + Documentation/userspace-api/mseal.rst
> > > +
> > > config HAVE_PERF_EVENTS
> > > bool
> > > help
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 57fd5ab2abe7..bc694c555805 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> > > unsigned long addr, unsigned long len,
> > > unsigned long vm_flags, const struct vm_special_mapping *spec)
> > > {
> > > + /*
> > > + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > > + * invoke the _install_special_mapping function can be sealed.
> > > + * Therefore, it is logical to call the seal_system_mappings_enabled()
> > > + * function here. In the future, if this is not the case, i.e. if certain
> > > + * mappings cannot be sealed, then it would be necessary to move this
> > > + * check to the calling function.
> > > + */
> > > + vm_flags |= seal_system_mappings();
> > > +
> > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> > > &special_mapping_vmops);
> > > }
> > > diff --git a/mm/mseal.c b/mm/mseal.c
> > > index ece977bd21e1..80126d6231bb 100644
> > > --- a/mm/mseal.c
> > > +++ b/mm/mseal.c
> > > @@ -7,6 +7,7 @@
> > > * Author: Jeff Xu <jeffxu@chromium.org>
> > > */
> > >
> > > +#include <linux/fs_parser.h>
> > > #include <linux/mempolicy.h>
> > > #include <linux/mman.h>
> > > #include <linux/mm.h>
> > > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> > > {
> > > return do_mseal(start, len, flags);
> > > }
> > > +
> > > +/*
> > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > + */
> > > +enum seal_system_mappings_type {
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > +};
> > > +
> > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > +
> > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > + { }
> > > +};
> > > +
> > > +static int __init early_seal_system_mappings_override(char *buf)
> > > +{
> > > + if (!buf)
> > > + return -EINVAL;
> > > +
> > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > + buf, seal_system_mappings_v);
> > > + return 0;
> > > +}
> > > +
> > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > > +
> > > +unsigned long seal_system_mappings(void)
> > > +{
> > > + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > > + return VM_SEALED;
> > > +
> > > + return 0;
> > > +}
> > > diff --git a/security/Kconfig b/security/Kconfig
> > > index 28e685f53bd1..5bbb8d989d79 100644
> > > --- a/security/Kconfig
> > > +++ b/security/Kconfig
> > > @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
> > >
> > > endchoice
> > >
> > > +config SEAL_SYSTEM_MAPPINGS
> > > + bool "seal system mappings"
> >
> > I'd prefer an 'mseal' here please, it's becoming hard to grep for this
> > stuff. We overload 'seal' too much and I want to be able to identify what
> > is a memfd seal and what is an mseal or whatever else...
> >
> I m OK with MSEAL_
Thanks.
>
> > > + default n
> > > + depends on 64BIT
> > > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > + depends on !CHECKPOINT_RESTORE
> >
> > I don't know why we bother setting restrictions on this but allow them to
> > be overriden with a boot flag?
> >
> The idea is a distribution might not enable kernel security features
> by default, and kernel cmdline provides flexibility to let users
> enable it.
>
> This is the same approach as proc_mem.force_override kernel cmd line
> where Kees recommended [1], I would prefer to keep this as is.
>
> [1] https://lore.kernel.org/all/202402261110.B8129C002@keescook/
>
This is flawed on multiple levels. Firstly, from the linked change:
+config SECURITY_PROC_MEM_RESTRICT_WRITES
+ bool "Restrict /proc/<pid>/mem write access"
+ default n
+ help
There are no 'depends on'. Yours has 'depends on' which you've just
rendered totally irrelevant including _allowing the enabling of this
feature in broken situations_ like CRIU, as I mentioned below.
For another, the linked feature changes behaviour and a user may or may not
want to allow the ability to write to /proc/<pid>/mem which is ENTIRELY
DIFFERENT from this proposed feature.
Under what circumstances could a user possibly want to write VVAR, VDSO,
etc. etc.? It just makes absolutely no sense for this to be a boot switch.
So the arguments presented there have zero bearing on this series.
> > This means somebody with CRIU enabled could enable this and have a broken
> > kernel right? We can't allow that.
Please do not ignore review comments like this.
> >
> > I'd much prefer we either:
> >
> > 1. Just have a CONFIG_MSEAL_SYSTEM_MAPPINGS flag. _or_
> > 2. Have CONFIG_MSEAL_SYSTEM_MAPPINGS enable, allow kernel flag to disable.
> >
> > In both cases you #ifdef on CONFIG_MSEAL_SYSTEM_MAPPINGS, and the
> > restrictions appply correctly.
> >
> > If in the future we decide this feature is stable and ready and good to
> > enable globally we can just change the default on this to y at some later
> > date?
> >
> > Otherwise it just seems like in a effect the kernel command line flag is a
> > debug flag to experiment on arbitrary kernels?
> >
> > > + help
> > > + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > > +
> > > + A 64-bit kernel is required for the memory sealing feature.
> > > + No specific hardware features from the CPU are needed.
> > > +
> > > + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> > > +
> > > + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > > + and remap/unmap will fail when the mapping is sealed, therefore
> > > + !CHECKPOINT_RESTORE is added as dependency.
> > > +
> > > + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> > > + this.
> > > +
> > > + For complete descriptions of memory sealing, please see
> > > + Documentation/userspace-api/mseal.rst
> > > +
> > > config SECURITY
> > > bool "Enable different security models"
> > > depends on SYSFS
> > > --
> > > 2.47.0.338.g60cca15819-goog
> > >
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-03 7:35 ` Lorenzo Stoakes
@ 2024-12-03 18:19 ` Jeff Xu
2024-12-03 20:16 ` Lorenzo Stoakes
0 siblings, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2024-12-03 18:19 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Vlastimil Babka, akpm, keescook, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe
On Mon, Dec 2, 2024 at 11:35 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Dec 02, 2024 at 12:38:27PM -0800, Jeff Xu wrote:
> > On Mon, Dec 2, 2024 at 10:29 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > > From: Jeff Xu <jeffxu@chromium.org>
> > > >
> > > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > > >
> > > > Those mappings are readonly or executable only, sealing can protect
> > > > them from ever changing or unmapped during the life time of the process.
> > > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > > >
> > > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > > generated by the kernel during program initialization, and are
> > > > sealed after creation.
> > > >
> > > > Unlike the aforementioned mappings, the uprobe mapping is not
> > > > established during program startup. However, its lifetime is the same
> > > > as the process's lifetime [2]. It is sealed from creation.
> > > >
> > > > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > > > _install_special_mapping() function. As no other mappings utilize this
> > > > function, it is logical to incorporate sealing logic within
> > > > _install_special_mapping(). This approach avoids the necessity of
> > > > modifying code across various architecture-specific implementations.
> > > >
> > > > The vsyscall mapping, which has its own initialization function, is
> > > > sealed in the XONLY case, it seems to be the most common and secure
> > > > case of using vsyscall.
> > > >
> > > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > > > alter the mapping of vdso, vvar, and sigpage during restore
> > > > operations. Consequently, this feature cannot be universally enabled
> > > > across all systems.
> > > >
> > > > Currently, memory sealing is only functional in a 64-bit kernel
> > > > configuration.
> > > >
> > > > To enable this feature, the architecture needs to be tested to
> > > > confirm that it doesn't unmap/remap system mappings during the
> > > > the life time of the process. After the architecture enables
> > > > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > > > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > > > Alternatively, kernel command line (exec.seal_system_mappings)
> > > > enables this feature also.
> > > >
> > > > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > > > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > > > Other architectures can enable this after testing. No specific hardware
> > > > features from the CPU are needed.
> > > >
> > > > This feature's security enhancements will benefit ChromeOS, Android,
> > > > and other secure-by-default systems.
> > > >
> > > > [1] Documentation/userspace-api/mseal.rst
> > > > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > > ---
> > > > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > > > Documentation/userspace-api/mseal.rst | 4 ++
> > > > arch/arm64/Kconfig | 1 +
> > > > arch/x86/Kconfig | 1 +
> > > > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > > > include/linux/mm.h | 12 ++++++
> > > > init/Kconfig | 25 ++++++++++++
> > > > mm/mmap.c | 10 +++++
> > > > mm/mseal.c | 39 +++++++++++++++++++
> > > > security/Kconfig | 24 ++++++++++++
> > > > 10 files changed, 133 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index e7bfe1bde49e..f63268341739 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -1538,6 +1538,17 @@
> > > > Permit 'security.evm' to be updated regardless of
> > > > current integrity status.
> > > >
> > > > + exec.seal_system_mappings = [KNL]
> > > > + Format: { no | yes }
> > > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > > + uprobe.
> > > > + - 'no': do not seal system mappings.
> > > > + - 'yes': seal system mappings.
> > > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > > + If not specified or invalid, default is the value set by
> > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > + This option has no effect if CONFIG_64BIT=n
> > > > +
> > > > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> > > > stages so cover more early boot allocations.
> > > > Please note that as side effect some optimizations
> > > > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > > > index 41102f74c5e2..bec122318a59 100644
> > > > --- a/Documentation/userspace-api/mseal.rst
> > > > +++ b/Documentation/userspace-api/mseal.rst
> > > > @@ -130,6 +130,10 @@ Use cases
> > > >
> > > > - Chrome browser: protect some security sensitive data structures.
> > > >
> > > > +- seal system mappings:
> > > > + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> > > > + as vdso, vvar, sigpage, uprobes and vsyscall.
> > > > +
> > > > When not to use mseal
> > > > =====================
> > > > Applications can apply sealing to any virtual memory region from userspace,
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index 63de71544d95..fc5da8f74342 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -44,6 +44,7 @@ config ARM64
> > > > select ARCH_HAS_SETUP_DMA_OPS
> > > > select ARCH_HAS_SET_DIRECT_MAP
> > > > select ARCH_HAS_SET_MEMORY
> > > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > select ARCH_STACKWALK
> > > > select ARCH_HAS_STRICT_KERNEL_RWX
> > > > select ARCH_HAS_STRICT_MODULE_RWX
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index 1ea18662942c..5f6bac99974c 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -26,6 +26,7 @@ config X86_64
> > > > depends on 64BIT
> > > > # Options that are inherently 64-bit kernel only:
> > > > select ARCH_HAS_GIGANTIC_PAGE
> > > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > > > select ARCH_SUPPORTS_PER_VMA_LOCK
> > > > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > > > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > > index 2fb7d53cf333..30e0958915ca 100644
> > > > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > > > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > > @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> > > > set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> > > > }
> > > >
> > > > - if (vsyscall_mode == XONLY)
> > > > - vm_flags_init(&gate_vma, VM_EXEC);
> > > > + if (vsyscall_mode == XONLY) {
> > > > + unsigned long vm_flags = VM_EXEC;
> > > > +
> > > > + vm_flags |= seal_system_mappings();
> > > > + vm_flags_init(&gate_vma, vm_flags);
> > > > + }
> > > >
> > > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> > > > (unsigned long)VSYSCALL_ADDR);
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index df0a5eac66b7..f787d6c85cbb 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> > > > int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> > > > int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> > > >
> > > > +#ifdef CONFIG_64BIT
> > > > +/*
> > > > + * return VM_SEALED if seal system mapping is enabled.
> > > > + */
> > > > +unsigned long seal_system_mappings(void);
> > > > +#else
> > > > +static inline unsigned long seal_system_mappings(void)
> > > > +{
> > > > + return 0;
> > > > +}
> > >
> > > OK so we can set seal system mappings on a 32-bit system and
> > > silently... just not do it?...
> > >
> > I don't understand what you meant.
> >
> > The function returns the vm_flags for seal system mappings.
> > In 32 bit, it returns 0.
> >
> > the caller (in mmap.c) does below:
> > vm_flags |= seal_system_mappings();
> >
> > (The pattern is recommended by Liam. )
> >
> > Is that because the function name is misleading ? I can change it to
> > seal_flags_system_mappings() if there is no objection to the long
> > name.
>
> No, I'm saying that you're making it possible for somebody to enable this
> feature on a 32-bit system, and to think it's enabled and that they're
> protected when in fact they're not.
>
The kernel cmdline change already has comments about 32-bit: see:
kernel-parameters.txt
"This option has no effect if CONFIG_64BIT=n"
> Which is, security-wise, I think rather unwise.
>
> Again it's an argument against a cmdline parameter. See below.
>
> >
> > > > +#endif
> > > > +
> > > > #endif /* _LINUX_MM_H */
> > > > diff --git a/init/Kconfig b/init/Kconfig
> > > > index 1aa95a5dfff8..614719259aa0 100644
> > > > --- a/init/Kconfig
> > > > +++ b/init/Kconfig
> > > > @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > > > config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > > > bool
> > > >
> > > > +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > + bool
> > > > + help
> > > > + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> > > > +
> > > > + A 64-bit kernel is required for the memory sealing feature.
> > > > + No specific hardware features from the CPU are needed.
> > > > +
> > > > + To enable this feature, the architecture needs to be tested to
> > > > + confirm that it doesn't unmap/remap system mappings during the
> > > > + the life time of the process. After the architecture enables this,
> > > > + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> > > > + to the feature.
> > > > +
> > > > + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> > > > + feature, which is known to remap/unmap vdso. Thus, the presence of
> > > > + CHECKPOINT_RESTORE is not considered a factor in enabling
> > > > + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> > > > +
> > > > + For complete list of system mappings, please see
> > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > +
> > > > + For complete descriptions of memory sealing, please see
> > > > + Documentation/userspace-api/mseal.rst
> > > > +
> > > > config HAVE_PERF_EVENTS
> > > > bool
> > > > help
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 57fd5ab2abe7..bc694c555805 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> > > > unsigned long addr, unsigned long len,
> > > > unsigned long vm_flags, const struct vm_special_mapping *spec)
> > > > {
> > > > + /*
> > > > + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > > > + * invoke the _install_special_mapping function can be sealed.
> > > > + * Therefore, it is logical to call the seal_system_mappings_enabled()
> > > > + * function here. In the future, if this is not the case, i.e. if certain
> > > > + * mappings cannot be sealed, then it would be necessary to move this
> > > > + * check to the calling function.
> > > > + */
> > > > + vm_flags |= seal_system_mappings();
> > > > +
> > > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> > > > &special_mapping_vmops);
> > > > }
> > > > diff --git a/mm/mseal.c b/mm/mseal.c
> > > > index ece977bd21e1..80126d6231bb 100644
> > > > --- a/mm/mseal.c
> > > > +++ b/mm/mseal.c
> > > > @@ -7,6 +7,7 @@
> > > > * Author: Jeff Xu <jeffxu@chromium.org>
> > > > */
> > > >
> > > > +#include <linux/fs_parser.h>
> > > > #include <linux/mempolicy.h>
> > > > #include <linux/mman.h>
> > > > #include <linux/mm.h>
> > > > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> > > > {
> > > > return do_mseal(start, len, flags);
> > > > }
> > > > +
> > > > +/*
> > > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > > + */
> > > > +enum seal_system_mappings_type {
> > > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > > +};
> > > > +
> > > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > > +
> > > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > > + { }
> > > > +};
> > > > +
> > > > +static int __init early_seal_system_mappings_override(char *buf)
> > > > +{
> > > > + if (!buf)
> > > > + return -EINVAL;
> > > > +
> > > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > > + buf, seal_system_mappings_v);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > > > +
> > > > +unsigned long seal_system_mappings(void)
> > > > +{
> > > > + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > > > + return VM_SEALED;
> > > > +
> > > > + return 0;
> > > > +}
> > > > diff --git a/security/Kconfig b/security/Kconfig
> > > > index 28e685f53bd1..5bbb8d989d79 100644
> > > > --- a/security/Kconfig
> > > > +++ b/security/Kconfig
> > > > @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
> > > >
> > > > endchoice
> > > >
> > > > +config SEAL_SYSTEM_MAPPINGS
> > > > + bool "seal system mappings"
> > >
> > > I'd prefer an 'mseal' here please, it's becoming hard to grep for this
> > > stuff. We overload 'seal' too much and I want to be able to identify what
> > > is a memfd seal and what is an mseal or whatever else...
> > >
> > I m OK with MSEAL_
>
> Thanks.
>
> >
> > > > + default n
> > > > + depends on 64BIT
> > > > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > + depends on !CHECKPOINT_RESTORE
> > >
> > > I don't know why we bother setting restrictions on this but allow them to
> > > be overriden with a boot flag?
> > >
> > The idea is a distribution might not enable kernel security features
> > by default, and kernel cmdline provides flexibility to let users
> > enable it.
> >
> > This is the same approach as proc_mem.force_override kernel cmd line
> > where Kees recommended [1], I would prefer to keep this as is.
> >
> > [1] https://lore.kernel.org/all/202402261110.B8129C002@keescook/
> >
>
> This is flawed on multiple levels. Firstly, from the linked change:
>
> +config SECURITY_PROC_MEM_RESTRICT_WRITES
> + bool "Restrict /proc/<pid>/mem write access"
> + default n
> + help
>
> There are no 'depends on'. Yours has 'depends on' which you've just
> rendered totally irrelevant including _allowing the enabling of this
> feature in broken situations_ like CRIU, as I mentioned below.
>
> For another, the linked feature changes behaviour and a user may or may not
> want to allow the ability to write to /proc/<pid>/mem which is ENTIRELY
> DIFFERENT from this proposed feature.
>
> Under what circumstances could a user possibly want to write VVAR, VDSO,
> etc. etc.? It just makes absolutely no sense for this to be a boot switch.
>
> So the arguments presented there have zero bearing on this series.
>
> > > This means somebody with CRIU enabled could enable this and have a broken
> > > kernel right? We can't allow that.
>
> Please do not ignore review comments like this.
>
kernel cmdline is a valid user case. The reasoning is explained in
the previous response, it allows users to enable security features
without having to rebuild the distribution's kernel.
For the concern that CRIU user enables this feature through kernel
cmdline mistakenly, there is already instruction and comments
throughout code to make aware of the impact to CRIU after enabling
sealing for system-mapping. In my view: users who want to enable this
feature through kernel cmdline should already have enough context
about this, i.e. it is an educated decision rather than experimenting.
That said, if we want to protect those CRIU users from mistakenly
enabling memory sealing, we could add a check to verify CRIU is not
enabled, i.e. if CRIU is enabled, the kernel cmd line has no effect.
Although in my view this is an extra complicity without meaningful
benefit - If it is the user's educated choice, I prefer to honor it.
> > >
> > > I'd much prefer we either:
> > >
> > > 1. Just have a CONFIG_MSEAL_SYSTEM_MAPPINGS flag. _or_
> > > 2. Have CONFIG_MSEAL_SYSTEM_MAPPINGS enable, allow kernel flag to disable.
> > >
> > > In both cases you #ifdef on CONFIG_MSEAL_SYSTEM_MAPPINGS, and the
> > > restrictions appply correctly.
> > >
> > > If in the future we decide this feature is stable and ready and good to
> > > enable globally we can just change the default on this to y at some later
> > > date?
> > >
> > > Otherwise it just seems like in a effect the kernel command line flag is a
> > > debug flag to experiment on arbitrary kernels?
> > >
> > > > + help
> > > > + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > > > +
> > > > + A 64-bit kernel is required for the memory sealing feature.
> > > > + No specific hardware features from the CPU are needed.
> > > > +
> > > > + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> > > > +
> > > > + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > > > + and remap/unmap will fail when the mapping is sealed, therefore
> > > > + !CHECKPOINT_RESTORE is added as dependency.
> > > > +
> > > > + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> > > > + this.
> > > > +
> > > > + For complete descriptions of memory sealing, please see
> > > > + Documentation/userspace-api/mseal.rst
> > > > +
> > > > config SECURITY
> > > > bool "Enable different security models"
> > > > depends on SYSFS
> > > > --
> > > > 2.47.0.338.g60cca15819-goog
> > > >
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-03 18:19 ` Jeff Xu
@ 2024-12-03 20:16 ` Lorenzo Stoakes
0 siblings, 0 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 20:16 UTC (permalink / raw)
To: Jeff Xu
Cc: Vlastimil Babka, akpm, keescook, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe
NACK.
Unfortunately past experience indicates that engaging in a back-and-forth
is not productive.
Please re-read what I've said and properly address the very serious
concerns raised.
This series is unacceptable in its current form as it allows untested
architectures and known broken configurations to enable this feature.
On Tue, Dec 03, 2024 at 10:19:31AM -0800, Jeff Xu wrote:
> On Mon, Dec 2, 2024 at 11:35 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Dec 02, 2024 at 12:38:27PM -0800, Jeff Xu wrote:
> > > On Mon, Dec 2, 2024 at 10:29 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > >
> > > > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > > > >
> > > > > Those mappings are readonly or executable only, sealing can protect
> > > > > them from ever changing or unmapped during the life time of the process.
> > > > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > > > >
> > > > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > > > generated by the kernel during program initialization, and are
> > > > > sealed after creation.
> > > > >
> > > > > Unlike the aforementioned mappings, the uprobe mapping is not
> > > > > established during program startup. However, its lifetime is the same
> > > > > as the process's lifetime [2]. It is sealed from creation.
> > > > >
> > > > > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > > > > _install_special_mapping() function. As no other mappings utilize this
> > > > > function, it is logical to incorporate sealing logic within
> > > > > _install_special_mapping(). This approach avoids the necessity of
> > > > > modifying code across various architecture-specific implementations.
> > > > >
> > > > > The vsyscall mapping, which has its own initialization function, is
> > > > > sealed in the XONLY case, it seems to be the most common and secure
> > > > > case of using vsyscall.
> > > > >
> > > > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > > > > alter the mapping of vdso, vvar, and sigpage during restore
> > > > > operations. Consequently, this feature cannot be universally enabled
> > > > > across all systems.
> > > > >
> > > > > Currently, memory sealing is only functional in a 64-bit kernel
> > > > > configuration.
> > > > >
> > > > > To enable this feature, the architecture needs to be tested to
> > > > > confirm that it doesn't unmap/remap system mappings during the
> > > > > the life time of the process. After the architecture enables
> > > > > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > > > > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > > > > Alternatively, kernel command line (exec.seal_system_mappings)
> > > > > enables this feature also.
> > > > >
> > > > > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > > > > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > > > > Other architectures can enable this after testing. No specific hardware
> > > > > features from the CPU are needed.
> > > > >
> > > > > This feature's security enhancements will benefit ChromeOS, Android,
> > > > > and other secure-by-default systems.
> > > > >
> > > > > [1] Documentation/userspace-api/mseal.rst
> > > > > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > > > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > > > ---
> > > > > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > > > > Documentation/userspace-api/mseal.rst | 4 ++
> > > > > arch/arm64/Kconfig | 1 +
> > > > > arch/x86/Kconfig | 1 +
> > > > > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > > > > include/linux/mm.h | 12 ++++++
> > > > > init/Kconfig | 25 ++++++++++++
> > > > > mm/mmap.c | 10 +++++
> > > > > mm/mseal.c | 39 +++++++++++++++++++
> > > > > security/Kconfig | 24 ++++++++++++
> > > > > 10 files changed, 133 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > > index e7bfe1bde49e..f63268341739 100644
> > > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > > @@ -1538,6 +1538,17 @@
> > > > > Permit 'security.evm' to be updated regardless of
> > > > > current integrity status.
> > > > >
> > > > > + exec.seal_system_mappings = [KNL]
> > > > > + Format: { no | yes }
> > > > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > > > + uprobe.
> > > > > + - 'no': do not seal system mappings.
> > > > > + - 'yes': seal system mappings.
> > > > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > > > + If not specified or invalid, default is the value set by
> > > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > > + This option has no effect if CONFIG_64BIT=n
> > > > > +
> > > > > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> > > > > stages so cover more early boot allocations.
> > > > > Please note that as side effect some optimizations
> > > > > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > > > > index 41102f74c5e2..bec122318a59 100644
> > > > > --- a/Documentation/userspace-api/mseal.rst
> > > > > +++ b/Documentation/userspace-api/mseal.rst
> > > > > @@ -130,6 +130,10 @@ Use cases
> > > > >
> > > > > - Chrome browser: protect some security sensitive data structures.
> > > > >
> > > > > +- seal system mappings:
> > > > > + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> > > > > + as vdso, vvar, sigpage, uprobes and vsyscall.
> > > > > +
> > > > > When not to use mseal
> > > > > =====================
> > > > > Applications can apply sealing to any virtual memory region from userspace,
> > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > > index 63de71544d95..fc5da8f74342 100644
> > > > > --- a/arch/arm64/Kconfig
> > > > > +++ b/arch/arm64/Kconfig
> > > > > @@ -44,6 +44,7 @@ config ARM64
> > > > > select ARCH_HAS_SETUP_DMA_OPS
> > > > > select ARCH_HAS_SET_DIRECT_MAP
> > > > > select ARCH_HAS_SET_MEMORY
> > > > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > > select ARCH_STACKWALK
> > > > > select ARCH_HAS_STRICT_KERNEL_RWX
> > > > > select ARCH_HAS_STRICT_MODULE_RWX
> > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > > index 1ea18662942c..5f6bac99974c 100644
> > > > > --- a/arch/x86/Kconfig
> > > > > +++ b/arch/x86/Kconfig
> > > > > @@ -26,6 +26,7 @@ config X86_64
> > > > > depends on 64BIT
> > > > > # Options that are inherently 64-bit kernel only:
> > > > > select ARCH_HAS_GIGANTIC_PAGE
> > > > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > > > > select ARCH_SUPPORTS_PER_VMA_LOCK
> > > > > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > > > > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > > > index 2fb7d53cf333..30e0958915ca 100644
> > > > > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > > > > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > > > @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> > > > > set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> > > > > }
> > > > >
> > > > > - if (vsyscall_mode == XONLY)
> > > > > - vm_flags_init(&gate_vma, VM_EXEC);
> > > > > + if (vsyscall_mode == XONLY) {
> > > > > + unsigned long vm_flags = VM_EXEC;
> > > > > +
> > > > > + vm_flags |= seal_system_mappings();
> > > > > + vm_flags_init(&gate_vma, vm_flags);
> > > > > + }
> > > > >
> > > > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> > > > > (unsigned long)VSYSCALL_ADDR);
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index df0a5eac66b7..f787d6c85cbb 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> > > > > int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> > > > > int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> > > > >
> > > > > +#ifdef CONFIG_64BIT
> > > > > +/*
> > > > > + * return VM_SEALED if seal system mapping is enabled.
> > > > > + */
> > > > > +unsigned long seal_system_mappings(void);
> > > > > +#else
> > > > > +static inline unsigned long seal_system_mappings(void)
> > > > > +{
> > > > > + return 0;
> > > > > +}
> > > >
> > > > OK so we can set seal system mappings on a 32-bit system and
> > > > silently... just not do it?...
> > > >
> > > I don't understand what you meant.
> > >
> > > The function returns the vm_flags for seal system mappings.
> > > In 32 bit, it returns 0.
> > >
> > > the caller (in mmap.c) does below:
> > > vm_flags |= seal_system_mappings();
> > >
> > > (The pattern is recommended by Liam. )
> > >
> > > Is that because the function name is misleading ? I can change it to
> > > seal_flags_system_mappings() if there is no objection to the long
> > > name.
> >
> > No, I'm saying that you're making it possible for somebody to enable this
> > feature on a 32-bit system, and to think it's enabled and that they're
> > protected when in fact they're not.
> >
> The kernel cmdline change already has comments about 32-bit: see:
> kernel-parameters.txt
> "This option has no effect if CONFIG_64BIT=n"
>
> > Which is, security-wise, I think rather unwise.
> >
> > Again it's an argument against a cmdline parameter. See below.
> >
> > >
> > > > > +#endif
> > > > > +
> > > > > #endif /* _LINUX_MM_H */
> > > > > diff --git a/init/Kconfig b/init/Kconfig
> > > > > index 1aa95a5dfff8..614719259aa0 100644
> > > > > --- a/init/Kconfig
> > > > > +++ b/init/Kconfig
> > > > > @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > > > > config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > > > > bool
> > > > >
> > > > > +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > > + bool
> > > > > + help
> > > > > + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> > > > > +
> > > > > + A 64-bit kernel is required for the memory sealing feature.
> > > > > + No specific hardware features from the CPU are needed.
> > > > > +
> > > > > + To enable this feature, the architecture needs to be tested to
> > > > > + confirm that it doesn't unmap/remap system mappings during the
> > > > > + the life time of the process. After the architecture enables this,
> > > > > + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> > > > > + to the feature.
> > > > > +
> > > > > + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> > > > > + feature, which is known to remap/unmap vdso. Thus, the presence of
> > > > > + CHECKPOINT_RESTORE is not considered a factor in enabling
> > > > > + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> > > > > +
> > > > > + For complete list of system mappings, please see
> > > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > > +
> > > > > + For complete descriptions of memory sealing, please see
> > > > > + Documentation/userspace-api/mseal.rst
> > > > > +
> > > > > config HAVE_PERF_EVENTS
> > > > > bool
> > > > > help
> > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > index 57fd5ab2abe7..bc694c555805 100644
> > > > > --- a/mm/mmap.c
> > > > > +++ b/mm/mmap.c
> > > > > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> > > > > unsigned long addr, unsigned long len,
> > > > > unsigned long vm_flags, const struct vm_special_mapping *spec)
> > > > > {
> > > > > + /*
> > > > > + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > > > > + * invoke the _install_special_mapping function can be sealed.
> > > > > + * Therefore, it is logical to call the seal_system_mappings_enabled()
> > > > > + * function here. In the future, if this is not the case, i.e. if certain
> > > > > + * mappings cannot be sealed, then it would be necessary to move this
> > > > > + * check to the calling function.
> > > > > + */
> > > > > + vm_flags |= seal_system_mappings();
> > > > > +
> > > > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> > > > > &special_mapping_vmops);
> > > > > }
> > > > > diff --git a/mm/mseal.c b/mm/mseal.c
> > > > > index ece977bd21e1..80126d6231bb 100644
> > > > > --- a/mm/mseal.c
> > > > > +++ b/mm/mseal.c
> > > > > @@ -7,6 +7,7 @@
> > > > > * Author: Jeff Xu <jeffxu@chromium.org>
> > > > > */
> > > > >
> > > > > +#include <linux/fs_parser.h>
> > > > > #include <linux/mempolicy.h>
> > > > > #include <linux/mman.h>
> > > > > #include <linux/mm.h>
> > > > > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> > > > > {
> > > > > return do_mseal(start, len, flags);
> > > > > }
> > > > > +
> > > > > +/*
> > > > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > > > + */
> > > > > +enum seal_system_mappings_type {
> > > > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > > > +};
> > > > > +
> > > > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > > > +
> > > > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > > > + { }
> > > > > +};
> > > > > +
> > > > > +static int __init early_seal_system_mappings_override(char *buf)
> > > > > +{
> > > > > + if (!buf)
> > > > > + return -EINVAL;
> > > > > +
> > > > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > > > + buf, seal_system_mappings_v);
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > > > > +
> > > > > +unsigned long seal_system_mappings(void)
> > > > > +{
> > > > > + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > > > > + return VM_SEALED;
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > diff --git a/security/Kconfig b/security/Kconfig
> > > > > index 28e685f53bd1..5bbb8d989d79 100644
> > > > > --- a/security/Kconfig
> > > > > +++ b/security/Kconfig
> > > > > @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
> > > > >
> > > > > endchoice
> > > > >
> > > > > +config SEAL_SYSTEM_MAPPINGS
> > > > > + bool "seal system mappings"
> > > >
> > > > I'd prefer an 'mseal' here please, it's becoming hard to grep for this
> > > > stuff. We overload 'seal' too much and I want to be able to identify what
> > > > is a memfd seal and what is an mseal or whatever else...
> > > >
> > > I m OK with MSEAL_
> >
> > Thanks.
> >
> > >
> > > > > + default n
> > > > > + depends on 64BIT
> > > > > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > > > + depends on !CHECKPOINT_RESTORE
> > > >
> > > > I don't know why we bother setting restrictions on this but allow them to
> > > > be overriden with a boot flag?
> > > >
> > > The idea is a distribution might not enable kernel security features
> > > by default, and kernel cmdline provides flexibility to let users
> > > enable it.
> > >
> > > This is the same approach as proc_mem.force_override kernel cmd line
> > > where Kees recommended [1], I would prefer to keep this as is.
> > >
> > > [1] https://lore.kernel.org/all/202402261110.B8129C002@keescook/
> > >
> >
> > This is flawed on multiple levels. Firstly, from the linked change:
> >
> > +config SECURITY_PROC_MEM_RESTRICT_WRITES
> > + bool "Restrict /proc/<pid>/mem write access"
> > + default n
> > + help
> >
> > There are no 'depends on'. Yours has 'depends on' which you've just
> > rendered totally irrelevant including _allowing the enabling of this
> > feature in broken situations_ like CRIU, as I mentioned below.
> >
> > For another, the linked feature changes behaviour and a user may or may not
> > want to allow the ability to write to /proc/<pid>/mem which is ENTIRELY
> > DIFFERENT from this proposed feature.
> >
> > Under what circumstances could a user possibly want to write VVAR, VDSO,
> > etc. etc.? It just makes absolutely no sense for this to be a boot switch.
> >
> > So the arguments presented there have zero bearing on this series.
> >
> > > > This means somebody with CRIU enabled could enable this and have a broken
> > > > kernel right? We can't allow that.
> >
> > Please do not ignore review comments like this.
> >
> kernel cmdline is a valid user case. The reasoning is explained in
> the previous response, it allows users to enable security features
> without having to rebuild the distribution's kernel.
>
> For the concern that CRIU user enables this feature through kernel
> cmdline mistakenly, there is already instruction and comments
> throughout code to make aware of the impact to CRIU after enabling
> sealing for system-mapping. In my view: users who want to enable this
> feature through kernel cmdline should already have enough context
> about this, i.e. it is an educated decision rather than experimenting.
>
> That said, if we want to protect those CRIU users from mistakenly
> enabling memory sealing, we could add a check to verify CRIU is not
> enabled, i.e. if CRIU is enabled, the kernel cmd line has no effect.
> Although in my view this is an extra complicity without meaningful
> benefit - If it is the user's educated choice, I prefer to honor it.
>
>
>
> > > >
> > > > I'd much prefer we either:
> > > >
> > > > 1. Just have a CONFIG_MSEAL_SYSTEM_MAPPINGS flag. _or_
> > > > 2. Have CONFIG_MSEAL_SYSTEM_MAPPINGS enable, allow kernel flag to disable.
> > > >
> > > > In both cases you #ifdef on CONFIG_MSEAL_SYSTEM_MAPPINGS, and the
> > > > restrictions appply correctly.
> > > >
> > > > If in the future we decide this feature is stable and ready and good to
> > > > enable globally we can just change the default on this to y at some later
> > > > date?
> > > >
> > > > Otherwise it just seems like in a effect the kernel command line flag is a
> > > > debug flag to experiment on arbitrary kernels?
> > > >
> > > > > + help
> > > > > + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > > > > +
> > > > > + A 64-bit kernel is required for the memory sealing feature.
> > > > > + No specific hardware features from the CPU are needed.
> > > > > +
> > > > > + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> > > > > +
> > > > > + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > > > > + and remap/unmap will fail when the mapping is sealed, therefore
> > > > > + !CHECKPOINT_RESTORE is added as dependency.
> > > > > +
> > > > > + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> > > > > + this.
> > > > > +
> > > > > + For complete descriptions of memory sealing, please see
> > > > > + Documentation/userspace-api/mseal.rst
> > > > > +
> > > > > config SECURITY
> > > > > bool "Enable different security models"
> > > > > depends on SYSFS
> > > > > --
> > > > > 2.47.0.338.g60cca15819-goog
> > > > >
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
2024-11-25 20:40 ` Matthew Wilcox
2024-12-02 18:29 ` Lorenzo Stoakes
@ 2024-12-04 14:04 ` Benjamin Berg
2024-12-04 17:43 ` Jeff Xu
2024-12-10 4:12 ` Andrei Vagin
2024-12-17 22:18 ` Kees Cook
4 siblings, 1 reply; 62+ messages in thread
From: Benjamin Berg @ 2024-12-04 14:04 UTC (permalink / raw)
To: jeffxu, akpm, keescook, jannh, torvalds, adhemerval.zanella,
oleg, linux-um
Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe
Hi,
On Mon, 2024-11-25 at 20:20 +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
>
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
>
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [2]. It is sealed from creation.
>
> The vdso, vvar, sigpage, and uprobe mappings all invoke the
> _install_special_mapping() function. As no other mappings utilize this
> function, it is logical to incorporate sealing logic within
> _install_special_mapping(). This approach avoids the necessity of
> modifying code across various architecture-specific implementations.
>
> The vsyscall mapping, which has its own initialization function, is
> sealed in the XONLY case, it seems to be the most common and secure
> case of using vsyscall.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the mapping of vdso, vvar, and sigpage during restore
> operations. Consequently, this feature cannot be universally enabled
> across all systems.
I think that enabling this feature would break User Mode Linux (UML).
It uses a tiny static helper executable to create userspace MMs. This
executable just maps some "stub" data/code pages[1] for management and
after that all other memory has to be unmapped as it is managed by the
UML kernel.
This unmapping will not work if the vdso/vvar mappings are sealed.
Maybe nobody who enables the feature cares about UML. But wanted to
raise it as a potential issue in case you are not aware yet.
Benjamin
[1] Hmm, we should mseal() those stub pages.
>
> Currently, memory sealing is only functional in a 64-bit kernel
> configuration.
>
> To enable this feature, the architecture needs to be tested to
> confirm that it doesn't unmap/remap system mappings during the
> the life time of the process. After the architecture enables
> ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> Alternatively, kernel command line (exec.seal_system_mappings)
> enables this feature also.
>
> This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> Other architectures can enable this after testing. No specific hardware
> features from the CPU are needed.
>
> This feature's security enhancements will benefit ChromeOS, Android,
> and other secure-by-default systems.
>
> [1] Documentation/userspace-api/mseal.rst
> [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
> .../admin-guide/kernel-parameters.txt | 11 ++++++
> Documentation/userspace-api/mseal.rst | 4 ++
> arch/arm64/Kconfig | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> include/linux/mm.h | 12 ++++++
> init/Kconfig | 25 ++++++++++++
> mm/mmap.c | 10 +++++
> mm/mseal.c | 39 +++++++++++++++++++
> security/Kconfig | 24 ++++++++++++
> 10 files changed, 133 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index e7bfe1bde49e..f63268341739 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1538,6 +1538,17 @@
> Permit 'security.evm' to be updated regardless of
> current integrity status.
>
> + exec.seal_system_mappings = [KNL]
> + Format: { no | yes }
> + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> + uprobe.
> + - 'no': do not seal system mappings.
> + - 'yes': seal system mappings.
> + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> + If not specified or invalid, default is the value set by
> + CONFIG_SEAL_SYSTEM_MAPPINGS.
> + This option has no effect if CONFIG_64BIT=n
> +
> early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> stages so cover more early boot allocations.
> Please note that as side effect some optimizations
> diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> index 41102f74c5e2..bec122318a59 100644
> --- a/Documentation/userspace-api/mseal.rst
> +++ b/Documentation/userspace-api/mseal.rst
> @@ -130,6 +130,10 @@ Use cases
>
> - Chrome browser: protect some security sensitive data structures.
>
> +- seal system mappings:
> + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> + as vdso, vvar, sigpage, uprobes and vsyscall.
> +
> When not to use mseal
> =====================
> Applications can apply sealing to any virtual memory region from userspace,
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 63de71544d95..fc5da8f74342 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -44,6 +44,7 @@ config ARM64
> select ARCH_HAS_SETUP_DMA_OPS
> select ARCH_HAS_SET_DIRECT_MAP
> select ARCH_HAS_SET_MEMORY
> + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> select ARCH_STACKWALK
> select ARCH_HAS_STRICT_KERNEL_RWX
> select ARCH_HAS_STRICT_MODULE_RWX
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ea18662942c..5f6bac99974c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -26,6 +26,7 @@ config X86_64
> depends on 64BIT
> # Options that are inherently 64-bit kernel only:
> select ARCH_HAS_GIGANTIC_PAGE
> + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> select ARCH_SUPPORTS_PER_VMA_LOCK
> select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> index 2fb7d53cf333..30e0958915ca 100644
> --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> }
>
> - if (vsyscall_mode == XONLY)
> - vm_flags_init(&gate_vma, VM_EXEC);
> + if (vsyscall_mode == XONLY) {
> + unsigned long vm_flags = VM_EXEC;
> +
> + vm_flags |= seal_system_mappings();
> + vm_flags_init(&gate_vma, vm_flags);
> + }
>
> BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> (unsigned long)VSYSCALL_ADDR);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index df0a5eac66b7..f787d6c85cbb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
>
> +#ifdef CONFIG_64BIT
> +/*
> + * return VM_SEALED if seal system mapping is enabled.
> + */
> +unsigned long seal_system_mappings(void);
> +#else
> +static inline unsigned long seal_system_mappings(void)
> +{
> + return 0;
> +}
> +#endif
> +
> #endif /* _LINUX_MM_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 1aa95a5dfff8..614719259aa0 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> config ARCH_HAS_MEMBARRIER_SYNC_CORE
> bool
>
> +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> + bool
> + help
> + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> +
> + A 64-bit kernel is required for the memory sealing feature.
> + No specific hardware features from the CPU are needed.
> +
> + To enable this feature, the architecture needs to be tested to
> + confirm that it doesn't unmap/remap system mappings during the
> + the life time of the process. After the architecture enables this,
> + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> + to the feature.
> +
> + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> + feature, which is known to remap/unmap vdso. Thus, the presence of
> + CHECKPOINT_RESTORE is not considered a factor in enabling
> + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> +
> + For complete list of system mappings, please see
> + CONFIG_SEAL_SYSTEM_MAPPINGS.
> +
> + For complete descriptions of memory sealing, please see
> + Documentation/userspace-api/mseal.rst
> +
> config HAVE_PERF_EVENTS
> bool
> help
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 57fd5ab2abe7..bc694c555805 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> unsigned long addr, unsigned long len,
> unsigned long vm_flags, const struct vm_special_mapping *spec)
> {
> + /*
> + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> + * invoke the _install_special_mapping function can be sealed.
> + * Therefore, it is logical to call the seal_system_mappings_enabled()
> + * function here. In the future, if this is not the case, i.e. if certain
> + * mappings cannot be sealed, then it would be necessary to move this
> + * check to the calling function.
> + */
> + vm_flags |= seal_system_mappings();
> +
> return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> &special_mapping_vmops);
> }
> diff --git a/mm/mseal.c b/mm/mseal.c
> index ece977bd21e1..80126d6231bb 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -7,6 +7,7 @@
> * Author: Jeff Xu <jeffxu@chromium.org>
> */
>
> +#include <linux/fs_parser.h>
> #include <linux/mempolicy.h>
> #include <linux/mman.h>
> #include <linux/mm.h>
> @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> {
> return do_mseal(start, len, flags);
> }
> +
> +/*
> + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> + */
> +enum seal_system_mappings_type {
> + SEAL_SYSTEM_MAPPINGS_DISABLED,
> + SEAL_SYSTEM_MAPPINGS_ENABLED
> +};
> +
> +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> + SEAL_SYSTEM_MAPPINGS_DISABLED;
> +
> +static const struct constant_table value_table_sys_mapping[] __initconst = {
> + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> + { }
> +};
> +
> +static int __init early_seal_system_mappings_override(char *buf)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> + buf, seal_system_mappings_v);
> + return 0;
> +}
> +
> +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> +
> +unsigned long seal_system_mappings(void)
> +{
> + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> + return VM_SEALED;
> +
> + return 0;
> +}
> diff --git a/security/Kconfig b/security/Kconfig
> index 28e685f53bd1..5bbb8d989d79 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
>
> endchoice
>
> +config SEAL_SYSTEM_MAPPINGS
> + bool "seal system mappings"
> + default n
> + depends on 64BIT
> + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> + depends on !CHECKPOINT_RESTORE
> + help
> + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> +
> + A 64-bit kernel is required for the memory sealing feature.
> + No specific hardware features from the CPU are needed.
> +
> + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> +
> + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> + and remap/unmap will fail when the mapping is sealed, therefore
> + !CHECKPOINT_RESTORE is added as dependency.
> +
> + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> + this.
> +
> + For complete descriptions of memory sealing, please see
> + Documentation/userspace-api/mseal.rst
> +
> config SECURITY
> bool "Enable different security models"
> depends on SYSFS
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-04 14:04 ` Benjamin Berg
@ 2024-12-04 17:43 ` Jeff Xu
2024-12-04 18:24 ` Benjamin Berg
0 siblings, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2024-12-04 17:43 UTC (permalink / raw)
To: Benjamin Berg
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-um, linux-kernel, linux-hardening, linux-mm, jorgelo,
sroettger, ojeda, adobriyan, anna-maria, mark.rutland,
linus.walleij, Jason, deller, rdunlap, davem, hch, peterx, hca,
f.fainelli, gerg, dave.hansen, mingo, ardb, Liam.Howlett, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe
On Wed, Dec 4, 2024 at 6:04 AM Benjamin Berg <benjamin@sipsolutions.net> wrote:
>
> Hi,
>
> On Mon, 2024-11-25 at 20:20 +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> >
> > Unlike the aforementioned mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime [2]. It is sealed from creation.
> >
> > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > _install_special_mapping() function. As no other mappings utilize this
> > function, it is logical to incorporate sealing logic within
> > _install_special_mapping(). This approach avoids the necessity of
> > modifying code across various architecture-specific implementations.
> >
> > The vsyscall mapping, which has its own initialization function, is
> > sealed in the XONLY case, it seems to be the most common and secure
> > case of using vsyscall.
> >
> > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > alter the mapping of vdso, vvar, and sigpage during restore
> > operations. Consequently, this feature cannot be universally enabled
> > across all systems.
>
> I think that enabling this feature would break User Mode Linux (UML).
> It uses a tiny static helper executable to create userspace MMs. This
> executable just maps some "stub" data/code pages[1] for management and
> after that all other memory has to be unmapped as it is managed by the
> UML kernel.
> This unmapping will not work if the vdso/vvar mappings are sealed.
>
> Maybe nobody who enables the feature cares about UML. But wanted to
> raise it as a potential issue in case you are not aware yet.
>
Thank you for bringing this to attention, I will add this information
to documentation/comments.
Do you think we need to add a KCONFIG check similar to
!CHECKPOINT_RESTORE ? or this is something purely in userspace and
the kernel doesn't have a control.
> Benjamin
>
> [1] Hmm, we should mseal() those stub pages.
>
is this reference [1] correct ?
> >
> > Currently, memory sealing is only functional in a 64-bit kernel
> > configuration.
> >
> > To enable this feature, the architecture needs to be tested to
> > confirm that it doesn't unmap/remap system mappings during the
> > the life time of the process. After the architecture enables
> > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > Alternatively, kernel command line (exec.seal_system_mappings)
> > enables this feature also.
> >
> > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > Other architectures can enable this after testing. No specific hardware
> > features from the CPU are needed.
> >
> > This feature's security enhancements will benefit ChromeOS, Android,
> > and other secure-by-default systems.
> >
> > [1] Documentation/userspace-api/mseal.rst
> > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > ---
> > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > Documentation/userspace-api/mseal.rst | 4 ++
> > arch/arm64/Kconfig | 1 +
> > arch/x86/Kconfig | 1 +
> > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > include/linux/mm.h | 12 ++++++
> > init/Kconfig | 25 ++++++++++++
> > mm/mmap.c | 10 +++++
> > mm/mseal.c | 39 +++++++++++++++++++
> > security/Kconfig | 24 ++++++++++++
> > 10 files changed, 133 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index e7bfe1bde49e..f63268341739 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1538,6 +1538,17 @@
> > Permit 'security.evm' to be updated regardless of
> > current integrity status.
> >
> > + exec.seal_system_mappings = [KNL]
> > + Format: { no | yes }
> > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > + uprobe.
> > + - 'no': do not seal system mappings.
> > + - 'yes': seal system mappings.
> > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > + If not specified or invalid, default is the value set by
> > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > + This option has no effect if CONFIG_64BIT=n
> > +
> > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> > stages so cover more early boot allocations.
> > Please note that as side effect some optimizations
> > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > index 41102f74c5e2..bec122318a59 100644
> > --- a/Documentation/userspace-api/mseal.rst
> > +++ b/Documentation/userspace-api/mseal.rst
> > @@ -130,6 +130,10 @@ Use cases
> >
> > - Chrome browser: protect some security sensitive data structures.
> >
> > +- seal system mappings:
> > + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> > + as vdso, vvar, sigpage, uprobes and vsyscall.
> > +
> > When not to use mseal
> > =====================
> > Applications can apply sealing to any virtual memory region from userspace,
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index 63de71544d95..fc5da8f74342 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -44,6 +44,7 @@ config ARM64
> > select ARCH_HAS_SETUP_DMA_OPS
> > select ARCH_HAS_SET_DIRECT_MAP
> > select ARCH_HAS_SET_MEMORY
> > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > select ARCH_STACKWALK
> > select ARCH_HAS_STRICT_KERNEL_RWX
> > select ARCH_HAS_STRICT_MODULE_RWX
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 1ea18662942c..5f6bac99974c 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -26,6 +26,7 @@ config X86_64
> > depends on 64BIT
> > # Options that are inherently 64-bit kernel only:
> > select ARCH_HAS_GIGANTIC_PAGE
> > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > select ARCH_SUPPORTS_PER_VMA_LOCK
> > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > index 2fb7d53cf333..30e0958915ca 100644
> > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> > set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> > }
> >
> > - if (vsyscall_mode == XONLY)
> > - vm_flags_init(&gate_vma, VM_EXEC);
> > + if (vsyscall_mode == XONLY) {
> > + unsigned long vm_flags = VM_EXEC;
> > +
> > + vm_flags |= seal_system_mappings();
> > + vm_flags_init(&gate_vma, vm_flags);
> > + }
> >
> > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> > (unsigned long)VSYSCALL_ADDR);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index df0a5eac66b7..f787d6c85cbb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> > int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> > int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> >
> > +#ifdef CONFIG_64BIT
> > +/*
> > + * return VM_SEALED if seal system mapping is enabled.
> > + */
> > +unsigned long seal_system_mappings(void);
> > +#else
> > +static inline unsigned long seal_system_mappings(void)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > #endif /* _LINUX_MM_H */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 1aa95a5dfff8..614719259aa0 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > bool
> >
> > +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > + bool
> > + help
> > + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> > +
> > + A 64-bit kernel is required for the memory sealing feature.
> > + No specific hardware features from the CPU are needed.
> > +
> > + To enable this feature, the architecture needs to be tested to
> > + confirm that it doesn't unmap/remap system mappings during the
> > + the life time of the process. After the architecture enables this,
> > + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> > + to the feature.
> > +
> > + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> > + feature, which is known to remap/unmap vdso. Thus, the presence of
> > + CHECKPOINT_RESTORE is not considered a factor in enabling
> > + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> > +
> > + For complete list of system mappings, please see
> > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > +
> > + For complete descriptions of memory sealing, please see
> > + Documentation/userspace-api/mseal.rst
> > +
> > config HAVE_PERF_EVENTS
> > bool
> > help
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 57fd5ab2abe7..bc694c555805 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> > unsigned long addr, unsigned long len,
> > unsigned long vm_flags, const struct vm_special_mapping *spec)
> > {
> > + /*
> > + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > + * invoke the _install_special_mapping function can be sealed.
> > + * Therefore, it is logical to call the seal_system_mappings_enabled()
> > + * function here. In the future, if this is not the case, i.e. if certain
> > + * mappings cannot be sealed, then it would be necessary to move this
> > + * check to the calling function.
> > + */
> > + vm_flags |= seal_system_mappings();
> > +
> > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> > &special_mapping_vmops);
> > }
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index ece977bd21e1..80126d6231bb 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -7,6 +7,7 @@
> > * Author: Jeff Xu <jeffxu@chromium.org>
> > */
> >
> > +#include <linux/fs_parser.h>
> > #include <linux/mempolicy.h>
> > #include <linux/mman.h>
> > #include <linux/mm.h>
> > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> > {
> > return do_mseal(start, len, flags);
> > }
> > +
> > +/*
> > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > + */
> > +enum seal_system_mappings_type {
> > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > +};
> > +
> > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > +
> > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > + { }
> > +};
> > +
> > +static int __init early_seal_system_mappings_override(char *buf)
> > +{
> > + if (!buf)
> > + return -EINVAL;
> > +
> > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > + buf, seal_system_mappings_v);
> > + return 0;
> > +}
> > +
> > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > +
> > +unsigned long seal_system_mappings(void)
> > +{
> > + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > + return VM_SEALED;
> > +
> > + return 0;
> > +}
> > diff --git a/security/Kconfig b/security/Kconfig
> > index 28e685f53bd1..5bbb8d989d79 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
> >
> > endchoice
> >
> > +config SEAL_SYSTEM_MAPPINGS
> > + bool "seal system mappings"
> > + default n
> > + depends on 64BIT
> > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > + depends on !CHECKPOINT_RESTORE
> > + help
> > + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > +
> > + A 64-bit kernel is required for the memory sealing feature.
> > + No specific hardware features from the CPU are needed.
> > +
> > + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> > +
> > + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > + and remap/unmap will fail when the mapping is sealed, therefore
> > + !CHECKPOINT_RESTORE is added as dependency.
> > +
> > + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> > + this.
> > +
> > + For complete descriptions of memory sealing, please see
> > + Documentation/userspace-api/mseal.rst
> > +
> > config SECURITY
> > bool "Enable different security models"
> > depends on SYSFS
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-04 17:43 ` Jeff Xu
@ 2024-12-04 18:24 ` Benjamin Berg
0 siblings, 0 replies; 62+ messages in thread
From: Benjamin Berg @ 2024-12-04 18:24 UTC (permalink / raw)
To: Jeff Xu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-um, linux-kernel, linux-hardening, linux-mm, jorgelo,
sroettger, ojeda, adobriyan, anna-maria, mark.rutland,
linus.walleij, Jason, deller, rdunlap, davem, hch, peterx, hca,
f.fainelli, gerg, dave.hansen, mingo, ardb, Liam.Howlett, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe
Hi,
On Wed, 2024-12-04 at 09:43 -0800, Jeff Xu wrote:
> On Wed, Dec 4, 2024 at 6:04 AM Benjamin Berg <benjamin@sipsolutions.net> wrote:
> > On Mon, 2024-11-25 at 20:20 +0000, jeffxu@chromium.org wrote:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > >
> > > Those mappings are readonly or executable only, sealing can protect
> > > them from ever changing or unmapped during the life time of the process.
> > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > >
> > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > generated by the kernel during program initialization, and are
> > > sealed after creation.
> > >
> > > Unlike the aforementioned mappings, the uprobe mapping is not
> > > established during program startup. However, its lifetime is the same
> > > as the process's lifetime [2]. It is sealed from creation.
> > >
> > > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > > _install_special_mapping() function. As no other mappings utilize this
> > > function, it is logical to incorporate sealing logic within
> > > _install_special_mapping(). This approach avoids the necessity of
> > > modifying code across various architecture-specific implementations.
> > >
> > > The vsyscall mapping, which has its own initialization function, is
> > > sealed in the XONLY case, it seems to be the most common and secure
> > > case of using vsyscall.
> > >
> > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > > alter the mapping of vdso, vvar, and sigpage during restore
> > > operations. Consequently, this feature cannot be universally enabled
> > > across all systems.
> >
> > I think that enabling this feature would break User Mode Linux (UML).
> > It uses a tiny static helper executable to create userspace MMs. This
> > executable just maps some "stub" data/code pages[1] for management and
> > after that all other memory has to be unmapped as it is managed by the
> > UML kernel.
> > This unmapping will not work if the vdso/vvar mappings are sealed.
> >
> > Maybe nobody who enables the feature cares about UML. But wanted to
> > raise it as a potential issue in case you are not aware yet.
> >
> Thank you for bringing this to attention, I will add this information
> to documentation/comments.
>
> Do you think we need to add a KCONFIG check similar to
> !CHECKPOINT_RESTORE ? or this is something purely in userspace and
> the kernel doesn't have a control.
UML is purely in userspace, so there is no need for any checks.
> > [1] Hmm, we should mseal() those stub pages.
> >
> is this reference [1] correct ?
I think so. But it was off-topic to this thread. I just realized that
this is a possible improvement of the UML code.
Benjamin
> > >
> > > Currently, memory sealing is only functional in a 64-bit kernel
> > > configuration.
> > >
> > > To enable this feature, the architecture needs to be tested to
> > > confirm that it doesn't unmap/remap system mappings during the
> > > the life time of the process. After the architecture enables
> > > ARCH_HAS_SEAL_SYSTEM_MAPPINGS, a distribution can set
> > > CONFIG_SEAL_SYSTEM_MAPPING to manage access to the feature.
> > > Alternatively, kernel command line (exec.seal_system_mappings)
> > > enables this feature also.
> > >
> > > This feature is tested using ChromeOS and Android on X86_64 and ARM64,
> > > therefore ARCH_HAS_SEAL_SYSTEM_MAPPINGS is set for X86_64 and ARM64.
> > > Other architectures can enable this after testing. No specific hardware
> > > features from the CPU are needed.
> > >
> > > This feature's security enhancements will benefit ChromeOS, Android,
> > > and other secure-by-default systems.
> > >
> > > [1] Documentation/userspace-api/mseal.rst
> > > [2] https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/
> > > Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> > > ---
> > > .../admin-guide/kernel-parameters.txt | 11 ++++++
> > > Documentation/userspace-api/mseal.rst | 4 ++
> > > arch/arm64/Kconfig | 1 +
> > > arch/x86/Kconfig | 1 +
> > > arch/x86/entry/vsyscall/vsyscall_64.c | 8 +++-
> > > include/linux/mm.h | 12 ++++++
> > > init/Kconfig | 25 ++++++++++++
> > > mm/mmap.c | 10 +++++
> > > mm/mseal.c | 39 +++++++++++++++++++
> > > security/Kconfig | 24 ++++++++++++
> > > 10 files changed, 133 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index e7bfe1bde49e..f63268341739 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -1538,6 +1538,17 @@
> > > Permit 'security.evm' to be updated regardless of
> > > current integrity status.
> > >
> > > + exec.seal_system_mappings = [KNL]
> > > + Format: { no | yes }
> > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > + uprobe.
> > > + - 'no': do not seal system mappings.
> > > + - 'yes': seal system mappings.
> > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > + If not specified or invalid, default is the value set by
> > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > + This option has no effect if CONFIG_64BIT=n
> > > +
> > > early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
> > > stages so cover more early boot allocations.
> > > Please note that as side effect some optimizations
> > > diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
> > > index 41102f74c5e2..bec122318a59 100644
> > > --- a/Documentation/userspace-api/mseal.rst
> > > +++ b/Documentation/userspace-api/mseal.rst
> > > @@ -130,6 +130,10 @@ Use cases
> > >
> > > - Chrome browser: protect some security sensitive data structures.
> > >
> > > +- seal system mappings:
> > > + kernel config CONFIG_SEAL_SYSTEM_MAPPINGS seals system mappings such
> > > + as vdso, vvar, sigpage, uprobes and vsyscall.
> > > +
> > > When not to use mseal
> > > =====================
> > > Applications can apply sealing to any virtual memory region from userspace,
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index 63de71544d95..fc5da8f74342 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -44,6 +44,7 @@ config ARM64
> > > select ARCH_HAS_SETUP_DMA_OPS
> > > select ARCH_HAS_SET_DIRECT_MAP
> > > select ARCH_HAS_SET_MEMORY
> > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > select ARCH_STACKWALK
> > > select ARCH_HAS_STRICT_KERNEL_RWX
> > > select ARCH_HAS_STRICT_MODULE_RWX
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index 1ea18662942c..5f6bac99974c 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -26,6 +26,7 @@ config X86_64
> > > depends on 64BIT
> > > # Options that are inherently 64-bit kernel only:
> > > select ARCH_HAS_GIGANTIC_PAGE
> > > + select ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > > select ARCH_SUPPORTS_PER_VMA_LOCK
> > > select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
> > > diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > index 2fb7d53cf333..30e0958915ca 100644
> > > --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> > > +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> > > @@ -366,8 +366,12 @@ void __init map_vsyscall(void)
> > > set_vsyscall_pgtable_user_bits(swapper_pg_dir);
> > > }
> > >
> > > - if (vsyscall_mode == XONLY)
> > > - vm_flags_init(&gate_vma, VM_EXEC);
> > > + if (vsyscall_mode == XONLY) {
> > > + unsigned long vm_flags = VM_EXEC;
> > > +
> > > + vm_flags |= seal_system_mappings();
> > > + vm_flags_init(&gate_vma, vm_flags);
> > > + }
> > >
> > > BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
> > > (unsigned long)VSYSCALL_ADDR);
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index df0a5eac66b7..f787d6c85cbb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -4238,4 +4238,16 @@ int arch_get_shadow_stack_status(struct task_struct *t, unsigned long __user *st
> > > int arch_set_shadow_stack_status(struct task_struct *t, unsigned long status);
> > > int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
> > >
> > > +#ifdef CONFIG_64BIT
> > > +/*
> > > + * return VM_SEALED if seal system mapping is enabled.
> > > + */
> > > +unsigned long seal_system_mappings(void);
> > > +#else
> > > +static inline unsigned long seal_system_mappings(void)
> > > +{
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > #endif /* _LINUX_MM_H */
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 1aa95a5dfff8..614719259aa0 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -1860,6 +1860,31 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > > config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > > bool
> > >
> > > +config ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > + bool
> > > + help
> > > + Control SEAL_SYSTEM_MAPPINGS access based on architecture.
> > > +
> > > + A 64-bit kernel is required for the memory sealing feature.
> > > + No specific hardware features from the CPU are needed.
> > > +
> > > + To enable this feature, the architecture needs to be tested to
> > > + confirm that it doesn't unmap/remap system mappings during the
> > > + the life time of the process. After the architecture enables this,
> > > + a distribution can set CONFIG_SEAL_SYSTEM_MAPPING to manage access
> > > + to the feature.
> > > +
> > > + The CONFIG_SEAL_SYSTEM_MAPPINGS already checks the CHECKPOINT_RESTORE
> > > + feature, which is known to remap/unmap vdso. Thus, the presence of
> > > + CHECKPOINT_RESTORE is not considered a factor in enabling
> > > + ARCH_HAS_SEAL_SYSTEM_MAPPINGS for a architecture.
> > > +
> > > + For complete list of system mappings, please see
> > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > +
> > > + For complete descriptions of memory sealing, please see
> > > + Documentation/userspace-api/mseal.rst
> > > +
> > > config HAVE_PERF_EVENTS
> > > bool
> > > help
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 57fd5ab2abe7..bc694c555805 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2133,6 +2133,16 @@ struct vm_area_struct *_install_special_mapping(
> > > unsigned long addr, unsigned long len,
> > > unsigned long vm_flags, const struct vm_special_mapping *spec)
> > > {
> > > + /*
> > > + * At present, all mappings (vdso, vvar, sigpage, and uprobe) that
> > > + * invoke the _install_special_mapping function can be sealed.
> > > + * Therefore, it is logical to call the seal_system_mappings_enabled()
> > > + * function here. In the future, if this is not the case, i.e. if certain
> > > + * mappings cannot be sealed, then it would be necessary to move this
> > > + * check to the calling function.
> > > + */
> > > + vm_flags |= seal_system_mappings();
> > > +
> > > return __install_special_mapping(mm, addr, len, vm_flags, (void *)spec,
> > > &special_mapping_vmops);
> > > }
> > > diff --git a/mm/mseal.c b/mm/mseal.c
> > > index ece977bd21e1..80126d6231bb 100644
> > > --- a/mm/mseal.c
> > > +++ b/mm/mseal.c
> > > @@ -7,6 +7,7 @@
> > > * Author: Jeff Xu <jeffxu@chromium.org>
> > > */
> > >
> > > +#include <linux/fs_parser.h>
> > > #include <linux/mempolicy.h>
> > > #include <linux/mman.h>
> > > #include <linux/mm.h>
> > > @@ -266,3 +267,41 @@ SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
> > > {
> > > return do_mseal(start, len, flags);
> > > }
> > > +
> > > +/*
> > > + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> > > + */
> > > +enum seal_system_mappings_type {
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED,
> > > + SEAL_SYSTEM_MAPPINGS_ENABLED
> > > +};
> > > +
> > > +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init =
> > > + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> > > + SEAL_SYSTEM_MAPPINGS_DISABLED;
> > > +
> > > +static const struct constant_table value_table_sys_mapping[] __initconst = {
> > > + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> > > + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> > > + { }
> > > +};
> > > +
> > > +static int __init early_seal_system_mappings_override(char *buf)
> > > +{
> > > + if (!buf)
> > > + return -EINVAL;
> > > +
> > > + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> > > + buf, seal_system_mappings_v);
> > > + return 0;
> > > +}
> > > +
> > > +early_param("exec.seal_system_mappings", early_seal_system_mappings_override);
> > > +
> > > +unsigned long seal_system_mappings(void)
> > > +{
> > > + if (seal_system_mappings_v == SEAL_SYSTEM_MAPPINGS_ENABLED)
> > > + return VM_SEALED;
> > > +
> > > + return 0;
> > > +}
> > > diff --git a/security/Kconfig b/security/Kconfig
> > > index 28e685f53bd1..5bbb8d989d79 100644
> > > --- a/security/Kconfig
> > > +++ b/security/Kconfig
> > > @@ -51,6 +51,30 @@ config PROC_MEM_NO_FORCE
> > >
> > > endchoice
> > >
> > > +config SEAL_SYSTEM_MAPPINGS
> > > + bool "seal system mappings"
> > > + default n
> > > + depends on 64BIT
> > > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > > + depends on !CHECKPOINT_RESTORE
> > > + help
> > > + Seal system mappings such as vdso, vvar, sigpage, vsyscall, uprobes.
> > > +
> > > + A 64-bit kernel is required for the memory sealing feature.
> > > + No specific hardware features from the CPU are needed.
> > > +
> > > + Depends on the ARCH_HAS_SEAL_SYSTEM_MAPPINGS.
> > > +
> > > + CHECKPOINT_RESTORE might relocate vdso mapping during restore,
> > > + and remap/unmap will fail when the mapping is sealed, therefore
> > > + !CHECKPOINT_RESTORE is added as dependency.
> > > +
> > > + Kernel command line exec.seal_system_mappings=(no/yes) overrides
> > > + this.
> > > +
> > > + For complete descriptions of memory sealing, please see
> > > + Documentation/userspace-api/mseal.rst
> > > +
> > > config SECURITY
> > > bool "Enable different security models"
> > > depends on SYSFS
> >
> >
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
` (2 preceding siblings ...)
2024-12-04 14:04 ` Benjamin Berg
@ 2024-12-10 4:12 ` Andrei Vagin
2024-12-11 22:46 ` Jeff Xu
2024-12-17 22:18 ` Kees Cook
4 siblings, 1 reply; 62+ messages in thread
From: Andrei Vagin @ 2024-12-10 4:12 UTC (permalink / raw)
To: jeffxu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Andrei Vagin
On Mon, Nov 25, 2024 at 12:49 PM <jeffxu@chromium.org> wrote:
>
> From: Jeff Xu <jeffxu@chromium.org>
>
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
>
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
>
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [2]. It is sealed from creation.
>
> The vdso, vvar, sigpage, and uprobe mappings all invoke the
> _install_special_mapping() function. As no other mappings utilize this
> function, it is logical to incorporate sealing logic within
> _install_special_mapping(). This approach avoids the necessity of
> modifying code across various architecture-specific implementations.
>
> The vsyscall mapping, which has its own initialization function, is
> sealed in the XONLY case, it seems to be the most common and secure
> case of using vsyscall.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the mapping of vdso, vvar, and sigpage during restore
> operations. Consequently, this feature cannot be universally enabled
> across all systems.
>
...
>
> +config SEAL_SYSTEM_MAPPINGS
> + bool "seal system mappings"
> + default n
> + depends on 64BIT
> + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> + depends on !CHECKPOINT_RESTORE
Hi Jeff,
I like the idea of this patchset, but I don’t like the idea of
forcing users to choose between this security feature and
checkpoint/restore functionality. We need to explore ways to make this
feature work with checkpoint/restore. Relying on CAP_CHECKPOINT_RESTORE
is the obvious approach.
CRIU just needs to move these mappings, and it doesn't need to change
their properties or modify their contents. With that in mind, here are
two options:
* Allow moving sealed mappings for processes with CAP_CHECKPOINT_RESTORE.
* Allow temporarily "unsealing" mappings for processes with
CAP_CHECKPOINT_RESTORE. CRIU could unseal mappings, move them, and
then seal them back.
Another approach might be to make this feature configurable on a
per-process basis (e.g., via prctl). Once enabled for a process, it
would be inherited by all its children. It can't be disabled unless a
process has CAP_CHECKPOINT_RESTORE.
I've added Mike, Dima, and Alex to the thread. They might have
other ideas.
Thanks,
Andrei
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-10 4:12 ` Andrei Vagin
@ 2024-12-11 22:46 ` Jeff Xu
2024-12-13 6:33 ` Andrei Vagin
0 siblings, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2024-12-11 22:46 UTC (permalink / raw)
To: Andrei Vagin
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Andrei Vagin
Hi Andrei
Thanks for your email.
I was hoping to get some feedback from CRIU devs, and happy to see you
reaching out..
On Mon, Dec 9, 2024 at 8:12 PM Andrei Vagin <avagin@gmail.com> wrote:
>
> On Mon, Nov 25, 2024 at 12:49 PM <jeffxu@chromium.org> wrote:
> >
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> >
> > Unlike the aforementioned mappings, the uprobe mapping is not
> > established during program startup. However, its lifetime is the same
> > as the process's lifetime [2]. It is sealed from creation.
> >
> > The vdso, vvar, sigpage, and uprobe mappings all invoke the
> > _install_special_mapping() function. As no other mappings utilize this
> > function, it is logical to incorporate sealing logic within
> > _install_special_mapping(). This approach avoids the necessity of
> > modifying code across various architecture-specific implementations.
> >
> > The vsyscall mapping, which has its own initialization function, is
> > sealed in the XONLY case, it seems to be the most common and secure
> > case of using vsyscall.
> >
> > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> > alter the mapping of vdso, vvar, and sigpage during restore
> > operations. Consequently, this feature cannot be universally enabled
> > across all systems.
> >
> ...
> >
> > +config SEAL_SYSTEM_MAPPINGS
> > + bool "seal system mappings"
> > + default n
> > + depends on 64BIT
> > + depends on ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> > + depends on !CHECKPOINT_RESTORE
>
> Hi Jeff,
>
> I like the idea of this patchset, but I don’t like the idea of
> forcing users to choose between this security feature and
> checkpoint/restore functionality. We need to explore ways to make this
> feature work with checkpoint/restore. Relying on CAP_CHECKPOINT_RESTORE
> is the obvious approach.
>
I agree that forcing users to choose isn't ideal. I'd prefer a
solution where both approaches can be used in some way, depending on
the situation and distributions. Hopefully, with input from CRIU
developers, this can be achieved.
However, it makes sense to unconditionally seal vdso/vvar for systems
like ChromeOS and Android that don't currently use CRIU, so we will
need a KCONFIG for that.
> CRIU just needs to move these mappings, and it doesn't need to change
> their properties or modify their contents. With that in mind, here are
That is an important detail to know, thanks for bringing it up.
> two options:
> * Allow moving sealed mappings for processes with CAP_CHECKPOINT_RESTORE.
We could try to propose this under a new KCONFIG, e.g.
CONFIG_SEAL_SYSTEM_MAPPING_WITH_CAP_CHECK
IIUC, You propose allowing userspace mremap vdso if the process has
CAP_CHECKPOINT_RESTORE. However, I believe this approach raises
security concerns. During the RFC for mseal, initially, I suggested
sealing mmap, mremap, and munmap individually, but Linus rejected this
proposal, for a good reason, mremap could leave an empty hole in the
address space, thus allowing the attacker to fill it with attacker
controlled content.
Furthermore, CAP_SYSTEM_ADMIN allows setting any capacity, so this
would become a by-pass for sealing.
> * Allow temporarily "unsealing" mappings for processes with
> CAP_CHECKPOINT_RESTORE. CRIU could unseal mappings, move them, and
> then seal them back.
>
We could also try to propose this under CONFIG_SEAL_SYSTEM_MAPPING_FOR_CRIU.
It's important to note that temporarily unsealing a mapping from
userspace is not permitted. If a mapping has the capability to be
unsealed, it fundamentally does not provide the sealing property.
Perhaps the intention was for these steps to be carried out within the
kernel? e.g. the userspace could instruct the kernel to relocate the
vdso mapping. Since the kernel can ensure the vdso contents are not
manipulated by an attacker, this approach could offer a viable
solution.
I have been thinking of other alternatives, but those would require
more understanding on CRIU use cases.
One of my questions is: Would CRIU target an individual process? or
entire systems?
If it is an individual process, we could use prctl to opt-in/opt-out
certain processes. There could be two alternatives.
1> Opt-in solution: process must set prctl.seal_criu_mapping, this
needs to be set before execve() because sealing is applied at execve()
call.
2> opt-out solution: The system will by default seal all of the system
mappings, but individual processes can opt-out by setting
prctl.not_seal_criu_mappings. This also needs to be set before
execve() call.
For both cases, we will want to identify what type of mapping CRIU
cares about, i.e. maybe CRIU doesn't care about uprobe and vsyscall ?
and only care about vdso/vvar/sigpage ?
> Another approach might be to make this feature configurable on a
> per-process basis (e.g., via prctl). Once enabled for a process, it
> would be inherited by all its children. It can't be disabled unless a
> process has CAP_CHECKPOINT_RESTORE.
>
> I've added Mike, Dima, and Alex to the thread. They might have
> other ideas.
>
Thanks. Please feel free to chime in, I will also add Mike,Dima and
Alex to the new version of this series as well.
Thanks!
-Jeff
> Thanks,
> Andrei
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-11 22:46 ` Jeff Xu
@ 2024-12-13 6:33 ` Andrei Vagin
2024-12-16 18:35 ` Jeff Xu
0 siblings, 1 reply; 62+ messages in thread
From: Andrei Vagin @ 2024-12-13 6:33 UTC (permalink / raw)
To: Jeff Xu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Andrei Vagin
On Wed, Dec 11, 2024 at 2:47 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> Hi Andrei
>
> Thanks for your email.
> I was hoping to get some feedback from CRIU devs, and happy to see you
> reaching out..
>
...
> I have been thinking of other alternatives, but those would require
> more understanding on CRIU use cases.
> One of my questions is: Would CRIU target an individual process? or
> entire systems?
It targets individual processes that have been forked from the main
CRIU process.
>
> If it is an individual process, we could use prctl to opt-in/opt-out
> certain processes. There could be two alternatives.
> 1> Opt-in solution: process must set prctl.seal_criu_mapping, this
> needs to be set before execve() because sealing is applied at execve()
> call.
> 2> opt-out solution: The system will by default seal all of the system
> mappings, but individual processes can opt-out by setting
> prctl.not_seal_criu_mappings. This also needs to be set before
> execve() call.
I like the idea and I think the opt-out solution should work for CRIU.
CRIU will be able to call this prctl and re-execute itself.
Let me give you a bit of context on how CRIU works. When CRIU restores
processes, it recreates a process tree by forking itself. Afterwards, it
restores all mappings in each process but doesn't put them to proper
addresses. After that, each process unmaps CRIU mappings from its address
space and remaps its restored mappings to the proper addresses. So CRIU should
be able to move system mappings and seal them if they have been sealed before
dump.
BTW, It isn't just about CRIU. gVisor and maybe some other sandbox solutions
will be affected by this change too. gVisor uses stub-processes to represent
guest address spaces. In a stub process, it unmaps all system mappings.
>
> For both cases, we will want to identify what type of mapping CRIU
> cares about, i.e. maybe CRIU doesn't care about uprobe and vsyscall ?
> and only care about vdso/vvar/sigpage ?
As for now, it handles only vdso/vvar/sigpage mappings. It doesn't care
about vsyscall because it is always mapped to the fixed address.
gVisor should be able to unmap all system mappings from a process
address space.
Thanks,
Andrei
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-13 6:33 ` Andrei Vagin
@ 2024-12-16 18:35 ` Jeff Xu
2024-12-16 18:56 ` Liam R. Howlett
0 siblings, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2024-12-16 18:35 UTC (permalink / raw)
To: Andrei Vagin
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Andrei Vagin
Hi Andrei
On Thu, Dec 12, 2024 at 10:33 PM Andrei Vagin <avagin@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 2:47 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > Hi Andrei
> >
> > Thanks for your email.
> > I was hoping to get some feedback from CRIU devs, and happy to see you
> > reaching out..
> >
> ...
> > I have been thinking of other alternatives, but those would require
> > more understanding on CRIU use cases.
> > One of my questions is: Would CRIU target an individual process? or
> > entire systems?
>
> It targets individual processes that have been forked from the main
> CRIU process.
>
> >
> > If it is an individual process, we could use prctl to opt-in/opt-out
> > certain processes. There could be two alternatives.
> > 1> Opt-in solution: process must set prctl.seal_criu_mapping, this
> > needs to be set before execve() because sealing is applied at execve()
> > call.
> > 2> opt-out solution: The system will by default seal all of the system
> > mappings, but individual processes can opt-out by setting
> > prctl.not_seal_criu_mappings. This also needs to be set before
> > execve() call.
>
> I like the idea and I think the opt-out solution should work for CRIU.
> CRIU will be able to call this prctl and re-execute itself.
>
Great! Let's iterate on the opt-out solution then.
> Let me give you a bit of context on how CRIU works. When CRIU restores
> processes, it recreates a process tree by forking itself. Afterwards, it
> restores all mappings in each process but doesn't put them to proper
> addresses. After that, each process unmaps CRIU mappings from its address
> space and remaps its restored mappings to the proper addresses. So CRIU should
> be able to move system mappings and seal them if they have been sealed before
> dump.
Thanks for the context.
> BTW, It isn't just about CRIU. gVisor and maybe some other sandbox solutions
> will be affected by this change too. gVisor uses stub-processes to represent
> guest address spaces. In a stub process, it unmaps all system mappings.
>
> >
> > For both cases, we will want to identify what type of mapping CRIU
> > cares about, i.e. maybe CRIU doesn't care about uprobe and vsyscall ?
> > and only care about vdso/vvar/sigpage ?
>
> As for now, it handles only vdso/vvar/sigpage mappings. It doesn't care
> about vsyscall because it is always mapped to the fixed address.
>
Given this understanding that CRIU intends to replace the current
process's vdso/vvar with that of the restored process, and therefore
doesn't want the parent CRIU process to seal the vdso/vvar, a prctl
opt-out for vdso/vvar is reasonable path going forward.
The sigpage mapping also should be included in this opt-out, for the
same reason as vdso/vvar, it is created by the
arch_setup_additional_pages() call during execve().
However, the uprobe mapping shouldn't be included by this opt-out, as
it is not created by arch_setup_additional_pages() during execveat().
CRIU should simply restore it from the restored process, if present.
vsyscall, which is created when the system boots, and maps to a fixed
virtual address and page, shouldn't be included by this opt-out.
So I'm proposing to opt-out vdso/vvar/sigpage with a new prctl:
disable_mseal_criu_system_mappings = true/false
What do you think ?
> gVisor should be able to unmap all system mappings from a process
> address space.
>
Do you think this opt-out solution will work for gVisor too ?
Thanks
-Jeff
> Thanks,
> Andrei
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-16 18:35 ` Jeff Xu
@ 2024-12-16 18:56 ` Liam R. Howlett
2024-12-16 20:20 ` Jeff Xu
0 siblings, 1 reply; 62+ messages in thread
From: Liam R. Howlett @ 2024-12-16 18:56 UTC (permalink / raw)
To: Jeff Xu
Cc: Andrei Vagin, akpm, keescook, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn,
Andrei Vagin
* Jeff Xu <jeffxu@chromium.org> [241216 13:35]:
...
> >
> > I like the idea and I think the opt-out solution should work for CRIU.
> > CRIU will be able to call this prctl and re-execute itself.
> >
> Great! Let's iterate on the opt-out solution then.
>
This patch set has been NACK'ed.
Please rework your solution and address all the concerns raised. It
will not be accepted in the current form.
...
Thanks,
Liam
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-16 18:56 ` Liam R. Howlett
@ 2024-12-16 20:20 ` Jeff Xu
0 siblings, 0 replies; 62+ messages in thread
From: Jeff Xu @ 2024-12-16 20:20 UTC (permalink / raw)
To: Liam R. Howlett, Jeff Xu, Andrei Vagin, akpm, keescook, jannh,
torvalds, adhemerval.zanella, oleg, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, ojeda, adobriyan,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, hch, peterx, hca, f.fainelli, gerg, dave.hansen, mingo,
ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
mpe, Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn,
Andrei Vagin
Hi Liam
On Mon, Dec 16, 2024 at 10:56 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [241216 13:35]:
>
> ...
>
> > >
> > > I like the idea and I think the opt-out solution should work for CRIU.
> > > CRIU will be able to call this prctl and re-execute itself.
> > >
> > Great! Let's iterate on the opt-out solution then.
> >
>
> This patch set has been NACK'ed.
>
> Please rework your solution and address all the concerns raised. It
> will not be accepted in the current form.
>
Thanks for reminding me. I'm still considering Lorenzo's feedback
about kernel cmd line [1], if that is what you are referring to.
This thread was initiated from Andrei, and is a separate topic for
CRIU, which I'm gathering input for a solution.
I would like to gather all feedback and consider them before the next
version of this series.
[1] https://lore.kernel.org/all/4e7088eb-b017-4d8b-8e0f-5cb409b112cb@lucifer.local/
Thanks
-Jeff
> ...
>
> Thanks,
> Liam
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
` (3 preceding siblings ...)
2024-12-10 4:12 ` Andrei Vagin
@ 2024-12-17 22:18 ` Kees Cook
2025-01-02 19:15 ` Andrei Vagin
` (2 more replies)
4 siblings, 3 replies; 62+ messages in thread
From: Kees Cook @ 2024-12-17 22:18 UTC (permalink / raw)
To: jeffxu
Cc: akpm, keescook, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Lorenzo Stoakes, Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> Seal vdso, vvar, sigpage, uprobes and vsyscall.
>
> Those mappings are readonly or executable only, sealing can protect
> them from ever changing or unmapped during the life time of the process.
> For complete descriptions of memory sealing, please see mseal.rst [1].
>
> System mappings such as vdso, vvar, and sigpage (for arm) are
> generated by the kernel during program initialization, and are
> sealed after creation.
> [...]
>
> + exec.seal_system_mappings = [KNL]
> + Format: { no | yes }
> + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> + uprobe.
> + - 'no': do not seal system mappings.
> + - 'yes': seal system mappings.
> + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> + If not specified or invalid, default is the value set by
> + CONFIG_SEAL_SYSTEM_MAPPINGS.
> + This option has no effect if CONFIG_64BIT=n
I know there is a v5 coming, but I wanted to give my thoughts to help
shape it based on the current discussion threads.
The callers of _install_special_mapping() cover what is mentioned here.
The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
sparc, x86). After that, I see a few extra things, in addition to
sigpage and uprobes as mentioned already in the patch:
arm sigpage
arm64 compat vectors (what is this for arm?)
arm64 compat sigreturn (what is this for arm?)
nios2 kuser helpers
uprobes
As mentioned in the patch, there is also the x86_64 vsyscall mapping which
eludes a regular grep since it's not using _install_special_mapping() :)
So I guess the question is: can we mseal all of these universally under
a common knob? Do the different uses mean we need finer granularity of
knob, and do different architectures need flexibility here too? The
patch handles the arch question with CONFIG_ARCH_HAS_SEAL_SYSTEM_MAPPINGS
(which I think will be renamed with s/SEAL/MSEAL/ if I am following the
threads). This seems a good solution to me. My question is
about if sigpage, vectors, and sigreturn can also be included? (It seems
like the answer is "yes", but I didn't see mention of the arm64 compat
mappings.)
Linus has expressed the desire that security features be available by
default if they don't break existing userspace and that they be compiled
in if possible (rather than be behind a CONFIG) so that code paths are
being exercised to gain the most exposure to finding bugs. To that end,
it's best to have a kernel command line to control it if it isn't safe
to have always enabled. This is how we've handled _many_ features so that
the code is built into the kernel, but that end users (e.g. distro users)
can enable/disable a feature without rebuilding the entire kernel.
For a "built into the kernel but default disabled unless enabled at boot
time" example see:
config RANDOMIZE_KSTACK_OFFSET
bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
default y
depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
...
config RANDOMIZE_KSTACK_OFFSET_DEFAULT
bool "Default state of kernel stack offset randomization"
depends on RANDOMIZE_KSTACK_OFFSET
...
#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
randomize_kstack_offset);
...
early_param("randomize_kstack_offset", early_randomize_kstack_offset);
For an example of the older "not built into the kernel but when built in
can be turned off at boot time" that predated Linus's recommendation see:
config HARDENED_USERCOPY
bool "Harden memory copies between kernel and userspace"
...
static DEFINE_STATIC_KEY_FALSE_RO(bypass_usercopy_checks);
...
__setup("hardened_usercopy=", parse_hardened_usercopy);
(This should arguably be "default y" in the kernel these days, but
whatever.)
So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
be "default y" since we have the ...ARCH_HAS... config already, and then
add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
we expect there may be userspace impact) and tie _that_ to the kernel
command-line so that end users can use it, or system builders can enable
CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
For the command line name, if a namespace is desired, I'd agree that
naming this "mseal.special_mappings" is reasonable. It does change process
behavior, so I'm also not opposed to "process.mseal_special_mappings", and
it happens at exec, so "exec.mseal_special_mappings" is fine by me too.
I think the main question would be: will there be other things under the
proposed "mseal", "process", or "exec" namespace? I'd like to encourage
things being logically grouped since we have SO MANY already. :)
Also from discussions it sounds like there may need to be even finer-gain
control, likely via prctl, for dealing with the CRIU case. The proposal
is to provide an opt-out prctl with CAP_CHECKPOINT_RESTORE? I think this
is reasonable and lets this all work without a new CONFIG. I imagine it
would look like:
criu process (which has CAP_CHECKPOINT_RESTORE):
- prctl(GET_MSEAL_SYSTEM_MAPPINGS)
- if set:
- remember we need to mseal mappings
- prctl(SET_MSEAL_SYSTEM_MAPPINGS, 0)
- re-exec with --mseal-system-mappings (or something)
- perform the "fork a tree to restore" work
- in each child, move around all the mappings
- if we need to mseal mappings:
- prctl(SET_MSEAL_SYSTEM_MAPPINGS, 1)
- mseal each system mapping
- eventually drop CAP_CHECKPOINT_RESTORE
- become the restored process
Does that all sound right? If so I think Jeff has all the details needed
to spin a v5.
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-17 22:18 ` Kees Cook
@ 2025-01-02 19:15 ` Andrei Vagin
2025-01-03 20:48 ` Liam R. Howlett
2025-01-03 21:38 ` Lorenzo Stoakes
2 siblings, 0 replies; 62+ messages in thread
From: Andrei Vagin @ 2025-01-02 19:15 UTC (permalink / raw)
To: Kees Cook
Cc: jeffxu, akpm, keescook, jannh, torvalds, adhemerval.zanella,
oleg, linux-kernel, linux-hardening, linux-mm, jorgelo,
sroettger, ojeda, adobriyan, anna-maria, mark.rutland,
linus.walleij, Jason, deller, rdunlap, davem, hch, peterx, hca,
f.fainelli, gerg, dave.hansen, mingo, ardb, Liam.Howlett, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
Vlastimil Babka, Lorenzo Stoakes, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Tue, Dec 17, 2024 at 2:18 PM Kees Cook <kees@kernel.org> wrote:
>
....
> Also from discussions it sounds like there may need to be even finer-gain
> control, likely via prctl, for dealing with the CRIU case. The proposal
> is to provide an opt-out prctl with CAP_CHECKPOINT_RESTORE? I think this
> is reasonable and lets this all work without a new CONFIG. I imagine it
> would look like:
>
> criu process (which has CAP_CHECKPOINT_RESTORE):
Hi Kees,
Sorry for the delay, I've been out of network for the last two weeks.
Overall, this approach looks good to me. However, I think the opt-out prctl
shouldn't depend on CAP_CHECKPOINT_RESTORE. There are other use cases
besides CRIU where we need to unmap or move system mappings.
For example, gVisor uses stub processes to represent guest address spaces.
gVisor unmaps all system mappings (like vdso, vvar, etc) from stub processes.
> - prctl(GET_MSEAL_SYSTEM_MAPPINGS)
> - if set:
> - remember we need to mseal mappings
> - prctl(SET_MSEAL_SYSTEM_MAPPINGS, 0)
> - re-exec with --mseal-system-mappings (or something)
> - perform the "fork a tree to restore" work
> - in each child, move around all the mappings
> - if we need to mseal mappings:
> - prctl(SET_MSEAL_SYSTEM_MAPPINGS, 1)
> - mseal each system mapping
> - eventually drop CAP_CHECKPOINT_RESTORE
> - become the restored process
>
Thanks,
Andrei
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-17 22:18 ` Kees Cook
2025-01-02 19:15 ` Andrei Vagin
@ 2025-01-03 20:48 ` Liam R. Howlett
2025-01-07 1:17 ` Kees Cook
2025-02-04 18:17 ` Johannes Berg
2025-01-03 21:38 ` Lorenzo Stoakes
2 siblings, 2 replies; 62+ messages in thread
From: Liam R. Howlett @ 2025-01-03 20:48 UTC (permalink / raw)
To: Kees Cook
Cc: jeffxu, akpm, keescook, jannh, torvalds, adhemerval.zanella,
oleg, linux-kernel, linux-hardening, linux-mm, jorgelo,
sroettger, ojeda, adobriyan, anna-maria, mark.rutland,
linus.walleij, Jason, deller, rdunlap, davem, hch, peterx, hca,
f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Lorenzo Stoakes, Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
* Kees Cook <kees@kernel.org> [241217 17:19]:
> On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> > [...]
> >
> > + exec.seal_system_mappings = [KNL]
> > + Format: { no | yes }
> > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > + uprobe.
> > + - 'no': do not seal system mappings.
> > + - 'yes': seal system mappings.
> > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > + If not specified or invalid, default is the value set by
> > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > + This option has no effect if CONFIG_64BIT=n
>
> I know there is a v5 coming, but I wanted to give my thoughts to help
> shape it based on the current discussion threads.
>
> The callers of _install_special_mapping() cover what is mentioned here.
> The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> sparc, x86). After that, I see a few extra things, in addition to
> sigpage and uprobes as mentioned already in the patch:
>
> arm sigpage
> arm64 compat vectors (what is this for arm?)
> arm64 compat sigreturn (what is this for arm?)
> nios2 kuser helpers
> uprobes
>
> As mentioned in the patch, there is also the x86_64 vsyscall mapping which
> eludes a regular grep since it's not using _install_special_mapping() :)
>
> So I guess the question is: can we mseal all of these universally under
> a common knob? Do the different uses mean we need finer granularity of
> knob, and do different architectures need flexibility here too? The
> patch handles the arch question with CONFIG_ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> (which I think will be renamed with s/SEAL/MSEAL/ if I am following the
> threads). This seems a good solution to me. My question is
> about if sigpage, vectors, and sigreturn can also be included? (It seems
> like the answer is "yes", but I didn't see mention of the arm64 compat
> mappings.)
>
> Linus has expressed the desire that security features be available by
> default if they don't break existing userspace and that they be compiled
> in if possible (rather than be behind a CONFIG) so that code paths are
> being exercised to gain the most exposure to finding bugs. To that end,
> it's best to have a kernel command line to control it if it isn't safe
> to have always enabled. This is how we've handled _many_ features so that
> the code is built into the kernel, but that end users (e.g. distro users)
> can enable/disable a feature without rebuilding the entire kernel.
>
> For a "built into the kernel but default disabled unless enabled at boot
> time" example see:
>
> config RANDOMIZE_KSTACK_OFFSET
> bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
> default y
> depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> ...
> config RANDOMIZE_KSTACK_OFFSET_DEFAULT
> bool "Default state of kernel stack offset randomization"
> depends on RANDOMIZE_KSTACK_OFFSET
> ...
> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
> randomize_kstack_offset);
> ...
> early_param("randomize_kstack_offset", early_randomize_kstack_offset);
>
>
> For an example of the older "not built into the kernel but when built in
> can be turned off at boot time" that predated Linus's recommendation see:
>
> config HARDENED_USERCOPY
> bool "Harden memory copies between kernel and userspace"
> ...
> static DEFINE_STATIC_KEY_FALSE_RO(bypass_usercopy_checks);
> ...
> __setup("hardened_usercopy=", parse_hardened_usercopy);
>
> (This should arguably be "default y" in the kernel these days, but
> whatever.)
>
> So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> be "default y" since we have the ...ARCH_HAS... config already, and then
> add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> we expect there may be userspace impact) and tie _that_ to the kernel
> command-line so that end users can use it, or system builders can enable
> CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
So we have at least two userspace uses that this will breaks: checkpoint
restore and now gVisor, but who knows what else? How many config
options before we decide this can't be just on by default?
We're also veering off what mimmutable does in bsd, for no good reason.
At what point do we decide that it's not worth pushing this?
I agree that security (and everything else) should be turned on (or
default to on) for everyone, when it doesn't break things for users. I
think this isn't one of those at this point - it's breaking things by
changing the behaviour.
And we really don't understand what it will impact fully - considering
v4 is still resulting in new things that will be broken.
Thanks,
Liam
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2024-12-17 22:18 ` Kees Cook
2025-01-02 19:15 ` Andrei Vagin
2025-01-03 20:48 ` Liam R. Howlett
@ 2025-01-03 21:38 ` Lorenzo Stoakes
2025-01-07 1:12 ` Kees Cook
2 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-01-03 21:38 UTC (permalink / raw)
To: Kees Cook
Cc: jeffxu, akpm, keescook, jannh, torvalds, adhemerval.zanella,
oleg, linux-kernel, linux-hardening, linux-mm, jorgelo,
sroettger, ojeda, adobriyan, anna-maria, mark.rutland,
linus.walleij, Jason, deller, rdunlap, davem, hch, peterx, hca,
f.fainelli, gerg, dave.hansen, mingo, ardb, Liam.Howlett, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
Vlastimil Babka, Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Tue, Dec 17, 2024 at 02:18:53PM -0800, Kees Cook wrote:
> On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> >
> > Those mappings are readonly or executable only, sealing can protect
> > them from ever changing or unmapped during the life time of the process.
> > For complete descriptions of memory sealing, please see mseal.rst [1].
> >
> > System mappings such as vdso, vvar, and sigpage (for arm) are
> > generated by the kernel during program initialization, and are
> > sealed after creation.
> > [...]
> >
> > + exec.seal_system_mappings = [KNL]
> > + Format: { no | yes }
> > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > + uprobe.
> > + - 'no': do not seal system mappings.
> > + - 'yes': seal system mappings.
> > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > + If not specified or invalid, default is the value set by
> > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > + This option has no effect if CONFIG_64BIT=n
>
> I know there is a v5 coming, but I wanted to give my thoughts to help
> shape it based on the current discussion threads.
>
> The callers of _install_special_mapping() cover what is mentioned here.
> The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> sparc, x86). After that, I see a few extra things, in addition to
> sigpage and uprobes as mentioned already in the patch:
>
> arm sigpage
> arm64 compat vectors (what is this for arm?)
> arm64 compat sigreturn (what is this for arm?)
> nios2 kuser helpers
> uprobes
OK let's not get ahead of ourselves :)
VDSOs/gate VMAs are treated quite differently by different arches. So we
have to tread _very_ carefully here.
I believe PPC doe some 'tricky' things and may actually want to unmap, for
instance.
The problem with this kind of change is we're doing something fundamental
that impacts _every possible combinatorial combination of configs, arches,
and use cases_ for each of these which we seeming - just assume - will have
no issue with this.
This is insufficient, deeply. We need:
1. Strong justification (hand waving won't suffice).
2. Very extensive testing and checking, and _proof_ of this testing being
performed.
3. Buy-in from arch maintainers.
So far this series has provided none of those. This is why I am cautious
and pushing back here.
And I absolutely will not accept a user being able to turn on a switch in a
known-broken configuration. This is absolutely unacceptable.
It's equally unacceptable for a user to enable a feature that is
untested/confirmed on an architecture.
So let's be careful about Linus's edict here - the operative part being 'if
it doesn't break things'.
>
> As mentioned in the patch, there is also the x86_64 vsyscall mapping which
> eludes a regular grep since it's not using _install_special_mapping() :)
I mean I think we need to do better than grepping :)
This requires really careful testing and checking. So far only two
architectures have been checked, and no real evidence of the checking has
been provided.
I am ok, generally, with the concept of providing this as an opt-in
experimental feature.
But if we're going to look at default-on, which if we have confidence in
this, I'd actually back, we need a LOT more confidence that this is OK.
>
> So I guess the question is: can we mseal all of these universally under
> a common knob? Do the different uses mean we need finer granularity of
> knob, and do different architectures need flexibility here too? The
> patch handles the arch question with CONFIG_ARCH_HAS_SEAL_SYSTEM_MAPPINGS
> (which I think will be renamed with s/SEAL/MSEAL/ if I am following the
> threads). This seems a good solution to me. My question is
But then immediately undoes this by enabling the feature to be turned on
regardless. This is emphatically _not_ good.
> about if sigpage, vectors, and sigreturn can also be included? (It seems
> like the answer is "yes", but I didn't see mention of the arm64 compat
> mappings.)
I oppose adding a bunch more things. Let's take this slowly, please.
And evidence of this being checked/tested really must be required.
>
> Linus has expressed the desire that security features be available by
> default if they don't break existing userspace and that they be compiled
> in if possible (rather than be behind a CONFIG) so that code paths are
> being exercised to gain the most exposure to finding bugs. To that end,
> it's best to have a kernel command line to control it if it isn't safe
> to have always enabled. This is how we've handled _many_ features so that
> the code is built into the kernel, but that end users (e.g. distro users)
> can enable/disable a feature without rebuilding the entire kernel.
See above - the operative part is 'if they don't break existing userspace',
and I would be so bold as to suggest he also means the kernel too :P
This isn't a green light to just enable a feature with little to no
testing/checking other than a verbal assurance because it involves
security.
>
> For a "built into the kernel but default disabled unless enabled at boot
> time" example see:
>
> config RANDOMIZE_KSTACK_OFFSET
> bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
> default y
> depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
> ...
> config RANDOMIZE_KSTACK_OFFSET_DEFAULT
> bool "Default state of kernel stack offset randomization"
> depends on RANDOMIZE_KSTACK_OFFSET
> ...
> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
> DEFINE_STATIC_KEY_MAYBE_RO(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
> randomize_kstack_offset);
> ...
> early_param("randomize_kstack_offset", early_randomize_kstack_offset);
>
>
> For an example of the older "not built into the kernel but when built in
> can be turned off at boot time" that predated Linus's recommendation see:
>
> config HARDENED_USERCOPY
> bool "Harden memory copies between kernel and userspace"
> ...
> static DEFINE_STATIC_KEY_FALSE_RO(bypass_usercopy_checks);
> ...
> __setup("hardened_usercopy=", parse_hardened_usercopy);
>
> (This should arguably be "default y" in the kernel these days, but
> whatever.)
>
> So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> be "default y" since we have the ...ARCH_HAS... config already, and then
> add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> we expect there may be userspace impact) and tie _that_ to the kernel
> command-line so that end users can use it, or system builders can enable
> CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
Again, I hate to push on this, but I am simply not going to allow users to
enable features we know break things.
Users might not be aware this feature is broken for CRIU, and X, and Y and
whatever else we've not thought about and enable it thinking it helps
security, and end up with a broken system.
>
> For the command line name, if a namespace is desired, I'd agree that
> naming this "mseal.special_mappings" is reasonable. It does change process
> behavior, so I'm also not opposed to "process.mseal_special_mappings", and
> it happens at exec, so "exec.mseal_special_mappings" is fine by me too.
> I think the main question would be: will there be other things under the
> proposed "mseal", "process", or "exec" namespace? I'd like to encourage
> things being logically grouped since we have SO MANY already. :)
>
> Also from discussions it sounds like there may need to be even finer-gain
> control, likely via prctl, for dealing with the CRIU case. The proposal
> is to provide an opt-out prctl with CAP_CHECKPOINT_RESTORE? I think this
> is reasonable and lets this all work without a new CONFIG. I imagine it
> would look like:
>
> criu process (which has CAP_CHECKPOINT_RESTORE):
> - prctl(GET_MSEAL_SYSTEM_MAPPINGS)
> - if set:
> - remember we need to mseal mappings
> - prctl(SET_MSEAL_SYSTEM_MAPPINGS, 0)
> - re-exec with --mseal-system-mappings (or something)
> - perform the "fork a tree to restore" work
> - in each child, move around all the mappings
> - if we need to mseal mappings:
> - prctl(SET_MSEAL_SYSTEM_MAPPINGS, 1)
> - mseal each system mapping
> - eventually drop CAP_CHECKPOINT_RESTORE
> - become the restored process
>
This seems like putting the onus on CRIU users to deal with a known-broken
thing? That seems really unreasonable? And people would just have to have
the right userland code to work in the kernel with mseal?
Yeah I oppose entirely this unless I'm missing something?
> Does that all sound right? If so I think Jeff has all the details needed
> to spin a v5.
I think a v5 would have to fulfill the 3 points I raised earlier on, and
(to risk repeating myself) I absolutely will not accept a version of this
that allows a user to break their kernel or userspace, even if they ask for
it, sorry.
Also let's not add anything extra, let's focus on the minimum to start
with. I'd prefer a v5 to be opt-in, or to provide truly copious evidence in
favour of default-yes.
The issue with this is we're changing something totally fundamental and
then assuming every combination of arch, config options, userland code,
etc. will work.
It might even be worthwhile just trying to default-y this in the minimal
case, perhaps x86-64, non-CRIU - with evidence provided to support that
this is fine (i.e. showing that all code interfacing with the VDSO at no
point requires mseal'd operations, running a bunch of configs on test
boxes, etc. etc.)?
Let's tread carefully here.
>
> -Kees
>
> --
> Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-03 21:38 ` Lorenzo Stoakes
@ 2025-01-07 1:12 ` Kees Cook
2025-01-13 21:26 ` Jeff Xu
0 siblings, 1 reply; 62+ messages in thread
From: Kees Cook @ 2025-01-07 1:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: jeffxu, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Fri, Jan 03, 2025 at 09:38:10PM +0000, Lorenzo Stoakes wrote:
> On Tue, Dec 17, 2024 at 02:18:53PM -0800, Kees Cook wrote:
> > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > >
> > > Those mappings are readonly or executable only, sealing can protect
> > > them from ever changing or unmapped during the life time of the process.
> > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > >
> > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > generated by the kernel during program initialization, and are
> > > sealed after creation.
> > > [...]
> > >
> > > + exec.seal_system_mappings = [KNL]
> > > + Format: { no | yes }
> > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > + uprobe.
> > > + - 'no': do not seal system mappings.
> > > + - 'yes': seal system mappings.
> > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > + If not specified or invalid, default is the value set by
> > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > + This option has no effect if CONFIG_64BIT=n
> >
> > I know there is a v5 coming, but I wanted to give my thoughts to help
> > shape it based on the current discussion threads.
> >
> > The callers of _install_special_mapping() cover what is mentioned here.
> > The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> > parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> > some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> > sparc, x86). After that, I see a few extra things, in addition to
> > sigpage and uprobes as mentioned already in the patch:
> >
> > arm sigpage
> > arm64 compat vectors (what is this for arm?)
> > arm64 compat sigreturn (what is this for arm?)
> > nios2 kuser helpers
> > uprobes
>
> OK let's not get ahead of ourselves :)
>
> VDSOs/gate VMAs are treated quite differently by different arches. So we
> have to tread _very_ carefully here.
>
> I believe PPC doe some 'tricky' things and may actually want to unmap, for
> instance.
>
> The problem with this kind of change is we're doing something fundamental
> that impacts _every possible combinatorial combination of configs, arches,
> and use cases_ for each of these which we seeming - just assume - will have
> no issue with this.
>
> This is insufficient, deeply. We need:
>
> 1. Strong justification (hand waving won't suffice).
> 2. Very extensive testing and checking, and _proof_ of this testing being
> performed.
> 3. Buy-in from arch maintainers.
>
> So far this series has provided none of those. This is why I am cautious
> and pushing back here.
Sure, I agree. This is why I was suggested the ...ARCH_HAS... Kconfig.
That will provide the way for 3) to happen. 1) just needs a little more
details in the commit log, I guess? The goal is attack surface reduction
in userspace, and remapping shenanigans have become a recent avenue of
attack.
For 2) there are limits. As you say we may have "every possible
combinatorial combination", which may not be feasible to test. But
making it available for the common cases (and of course testing those)
makes sense.
> And I absolutely will not accept a user being able to turn on a switch in a
> known-broken configuration. This is absolutely unacceptable.
Sure, of course.
> It's equally unacceptable for a user to enable a feature that is
> untested/confirmed on an architecture.
Agreed.
> So let's be careful about Linus's edict here - the operative part being 'if
> it doesn't break things'.
Right -- I should clarify: I don't mean to say "it should be enabled by
default", I meant to say that we have a common pattern for making these
kinds of features available without hiding them behind a build-time
Kconfig that would have put the features out of reach for system owners
that only use distro kernels, etc. I was pushing back on an earlier
comment that I interpreted as rejecting boot params. A boot param (when
other aspects of the system are sane) is needed for this kind of thing,
and is the pattern we use for providing optional features that distros
can make available without enabling them by default.
> > So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> > be "default y" since we have the ...ARCH_HAS... config already, and then
> > add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> > we expect there may be userspace impact) and tie _that_ to the kernel
> > command-line so that end users can use it, or system builders can enable
> > CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
>
> Again, I hate to push on this, but I am simply not going to allow users to
> enable features we know break things.
>
> Users might not be aware this feature is broken for CRIU, and X, and Y and
> whatever else we've not thought about and enable it thinking it helps
> security, and end up with a broken system.
This will never be a bright line, and I think choice is more important.
For example, Ubuntu builds with CRIU, but only a tiny set of tools
actually use it. (I've actually been considering adding a boot param to
disable CRIU features since they undermine some aspects of userspace
security.)
Regardless, yes, if we can make this work with CRIU (which I thought
there seem to be consensus on), let's do it.
> This seems like putting the onus on CRIU users to deal with a known-broken
> thing? That seems really unreasonable? And people would just have to have
> the right userland code to work in the kernel with mseal?
>
> Yeah I oppose entirely this unless I'm missing something?
Hm, well, the primary goal is for Chrome OS and Android to use this. If
there is honestly no path forward with CRIU, then hard Kconfig conflict
it is. I'd much rather have it available for anyone who wants it, just
like we do with lots of other features. Why force people who want this
and not CRIU to build their own kernels? We have all kinds of boot params
that if you set you get a broken system.
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-03 20:48 ` Liam R. Howlett
@ 2025-01-07 1:17 ` Kees Cook
2025-02-04 18:17 ` Johannes Berg
1 sibling, 0 replies; 62+ messages in thread
From: Kees Cook @ 2025-01-07 1:17 UTC (permalink / raw)
To: Liam R. Howlett, jeffxu, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
Vlastimil Babka, Lorenzo Stoakes, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn
On Fri, Jan 03, 2025 at 03:48:23PM -0500, Liam R. Howlett wrote:
> So we have at least two userspace uses that this will breaks: checkpoint
> restore and now gVisor, but who knows what else? How many config
> options before we decide this can't be just on by default?
See my reply to Lorenzo, but I'm not arguing for it to be enabled by
default. I was trying to show how we traditionally handle these kinds
of features: putting their enablement behind a Kconfig and boot param
that work together. That way distro kernels have it _available_ without
making it _enabled_, and specialty kernels can have in enabled by default
(and can disable it at boot if needed too).
--
Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-07 1:12 ` Kees Cook
@ 2025-01-13 21:26 ` Jeff Xu
2025-01-14 4:19 ` Matthew Wilcox
2025-01-15 19:02 ` Jeff Xu
0 siblings, 2 replies; 62+ messages in thread
From: Jeff Xu @ 2025-01-13 21:26 UTC (permalink / raw)
To: Kees Cook
Cc: Lorenzo Stoakes, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Mon, Jan 6, 2025 at 5:12 PM Kees Cook <kees@kernel.org> wrote:
>
> On Fri, Jan 03, 2025 at 09:38:10PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Dec 17, 2024 at 02:18:53PM -0800, Kees Cook wrote:
> > > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > > >
> > > > Those mappings are readonly or executable only, sealing can protect
> > > > them from ever changing or unmapped during the life time of the process.
> > > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > > >
> > > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > > generated by the kernel during program initialization, and are
> > > > sealed after creation.
> > > > [...]
> > > >
> > > > + exec.seal_system_mappings = [KNL]
> > > > + Format: { no | yes }
> > > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > > + uprobe.
> > > > + - 'no': do not seal system mappings.
> > > > + - 'yes': seal system mappings.
> > > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > > + If not specified or invalid, default is the value set by
> > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > + This option has no effect if CONFIG_64BIT=n
> > >
> > > I know there is a v5 coming, but I wanted to give my thoughts to help
> > > shape it based on the current discussion threads.
> > >
> > > The callers of _install_special_mapping() cover what is mentioned here.
> > > The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> > > parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> > > some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> > > sparc, x86). After that, I see a few extra things, in addition to
> > > sigpage and uprobes as mentioned already in the patch:
> > >
> > > arm sigpage
> > > arm64 compat vectors (what is this for arm?)
> > > arm64 compat sigreturn (what is this for arm?)
> > > nios2 kuser helpers
> > > uprobes
> >
> > OK let's not get ahead of ourselves :)
> >
> > VDSOs/gate VMAs are treated quite differently by different arches. So we
> > have to tread _very_ carefully here.
> >
> > I believe PPC doe some 'tricky' things and may actually want to unmap, for
> > instance.
> >
> > The problem with this kind of change is we're doing something fundamental
> > that impacts _every possible combinatorial combination of configs, arches,
> > and use cases_ for each of these which we seeming - just assume - will have
> > no issue with this.
> >
> > This is insufficient, deeply. We need:
> >
> > 1. Strong justification (hand waving won't suffice).
> > 2. Very extensive testing and checking, and _proof_ of this testing being
> > performed.
> > 3. Buy-in from arch maintainers.
> >
> > So far this series has provided none of those. This is why I am cautious
> > and pushing back here.
>
> Sure, I agree. This is why I was suggested the ...ARCH_HAS... Kconfig.
> That will provide the way for 3) to happen. 1) just needs a little more
> details in the commit log, I guess? The goal is attack surface reduction
> in userspace, and remapping shenanigans have become a recent avenue of
> attack.
>
> For 2) there are limits. As you say we may have "every possible
> combinatorial combination", which may not be feasible to test. But
> making it available for the common cases (and of course testing those)
> makes sense.
>
> > And I absolutely will not accept a user being able to turn on a switch in a
> > known-broken configuration. This is absolutely unacceptable.
>
> Sure, of course.
>
> > It's equally unacceptable for a user to enable a feature that is
> > untested/confirmed on an architecture.
>
> Agreed.
>
> > So let's be careful about Linus's edict here - the operative part being 'if
> > it doesn't break things'.
>
> Right -- I should clarify: I don't mean to say "it should be enabled by
> default", I meant to say that we have a common pattern for making these
> kinds of features available without hiding them behind a build-time
> Kconfig that would have put the features out of reach for system owners
> that only use distro kernels, etc. I was pushing back on an earlier
> comment that I interpreted as rejecting boot params. A boot param (when
> other aspects of the system are sane) is needed for this kind of thing,
> and is the pattern we use for providing optional features that distros
> can make available without enabling them by default.
>
> > > So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> > > be "default y" since we have the ...ARCH_HAS... config already, and then
> > > add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> > > we expect there may be userspace impact) and tie _that_ to the kernel
> > > command-line so that end users can use it, or system builders can enable
> > > CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
> >
> > Again, I hate to push on this, but I am simply not going to allow users to
> > enable features we know break things.
> >
> > Users might not be aware this feature is broken for CRIU, and X, and Y and
> > whatever else we've not thought about and enable it thinking it helps
> > security, and end up with a broken system.
>
> This will never be a bright line, and I think choice is more important.
> For example, Ubuntu builds with CRIU, but only a tiny set of tools
> actually use it. (I've actually been considering adding a boot param to
> disable CRIU features since they undermine some aspects of userspace
> security.)
>
> Regardless, yes, if we can make this work with CRIU (which I thought
> there seem to be consensus on), let's do it.
>
> > This seems like putting the onus on CRIU users to deal with a known-broken
> > thing? That seems really unreasonable? And people would just have to have
> > the right userland code to work in the kernel with mseal?
> >
> > Yeah I oppose entirely this unless I'm missing something?
>
> Hm, well, the primary goal is for Chrome OS and Android to use this. If
> there is honestly no path forward with CRIU, then hard Kconfig conflict
> it is. I'd much rather have it available for anyone who wants it, just
> like we do with lots of other features. Why force people who want this
> and not CRIU to build their own kernels? We have all kinds of boot params
> that if you set you get a broken system.
>
This patch is intended for ChromeOS and Android and is
feature-complete from their perspective.
To simplify v5, I propose removing kernel-cmd-line and avoiding the
complexities of CRIU/UML and gVisor. The KCONFIG is disabled by
default and will only apply to ARM and Intel architectures.
Later when a generic distribution wants to enable this feature, we can
work out a solution to handle those complexities.
Is this a reasonable path to move forward ?
Thanks
-Jeff
> -Kees
>
> --
> Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-13 21:26 ` Jeff Xu
@ 2025-01-14 4:19 ` Matthew Wilcox
2025-01-15 19:02 ` Jeff Xu
1 sibling, 0 replies; 62+ messages in thread
From: Matthew Wilcox @ 2025-01-14 4:19 UTC (permalink / raw)
To: Jeff Xu
Cc: Kees Cook, Lorenzo Stoakes, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn
On Mon, Jan 13, 2025 at 01:26:59PM -0800, Jeff Xu wrote:
> This patch is intended for ChromeOS and Android and is
> feature-complete from their perspective.
"I have everything I need from the Google point of view, so I will push
this feature into upstream".
No, thanks.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-13 21:26 ` Jeff Xu
2025-01-14 4:19 ` Matthew Wilcox
@ 2025-01-15 19:02 ` Jeff Xu
2025-01-15 19:46 ` Lorenzo Stoakes
1 sibling, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2025-01-15 19:02 UTC (permalink / raw)
To: Kees Cook
Cc: Lorenzo Stoakes, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
On Mon, Jan 13, 2025 at 1:26 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Mon, Jan 6, 2025 at 5:12 PM Kees Cook <kees@kernel.org> wrote:
> >
> > On Fri, Jan 03, 2025 at 09:38:10PM +0000, Lorenzo Stoakes wrote:
> > > On Tue, Dec 17, 2024 at 02:18:53PM -0800, Kees Cook wrote:
> > > > On Mon, Nov 25, 2024 at 08:20:21PM +0000, jeffxu@chromium.org wrote:
> > > > > Seal vdso, vvar, sigpage, uprobes and vsyscall.
> > > > >
> > > > > Those mappings are readonly or executable only, sealing can protect
> > > > > them from ever changing or unmapped during the life time of the process.
> > > > > For complete descriptions of memory sealing, please see mseal.rst [1].
> > > > >
> > > > > System mappings such as vdso, vvar, and sigpage (for arm) are
> > > > > generated by the kernel during program initialization, and are
> > > > > sealed after creation.
> > > > > [...]
> > > > >
> > > > > + exec.seal_system_mappings = [KNL]
> > > > > + Format: { no | yes }
> > > > > + Seal system mappings: vdso, vvar, sigpage, vsyscall,
> > > > > + uprobe.
> > > > > + - 'no': do not seal system mappings.
> > > > > + - 'yes': seal system mappings.
> > > > > + This overrides CONFIG_SEAL_SYSTEM_MAPPINGS=(y/n)
> > > > > + If not specified or invalid, default is the value set by
> > > > > + CONFIG_SEAL_SYSTEM_MAPPINGS.
> > > > > + This option has no effect if CONFIG_64BIT=n
> > > >
> > > > I know there is a v5 coming, but I wanted to give my thoughts to help
> > > > shape it based on the current discussion threads.
> > > >
> > > > The callers of _install_special_mapping() cover what is mentioned here.
> > > > The vdso is very common (arm, arm64, csky, hexagon, loongarch, mips,
> > > > parisc, powerpc, riscv, s390, sh, sparc, x86, um). For those with vdso,
> > > > some also have vvar (arm, arm64, loongarch, mips, powerpc, riscv, s390,
> > > > sparc, x86). After that, I see a few extra things, in addition to
> > > > sigpage and uprobes as mentioned already in the patch:
> > > >
> > > > arm sigpage
> > > > arm64 compat vectors (what is this for arm?)
> > > > arm64 compat sigreturn (what is this for arm?)
> > > > nios2 kuser helpers
> > > > uprobes
> > >
> > > OK let's not get ahead of ourselves :)
> > >
> > > VDSOs/gate VMAs are treated quite differently by different arches. So we
> > > have to tread _very_ carefully here.
> > >
> > > I believe PPC doe some 'tricky' things and may actually want to unmap, for
> > > instance.
> > >
> > > The problem with this kind of change is we're doing something fundamental
> > > that impacts _every possible combinatorial combination of configs, arches,
> > > and use cases_ for each of these which we seeming - just assume - will have
> > > no issue with this.
> > >
> > > This is insufficient, deeply. We need:
> > >
> > > 1. Strong justification (hand waving won't suffice).
> > > 2. Very extensive testing and checking, and _proof_ of this testing being
> > > performed.
> > > 3. Buy-in from arch maintainers.
> > >
> > > So far this series has provided none of those. This is why I am cautious
> > > and pushing back here.
> >
> > Sure, I agree. This is why I was suggested the ...ARCH_HAS... Kconfig.
> > That will provide the way for 3) to happen. 1) just needs a little more
> > details in the commit log, I guess? The goal is attack surface reduction
> > in userspace, and remapping shenanigans have become a recent avenue of
> > attack.
> >
> > For 2) there are limits. As you say we may have "every possible
> > combinatorial combination", which may not be feasible to test. But
> > making it available for the common cases (and of course testing those)
> > makes sense.
> >
> > > And I absolutely will not accept a user being able to turn on a switch in a
> > > known-broken configuration. This is absolutely unacceptable.
> >
> > Sure, of course.
> >
> > > It's equally unacceptable for a user to enable a feature that is
> > > untested/confirmed on an architecture.
> >
> > Agreed.
> >
> > > So let's be careful about Linus's edict here - the operative part being 'if
> > > it doesn't break things'.
> >
> > Right -- I should clarify: I don't mean to say "it should be enabled by
> > default", I meant to say that we have a common pattern for making these
> > kinds of features available without hiding them behind a build-time
> > Kconfig that would have put the features out of reach for system owners
> > that only use distro kernels, etc. I was pushing back on an earlier
> > comment that I interpreted as rejecting boot params. A boot param (when
> > other aspects of the system are sane) is needed for this kind of thing,
> > and is the pattern we use for providing optional features that distros
> > can make available without enabling them by default.
> >
> > > > So, if we want to have a CONFIG_MSEAL_SYSTEM_MAPPINGS at all, it should
> > > > be "default y" since we have the ...ARCH_HAS... config already, and then
> > > > add a CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT that is off by default (since
> > > > we expect there may be userspace impact) and tie _that_ to the kernel
> > > > command-line so that end users can use it, or system builders can enable
> > > > CONFIG_MSEAL_SYSTEM_MAPPINGS_DEFAULT.
> > >
> > > Again, I hate to push on this, but I am simply not going to allow users to
> > > enable features we know break things.
> > >
> > > Users might not be aware this feature is broken for CRIU, and X, and Y and
> > > whatever else we've not thought about and enable it thinking it helps
> > > security, and end up with a broken system.
> >
> > This will never be a bright line, and I think choice is more important.
> > For example, Ubuntu builds with CRIU, but only a tiny set of tools
> > actually use it. (I've actually been considering adding a boot param to
> > disable CRIU features since they undermine some aspects of userspace
> > security.)
> >
> > Regardless, yes, if we can make this work with CRIU (which I thought
> > there seem to be consensus on), let's do it.
> >
> > > This seems like putting the onus on CRIU users to deal with a known-broken
> > > thing? That seems really unreasonable? And people would just have to have
> > > the right userland code to work in the kernel with mseal?
> > >
> > > Yeah I oppose entirely this unless I'm missing something?
> >
> > Hm, well, the primary goal is for Chrome OS and Android to use this. If
> > there is honestly no path forward with CRIU, then hard Kconfig conflict
> > it is. I'd much rather have it available for anyone who wants it, just
> > like we do with lots of other features. Why force people who want this
> > and not CRIU to build their own kernels? We have all kinds of boot params
> > that if you set you get a broken system.
> >
> This patch is intended for ChromeOS and Android and is
> feature-complete from their perspective.
>
> To simplify v5, I propose removing kernel-cmd-line and avoiding the
> complexities of CRIU/UML and gVisor. The KCONFIG is disabled by
> default and will only apply to ARM and Intel architectures.
>
> Later when a generic distribution wants to enable this feature, we can
> work out a solution to handle those complexities.
>
> Is this a reasonable path to move forward ?
>
If a complete solution is desired, we can continue to discuss the
open/unresolved items.
This summarizes code logic requirements/comments.
1> Enable the feature per architecture (Lorenze)
Current state: Already in current patch (v4)
2> CONFIG_SEAL_SYSTEM_MAPPINGS
Current State: In v4, this depends on !CHECKPOINT_RESTORE. This
dependency is unhelpful for gVisor and UML (which don't depend on
CRIU) and should be removed.
(next step): remove !CHECKPOINT_RESTORE dependency.
3> Per process prctl (Andrei Vagin, Benjamin Berg)
Current state: Not in v4. Andrei suggested using prctl for opt-out.
This shouldn't depend on CRIU/CAP_CHECKPOINT_RESTORE (since gVisor is
a different feature)
(next step): Add this opt-out prctl under the new KCONFIG:
CONFIG_SEAL_SYSTEM_MAPPING_WITH_PRCTL_OPTOUT
4> kernel cmd line:
Current state: This is in v4. Lorenze commented that the new kernel
cmd line doesn't have effect on the 32 bit build, so users might think
it is enabled but not.
This is due to the fact that the 32 bit kernel doesn't support
mseal(), i.e. mseal.c is not even built under 32 bit.
Standard practice when processing kernel cmd lines with incorrect
input is to do nothing (no logs or crashes). Addressing this is
difficult unless mseal is supported for all architectures.
These are all the code logic related comments. Did I miss anything ?
Other non-code related comments (threat-model, tests, etc) will be added later.
To simplify the next version (if no agreement is reached), I can focus
on items 1 and 2. This would satisfy ChromeOS and Android, as well as
distributions and users who don't use CRIU/UML/gVisor (likely the
majority). Items 3 and 4 could be addressed later.
Comments, suggestions, and alternative approaches are welcome.
Thanks
-Jeff
> Thanks
> -Jeff
>
> > -Kees
> >
> > --
> > Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-15 19:02 ` Jeff Xu
@ 2025-01-15 19:46 ` Lorenzo Stoakes
2025-01-15 20:20 ` Jeff Xu
2025-01-15 23:52 ` Kees Cook
0 siblings, 2 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-01-15 19:46 UTC (permalink / raw)
To: Jeff Xu
Cc: Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
Jeff,
My name is Lorenzo, not Lorenze.
I've made it abundantly clear that this (NACKed) series cannot allow the
kernel to be in a broken state even if a user sets flags to do so.
This is because users might lack context to make this decision and
incorrectly do so, and now we ship a known-broken kernel.
You are now suggesting disabling the !CRIU requirement. Which violates my
_requirements_ (not optional features).
You seem to be saying you're pushing an internal feature on upstream and
only care about internal use cases, this is not how upstream works, as
Matthew alludes to.
I have told you that my requirements are:
1. You cannot allow a user to set config or boot options to have a
broken kernel configuration.
2. You must provide evidence that the arches you claim work with this,
actually do.
You seem to have eliminated that from your summary as if the very thing
that makes this series NACKed were not pertinent.
if you do not address these correctly, I will simply have to reject your v5
too and it'll waste everybody's time. I _genuinely_ don't want to have to
do this.
Any solution MUST fulfil these requirements. I also want to see v5 as an
RFC honestly at this stage, since it seems we are VERY MUCH in a discussion
phase rather than a patch phase at this time.
I really want to help you improve mseal and get things upstream, but I
can't ignore my duty to ensure that the kernel remains stable and we don't
hand kernel users (overly huge) footguns. I hate to be negative, but this
is why I am pushing back so much here.
Thanks!
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-15 19:46 ` Lorenzo Stoakes
@ 2025-01-15 20:20 ` Jeff Xu
2025-01-16 15:48 ` Lorenzo Stoakes
2025-01-15 23:52 ` Kees Cook
1 sibling, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2025-01-15 20:20 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
Hi Lorenzo
On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Jeff,
>
> My name is Lorenzo, not Lorenze.
>
I apologize.
> I've made it abundantly clear that this (NACKed) series cannot allow the
> kernel to be in a broken state even if a user sets flags to do so.
>
> This is because users might lack context to make this decision and
> incorrectly do so, and now we ship a known-broken kernel.
>
> You are now suggesting disabling the !CRIU requirement. Which violates my
> _requirements_ (not optional features).
>
Sure, I can add CRIU back.
Are you fine with UML and gViso not working under this CONFIG ?
UML/gViso doesn't use any KCONFIG like CRIU does.
> You seem to be saying you're pushing an internal feature on upstream and
> only care about internal use cases, this is not how upstream works, as
> Matthew alludes to.
>
> I have told you that my requirements are:
>
> 1. You cannot allow a user to set config or boot options to have a
> broken kernel configuration.
>
Can you clarify on the definition of "broken kernel configuration":
Do you consider "setting mseal kernel cmd line under 32 bit build" as broken ?
If so, this problem is not solvable and I might just not try to solve
it for the next version.
If you just refer to a need to detect CRIU, in KCONFIG or/and kernel
cmd line, this is solvable.
> 2. You must provide evidence that the arches you claim work with this,
> actually do.
>
Sure
> You seem to have eliminated that from your summary as if the very thing
> that makes this series NACKed were not pertinent.
>
In my last email, I tried to cover all code-logic related comments,
which is blocking me.
I also mentioned I will address non-code related comments
(threat-model/test etc), later.
> if you do not address these correctly, I will simply have to reject your v5
> too and it'll waste everybody's time. I _genuinely_ don't want to have to
> do this.
>
> Any solution MUST fulfil these requirements. I also want to see v5 as an
> RFC honestly at this stage, since it seems we are VERY MUCH in a discussion
> phase rather than a patch phase at this time.
>
Sure.
> I really want to help you improve mseal and get things upstream, but I
> can't ignore my duty to ensure that the kernel remains stable and we don't
> hand kernel users (overly huge) footguns. I hate to be negative, but this
> is why I am pushing back so much here.
>
Thanks. You can help me by answering my questions, and clarify your
requirements. I appreciate your time to make this feature useful.
Please take note that the security feature often takes away
capabilities. Sometimes it is impossible to meet security, usability
or performance goals simultaneously. I'm trying my best to get all
aspected satisfied.
-Jeff
> Thanks!
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-15 19:46 ` Lorenzo Stoakes
2025-01-15 20:20 ` Jeff Xu
@ 2025-01-15 23:52 ` Kees Cook
2025-01-16 5:26 ` Christoph Hellwig
2025-01-16 15:34 ` Lorenzo Stoakes
1 sibling, 2 replies; 62+ messages in thread
From: Kees Cook @ 2025-01-15 23:52 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Jeff Xu, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
On Wed, Jan 15, 2025 at 07:46:00PM +0000, Lorenzo Stoakes wrote:
> You are now suggesting disabling the !CRIU requirement. Which violates my
> _requirements_ (not optional features).
Why not make this simply incremental? The feature isn't intended to work
with CRIU. Why don't we get the feature in first, with a !CRUI depends?
This lets the users of this feature actually use it.
> You seem to be saying you're pushing an internal feature on upstream and
> only care about internal use cases, this is not how upstream works, as
> Matthew alludes to.
Internal? No. Chrome OS and Android. Linux runs more Android devices
than everything else in the world combined -- this is not some random
experiment.
I really don't like the feature creep nature of the system mapping
sealing reviews. There's nothing special here -- we have plenty of
features that conflict with other features. And we have a long history
in the kernel of landing the core changes with lots of conflict depends
that we then resolve as we move forward.
Why not just make system map sealing conflict with CRIU? Who is asking
to use both at the same time?
> I have told you that my requirements are:
>
> 1. You cannot allow a user to set config or boot options to have a
> broken kernel configuration.
What do you define as a "broken kernel configuration"?
> 2. You must provide evidence that the arches you claim work with this,
> actually do.
What evidence would you find sufficient? I'm concerned this is turning
into a rock fetching quest.
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-15 23:52 ` Kees Cook
@ 2025-01-16 5:26 ` Christoph Hellwig
2025-01-16 19:40 ` Kees Cook
2025-01-16 15:34 ` Lorenzo Stoakes
1 sibling, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2025-01-16 5:26 UTC (permalink / raw)
To: Kees Cook
Cc: Lorenzo Stoakes, Jeff Xu, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Benjamin Berg
On Wed, Jan 15, 2025 at 03:52:23PM -0800, Kees Cook wrote:
> > You seem to be saying you're pushing an internal feature on upstream and
> > only care about internal use cases, this is not how upstream works, as
> > Matthew alludes to.
>
> Internal? No. Chrome OS and Android. Linux runs more Android devices
> than everything else in the world combined -- this is not some random
> experiment.
All of which are tightly controlled by Google and not actually open
to users. Which doesn't say they don't matter, but they matter a
lot less than fetures widely useful to the open not locked down
userbase of classic Linux.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-15 23:52 ` Kees Cook
2025-01-16 5:26 ` Christoph Hellwig
@ 2025-01-16 15:34 ` Lorenzo Stoakes
2025-01-16 19:44 ` Kees Cook
1 sibling, 1 reply; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-01-16 15:34 UTC (permalink / raw)
To: Kees Cook
Cc: Jeff Xu, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
Kees,
I reply inline below but the TL;DR is - I'm fine with an incremental
approach, my requirements for arch support are sensible and doable and I'll
_give a R-b tag_ if such a version is submitted. :)
This isn't a discreet means of me trying to reject the whole idea, if I
felt that way I'd just say!
I actually firmly want to _help_ mseal features land in the kernel.
Cheers, Lorenzo
On Wed, Jan 15, 2025 at 03:52:23PM -0800, Kees Cook wrote:
> On Wed, Jan 15, 2025 at 07:46:00PM +0000, Lorenzo Stoakes wrote:
> > You are now suggesting disabling the !CRIU requirement. Which violates my
> > _requirements_ (not optional features).
>
> Why not make this simply incremental? The feature isn't intended to work
> with CRIU. Why don't we get the feature in first, with a !CRUI depends?
> This lets the users of this feature actually use it.
Sure, I'm ok with this.
The analysis at the end of the series suggested the consensus was otherwise,
which is why I highlighted this.
>
> > You seem to be saying you're pushing an internal feature on upstream and
> > only care about internal use cases, this is not how upstream works, as
> > Matthew alludes to.
>
> Internal? No. Chrome OS and Android. Linux runs more Android devices
> than everything else in the world combined -- this is not some random
> experiment.
This is unfair, I'm not claiming otherwise, and I would suggest you look
into other work I've done which has directly benefitted android if you
believe I'm not aware of how widespread a user it is (you're welcome ;) I
also own and very much enjoy my Pixel Pro 9 Fold 2...
I'm saying we can't _only_ consider this. This is upstream kernel, we must
consider all architectures and use cases.
This seems a reasonable position to me.
>
> I really don't like the feature creep nature of the system mapping
> sealing reviews. There's nothing special here -- we have plenty of
> features that conflict with other features. And we have a long history
> in the kernel of landing the core changes with lots of conflict depends
> that we then resolve as we move forward.
There's been no feature creep. I explicitly said very early on what the
problem was and what needed to be done to fix it.
Then a bunch of discussion happened and an analysis was presented that
seemed to neglect this.
I also don't agree we have a long history of landing changes that quietly
break things in the kernel.
As I said in the mail you are responding to - my concern is that a kernel
user will enable this feature, not realising that it breaks X, Y or Z, and
there's no easy or clear way for them to know.
This was originally addressed with config flags, but then boot options were
provided which completely overrode this.
My concern is there is a disconnect between a kernel user seeing a security
feature - and them knowing or realising that it is broken, for instance, if
you try to use CRIU.
Then suddenly what seems a reasonable feature to enable is suddenly a
landmine for somebody to step on to break their system.
Again, I have no objection to a version of this series which explicitly
disallows known-broken scenarios.
>
> Why not just make system map sealing conflict with CRIU? Who is asking
> to use both at the same time?
Again, the analysis Jeff presented appeared to rule out doing this. Hence
my objection.
If we explicitly disallow this (with no boot override) I'm fine with it!
Sorry if I wasn't clear about that.
>
> > I have told you that my requirements are:
> >
> > 1. You cannot allow a user to set config or boot options to have a
> > broken kernel configuration.
>
> What do you define as a "broken kernel configuration"?
One which results in a kernel which cannot function correctly in some
fundamental respect. For instance, breaking userspace for certain programs
when a feature which might non-obviously do so.
>
> > 2. You must provide evidence that the arches you claim work with this,
> > actually do.
>
> What evidence would you find sufficient? I'm concerned this is turning
> into a rock fetching quest.
That no code relies on the VDSO being non-sealed in supported raches, you
can do this by pointing out the code that interacts with the VDSO in these
arches in all instances does not encounter a problem when this is so.
I _believe_ this is a reasonable request, and sort of a fundamental thing
you'd want for a change like this for such a weird beast as the VDSO.
>
> -Kees
>
> --
> Kees Cook
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-15 20:20 ` Jeff Xu
@ 2025-01-16 15:48 ` Lorenzo Stoakes
2025-01-16 17:01 ` Benjamin Berg
2025-01-17 18:08 ` Jeff Xu
0 siblings, 2 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-01-16 15:48 UTC (permalink / raw)
To: Jeff Xu
Cc: Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> Hi Lorenzo
>
> On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Jeff,
> >
> > My name is Lorenzo, not Lorenze.
> >
> I apologize.
No worries, sorry I realise it was probably a typo! But just in case you
didn't realise :P
>
> > I've made it abundantly clear that this (NACKed) series cannot allow the
> > kernel to be in a broken state even if a user sets flags to do so.
> >
> > This is because users might lack context to make this decision and
> > incorrectly do so, and now we ship a known-broken kernel.
> >
> > You are now suggesting disabling the !CRIU requirement. Which violates my
> > _requirements_ (not optional features).
> >
> Sure, I can add CRIU back.
>
> Are you fine with UML and gViso not working under this CONFIG ?
> UML/gViso doesn't use any KCONFIG like CRIU does.
Yeah this is a concern, wouldn't we be able to catch UML with a flag?
Apologies my fault for maybe not being totally up to date with this, but what
exactly was the gViso (is it gVisor actually?)
>
> > You seem to be saying you're pushing an internal feature on upstream and
> > only care about internal use cases, this is not how upstream works, as
> > Matthew alludes to.
> >
> > I have told you that my requirements are:
> >
> > 1. You cannot allow a user to set config or boot options to have a
> > broken kernel configuration.
> >
> Can you clarify on the definition of "broken kernel configuration":
Anything that'd unexpected break userland in a way that would be entirely
unexpected.
Especially so if there is a real disconnect between the person who is
enabling the feature and the program.
For instance if a distro wants to be big on security, is (as is entirely
reasonable) concerned about an unsealed VDSO/VVAR/etc. being exploited, so
turns on the flag, but _doesn't realise_ or doesn't communicate (such a big
problem and difficult actually for many distros/vendors) that this will
break certain programs - and then users do a kernel update, and *bang*
their whole system is broken.
It's really this kind of scenario I'm worried about.
This is the crux of it really.
>
> Do you consider "setting mseal kernel cmd line under 32 bit build" as broken ?
> If so, this problem is not solvable and I might just not try to solve
> it for the next version.
Yeah, I really don't like the kernel cmd line thing, because of this risk
of disconnect - your justification for it is prima facie reasonable - the
distro didn't want to enable the thing by default but you want more
security - but then we have this issue with the possible disconnect between
'hey here is security feature X' vs. 'security feature X breaks Y, Z +
alpha'.
>
> If you just refer to a need to detect CRIU, in KCONFIG or/and kernel
> cmd line, this is solvable.
>
> > 2. You must provide evidence that the arches you claim work with this,
> > actually do.
> >
> Sure
See my reply to Kees as to what this comprises, sorry if I was not clear
previously.
>
> > You seem to have eliminated that from your summary as if the very thing
> > that makes this series NACKed were not pertinent.
> >
> In my last email, I tried to cover all code-logic related comments,
> which is blocking me.
> I also mentioned I will address non-code related comments
> (threat-model/test etc), later.
Ack.
I felt that you hadn't hit on my fundamental objections and this was in
effect - a final analysis as to how you would be moving forward with v5 -
but apologies if you did intend to separately discuss them.
>
> > if you do not address these correctly, I will simply have to reject your v5
> > too and it'll waste everybody's time. I _genuinely_ don't want to have to
> > do this.
> >
> > Any solution MUST fulfil these requirements. I also want to see v5 as an
> > RFC honestly at this stage, since it seems we are VERY MUCH in a discussion
> > phase rather than a patch phase at this time.
> >
> Sure.
To be clear - if the series is viable, I want to see it merged. And to
further clarify - a simpler, smaller version of this that explicitly
disallows breakage in config options suffices (though we must clarify the
gVisor + UML things).
If I just wanted to reject this outright, I'd tell you :) (I don't).
I just need to feel vaguely less anxious about breaking things! :)
>
> > I really want to help you improve mseal and get things upstream, but I
> > can't ignore my duty to ensure that the kernel remains stable and we don't
> > hand kernel users (overly huge) footguns. I hate to be negative, but this
> > is why I am pushing back so much here.
> >
> Thanks. You can help me by answering my questions, and clarify your
> requirements. I appreciate your time to make this feature useful.
Sure, hopefully I have done so, do follow up if anything was unclear.
>
> Please take note that the security feature often takes away
> capabilities. Sometimes it is impossible to meet security, usability
> or performance goals simultaneously. I'm trying my best to get all
> aspected satisfied.
Ack, and I realise it's often a difficult trade-off. I just worry about
compounding complexity in consequences of kernel configuration vs. userland
stuff + the disconnect between the two.
>
> -Jeff
>
> > Thanks!
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 15:48 ` Lorenzo Stoakes
@ 2025-01-16 17:01 ` Benjamin Berg
2025-01-16 17:16 ` Lorenzo Stoakes
2025-01-16 17:18 ` Pedro Falcato
2025-01-17 18:08 ` Jeff Xu
1 sibling, 2 replies; 62+ messages in thread
From: Benjamin Berg @ 2025-01-16 17:01 UTC (permalink / raw)
To: Lorenzo Stoakes, Jeff Xu
Cc: Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
Hi Lorenzo,
On Thu, 2025-01-16 at 15:48 +0000, Lorenzo Stoakes wrote:
> On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
>
> [SNIP]
> >
> > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > kernel to be in a broken state even if a user sets flags to do so.
> > >
> > > This is because users might lack context to make this decision and
> > > incorrectly do so, and now we ship a known-broken kernel.
> > >
> > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > _requirements_ (not optional features).
> > >
> > Sure, I can add CRIU back.
> >
> > Are you fine with UML and gViso not working under this CONFIG ?
> > UML/gViso doesn't use any KCONFIG like CRIU does.
>
> Yeah this is a concern, wouldn't we be able to catch UML with a flag?
>
> Apologies my fault for maybe not being totally up to date with this, but what
> exactly was the gViso (is it gVisor actually?)
UML is a separate architecture. It is a Linux kernel running as a
userspace application on top of an unmodified host kernel.
So really, UML is a mostly weird userspace program for the purpose of
this discussion. And a pretty buggy one too--it got broken by rseq
already.
What UML now does is:
* Execute a tiny static binary
* map special "stub" code/data pages at the topmost userspace address
(replacing its stack)
* continue execution inside the "stub" pages
* unmap everything below the "stub" pages
* use the unmap'ed area for userspace application mappings
I believe that the "unmap everything" step will fail with this feature.
Now, I am sure one can come up with solutions, e.g.:
1. Simply print an explanation if the unmap() fails
2. Find an address that is guaranteed to be below the VDSO and use a
smaller address space for the UML userspace.
3. Somehow tell the host kernel to not install the VDSO mappings
4. Add the host VDSO pages as a sealed VMA within UML to guard them
UML is a bit of a niche and I am not sure it is worth worrying about it
too much.
Benjamin
>
> >
> > > You seem to be saying you're pushing an internal feature on upstream and
> > > only care about internal use cases, this is not how upstream works, as
> > > Matthew alludes to.
> > >
> > > I have told you that my requirements are:
> > >
> > > 1. You cannot allow a user to set config or boot options to have a
> > > broken kernel configuration.
> > >
> > Can you clarify on the definition of "broken kernel configuration":
>
> Anything that'd unexpected break userland in a way that would be entirely
> unexpected.
>
> Especially so if there is a real disconnect between the person who is
> enabling the feature and the program.
>
> For instance if a distro wants to be big on security, is (as is entirely
> reasonable) concerned about an unsealed VDSO/VVAR/etc. being exploited, so
> turns on the flag, but _doesn't realise_ or doesn't communicate (such a big
> problem and difficult actually for many distros/vendors) that this will
> break certain programs - and then users do a kernel update, and *bang*
> their whole system is broken.
>
> It's really this kind of scenario I'm worried about.
>
> This is the crux of it really.
>
> >
> > Do you consider "setting mseal kernel cmd line under 32 bit build" as broken ?
> > If so, this problem is not solvable and I might just not try to solve
> > it for the next version.
>
> Yeah, I really don't like the kernel cmd line thing, because of this risk
> of disconnect - your justification for it is prima facie reasonable - the
> distro didn't want to enable the thing by default but you want more
> security - but then we have this issue with the possible disconnect between
> 'hey here is security feature X' vs. 'security feature X breaks Y, Z +
> alpha'.
>
> >
> > If you just refer to a need to detect CRIU, in KCONFIG or/and kernel
> > cmd line, this is solvable.
> >
> > > 2. You must provide evidence that the arches you claim work with this,
> > > actually do.
> > >
> > Sure
>
> See my reply to Kees as to what this comprises, sorry if I was not clear
> previously.
>
>
> >
> > > You seem to have eliminated that from your summary as if the very thing
> > > that makes this series NACKed were not pertinent.
> > >
> > In my last email, I tried to cover all code-logic related comments,
> > which is blocking me.
> > I also mentioned I will address non-code related comments
> > (threat-model/test etc), later.
>
> Ack.
>
> I felt that you hadn't hit on my fundamental objections and this was in
> effect - a final analysis as to how you would be moving forward with v5 -
> but apologies if you did intend to separately discuss them.
>
> >
> > > if you do not address these correctly, I will simply have to reject your v5
> > > too and it'll waste everybody's time. I _genuinely_ don't want to have to
> > > do this.
> > >
> > > Any solution MUST fulfil these requirements. I also want to see v5 as an
> > > RFC honestly at this stage, since it seems we are VERY MUCH in a discussion
> > > phase rather than a patch phase at this time.
> > >
> > Sure.
>
> To be clear - if the series is viable, I want to see it merged. And to
> further clarify - a simpler, smaller version of this that explicitly
> disallows breakage in config options suffices (though we must clarify the
> gVisor + UML things).
>
> If I just wanted to reject this outright, I'd tell you :) (I don't).
>
> I just need to feel vaguely less anxious about breaking things! :)
>
> >
> > > I really want to help you improve mseal and get things upstream, but I
> > > can't ignore my duty to ensure that the kernel remains stable and we don't
> > > hand kernel users (overly huge) footguns. I hate to be negative, but this
> > > is why I am pushing back so much here.
> > >
> > Thanks. You can help me by answering my questions, and clarify your
> > requirements. I appreciate your time to make this feature useful.
>
> Sure, hopefully I have done so, do follow up if anything was unclear.
>
> >
> > Please take note that the security feature often takes away
> > capabilities. Sometimes it is impossible to meet security, usability
> > or performance goals simultaneously. I'm trying my best to get all
> > aspected satisfied.
>
> Ack, and I realise it's often a difficult trade-off. I just worry about
> compounding complexity in consequences of kernel configuration vs. userland
> stuff + the disconnect between the two.
>
> >
> > -Jeff
> >
> > > Thanks!
>
> Cheers, Lorenzo
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 17:01 ` Benjamin Berg
@ 2025-01-16 17:16 ` Lorenzo Stoakes
2025-01-16 17:18 ` Pedro Falcato
1 sibling, 0 replies; 62+ messages in thread
From: Lorenzo Stoakes @ 2025-01-16 17:16 UTC (permalink / raw)
To: Benjamin Berg
Cc: Jeff Xu, Kees Cook, akpm, jannh, torvalds, adhemerval.zanella,
oleg, linux-kernel, linux-hardening, linux-mm, jorgelo,
sroettger, ojeda, adobriyan, anna-maria, mark.rutland,
linus.walleij, Jason, deller, rdunlap, davem, hch, peterx, hca,
f.fainelli, gerg, dave.hansen, mingo, ardb, Liam.Howlett, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
Vlastimil Babka, Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Thu, Jan 16, 2025 at 06:01:47PM +0100, Benjamin Berg wrote:
> Hi Lorenzo,
>
> On Thu, 2025-01-16 at 15:48 +0000, Lorenzo Stoakes wrote:
> > On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> >
> > [SNIP]
> > >
> > > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > > kernel to be in a broken state even if a user sets flags to do so.
> > > >
> > > > This is because users might lack context to make this decision and
> > > > incorrectly do so, and now we ship a known-broken kernel.
> > > >
> > > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > > _requirements_ (not optional features).
> > > >
> > > Sure, I can add CRIU back.
> > >
> > > Are you fine with UML and gViso not working under this CONFIG ?
> > > UML/gViso doesn't use any KCONFIG like CRIU does.
> >
> > Yeah this is a concern, wouldn't we be able to catch UML with a flag?
> >
> > Apologies my fault for maybe not being totally up to date with this, but what
> > exactly was the gViso (is it gVisor actually?)
>
> UML is a separate architecture. It is a Linux kernel running as a
> userspace application on top of an unmodified host kernel.
>
> So really, UML is a mostly weird userspace program for the purpose of
> this discussion. And a pretty buggy one too--it got broken by rseq
> already.
>
> What UML now does is:
> * Execute a tiny static binary
> * map special "stub" code/data pages at the topmost userspace address
> (replacing its stack)
> * continue execution inside the "stub" pages
> * unmap everything below the "stub" pages
> * use the unmap'ed area for userspace application mappings
>
> I believe that the "unmap everything" step will fail with this feature.
>
Ahhh interesting.
>
> Now, I am sure one can come up with solutions, e.g.:
> 1. Simply print an explanation if the unmap() fails
> 2. Find an address that is guaranteed to be below the VDSO and use a
> smaller address space for the UML userspace.
> 3. Somehow tell the host kernel to not install the VDSO mappings
> 4. Add the host VDSO pages as a sealed VMA within UML to guard them
>
Right.
> UML is a bit of a niche and I am not sure it is worth worrying about it
> too much.
>
> Benjamin
Well in that case then it's number_of_things_to_worry_about--; here :)
Cheers Benjamin!
[snip]
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 17:01 ` Benjamin Berg
2025-01-16 17:16 ` Lorenzo Stoakes
@ 2025-01-16 17:18 ` Pedro Falcato
2025-01-17 18:20 ` Jeff Xu
1 sibling, 1 reply; 62+ messages in thread
From: Pedro Falcato @ 2025-01-16 17:18 UTC (permalink / raw)
To: Benjamin Berg
Cc: Lorenzo Stoakes, Jeff Xu, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn
On Thu, Jan 16, 2025 at 5:02 PM Benjamin Berg <benjamin@sipsolutions.net> wrote:
>
> Hi Lorenzo,
>
> On Thu, 2025-01-16 at 15:48 +0000, Lorenzo Stoakes wrote:
> > On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> >
> > [SNIP]
> > >
> > > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > > kernel to be in a broken state even if a user sets flags to do so.
> > > >
> > > > This is because users might lack context to make this decision and
> > > > incorrectly do so, and now we ship a known-broken kernel.
> > > >
> > > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > > _requirements_ (not optional features).
> > > >
> > > Sure, I can add CRIU back.
> > >
> > > Are you fine with UML and gViso not working under this CONFIG ?
> > > UML/gViso doesn't use any KCONFIG like CRIU does.
> >
> > Yeah this is a concern, wouldn't we be able to catch UML with a flag?
> >
> > Apologies my fault for maybe not being totally up to date with this, but what
> > exactly was the gViso (is it gVisor actually?)
>
> UML is a separate architecture. It is a Linux kernel running as a
> userspace application on top of an unmodified host kernel.
>
> So really, UML is a mostly weird userspace program for the purpose of
> this discussion. And a pretty buggy one too--it got broken by rseq
> already.
>
> What UML now does is:
> * Execute a tiny static binary
> * map special "stub" code/data pages at the topmost userspace address
> (replacing its stack)
> * continue execution inside the "stub" pages
> * unmap everything below the "stub" pages
> * use the unmap'ed area for userspace application mappings
>
> I believe that the "unmap everything" step will fail with this feature.
>
>
> Now, I am sure one can come up with solutions, e.g.:
> 1. Simply print an explanation if the unmap() fails
> 2. Find an address that is guaranteed to be below the VDSO and use a
> smaller address space for the UML userspace.
> 3. Somehow tell the host kernel to not install the VDSO mappings
> 4. Add the host VDSO pages as a sealed VMA within UML to guard them
>
> UML is a bit of a niche and I am not sure it is worth worrying about it
> too much.
I've been absent from this patch series in general, but this gave me
an idea: what if we let userspace seal these mappings itself? Since
glibc is already sealing things, it might as well seal these?
And then systems that _do_ care about this would set the glibc tunable
and deal with the breakage.
Is there something seriously wrong with this approach? Besides maybe
not having a super easy way to discover these mappings atm, I feel
like it would solve all of the policy issues people have been talking
about in these threads.
--
Pedro
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 5:26 ` Christoph Hellwig
@ 2025-01-16 19:40 ` Kees Cook
2025-01-17 10:14 ` Heiko Carstens
0 siblings, 1 reply; 62+ messages in thread
From: Kees Cook @ 2025-01-16 19:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Lorenzo Stoakes, Jeff Xu, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Benjamin Berg
On Thu, Jan 16, 2025 at 06:26:55AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 15, 2025 at 03:52:23PM -0800, Kees Cook wrote:
> > > You seem to be saying you're pushing an internal feature on upstream and
> > > only care about internal use cases, this is not how upstream works, as
> > > Matthew alludes to.
> >
> > Internal? No. Chrome OS and Android. Linux runs more Android devices
> > than everything else in the world combined -- this is not some random
> > experiment.
>
> All of which are tightly controlled by Google and not actually open
> to users. Which doesn't say they don't matter, but they matter a
> lot less than fetures widely useful to the open not locked down
> userbase of classic Linux.
I get your point. Though in my proposal it would be available to anyone
without CRIU too, which is, for example, defconfig builds (excepting
s390 and riscv).
--
Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 15:34 ` Lorenzo Stoakes
@ 2025-01-16 19:44 ` Kees Cook
0 siblings, 0 replies; 62+ messages in thread
From: Kees Cook @ 2025-01-16 19:44 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Jeff Xu, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
On Thu, Jan 16, 2025 at 03:34:40PM +0000, Lorenzo Stoakes wrote:
> This was originally addressed with config flags, but then boot options were
> provided which completely overrode this.
> [...]
> Again, I have no objection to a version of this series which explicitly
> disallows known-broken scenarios.
Okay, thanks. Honestly, it will motivate me to finally make CRIU a boot
param too. I'd like to run distro kernels but keep CRIU fully disabled
(it provides some "extra" introspection of seccomp filters that feels
wrong to me, but is needed for CRIU -- but I don't use CRIU...)
--
Kees Cook
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 19:40 ` Kees Cook
@ 2025-01-17 10:14 ` Heiko Carstens
0 siblings, 0 replies; 62+ messages in thread
From: Heiko Carstens @ 2025-01-17 10:14 UTC (permalink / raw)
To: Kees Cook
Cc: Christoph Hellwig, Lorenzo Stoakes, Jeff Xu, akpm, jannh,
torvalds, adhemerval.zanella, oleg, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, ojeda, adobriyan,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn, Benjamin Berg
Hi Kees,
On Thu, Jan 16, 2025 at 11:40:37AM -0800, Kees Cook wrote:
> On Thu, Jan 16, 2025 at 06:26:55AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 15, 2025 at 03:52:23PM -0800, Kees Cook wrote:
> > > > You seem to be saying you're pushing an internal feature on upstream and
> > > > only care about internal use cases, this is not how upstream works, as
> > > > Matthew alludes to.
> > >
> > > Internal? No. Chrome OS and Android. Linux runs more Android devices
> > > than everything else in the world combined -- this is not some random
> > > experiment.
> >
> > All of which are tightly controlled by Google and not actually open
> > to users. Which doesn't say they don't matter, but they matter a
> > lot less than fetures widely useful to the open not locked down
> > userbase of classic Linux.
>
> I get your point. Though in my proposal it would be available to anyone
> without CRIU too, which is, for example, defconfig builds (excepting
> s390 and riscv).
Just looking from time to time into this discussion, so I didn't
follow everything. What makes s390 and riscv special here?
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 15:48 ` Lorenzo Stoakes
2025-01-16 17:01 ` Benjamin Berg
@ 2025-01-17 18:08 ` Jeff Xu
1 sibling, 0 replies; 62+ messages in thread
From: Jeff Xu @ 2025-01-17 18:08 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, enh, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn, Benjamin Berg
On Thu, Jan 16, 2025 at 7:49 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > Hi Lorenzo
> >
> > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > Jeff,
> > >
> > > My name is Lorenzo, not Lorenze.
> > >
> > I apologize.
>
> No worries, sorry I realise it was probably a typo! But just in case you
> didn't realise :P
>
> >
> > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > kernel to be in a broken state even if a user sets flags to do so.
> > >
> > > This is because users might lack context to make this decision and
> > > incorrectly do so, and now we ship a known-broken kernel.
> > >
> > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > _requirements_ (not optional features).
> > >
> > Sure, I can add CRIU back.
> >
> > Are you fine with UML and gViso not working under this CONFIG ?
> > UML/gViso doesn't use any KCONFIG like CRIU does.
>
> Yeah this is a concern, wouldn't we be able to catch UML with a flag?
>
> Apologies my fault for maybe not being totally up to date with this, but what
> exactly was the gViso (is it gVisor actually?)
>
It is a typo, should be gVisor.
> >
> > > You seem to be saying you're pushing an internal feature on upstream and
> > > only care about internal use cases, this is not how upstream works, as
> > > Matthew alludes to.
> > >
> > > I have told you that my requirements are:
> > >
> > > 1. You cannot allow a user to set config or boot options to have a
> > > broken kernel configuration.
> > >
> > Can you clarify on the definition of "broken kernel configuration":
>
> Anything that'd unexpected break userland in a way that would be entirely
> unexpected.
>
> Especially so if there is a real disconnect between the person who is
> enabling the feature and the program.
>
> For instance if a distro wants to be big on security, is (as is entirely
> reasonable) concerned about an unsealed VDSO/VVAR/etc. being exploited, so
> turns on the flag, but _doesn't realise_ or doesn't communicate (such a big
> problem and difficult actually for many distros/vendors) that this will
> break certain programs - and then users do a kernel update, and *bang*
> their whole system is broken.
>
> It's really this kind of scenario I'm worried about.
>
> This is the crux of it really.
>
Ok, thank you for clarifying.
> >
> > Do you consider "setting mseal kernel cmd line under 32 bit build" as broken ?
> > If so, this problem is not solvable and I might just not try to solve
> > it for the next version.
>
> Yeah, I really don't like the kernel cmd line thing, because of this risk
> of disconnect - your justification for it is prima facie reasonable - the
> distro didn't want to enable the thing by default but you want more
> security - but then we have this issue with the possible disconnect between
> 'hey here is security feature X' vs. 'security feature X breaks Y, Z +
> alpha'.
>
Ok, the kernel cmd line won't be in the next version.
The kernel cmd line feature exists as a supplementary to KCONFIG, and
has its own user cases. However, this discussion can happen later when
adding a kernel cmd line for this feature.
> >
> > If you just refer to a need to detect CRIU, in KCONFIG or/and kernel
> > cmd line, this is solvable.
> >
> > > 2. You must provide evidence that the arches you claim work with this,
> > > actually do.
> > >
> > Sure
>
> See my reply to Kees as to what this comprises, sorry if I was not clear
> previously.
>
>
> >
> > > You seem to have eliminated that from your summary as if the very thing
> > > that makes this series NACKed were not pertinent.
> > >
> > In my last email, I tried to cover all code-logic related comments,
> > which is blocking me.
> > I also mentioned I will address non-code related comments
> > (threat-model/test etc), later.
>
> Ack.
>
> I felt that you hadn't hit on my fundamental objections and this was in
> effect - a final analysis as to how you would be moving forward with v5 -
> but apologies if you did intend to separately discuss them.
>
> >
> > > if you do not address these correctly, I will simply have to reject your v5
> > > too and it'll waste everybody's time. I _genuinely_ don't want to have to
> > > do this.
> > >
> > > Any solution MUST fulfil these requirements. I also want to see v5 as an
> > > RFC honestly at this stage, since it seems we are VERY MUCH in a discussion
> > > phase rather than a patch phase at this time.
> > >
> > Sure.
>
> To be clear - if the series is viable, I want to see it merged. And to
> further clarify - a simpler, smaller version of this that explicitly
> disallows breakage in config options suffices (though we must clarify the
> gVisor + UML things).
>
> If I just wanted to reject this outright, I'd tell you :) (I don't).
>
> I just need to feel vaguely less anxious about breaking things! :)
>
> >
> > > I really want to help you improve mseal and get things upstream, but I
> > > can't ignore my duty to ensure that the kernel remains stable and we don't
> > > hand kernel users (overly huge) footguns. I hate to be negative, but this
> > > is why I am pushing back so much here.
> > >
> > Thanks. You can help me by answering my questions, and clarify your
> > requirements. I appreciate your time to make this feature useful.
>
> Sure, hopefully I have done so, do follow up if anything was unclear.
>
> >
> > Please take note that the security feature often takes away
> > capabilities. Sometimes it is impossible to meet security, usability
> > or performance goals simultaneously. I'm trying my best to get all
> > aspected satisfied.
>
> Ack, and I realise it's often a difficult trade-off. I just worry about
> compounding complexity in consequences of kernel configuration vs. userland
> stuff + the disconnect between the two.
>
Understood, thanks for explaining. I value feedback to make the
feature useful/robust.
-Jeff
> >
> > -Jeff
> >
> > > Thanks!
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-16 17:18 ` Pedro Falcato
@ 2025-01-17 18:20 ` Jeff Xu
2025-01-17 19:35 ` enh
0 siblings, 1 reply; 62+ messages in thread
From: Jeff Xu @ 2025-01-17 18:20 UTC (permalink / raw)
To: Pedro Falcato
Cc: Benjamin Berg, Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb,
Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn
On Thu, Jan 16, 2025 at 9:18 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Thu, Jan 16, 2025 at 5:02 PM Benjamin Berg <benjamin@sipsolutions.net> wrote:
> >
> > Hi Lorenzo,
> >
> > On Thu, 2025-01-16 at 15:48 +0000, Lorenzo Stoakes wrote:
> > > On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > > > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > [SNIP]
> > > >
> > > > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > > > kernel to be in a broken state even if a user sets flags to do so.
> > > > >
> > > > > This is because users might lack context to make this decision and
> > > > > incorrectly do so, and now we ship a known-broken kernel.
> > > > >
> > > > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > > > _requirements_ (not optional features).
> > > > >
> > > > Sure, I can add CRIU back.
> > > >
> > > > Are you fine with UML and gViso not working under this CONFIG ?
> > > > UML/gViso doesn't use any KCONFIG like CRIU does.
> > >
> > > Yeah this is a concern, wouldn't we be able to catch UML with a flag?
> > >
> > > Apologies my fault for maybe not being totally up to date with this, but what
> > > exactly was the gViso (is it gVisor actually?)
> >
> > UML is a separate architecture. It is a Linux kernel running as a
> > userspace application on top of an unmodified host kernel.
> >
> > So really, UML is a mostly weird userspace program for the purpose of
> > this discussion. And a pretty buggy one too--it got broken by rseq
> > already.
> >
> > What UML now does is:
> > * Execute a tiny static binary
> > * map special "stub" code/data pages at the topmost userspace address
> > (replacing its stack)
> > * continue execution inside the "stub" pages
> > * unmap everything below the "stub" pages
> > * use the unmap'ed area for userspace application mappings
> >
> > I believe that the "unmap everything" step will fail with this feature.
> >
> >
> > Now, I am sure one can come up with solutions, e.g.:
> > 1. Simply print an explanation if the unmap() fails
> > 2. Find an address that is guaranteed to be below the VDSO and use a
> > smaller address space for the UML userspace.
> > 3. Somehow tell the host kernel to not install the VDSO mappings
> > 4. Add the host VDSO pages as a sealed VMA within UML to guard them
> >
> > UML is a bit of a niche and I am not sure it is worth worrying about it
> > too much.
>
> I've been absent from this patch series in general, but this gave me
> an idea: what if we let userspace seal these mappings itself? Since
> glibc is already sealing things, it might as well seal these?
> And then systems that _do_ care about this would set the glibc tunable
> and deal with the breakage.
>
> Is there something seriously wrong with this approach? Besides maybe
> not having a super easy way to discover these mappings atm, I feel
> like it would solve all of the policy issues people have been talking
> about in these threads.
>
There are technical difficulties to seal vdso/vvar from the glibc
side. The dynamic linker lacks vdso/vvar mapping size information, and
architectural variations for vdso/vvar also means sealing from the
kernel side is a simpler solution. Adhemerval has more details in case
clarification is needed from the glibc side.
Additionally, uprobe mapping can't be sealed by the dynamic linker,
dynamic linker can only apply sealing during execve() and dlopen(),
uprobe mapping isn't created during those two calls.
-Jeff
> --
> Pedro
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-17 18:20 ` Jeff Xu
@ 2025-01-17 19:35 ` enh
2025-01-17 20:15 ` Jeff Xu
` (2 more replies)
0 siblings, 3 replies; 62+ messages in thread
From: enh @ 2025-01-17 19:35 UTC (permalink / raw)
To: Jeff Xu
Cc: Pedro Falcato, Benjamin Berg, Lorenzo Stoakes, Kees Cook, akpm,
jannh, torvalds, adhemerval.zanella, oleg, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, ojeda, adobriyan,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, hch, peterx, hca, f.fainelli, gerg, dave.hansen, mingo,
ardb, Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn
On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Thu, Jan 16, 2025 at 9:18 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Thu, Jan 16, 2025 at 5:02 PM Benjamin Berg <benjamin@sipsolutions.net> wrote:
> > >
> > > Hi Lorenzo,
> > >
> > > On Thu, 2025-01-16 at 15:48 +0000, Lorenzo Stoakes wrote:
> > > > On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > > > > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > [SNIP]
> > > > >
> > > > > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > > > > kernel to be in a broken state even if a user sets flags to do so.
> > > > > >
> > > > > > This is because users might lack context to make this decision and
> > > > > > incorrectly do so, and now we ship a known-broken kernel.
> > > > > >
> > > > > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > > > > _requirements_ (not optional features).
> > > > > >
> > > > > Sure, I can add CRIU back.
> > > > >
> > > > > Are you fine with UML and gViso not working under this CONFIG ?
> > > > > UML/gViso doesn't use any KCONFIG like CRIU does.
> > > >
> > > > Yeah this is a concern, wouldn't we be able to catch UML with a flag?
> > > >
> > > > Apologies my fault for maybe not being totally up to date with this, but what
> > > > exactly was the gViso (is it gVisor actually?)
> > >
> > > UML is a separate architecture. It is a Linux kernel running as a
> > > userspace application on top of an unmodified host kernel.
> > >
> > > So really, UML is a mostly weird userspace program for the purpose of
> > > this discussion. And a pretty buggy one too--it got broken by rseq
> > > already.
> > >
> > > What UML now does is:
> > > * Execute a tiny static binary
> > > * map special "stub" code/data pages at the topmost userspace address
> > > (replacing its stack)
> > > * continue execution inside the "stub" pages
> > > * unmap everything below the "stub" pages
> > > * use the unmap'ed area for userspace application mappings
> > >
> > > I believe that the "unmap everything" step will fail with this feature.
> > >
> > >
> > > Now, I am sure one can come up with solutions, e.g.:
> > > 1. Simply print an explanation if the unmap() fails
> > > 2. Find an address that is guaranteed to be below the VDSO and use a
> > > smaller address space for the UML userspace.
> > > 3. Somehow tell the host kernel to not install the VDSO mappings
> > > 4. Add the host VDSO pages as a sealed VMA within UML to guard them
> > >
> > > UML is a bit of a niche and I am not sure it is worth worrying about it
> > > too much.
> >
> > I've been absent from this patch series in general, but this gave me
> > an idea: what if we let userspace seal these mappings itself? Since
> > glibc is already sealing things, it might as well seal these?
> > And then systems that _do_ care about this would set the glibc tunable
> > and deal with the breakage.
> >
> > Is there something seriously wrong with this approach? Besides maybe
> > not having a super easy way to discover these mappings atm, I feel
> > like it would solve all of the policy issues people have been talking
> > about in these threads.
> >
> There are technical difficulties to seal vdso/vvar from the glibc
> side. The dynamic linker lacks vdso/vvar mapping size information, and
> architectural variations for vdso/vvar also means sealing from the
> kernel side is a simpler solution. Adhemerval has more details in case
> clarification is needed from the glibc side.
as a maintainer of a different linux libc, i've long wanted a "tell me
everything there is to know about this vma" syscall rather than having
to parse /proc/maps...
...but in this special case, is the vdso/vvar size ever anything other
than "one page" in practice?
> Additionally, uprobe mapping can't be sealed by the dynamic linker,
> dynamic linker can only apply sealing during execve() and dlopen(),
> uprobe mapping isn't created during those two calls.
>
> -Jeff
>
>
> > --
> > Pedro
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-17 19:35 ` enh
@ 2025-01-17 20:15 ` Jeff Xu
2025-01-17 22:08 ` Liam R. Howlett
2025-02-06 13:20 ` Thomas Weißschuh
2 siblings, 0 replies; 62+ messages in thread
From: Jeff Xu @ 2025-01-17 20:15 UTC (permalink / raw)
To: enh
Cc: Pedro Falcato, Benjamin Berg, Lorenzo Stoakes, Kees Cook, akpm,
jannh, torvalds, adhemerval.zanella, oleg, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, ojeda, adobriyan,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, hch, peterx, hca, f.fainelli, gerg, dave.hansen, mingo,
ardb, Liam.Howlett, mhocko, 42.hyeyoo, peterz, ardb, rientjes,
groeck, mpe, Vlastimil Babka, Andrei Vagin, Dmitry Safonov,
Mike Rapoport, Alexander Mikhalitsyn
On Fri, Jan 17, 2025 at 11:35 AM enh <enh@google.com> wrote:
>
> On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > On Thu, Jan 16, 2025 at 9:18 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > On Thu, Jan 16, 2025 at 5:02 PM Benjamin Berg <benjamin@sipsolutions.net> wrote:
> > > >
> > > > Hi Lorenzo,
> > > >
> > > > On Thu, 2025-01-16 at 15:48 +0000, Lorenzo Stoakes wrote:
> > > > > On Wed, Jan 15, 2025 at 12:20:59PM -0800, Jeff Xu wrote:
> > > > > > On Wed, Jan 15, 2025 at 11:46 AM Lorenzo Stoakes
> > > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > [SNIP]
> > > > > >
> > > > > > > I've made it abundantly clear that this (NACKed) series cannot allow the
> > > > > > > kernel to be in a broken state even if a user sets flags to do so.
> > > > > > >
> > > > > > > This is because users might lack context to make this decision and
> > > > > > > incorrectly do so, and now we ship a known-broken kernel.
> > > > > > >
> > > > > > > You are now suggesting disabling the !CRIU requirement. Which violates my
> > > > > > > _requirements_ (not optional features).
> > > > > > >
> > > > > > Sure, I can add CRIU back.
> > > > > >
> > > > > > Are you fine with UML and gViso not working under this CONFIG ?
> > > > > > UML/gViso doesn't use any KCONFIG like CRIU does.
> > > > >
> > > > > Yeah this is a concern, wouldn't we be able to catch UML with a flag?
> > > > >
> > > > > Apologies my fault for maybe not being totally up to date with this, but what
> > > > > exactly was the gViso (is it gVisor actually?)
> > > >
> > > > UML is a separate architecture. It is a Linux kernel running as a
> > > > userspace application on top of an unmodified host kernel.
> > > >
> > > > So really, UML is a mostly weird userspace program for the purpose of
> > > > this discussion. And a pretty buggy one too--it got broken by rseq
> > > > already.
> > > >
> > > > What UML now does is:
> > > > * Execute a tiny static binary
> > > > * map special "stub" code/data pages at the topmost userspace address
> > > > (replacing its stack)
> > > > * continue execution inside the "stub" pages
> > > > * unmap everything below the "stub" pages
> > > > * use the unmap'ed area for userspace application mappings
> > > >
> > > > I believe that the "unmap everything" step will fail with this feature.
> > > >
> > > >
> > > > Now, I am sure one can come up with solutions, e.g.:
> > > > 1. Simply print an explanation if the unmap() fails
> > > > 2. Find an address that is guaranteed to be below the VDSO and use a
> > > > smaller address space for the UML userspace.
> > > > 3. Somehow tell the host kernel to not install the VDSO mappings
> > > > 4. Add the host VDSO pages as a sealed VMA within UML to guard them
> > > >
> > > > UML is a bit of a niche and I am not sure it is worth worrying about it
> > > > too much.
> > >
> > > I've been absent from this patch series in general, but this gave me
> > > an idea: what if we let userspace seal these mappings itself? Since
> > > glibc is already sealing things, it might as well seal these?
> > > And then systems that _do_ care about this would set the glibc tunable
> > > and deal with the breakage.
> > >
> > > Is there something seriously wrong with this approach? Besides maybe
> > > not having a super easy way to discover these mappings atm, I feel
> > > like it would solve all of the policy issues people have been talking
> > > about in these threads.
> > >
> > There are technical difficulties to seal vdso/vvar from the glibc
> > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > architectural variations for vdso/vvar also means sealing from the
> > kernel side is a simpler solution. Adhemerval has more details in case
> > clarification is needed from the glibc side.
>
> as a maintainer of a different linux libc, i've long wanted a "tell me
> everything there is to know about this vma" syscall rather than having
> to parse /proc/maps...
>
That will be an interesting mm feature, i.e. query the vma information
given an address. ASLR might be a thing to consider, there are
sandbox solutions to block the read on /proc/pid/maps, such as
landlock.
The glibc's dynamic linker gets the mapping size info from the elf
header of the .so, during execve() call. In a previous attempt of
glibc sealing the vdso, the size of vdso.so (in PT_LOAD) was found to
be inaccurate. To make the thing more difficult, the vvar size might
not be present, iiuc.
> ...but in this special case, is the vdso/vvar size ever anything other
> than "one page" in practice?
>
yes. on x86, the vdso size can be two pages long.
> > Additionally, uprobe mapping can't be sealed by the dynamic linker,
> > dynamic linker can only apply sealing during execve() and dlopen(),
> > uprobe mapping isn't created during those two calls.
> >
> > -Jeff
> >
> >
> > > --
> > > Pedro
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-17 19:35 ` enh
2025-01-17 20:15 ` Jeff Xu
@ 2025-01-17 22:08 ` Liam R. Howlett
2025-01-21 15:38 ` enh
2025-02-06 13:20 ` Thomas Weißschuh
2 siblings, 1 reply; 62+ messages in thread
From: Liam R. Howlett @ 2025-01-17 22:08 UTC (permalink / raw)
To: enh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb,
rientjes, groeck, mpe, Vlastimil Babka, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn
* enh <enh@google.com> [250117 14:35]:
...
>
> as a maintainer of a different linux libc, i've long wanted a "tell me
> everything there is to know about this vma" syscall rather than having
> to parse /proc/maps...
>
You mean an ioctl()-based API to query VMAs from /proc/<pid>/maps?
Andrii had something like that [1], check out ed5d583a88a92 ("fs/procfs:
implement efficient VMA querying API for /proc/<pid>/maps")
Regards,
Liam
[1]. https://lore.kernel.org/linux-mm/20240627170900.1672542-1-andrii@kernel.org/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-17 22:08 ` Liam R. Howlett
@ 2025-01-21 15:38 ` enh
2025-01-22 17:23 ` Liam R. Howlett
0 siblings, 1 reply; 62+ messages in thread
From: enh @ 2025-01-21 15:38 UTC (permalink / raw)
To: Liam R. Howlett, enh, Jeff Xu, Pedro Falcato, Benjamin Berg,
Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Fri, Jan 17, 2025 at 5:08 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * enh <enh@google.com> [250117 14:35]:
> ...
>
> >
> > as a maintainer of a different linux libc, i've long wanted a "tell me
> > everything there is to know about this vma" syscall rather than having
> > to parse /proc/maps...
> >
>
> You mean an ioctl()-based API to query VMAs from /proc/<pid>/maps?
i wasn't imagining an ioctl(), no, just a regular syscall, but that
would work too.
> Andrii had something like that [1], check out ed5d583a88a92 ("fs/procfs:
> implement efficient VMA querying API for /proc/<pid>/maps")
yeah, that would work for the use cases i've seen too (some of which
are similar to the ones mentioned in the patch description, but other
ones too).
the other motivation we've had that i didn't notice mentioned there is
avoiding the awkward /proc/<pid>/maps behavior when you have too many
vmas to fit all the output into a page.
i'd definitely use this in Android's libc, and several of our
profiling/unwinding libraries.
> Regards,
> Liam
>
> [1]. https://lore.kernel.org/linux-mm/20240627170900.1672542-1-andrii@kernel.org/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-21 15:38 ` enh
@ 2025-01-22 17:23 ` Liam R. Howlett
2025-01-22 22:29 ` enh
0 siblings, 1 reply; 62+ messages in thread
From: Liam R. Howlett @ 2025-01-22 17:23 UTC (permalink / raw)
To: enh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, mhocko, 42.hyeyoo, peterz, ardb,
rientjes, groeck, mpe, Vlastimil Babka, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn
* enh <enh@google.com> [250121 10:38]:
> On Fri, Jan 17, 2025 at 5:08 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * enh <enh@google.com> [250117 14:35]:
> > ...
> >
> > >
> > > as a maintainer of a different linux libc, i've long wanted a "tell me
> > > everything there is to know about this vma" syscall rather than having
> > > to parse /proc/maps...
> > >
> >
> > You mean an ioctl()-based API to query VMAs from /proc/<pid>/maps?
>
> i wasn't imagining an ioctl(), no, just a regular syscall, but that
> would work too.
>
> > Andrii had something like that [1], check out ed5d583a88a92 ("fs/procfs:
> > implement efficient VMA querying API for /proc/<pid>/maps")
>
> yeah, that would work for the use cases i've seen too (some of which
> are similar to the ones mentioned in the patch description, but other
> ones too).
>
> the other motivation we've had that i didn't notice mentioned there is
> avoiding the awkward /proc/<pid>/maps behavior when you have too many
> vmas to fit all the output into a page.
We are making changes in this area. Can you elaborate on the 'awkward'
part? The locking or the tearing of data?
The way you state the page, it makes me think it's the tearing that is
the issue? I think the ioctl would be worse for the possible tearing of
data.
Thanks,
Liam
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-22 17:23 ` Liam R. Howlett
@ 2025-01-22 22:29 ` enh
2025-01-23 8:40 ` Vlastimil Babka
0 siblings, 1 reply; 62+ messages in thread
From: enh @ 2025-01-22 22:29 UTC (permalink / raw)
To: Liam R. Howlett, enh, Jeff Xu, Pedro Falcato, Benjamin Berg,
Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Wed, Jan 22, 2025 at 12:24 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> * enh <enh@google.com> [250121 10:38]:
> > On Fri, Jan 17, 2025 at 5:08 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * enh <enh@google.com> [250117 14:35]:
> > > ...
> > >
> > > >
> > > > as a maintainer of a different linux libc, i've long wanted a "tell me
> > > > everything there is to know about this vma" syscall rather than having
> > > > to parse /proc/maps...
> > > >
> > >
> > > You mean an ioctl()-based API to query VMAs from /proc/<pid>/maps?
> >
> > i wasn't imagining an ioctl(), no, just a regular syscall, but that
> > would work too.
> >
> > > Andrii had something like that [1], check out ed5d583a88a92 ("fs/procfs:
> > > implement efficient VMA querying API for /proc/<pid>/maps")
> >
> > yeah, that would work for the use cases i've seen too (some of which
> > are similar to the ones mentioned in the patch description, but other
> > ones too).
> >
> > the other motivation we've had that i didn't notice mentioned there is
> > avoiding the awkward /proc/<pid>/maps behavior when you have too many
> > vmas to fit all the output into a page.
>
> We are making changes in this area. Can you elaborate on the 'awkward'
> part? The locking or the tearing of data?
>
> The way you state the page, it makes me think it's the tearing that is
> the issue? I think the ioctl would be worse for the possible tearing of
> data.
there are two main styles of use of the /proc/<pid>/maps data i've
seen in userspace:
1. i need to know about _all_ the vmas, so i'm reading the whole file
and stitching it together.
2. i need to know about the vma containing a specific address, so i'm
reading line-by-line looking for a match.
the former is typically just for humans anyway (to be added to a crash
report) -- so the ability to sendfile() or whatever would be something
we could use there (though i've no idea whether that actually solves
the tearing issue) -- so those use cases mostly tend to ignore (or not
notice) the problem.
for the latter, tearing is a bigger problem because what does "not
found" mean? do we try again? how many times do we try? so i think the
ioctl() would solve those problems. plus it would be a lot cheaper,
and in particular not require [or at least "strongly benefit from"]
dynamic memory allocation like parsing a text file does.
> Thanks,
> Liam
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-22 22:29 ` enh
@ 2025-01-23 8:40 ` Vlastimil Babka
2025-01-23 21:50 ` enh
0 siblings, 1 reply; 62+ messages in thread
From: Vlastimil Babka @ 2025-01-23 8:40 UTC (permalink / raw)
To: enh, Liam R. Howlett, Jeff Xu, Pedro Falcato, Benjamin Berg,
Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, rientjes, groeck, mpe, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn
On 1/22/25 23:29, enh wrote:
> On Wed, Jan 22, 2025 at 12:24 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
>>
>>
>> We are making changes in this area. Can you elaborate on the 'awkward'
>> part? The locking or the tearing of data?
>>
>> The way you state the page, it makes me think it's the tearing that is
>> the issue? I think the ioctl would be worse for the possible tearing of
>> data.
>
> there are two main styles of use of the /proc/<pid>/maps data i've
> seen in userspace:
>
> 1. i need to know about _all_ the vmas, so i'm reading the whole file
> and stitching it together.
> 2. i need to know about the vma containing a specific address, so i'm
> reading line-by-line looking for a match.
>
> the former is typically just for humans anyway (to be added to a crash
> report) -- so the ability to sendfile() or whatever would be something
> we could use there (though i've no idea whether that actually solves
> the tearing issue) -- so those use cases mostly tend to ignore (or not
> notice) the problem.
>
> for the latter, tearing is a bigger problem because what does "not
> found" mean? do we try again? how many times do we try? so i think the
> ioctl() would solve those problems. plus it would be a lot cheaper,
> and in particular not require [or at least "strongly benefit from"]
> dynamic memory allocation like parsing a text file does.
IIUC tearing can only happen in place of parallel changes (mmap/munmap etc)
by another thread, but if it does not affect the VMA in question it
shouldn't lead to skipping it. If it does affect it, the query would be
inherently racy anyway.
One corner case I can however think of is a modification of a previous vma
leads to merging with the vma we are querying but even then IIRC the races
could mean the previous vma could be reported still as separate, but then
followed by the merged result, so the virtual address range corresponding of
the previous vma would appear twice in the report, not that a range would be
skipped.
But if you have examples of a different experience, i.e. where a vma would
indeed be missing, we would be eager to hear that!
I think the ioctl interface lets you "seek" to the address of interest
quickly, but I believe it also can't eliminate these racing issues completely.
>> Thanks,
>> Liam
>>
>>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-23 8:40 ` Vlastimil Babka
@ 2025-01-23 21:50 ` enh
2025-01-23 22:38 ` Matthew Wilcox
0 siblings, 1 reply; 62+ messages in thread
From: enh @ 2025-01-23 21:50 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Liam R. Howlett, Jeff Xu, Pedro Falcato, Benjamin Berg,
Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, rientjes, groeck, mpe, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn,
Christopher Ferris
On Thu, Jan 23, 2025 at 3:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 1/22/25 23:29, enh wrote:
> > On Wed, Jan 22, 2025 at 12:24 PM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> >>
> >>
> >> We are making changes in this area. Can you elaborate on the 'awkward'
> >> part? The locking or the tearing of data?
> >>
> >> The way you state the page, it makes me think it's the tearing that is
> >> the issue? I think the ioctl would be worse for the possible tearing of
> >> data.
> >
> > there are two main styles of use of the /proc/<pid>/maps data i've
> > seen in userspace:
> >
> > 1. i need to know about _all_ the vmas, so i'm reading the whole file
> > and stitching it together.
> > 2. i need to know about the vma containing a specific address, so i'm
> > reading line-by-line looking for a match.
> >
> > the former is typically just for humans anyway (to be added to a crash
> > report) -- so the ability to sendfile() or whatever would be something
> > we could use there (though i've no idea whether that actually solves
> > the tearing issue) -- so those use cases mostly tend to ignore (or not
> > notice) the problem.
> >
> > for the latter, tearing is a bigger problem because what does "not
> > found" mean? do we try again? how many times do we try? so i think the
> > ioctl() would solve those problems. plus it would be a lot cheaper,
> > and in particular not require [or at least "strongly benefit from"]
> > dynamic memory allocation like parsing a text file does.
>
> IIUC tearing can only happen in place of parallel changes (mmap/munmap etc)
> by another thread, but if it does not affect the VMA in question it
> shouldn't lead to skipping it. If it does affect it, the query would be
> inherently racy anyway.
yeah, at this point i should (a) drag in +cferris who may have actual
experience of this and (b) admit that iirc i've never personally seen
_evidence_ of this, just claims. most famously in the chrome source...
if you `grep -r /proc/.*/maps` you'll find lots of examples, but
something like https://chromium.googlesource.com/chromium/src/+/main/base/debug/proc_maps_linux.h#61
is quite representative of the "folklore" in this area.
> One corner case I can however think of is a modification of a previous vma
> leads to merging with the vma we are querying but even then IIRC the races
> could mean the previous vma could be reported still as separate, but then
> followed by the merged result, so the virtual address range corresponding of
> the previous vma would appear twice in the report, not that a range would be
> skipped.
>
> But if you have examples of a different experience, i.e. where a vma would
> indeed be missing, we would be eager to hear that!
>
> I think the ioctl interface lets you "seek" to the address of interest
> quickly, but I believe it also can't eliminate these racing issues completely.
i wasn't thinking of changing the "i inherently need to read the whole
thing" uses out for the ioctl(), just the "i need to know more about
the vma containing this specific address" cases.
> >> Thanks,
> >> Liam
> >>
> >>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-23 21:50 ` enh
@ 2025-01-23 22:38 ` Matthew Wilcox
2025-02-06 14:19 ` enh
0 siblings, 1 reply; 62+ messages in thread
From: Matthew Wilcox @ 2025-01-23 22:38 UTC (permalink / raw)
To: enh
Cc: Vlastimil Babka, Liam R. Howlett, Jeff Xu, Pedro Falcato,
Benjamin Berg, Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, rientjes, groeck, mpe, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn,
Christopher Ferris
On Thu, Jan 23, 2025 at 04:50:46PM -0500, enh wrote:
> yeah, at this point i should (a) drag in +cferris who may have actual
> experience of this and (b) admit that iirc i've never personally seen
> _evidence_ of this, just claims. most famously in the chrome source...
> if you `grep -r /proc/.*/maps` you'll find lots of examples, but
> something like https://chromium.googlesource.com/chromium/src/+/main/base/debug/proc_maps_linux.h#61
> is quite representative of the "folklore" in this area.
That folklore is 100% based on a true story! I'm not sure that all of
the details are precisely correct, but it's true enough that I wouldn't
quibble with it.
In fact, we want to make it worse. Because the mmap_lock is such a
huge point of contention, we want to read /proc/PID/maps protected
only by RCU. That will relax the guarantees to:
a. If a VMA existed and was not modified during the duration of the
read, it will definitely be returned.
b. If a VMA was added during the call, it might be returned.
c. If a VMA was removed during the call, it might be returned.
d. If an address was covered by a VMA before the call and that
VMA was modified during the call, you might get the prior or
posterior state of the VMA. And you might get both!
What might be confusing:
e. If VMA A is added, then VMA B is added, your call might show you VMA
B and not VMA A.
f. Similarly for deleted.
g. If you have, say, a VMA from (4000-9000) and you mprotect the region
(5000-6000), you might see:
4000-9000 oldA
or
4000-5000 newA
4000-9000 oldA
or
4000-5000 newA
5000-6000 newB
4000-9000 oldA
or
4000-5000 newA
5000-6000 newB
6000-9000 newC
(it's possible other combinations might be visible; i'm not working on
the details of this right now)
We shouldn't be able to _skip_ a VMA. That seems far worse than
returning duplicates; if your maps parser sees duplicates it can either
try to figure it out itself, or retry the whole read.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-03 20:48 ` Liam R. Howlett
2025-01-07 1:17 ` Kees Cook
@ 2025-02-04 18:17 ` Johannes Berg
1 sibling, 0 replies; 62+ messages in thread
From: Johannes Berg @ 2025-02-04 18:17 UTC (permalink / raw)
To: Liam R. Howlett, Kees Cook, jeffxu, akpm, keescook, jannh,
torvalds, adhemerval.zanella, oleg, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, ojeda, adobriyan,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, hch, peterx, hca, f.fainelli, gerg, dave.hansen, mingo,
ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck,
mpe, Vlastimil Babka, Lorenzo Stoakes, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn
On Fri, 2025-01-03 at 15:48 -0500, Liam R. Howlett wrote:
>
> So we have at least two userspace uses that this will breaks: checkpoint
> restore and now gVisor, but who knows what else?
I believe we previously pointed out it might also break running the
ARCH=um kernel:
https://lore.kernel.org/all/2e5de601da34342d8eb0d8319dcf81ff213c7ef0.camel@sipsolutions.net/
johannes
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-17 19:35 ` enh
2025-01-17 20:15 ` Jeff Xu
2025-01-17 22:08 ` Liam R. Howlett
@ 2025-02-06 13:20 ` Thomas Weißschuh
2025-02-06 14:38 ` enh
2 siblings, 1 reply; 62+ messages in thread
From: Thomas Weißschuh @ 2025-02-06 13:20 UTC (permalink / raw)
To: enh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
<snip>
> > There are technical difficulties to seal vdso/vvar from the glibc
> > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > architectural variations for vdso/vvar also means sealing from the
> > kernel side is a simpler solution. Adhemerval has more details in case
> > clarification is needed from the glibc side.
>
> as a maintainer of a different linux libc, i've long wanted a "tell me
> everything there is to know about this vma" syscall rather than having
> to parse /proc/maps...
>
> ...but in this special case, is the vdso/vvar size ever anything other
> than "one page" in practice?
x86 has two additional vvar pages for virtual clocks.
(Since v6.13 even split into their own mapping)
Loongarch has per-cpu vvar data which is larger than one page.
The vdso mapping is however many pages the code ends up being compiled as,
for example on my current x86_64 distro kernel it's two pages.
In the near future, probably v6.14, vvars will be split over multiple
pages in general [0].
Figuring out the start and size from /proc/maps, or the new
PROCMAP_QUERY ioctl, is not trivial, due to architectural variations.
Trying to construct the size from the ELF header is also problematic as
that only contains information about the vdso code.
The vvars are mapped before the code in memory independently.
A dedicated interface like a prctl() would be actually reliable.
Or theoretically a function from the vdso itself.
<snip>
[0] https://lore.kernel.org/lkml/20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-01-23 22:38 ` Matthew Wilcox
@ 2025-02-06 14:19 ` enh
0 siblings, 0 replies; 62+ messages in thread
From: enh @ 2025-02-06 14:19 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Vlastimil Babka, Liam R. Howlett, Jeff Xu, Pedro Falcato,
Benjamin Berg, Lorenzo Stoakes, Kees Cook, akpm, jannh, torvalds,
adhemerval.zanella, oleg, linux-kernel, linux-hardening,
linux-mm, jorgelo, sroettger, ojeda, adobriyan, anna-maria,
mark.rutland, linus.walleij, Jason, deller, rdunlap, davem, hch,
peterx, hca, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, rientjes, groeck, mpe, Andrei Vagin,
Dmitry Safonov, Mike Rapoport, Alexander Mikhalitsyn,
Christopher Ferris
On Thu, Jan 23, 2025 at 5:38 PM Matthew Wilcox <willy@infradead.org> wrote:
[heh, long time no see! haven't been on an email thread with you in a
while :-) ]
> On Thu, Jan 23, 2025 at 04:50:46PM -0500, enh wrote:
> > yeah, at this point i should (a) drag in +cferris who may have actual
> > experience of this and (b) admit that iirc i've never personally seen
> > _evidence_ of this, just claims. most famously in the chrome source...
> > if you `grep -r /proc/.*/maps` you'll find lots of examples, but
> > something like https://chromium.googlesource.com/chromium/src/+/main/base/debug/proc_maps_linux.h#61
> > is quite representative of the "folklore" in this area.
>
> That folklore is 100% based on a true story! I'm not sure that all of
> the details are precisely correct, but it's true enough that I wouldn't
> quibble with it.
>
> In fact, we want to make it worse. Because the mmap_lock is such a
> huge point of contention, we want to read /proc/PID/maps protected
> only by RCU. That will relax the guarantees to:
>
> a. If a VMA existed and was not modified during the duration of the
> read, it will definitely be returned.
> b. If a VMA was added during the call, it might be returned.
> c. If a VMA was removed during the call, it might be returned.
> d. If an address was covered by a VMA before the call and that
> VMA was modified during the call, you might get the prior or
> posterior state of the VMA. And you might get both!
>
> What might be confusing:
>
> e. If VMA A is added, then VMA B is added, your call might show you VMA
> B and not VMA A.
> f. Similarly for deleted.
> g. If you have, say, a VMA from (4000-9000) and you mprotect the region
> (5000-6000), you might see:
> 4000-9000 oldA
> or
> 4000-5000 newA
> 4000-9000 oldA
> or
> 4000-5000 newA
> 5000-6000 newB
> 4000-9000 oldA
> or
> 4000-5000 newA
> 5000-6000 newB
> 6000-9000 newC
>
> (it's possible other combinations might be visible; i'm not working on
> the details of this right now)
>
> We shouldn't be able to _skip_ a VMA. That seems far worse than
> returning duplicates; if your maps parser sees duplicates it can either
> try to figure it out itself, or retry the whole read.
yeah, fwiw i can't think i've seen a case where a duplicate would
matter --- half the code i've seen ["tell me more about the VMA
containing this address"] would just stop at the first match anyway
(though that's exactly the case where i'd rather have "direct access"
than have to search), and the other half ["give me a snapshot of all
the VMAs for offline debugging purposes"] doesn't really bother with
interpretation and leaves that up to humans.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-02-06 13:20 ` Thomas Weißschuh
@ 2025-02-06 14:38 ` enh
2025-02-06 15:28 ` Thomas Weißschuh
0 siblings, 1 reply; 62+ messages in thread
From: enh @ 2025-02-06 14:38 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Thu, Feb 6, 2025 at 8:20 AM Thomas Weißschuh
<thomas.weissschuh@linutronix.de> wrote:
>
> On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> > On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> <snip>
>
> > > There are technical difficulties to seal vdso/vvar from the glibc
> > > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > > architectural variations for vdso/vvar also means sealing from the
> > > kernel side is a simpler solution. Adhemerval has more details in case
> > > clarification is needed from the glibc side.
> >
> > as a maintainer of a different linux libc, i've long wanted a "tell me
> > everything there is to know about this vma" syscall rather than having
> > to parse /proc/maps...
> >
> > ...but in this special case, is the vdso/vvar size ever anything other
> > than "one page" in practice?
>
> x86 has two additional vvar pages for virtual clocks.
> (Since v6.13 even split into their own mapping)
> Loongarch has per-cpu vvar data which is larger than one page.
> The vdso mapping is however many pages the code ends up being compiled as,
> for example on my current x86_64 distro kernel it's two pages.
> In the near future, probably v6.14, vvars will be split over multiple
> pages in general [0].
/me checks the nearest arm64 phone ... yeah, vdso is still only one
page there but vvars is already more than one.
is there a TL;DR (or RTFM link) for why this is so big? a quick look
at the x86 suggests there should only be 640 bytes of various things
plus a handful of bytes for the rng, and while arm64 looks very
different, that looks like it's explicitly asking for a page (with the
vdso_data_store stuff)? (i've never had any reason to look at vvars
before, only vdso.)
> Figuring out the start and size from /proc/maps, or the new
> PROCMAP_QUERY ioctl, is not trivial, due to architectural variations.
(obviously it's unsatisfying as a general interface, but in practice
the VMAs i see asked about about directly -- rather than just rounded
up in a diagnostic dump -- are either stacks ["what are the bounds of
this stack, and does it have guard pages already?"] or code ["what
file was the code at this pc mapped in from?"]. so while the vdso
would come up, we'd never notice if vvars didn't work. if your sp/pc
point there, we were already just going to bail anyway :-) )
> Trying to construct the size from the ELF header is also problematic as
> that only contains information about the vdso code.
> The vvars are mapped before the code in memory independently.
>
> A dedicated interface like a prctl() would be actually reliable.
> Or theoretically a function from the vdso itself.
>
> <snip>
>
> [0] https://lore.kernel.org/lkml/20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-02-06 14:38 ` enh
@ 2025-02-06 15:28 ` Thomas Weißschuh
2025-02-06 15:51 ` enh
0 siblings, 1 reply; 62+ messages in thread
From: Thomas Weißschuh @ 2025-02-06 15:28 UTC (permalink / raw)
To: enh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Thu, Feb 06, 2025 at 09:38:59AM -0500, enh wrote:
> On Thu, Feb 6, 2025 at 8:20 AM Thomas Weißschuh
> <thomas.weissschuh@linutronix.de> wrote:
> >
> > On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> > > On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > <snip>
> >
> > > > There are technical difficulties to seal vdso/vvar from the glibc
> > > > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > > > architectural variations for vdso/vvar also means sealing from the
> > > > kernel side is a simpler solution. Adhemerval has more details in case
> > > > clarification is needed from the glibc side.
> > >
> > > as a maintainer of a different linux libc, i've long wanted a "tell me
> > > everything there is to know about this vma" syscall rather than having
> > > to parse /proc/maps...
> > >
> > > ...but in this special case, is the vdso/vvar size ever anything other
> > > than "one page" in practice?
> >
> > x86 has two additional vvar pages for virtual clocks.
> > (Since v6.13 even split into their own mapping)
> > Loongarch has per-cpu vvar data which is larger than one page.
> > The vdso mapping is however many pages the code ends up being compiled as,
> > for example on my current x86_64 distro kernel it's two pages.
> > In the near future, probably v6.14, vvars will be split over multiple
> > pages in general [0].
>
> /me checks the nearest arm64 phone ... yeah, vdso is still only one
> page there but vvars is already more than one.
Probably due to CONFIG_TIME_NS, see below.
> is there a TL;DR (or RTFM link) for why this is so big? a quick look
> at the x86 suggests there should only be 640 bytes of various things
> plus a handful of bytes for the rng, and while arm64 looks very
> different, that looks like it's explicitly asking for a page (with the
> vdso_data_store stuff)? (i've never had any reason to look at vvars
> before, only vdso.)
I don't think there is any real manual.
The vvar data is *shared* between the kernel and userspace.
This is done by mapping the *same* physical memory into the kernel
("vdso_data_store") and (read-only) into all userspace processes.
As PTEs always cover a full page and the kernel can not expose random
other internal kernel data into userspace, the vvars need to be in their
own dedicated page.
(The same is true for the vDSO code, uprobe trampoline, etc... mappings)
The vDSO functions also need to be aware of time namespaces. This is
implemented by allocating one page per namespace and mapping this
in place of the regular vvar page. But the vDSO still needs to access
the regular vvar page for some information, so both are mapped.
Then on top come the rng state and some architecture-specific data.
These are currently part of the time page. So they also have to dance
around the time namespace mapping shenanigans. In addition they have to
coexist with the actual time data, which is currently done by manually
calculating byte offsets for them in the time page and hardcoding those.
The linked series cleans this up by moving things into dedicated pages.
To make the code easier to understand and to make it possible to
add new data to the time page without running out of space or
introducing conflicts which need to be detected manually.
While this needs to allocate more pages, these are shared between the
whole system, so effectively it's cheap. It also requires more virtual
memory space in each process, but that shouldn't matter.
As for arm64 looking very different from x86: Hopefully not for long :-)
> > Figuring out the start and size from /proc/maps, or the new
> > PROCMAP_QUERY ioctl, is not trivial, due to architectural variations.
>
> (obviously it's unsatisfying as a general interface, but in practice
> the VMAs i see asked about about directly -- rather than just rounded
> up in a diagnostic dump -- are either stacks ["what are the bounds of
> this stack, and does it have guard pages already?"] or code ["what
> file was the code at this pc mapped in from?"]. so while the vdso
> would come up, we'd never notice if vvars didn't work. if your sp/pc
> point there, we were already just going to bail anyway :-) )
Fair enough.
This information was also a response to Jeff's parent mail,
as it would be relevant when sealing the mappings from ld.so.
<snip>
> > [0] https://lore.kernel.org/lkml/20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-02-06 15:28 ` Thomas Weißschuh
@ 2025-02-06 15:51 ` enh
2025-02-06 16:37 ` Thomas Weißschuh
0 siblings, 1 reply; 62+ messages in thread
From: enh @ 2025-02-06 15:51 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Thu, Feb 6, 2025 at 10:28 AM Thomas Weißschuh
<thomas.weissschuh@linutronix.de> wrote:
>
> On Thu, Feb 06, 2025 at 09:38:59AM -0500, enh wrote:
> > On Thu, Feb 6, 2025 at 8:20 AM Thomas Weißschuh
> > <thomas.weissschuh@linutronix.de> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> > > > On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > <snip>
> > >
> > > > > There are technical difficulties to seal vdso/vvar from the glibc
> > > > > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > > > > architectural variations for vdso/vvar also means sealing from the
> > > > > kernel side is a simpler solution. Adhemerval has more details in case
> > > > > clarification is needed from the glibc side.
> > > >
> > > > as a maintainer of a different linux libc, i've long wanted a "tell me
> > > > everything there is to know about this vma" syscall rather than having
> > > > to parse /proc/maps...
> > > >
> > > > ...but in this special case, is the vdso/vvar size ever anything other
> > > > than "one page" in practice?
> > >
> > > x86 has two additional vvar pages for virtual clocks.
> > > (Since v6.13 even split into their own mapping)
> > > Loongarch has per-cpu vvar data which is larger than one page.
> > > The vdso mapping is however many pages the code ends up being compiled as,
> > > for example on my current x86_64 distro kernel it's two pages.
> > > In the near future, probably v6.14, vvars will be split over multiple
> > > pages in general [0].
> >
> > /me checks the nearest arm64 phone ... yeah, vdso is still only one
> > page there but vvars is already more than one.
>
> Probably due to CONFIG_TIME_NS, see below.
>
> > is there a TL;DR (or RTFM link) for why this is so big? a quick look
> > at the x86 suggests there should only be 640 bytes of various things
> > plus a handful of bytes for the rng, and while arm64 looks very
> > different, that looks like it's explicitly asking for a page (with the
> > vdso_data_store stuff)? (i've never had any reason to look at vvars
> > before, only vdso.)
>
> I don't think there is any real manual.
>
> The vvar data is *shared* between the kernel and userspace.
> This is done by mapping the *same* physical memory into the kernel
> ("vdso_data_store") and (read-only) into all userspace processes.
> As PTEs always cover a full page and the kernel can not expose random
> other internal kernel data into userspace, the vvars need to be in their
> own dedicated page.
> (The same is true for the vDSO code, uprobe trampoline, etc... mappings)
>
> The vDSO functions also need to be aware of time namespaces. This is
> implemented by allocating one page per namespace and mapping this
> in place of the regular vvar page. But the vDSO still needs to access
> the regular vvar page for some information, so both are mapped.
ah, i see. yeah, that makes sense. (amusingly, i almost quipped "it's
not like there are _that_ many clocks to go in there" in my previous
mail, forgetting that there are effectively an unbounded number of
clocks thanks to this feature!)
> Then on top come the rng state and some architecture-specific data.
> These are currently part of the time page. So they also have to dance
> around the time namespace mapping shenanigans. In addition they have to
> coexist with the actual time data, which is currently done by manually
> calculating byte offsets for them in the time page and hardcoding those.
>
> The linked series cleans this up by moving things into dedicated pages.
> To make the code easier to understand and to make it possible to
> add new data to the time page without running out of space or
> introducing conflicts which need to be detected manually.
> While this needs to allocate more pages, these are shared between the
> whole system, so effectively it's cheap. It also requires more virtual
> memory space in each process, but that shouldn't matter.
>
>
> As for arm64 looking very different from x86: Hopefully not for long :-)
(even as someone who doesn't work on the kernel, things like this are
always helpful --- just having one thing to understand/your first grep
being relevant is much nicer than "oh, wait ... which architecture was
that?".)
> > > Figuring out the start and size from /proc/maps, or the new
> > > PROCMAP_QUERY ioctl, is not trivial, due to architectural variations.
> >
> > (obviously it's unsatisfying as a general interface, but in practice
> > the VMAs i see asked about about directly -- rather than just rounded
> > up in a diagnostic dump -- are either stacks ["what are the bounds of
> > this stack, and does it have guard pages already?"] or code ["what
> > file was the code at this pc mapped in from?"]. so while the vdso
> > would come up, we'd never notice if vvars didn't work. if your sp/pc
> > point there, we were already just going to bail anyway :-) )
>
> Fair enough.
>
> This information was also a response to Jeff's parent mail,
> as it would be relevant when sealing the mappings from ld.so.
>
> <snip>
>
> > > [0] https://lore.kernel.org/lkml/20250204-vdso-store-rng-v3-0-13a4669dfc8c@linutronix.de/
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH v4 1/1] exec: seal system mappings
2025-02-06 15:51 ` enh
@ 2025-02-06 16:37 ` Thomas Weißschuh
0 siblings, 0 replies; 62+ messages in thread
From: Thomas Weißschuh @ 2025-02-06 16:37 UTC (permalink / raw)
To: enh
Cc: Jeff Xu, Pedro Falcato, Benjamin Berg, Lorenzo Stoakes,
Kees Cook, akpm, jannh, torvalds, adhemerval.zanella, oleg,
linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger,
ojeda, adobriyan, anna-maria, mark.rutland, linus.walleij, Jason,
deller, rdunlap, davem, hch, peterx, hca, f.fainelli, gerg,
dave.hansen, mingo, ardb, Liam.Howlett, mhocko, 42.hyeyoo,
peterz, ardb, rientjes, groeck, mpe, Vlastimil Babka,
Andrei Vagin, Dmitry Safonov, Mike Rapoport,
Alexander Mikhalitsyn
On Thu, Feb 06, 2025 at 10:51:54AM -0500, enh wrote:
> On Thu, Feb 6, 2025 at 10:28 AM Thomas Weißschuh
> <thomas.weissschuh@linutronix.de> wrote:
> >
> > On Thu, Feb 06, 2025 at 09:38:59AM -0500, enh wrote:
> > > On Thu, Feb 6, 2025 at 8:20 AM Thomas Weißschuh
> > > <thomas.weissschuh@linutronix.de> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> > > > > On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@chromium.org> wrote:
<snip>
> > > > x86 has two additional vvar pages for virtual clocks.
> > > > (Since v6.13 even split into their own mapping)
> > > > Loongarch has per-cpu vvar data which is larger than one page.
> > > > The vdso mapping is however many pages the code ends up being compiled as,
> > > > for example on my current x86_64 distro kernel it's two pages.
> > > > In the near future, probably v6.14, vvars will be split over multiple
> > > > pages in general [0].
> > >
> > > /me checks the nearest arm64 phone ... yeah, vdso is still only one
> > > page there but vvars is already more than one.
> >
> > Probably due to CONFIG_TIME_NS, see below.
> >
> > > is there a TL;DR (or RTFM link) for why this is so big? a quick look
> > > at the x86 suggests there should only be 640 bytes of various things
> > > plus a handful of bytes for the rng, and while arm64 looks very
> > > different, that looks like it's explicitly asking for a page (with the
> > > vdso_data_store stuff)? (i've never had any reason to look at vvars
> > > before, only vdso.)
> >
> > I don't think there is any real manual.
> >
> > The vvar data is *shared* between the kernel and userspace.
> > This is done by mapping the *same* physical memory into the kernel
> > ("vdso_data_store") and (read-only) into all userspace processes.
> > As PTEs always cover a full page and the kernel can not expose random
> > other internal kernel data into userspace, the vvars need to be in their
> > own dedicated page.
> > (The same is true for the vDSO code, uprobe trampoline, etc... mappings)
> >
> > The vDSO functions also need to be aware of time namespaces. This is
> > implemented by allocating one page per namespace and mapping this
> > in place of the regular vvar page. But the vDSO still needs to access
> > the regular vvar page for some information, so both are mapped.
>
> ah, i see. yeah, that makes sense. (amusingly, i almost quipped "it's
> not like there are _that_ many clocks to go in there" in my previous
> mail, forgetting that there are effectively an unbounded number of
> clocks thanks to this feature!)
Tiny clarification:
The additional, namespaced clocks do not use additional space in the
global time vvar page. They live in a dedicated, dynamically allocated,
per-namespace page. So the used space within a vvar page does not change
at runtime and can never run out. The amount of vvar mappings per
process is also constant.
The namespaced time vvar pages have the same structure layout as the
global one, but not all fields are used and some are used differently.
Specifically the namespace pages only contain the offsets to the base
clock and the dynamic clock data is read from the global page.
<snip>
^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2025-02-06 16:37 UTC | newest]
Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-25 20:20 [PATCH v4 0/1] Seal system mappings jeffxu
2024-11-25 20:20 ` [PATCH v4 1/1] exec: seal " jeffxu
2024-11-25 20:40 ` Matthew Wilcox
2024-12-02 17:22 ` Jeff Xu
2024-12-02 17:57 ` Lorenzo Stoakes
2024-12-02 20:05 ` Jeff Xu
2024-12-02 19:57 ` Jeff Xu
2024-12-02 18:29 ` Lorenzo Stoakes
2024-12-02 20:38 ` Jeff Xu
2024-12-03 7:35 ` Lorenzo Stoakes
2024-12-03 18:19 ` Jeff Xu
2024-12-03 20:16 ` Lorenzo Stoakes
2024-12-04 14:04 ` Benjamin Berg
2024-12-04 17:43 ` Jeff Xu
2024-12-04 18:24 ` Benjamin Berg
2024-12-10 4:12 ` Andrei Vagin
2024-12-11 22:46 ` Jeff Xu
2024-12-13 6:33 ` Andrei Vagin
2024-12-16 18:35 ` Jeff Xu
2024-12-16 18:56 ` Liam R. Howlett
2024-12-16 20:20 ` Jeff Xu
2024-12-17 22:18 ` Kees Cook
2025-01-02 19:15 ` Andrei Vagin
2025-01-03 20:48 ` Liam R. Howlett
2025-01-07 1:17 ` Kees Cook
2025-02-04 18:17 ` Johannes Berg
2025-01-03 21:38 ` Lorenzo Stoakes
2025-01-07 1:12 ` Kees Cook
2025-01-13 21:26 ` Jeff Xu
2025-01-14 4:19 ` Matthew Wilcox
2025-01-15 19:02 ` Jeff Xu
2025-01-15 19:46 ` Lorenzo Stoakes
2025-01-15 20:20 ` Jeff Xu
2025-01-16 15:48 ` Lorenzo Stoakes
2025-01-16 17:01 ` Benjamin Berg
2025-01-16 17:16 ` Lorenzo Stoakes
2025-01-16 17:18 ` Pedro Falcato
2025-01-17 18:20 ` Jeff Xu
2025-01-17 19:35 ` enh
2025-01-17 20:15 ` Jeff Xu
2025-01-17 22:08 ` Liam R. Howlett
2025-01-21 15:38 ` enh
2025-01-22 17:23 ` Liam R. Howlett
2025-01-22 22:29 ` enh
2025-01-23 8:40 ` Vlastimil Babka
2025-01-23 21:50 ` enh
2025-01-23 22:38 ` Matthew Wilcox
2025-02-06 14:19 ` enh
2025-02-06 13:20 ` Thomas Weißschuh
2025-02-06 14:38 ` enh
2025-02-06 15:28 ` Thomas Weißschuh
2025-02-06 15:51 ` enh
2025-02-06 16:37 ` Thomas Weißschuh
2025-01-17 18:08 ` Jeff Xu
2025-01-15 23:52 ` Kees Cook
2025-01-16 5:26 ` Christoph Hellwig
2025-01-16 19:40 ` Kees Cook
2025-01-17 10:14 ` Heiko Carstens
2025-01-16 15:34 ` Lorenzo Stoakes
2025-01-16 19:44 ` Kees Cook
2024-11-26 16:39 ` [PATCH v4 0/1] Seal " Lorenzo Stoakes
2024-12-02 17:28 ` Jeff Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox