* [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec
@ 2023-07-13 14:33 Aleksa Sarai
2023-07-13 14:33 ` [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling Aleksa Sarai
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: Aleksa Sarai @ 2023-07-13 14:33 UTC (permalink / raw)
To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Aleksa Sarai,
Daniel Verkamp, Luis Chamberlain, YueHaibing
Cc: linux-mm, linux-kernel, linux-kselftest
It seems that the most critical issue with vm.memfd_noexec=2 (the fact
that passing MFD_EXEC would bypass it entirely[1]) has been fixed in
Andrew's tree[2], but there are still some outstanding issues that need
to be addressed:
* The dmesg warnings are pr_warn_once, which on most systems means that
they will be used up by systemd or some other boot process and
userspace developers will never see it. The original patch posted to
the ML used pr_warn_ratelimited but the merged patch had it changed
(with a comment about it being "per review"), but given that the
current warnings are useless, pr_warn_ratelimited makes far more
sense.
* vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
because it will make it far to difficult to ever migrate. Instead it
should imply MFD_EXEC.
* The racheting mechanism for vm.memfd_noexec doesn't make sense as a
security mechanism because a CAP_SYS_ADMIN capable user can create
executable binaries in a hidden tmpfs very easily, not to mention the
many other things they can do.
* The memfd selftests would not exit with a non-zero error code when
certain tests that ran in a forked process (specifically the ones
related to MFD_EXEC and MFD_NOEXEC_SEAL) failed.
(This patchset is based on top of Jeff Xu's patches[2] fixing the
MFD_EXEC bug in vm.memfd_noexec=2.)
[1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/
[2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/
Aleksa Sarai (3):
memfd: cleanups for vm.memfd_noexec handling
memfd: remove racheting feature from vm.memfd_noexec
selftests: memfd: error out test process when child test fails
include/linux/pid_namespace.h | 16 +++------
kernel/pid_sysctl.h | 7 ----
mm/memfd.c | 32 +++++++----------
tools/testing/selftests/memfd/memfd_test.c | 41 ++++++++++++++++++----
4 files changed, 51 insertions(+), 45 deletions(-)
--
2.41.0
^ permalink raw reply [flat|nested] 20+ messages in thread* [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling 2023-07-13 14:33 [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Aleksa Sarai @ 2023-07-13 14:33 ` Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec Aleksa Sarai ` (2 subsequent siblings) 3 siblings, 0 replies; 20+ messages in thread From: Aleksa Sarai @ 2023-07-13 14:33 UTC (permalink / raw) To: Andrew Morton, Shuah Khan, Kees Cook, Aleksa Sarai, Jeff Xu, Daniel Verkamp Cc: linux-mm, Dominique Martinet, Christian Brauner, stable, linux-kernel, linux-kselftest The previous implementation of vm.memfd_noexec=2 did not actually enforce the usage of MFD_NOEXEC_SEAL, as any program that set MFD_EXEC would not be affected by the "enforcement" mechanism. This was fixed in in Andrew's tree recently, but there were still some things that could be cleaned up. On the topic of older programs, it seems far less disruptive to have vm.memfd_noexec=2 have the same behaviour as vm.memfd_noexec=1 (default to MFD_NOEXEC_SEAL if unspecified) to avoid breaking older programs that didn't actually care about the exec bits -- which includes the vast majority of programs that use memfd_create(2), thus allowing users to be able to enable this sysctl without all older programs needlessly breaking. Otherwise vm.memfd_noexec=2 would be unusable on most general-purpose systems as it would require an audit of all of userspace. While we're at it, fix the warnings emitted by memfd_create() to use pr_warn_ratelimited(). If the intention of the warning is to get developers to switch to explicitly specifying if they want exec bits or not, you need to warn them whenever they use it. The systemd version on my box doesn't pass MFD_EXEC, making the warning useless for most userspace developers because it was already emitted during boot. Commit 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") mentions that this was switched to pr_warn_once "as per review" but I couldn't find the discussion anywhere, and the obvious issue (the ability for unprivileged userspace to spam the kernel log) should be handled by pr_warn_ratelimited. If the issue is that this is too spammy, we could tie it to using vm.memfd_noexec=1 or higher. This is a user-visible API change, but as it allows programs to do something that would be blocked before, and the sysctl itself was broken and recently released, it seems unlikely this will cause any issues. Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Christian Brauner <brauner@kernel.org> Cc: stable@vger.kernel.org # v6.3+ Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- include/linux/pid_namespace.h | 16 +++-------- mm/memfd.c | 32 ++++++++-------------- tools/testing/selftests/memfd/memfd_test.c | 22 +++++++++++---- 3 files changed, 33 insertions(+), 37 deletions(-) diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index c758809d5bcf..53974d79d98e 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -17,18 +17,10 @@ struct fs_pin; #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) -/* - * sysctl for vm.memfd_noexec - * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL - * acts like MFD_EXEC was set. - * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL - * acts like MFD_NOEXEC_SEAL was set. - * 2: memfd_create() without MFD_NOEXEC_SEAL will be - * rejected. - */ -#define MEMFD_NOEXEC_SCOPE_EXEC 0 -#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 -#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 +/* modes for vm.memfd_noexec sysctl */ +#define MEMFD_NOEXEC_SCOPE_EXEC 0 /* MFD_EXEC implied if unset */ +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 /* MFD_NOEXEC_SEAL implied if unset */ +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 /* same as 1, except MFD_EXEC rejected */ #endif struct pid_namespace { diff --git a/mm/memfd.c b/mm/memfd.c index 0bdbd2335af7..4f1f841ae39d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -271,30 +271,22 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg) static int check_sysctl_memfd_noexec(unsigned int *flags) { #ifdef CONFIG_SYSCTL - char comm[TASK_COMM_LEN]; - int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; - struct pid_namespace *ns; - - ns = task_active_pid_ns(current); - if (ns) - sysctl = ns->memfd_noexec_scope; + int sysctl = task_active_pid_ns(current)->memfd_noexec_scope; if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { - if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) + if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) *flags |= MFD_NOEXEC_SEAL; else *flags |= MFD_EXEC; } - if (*flags & MFD_EXEC && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) { - pr_warn_once( - "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", - task_pid_nr(current), get_task_comm(comm, current)); - + if (!(*flags & MFD_NOEXEC_SEAL) && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) { + pr_warn_ratelimited( + "%s[%d]: memfd_create() requires MFD_NOEXEC_SEAL with vm.memfd_noexec=%d\n", + current->comm, task_pid_nr(current), sysctl); return -EACCES; } #endif - return 0; } @@ -302,7 +294,6 @@ SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { - char comm[TASK_COMM_LEN]; unsigned int *file_seals; struct file *file; int fd, error; @@ -324,13 +315,14 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { - pr_warn_once( - "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", - task_pid_nr(current), get_task_comm(comm, current)); + pr_warn_ratelimited( + "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n", + current->comm, task_pid_nr(current)); } - if (check_sysctl_memfd_noexec(&flags) < 0) - return -EACCES; + error = check_sysctl_memfd_noexec(&flags); + if (error < 0) + return error; /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index dbdd9ec5e397..d8342989c547 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -1145,11 +1145,23 @@ static void test_sysctl_child(void) printf("%s sysctl 2\n", memfd_str); sysctl_assert_write("2"); - mfd_fail_new("kern_memfd_sysctl_2", - MFD_CLOEXEC | MFD_ALLOW_SEALING); - mfd_fail_new("kern_memfd_sysctl_2_MFD_EXEC", - MFD_CLOEXEC | MFD_EXEC); - fd = mfd_assert_new("", 0, MFD_NOEXEC_SEAL); + mfd_fail_new("kern_memfd_sysctl_2_exec", + MFD_EXEC | MFD_CLOEXEC | MFD_ALLOW_SEALING); + + fd = mfd_assert_new("kern_memfd_sysctl_2_dfl", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); + close(fd); + + fd = mfd_assert_new("kern_memfd_sysctl_2_noexec_seal", + mfd_def_size, + MFD_NOEXEC_SEAL | MFD_CLOEXEC | MFD_ALLOW_SEALING); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); close(fd); sysctl_fail_write("0"); -- 2.41.0 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec 2023-07-13 14:33 [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling Aleksa Sarai @ 2023-07-13 14:33 ` Aleksa Sarai 2023-07-14 0:07 ` Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails Aleksa Sarai 2023-07-18 0:47 ` [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Jeff Xu 3 siblings, 1 reply; 20+ messages in thread From: Aleksa Sarai @ 2023-07-13 14:33 UTC (permalink / raw) To: Andrew Morton, Jeff Xu, Aleksa Sarai, YueHaibing, Luis Chamberlain, Kees Cook, Daniel Verkamp Cc: linux-mm, Dominique Martinet, Christian Brauner, stable, linux-kernel This sysctl has the very unusal behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. It should be noted that with this change, programs that can do an unprivileged unshare(CLONE_NEWUSER) would be able to create an executable memfd even if their current pidns didn't allow it. However, the same sample program above can also be used in this scenario, meaning that even with this consideration, blocking CAP_SYS_ADMIN makes little sense: % unshare -rm ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 This simply further reinforces that locked-down environments need to disallow CLONE_NEWUSER for unprivileged users (as is already the case in most container environments). [1]: https://lore.kernel.org/all/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Christian Brauner <brauner@kernel.org> Cc: stable@vger.kernel.org # v6.3+ Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- kernel/pid_sysctl.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h index b26e027fc9cd..8a22bc29ebb4 100644 --- a/kernel/pid_sysctl.h +++ b/kernel/pid_sysctl.h @@ -24,13 +24,6 @@ static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, if (ns != &init_pid_ns) table_copy.data = &ns->memfd_noexec_scope; - /* - * set minimum to current value, the effect is only bigger - * value is accepted. - */ - if (*(int *)table_copy.data > *(int *)table_copy.extra1) - table_copy.extra1 = table_copy.data; - return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); } -- 2.41.0 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec 2023-07-13 14:33 ` [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec Aleksa Sarai @ 2023-07-14 0:07 ` Aleksa Sarai 0 siblings, 0 replies; 20+ messages in thread From: Aleksa Sarai @ 2023-07-14 0:07 UTC (permalink / raw) To: Andrew Morton, Jeff Xu, YueHaibing, Luis Chamberlain, Kees Cook, Daniel Verkamp Cc: linux-mm, Dominique Martinet, Christian Brauner, stable, linux-kernel [-- Attachment #1: Type: text/plain, Size: 4901 bytes --] On 2023-07-14, Aleksa Sarai <cyphar@cyphar.com> wrote: > This sysctl has the very unusal behaviour of not allowing any user (even > CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you > were to set this sysctl to a more restrictive option in the host pidns > you would need to reboot your machine in order to reset it. > > The justification given in [1] is that this is a security feature and > thus it should not be possible to disable. Aside from the fact that we > have plenty of security-related sysctls that can be disabled after being > enabled (fs.protected_symlinks for instance), the protection provided by > the sysctl is to stop users from being able to create a binary and then > execute it. A user with CAP_SYS_ADMIN can trivially do this without > memfd_create(2): > > % cat mount-memfd.c > #include <fcntl.h> > #include <string.h> > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <linux/mount.h> > > #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" > > int main(void) > { > int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); > assert(fsfd >= 0); > assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); > > int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); > assert(dfd >= 0); > > int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); 0777 Oops. I must've garbled something when copying from my test program. > assert(execfd >= 0); > assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); > assert(!close(execfd)); > > char *execpath = NULL; > char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; > execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); > assert(execfd >= 0); > assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); > assert(!execve(execpath, argv, envp)); > } > % ./mount-memfd > this file was executed from this totally private tmpfs: /proc/self/fd/5 > % > > Given that it is possible for CAP_SYS_ADMIN users to create executable > binaries without memfd_create(2) and without touching the host > filesystem (not to mention the many other things a CAP_SYS_ADMIN process > would be able to do that would be equivalent or worse), it seems strange > to cause a fair amount of headache to admins when there doesn't appear > to be an actual security benefit to blocking this. > > It should be noted that with this change, programs that can do an > unprivileged unshare(CLONE_NEWUSER) would be able to create an > executable memfd even if their current pidns didn't allow it. However, > the same sample program above can also be used in this scenario, meaning > that even with this consideration, blocking CAP_SYS_ADMIN makes little > sense: > > % unshare -rm ./mount-memfd > this file was executed from this totally private tmpfs: /proc/self/fd/5 > > This simply further reinforces that locked-down environments need to > disallow CLONE_NEWUSER for unprivileged users (as is already the case in > most container environments). > > [1]: https://lore.kernel.org/all/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ > > Cc: Dominique Martinet <asmadeus@codewreck.org> > Cc: Christian Brauner <brauner@kernel.org> > Cc: stable@vger.kernel.org # v6.3+ > Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > kernel/pid_sysctl.h | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h > index b26e027fc9cd..8a22bc29ebb4 100644 > --- a/kernel/pid_sysctl.h > +++ b/kernel/pid_sysctl.h > @@ -24,13 +24,6 @@ static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, > if (ns != &init_pid_ns) > table_copy.data = &ns->memfd_noexec_scope; > > - /* > - * set minimum to current value, the effect is only bigger > - * value is accepted. > - */ > - if (*(int *)table_copy.data > *(int *)table_copy.extra1) > - table_copy.extra1 = table_copy.data; > - > return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); > } I also have a patch to properly tie the sysctl to the pid namespace rather that having a global sysctl that magically has its value changed in this pid_mfd_noexec_dointvec_minmax() and another to do the same for the other pidns-tied sysctl (kernel.ns_last_pid) but I'm not sure whether it's needed. It does make vm.memfd_noexec a bit cleaner but because the two sysctls are in different tables you can't register them together AFAICS which means a bunch of needless duplication. > > -- > 2.41.0 > -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails 2023-07-13 14:33 [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec Aleksa Sarai @ 2023-07-13 14:33 ` Aleksa Sarai 2023-07-19 23:38 ` Jeff Xu 2023-07-18 0:47 ` [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Jeff Xu 3 siblings, 1 reply; 20+ messages in thread From: Aleksa Sarai @ 2023-07-13 14:33 UTC (permalink / raw) To: Shuah Khan, Jeff Xu, Andrew Morton, Daniel Verkamp, Aleksa Sarai, Kees Cook Cc: linux-mm, linux-kselftest, linux-kernel Before this change, a test runner using this self test would see a return code of 0 when the tests using a child process (namely the MFD_NOEXEC_SEAL and MFD_EXEC tests) failed, masking test failures. Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- tools/testing/selftests/memfd/memfd_test.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index d8342989c547..8b7390ad81d1 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -1219,7 +1219,24 @@ static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *)) static void join_newpid_thread(pid_t pid) { - waitpid(pid, NULL, 0); + int wstatus; + + if (waitpid(pid, &wstatus, 0) < 0) { + printf("newpid thread: waitpid() failed: %m\n"); + abort(); + } + + if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) { + printf("newpid thread: exited with non-zero error code %d\n", + WEXITSTATUS(wstatus)); + abort(); + } + + if (WIFSIGNALED(wstatus)) { + printf("newpid thread: killed by signal %d\n", + WTERMSIG(wstatus)); + abort(); + } } /* -- 2.41.0 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails 2023-07-13 14:33 ` [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails Aleksa Sarai @ 2023-07-19 23:38 ` Jeff Xu 2023-08-03 2:25 ` Aleksa Sarai 0 siblings, 1 reply; 20+ messages in thread From: Jeff Xu @ 2023-07-19 23:38 UTC (permalink / raw) To: Aleksa Sarai Cc: Shuah Khan, Andrew Morton, Daniel Verkamp, Kees Cook, linux-mm, linux-kselftest, linux-kernel On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > Before this change, a test runner using this self test would see a > return code of 0 when the tests using a child process (namely the > MFD_NOEXEC_SEAL and MFD_EXEC tests) failed, masking test failures. > > Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC") > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- > tools/testing/selftests/memfd/memfd_test.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c > index d8342989c547..8b7390ad81d1 100644 > --- a/tools/testing/selftests/memfd/memfd_test.c > +++ b/tools/testing/selftests/memfd/memfd_test.c > @@ -1219,7 +1219,24 @@ static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *)) > > static void join_newpid_thread(pid_t pid) > { > - waitpid(pid, NULL, 0); > + int wstatus; > + > + if (waitpid(pid, &wstatus, 0) < 0) { > + printf("newpid thread: waitpid() failed: %m\n"); > + abort(); > + } > + > + if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) { > + printf("newpid thread: exited with non-zero error code %d\n", > + WEXITSTATUS(wstatus)); > + abort(); > + } > + > + if (WIFSIGNALED(wstatus)) { > + printf("newpid thread: killed by signal %d\n", > + WTERMSIG(wstatus)); > + abort(); > + } > } > Signed-off-by: Jeff Xu <jeffxu@google.com> -Jeff Xu > /* > -- > 2.41.0 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails 2023-07-19 23:38 ` Jeff Xu @ 2023-08-03 2:25 ` Aleksa Sarai 2023-08-03 2:55 ` Jeff Xu 0 siblings, 1 reply; 20+ messages in thread From: Aleksa Sarai @ 2023-08-03 2:25 UTC (permalink / raw) To: Jeff Xu Cc: Shuah Khan, Andrew Morton, Daniel Verkamp, Kees Cook, linux-mm, linux-kselftest, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1956 bytes --] On 2023-07-19, Jeff Xu <jeffxu@google.com> wrote: > On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > Before this change, a test runner using this self test would see a > > return code of 0 when the tests using a child process (namely the > > MFD_NOEXEC_SEAL and MFD_EXEC tests) failed, masking test failures. > > > > Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC") > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > --- > > tools/testing/selftests/memfd/memfd_test.c | 19 ++++++++++++++++++- > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c > > index d8342989c547..8b7390ad81d1 100644 > > --- a/tools/testing/selftests/memfd/memfd_test.c > > +++ b/tools/testing/selftests/memfd/memfd_test.c > > @@ -1219,7 +1219,24 @@ static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *)) > > > > static void join_newpid_thread(pid_t pid) > > { > > - waitpid(pid, NULL, 0); > > + int wstatus; > > + > > + if (waitpid(pid, &wstatus, 0) < 0) { > > + printf("newpid thread: waitpid() failed: %m\n"); > > + abort(); > > + } > > + > > + if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) { > > + printf("newpid thread: exited with non-zero error code %d\n", > > + WEXITSTATUS(wstatus)); > > + abort(); > > + } > > + > > + if (WIFSIGNALED(wstatus)) { > > + printf("newpid thread: killed by signal %d\n", > > + WTERMSIG(wstatus)); > > + abort(); > > + } > > } > > > Signed-off-by: Jeff Xu <jeffxu@google.com> Did you mean for this to a Reviewed-by? -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails 2023-08-03 2:25 ` Aleksa Sarai @ 2023-08-03 2:55 ` Jeff Xu 0 siblings, 0 replies; 20+ messages in thread From: Jeff Xu @ 2023-08-03 2:55 UTC (permalink / raw) To: Aleksa Sarai Cc: Shuah Khan, Andrew Morton, Daniel Verkamp, Kees Cook, linux-mm, linux-kselftest, linux-kernel On Wed, Aug 2, 2023 at 7:25 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-07-19, Jeff Xu <jeffxu@google.com> wrote: > > On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > Before this change, a test runner using this self test would see a > > > return code of 0 when the tests using a child process (namely the > > > MFD_NOEXEC_SEAL and MFD_EXEC tests) failed, masking test failures. > > > > > > Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC") > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > > > --- > > > tools/testing/selftests/memfd/memfd_test.c | 19 ++++++++++++++++++- > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c > > > index d8342989c547..8b7390ad81d1 100644 > > > --- a/tools/testing/selftests/memfd/memfd_test.c > > > +++ b/tools/testing/selftests/memfd/memfd_test.c > > > @@ -1219,7 +1219,24 @@ static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *)) > > > > > > static void join_newpid_thread(pid_t pid) > > > { > > > - waitpid(pid, NULL, 0); > > > + int wstatus; > > > + > > > + if (waitpid(pid, &wstatus, 0) < 0) { > > > + printf("newpid thread: waitpid() failed: %m\n"); > > > + abort(); > > > + } > > > + > > > + if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) { > > > + printf("newpid thread: exited with non-zero error code %d\n", > > > + WEXITSTATUS(wstatus)); > > > + abort(); > > > + } > > > + > > > + if (WIFSIGNALED(wstatus)) { > > > + printf("newpid thread: killed by signal %d\n", > > > + WTERMSIG(wstatus)); > > > + abort(); > > > + } > > > } > > > > > Signed-off-by: Jeff Xu <jeffxu@google.com> > > Did you mean for this to a Reviewed-by? > Yes! Thanks for asking. Reviewed-by: Jeff Xu <jeffxu@google.com> > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-07-13 14:33 [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Aleksa Sarai ` (2 preceding siblings ...) 2023-07-13 14:33 ` [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails Aleksa Sarai @ 2023-07-18 0:47 ` Jeff Xu 2023-07-19 3:10 ` Aleksa Sarai 3 siblings, 1 reply; 20+ messages in thread From: Jeff Xu @ 2023-07-18 0:47 UTC (permalink / raw) To: Aleksa Sarai Cc: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening Hello Aleksa, Thanks for your email and patches for discussion. On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > It seems that the most critical issue with vm.memfd_noexec=2 (the fact > that passing MFD_EXEC would bypass it entirely[1]) has been fixed in > Andrew's tree[2], but there are still some outstanding issues that need > to be addressed: > > * The dmesg warnings are pr_warn_once, which on most systems means that > they will be used up by systemd or some other boot process and > userspace developers will never see it. The original patch posted to > the ML used pr_warn_ratelimited but the merged patch had it changed > (with a comment about it being "per review"), but given that the > current warnings are useless, pr_warn_ratelimited makes far more > sense. > Ya, This was discussed in [1] Replacing pr_warn_once with pr_warn_ratelimited won't address Peter Xu's observation that "ratelimited" will fill syslog [2], I'm not sure it is acceptable to ones who is not interested in memfd, I will defer this to maintainers. [1] https://lore.kernel.org/lkml/202212161233.85C9783FB@keescook/ [2] https://lwn.net/ml/linux-kernel/Y5yS8wCnuYGLHMj4@x1n/ > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > because it will make it far to difficult to ever migrate. Instead it > should imply MFD_EXEC. > Though the purpose of memfd_noexec=2 is not to help with migration - but to disable creation of executable memfd for the current system/pid namespace. During the migration, vm.memfd_noexe = 1 helps overwriting for unmigrated user code as a temporary measure. Additional functionality/features should be implemented through security hook and LSM, not sysctl, I think. > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > security mechanism because a CAP_SYS_ADMIN capable user can create > executable binaries in a hidden tmpfs very easily, not to mention the > many other things they can do. > By further limiting CAP_SYS_ADMIN, an attacker can't modify this sysctl even after compromising some system service with high privilege, YAMA has the same approach for ptrace_scope=3 In addition, this sysctl is pid_name spaced, this means child pid_namespace will alway have the same or stricter security setting than its parent, this allows admin to maintain a tree like view. If we allow the child pid namespace to elevate its setting, then the system-wide setting is no longer meaningful. The code sample shared in this patch set indicates that the attacker already has the ability of creating tmpfs and executing complex steps, at that point, it doesn't matter if the code execution is from memfd or not. For a safe by default system such as ChromeOS, attackers won't easily run arbitrary code, memfd is one of the open doors for that, so we are disabling executable memfd in ChromeOS. In other words: if an attacker can already execute the arbitrary code as sample given in ChromeOS, without using executable memfd, then memfd is no longer the thing we need to worry about, the arbitrary code execution is already achieved by the attacker. Even though I use ChromeOS as an example, I think the same type of threat model applies to any system that wants to disable executable memfd entirely. > * The memfd selftests would not exit with a non-zero error code when > certain tests that ran in a forked process (specifically the ones > related to MFD_EXEC and MFD_NOEXEC_SEAL) failed. > I will test this code and follow up. Thanks! -Jeff Xu > (This patchset is based on top of Jeff Xu's patches[2] fixing the > MFD_EXEC bug in vm.memfd_noexec=2.) > > [1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/ > [2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/ > > Aleksa Sarai (3): > memfd: cleanups for vm.memfd_noexec handling > memfd: remove racheting feature from vm.memfd_noexec > selftests: memfd: error out test process when child test fails > > include/linux/pid_namespace.h | 16 +++------ > kernel/pid_sysctl.h | 7 ---- > mm/memfd.c | 32 +++++++---------- > tools/testing/selftests/memfd/memfd_test.c | 41 ++++++++++++++++++---- > 4 files changed, 51 insertions(+), 45 deletions(-) > > -- > 2.41.0 > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-07-18 0:47 ` [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Jeff Xu @ 2023-07-19 3:10 ` Aleksa Sarai 2023-07-19 7:17 ` Jeff Xu 0 siblings, 1 reply; 20+ messages in thread From: Aleksa Sarai @ 2023-07-19 3:10 UTC (permalink / raw) To: Jeff Xu Cc: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening [-- Attachment #1: Type: text/plain, Size: 8054 bytes --] On 2023-07-17, Jeff Xu <jeffxu@chromium.org> wrote: > Hello Aleksa, > > Thanks for your email and patches for discussion. > > On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > It seems that the most critical issue with vm.memfd_noexec=2 (the fact > > that passing MFD_EXEC would bypass it entirely[1]) has been fixed in > > Andrew's tree[2], but there are still some outstanding issues that need > > to be addressed: > > > > * The dmesg warnings are pr_warn_once, which on most systems means that > > they will be used up by systemd or some other boot process and > > userspace developers will never see it. The original patch posted to > > the ML used pr_warn_ratelimited but the merged patch had it changed > > (with a comment about it being "per review"), but given that the > > current warnings are useless, pr_warn_ratelimited makes far more > > sense. > > > Ya, This was discussed in [1] > Replacing pr_warn_once with pr_warn_ratelimited won't address Peter > Xu's observation that "ratelimited" will fill syslog [2], I'm not > sure it is acceptable to ones who is not interested in memfd, I will > defer this to maintainers. > > [1] https://lore.kernel.org/lkml/202212161233.85C9783FB@keescook/ > [2] https://lwn.net/ml/linux-kernel/Y5yS8wCnuYGLHMj4@x1n/ I see Kees's point, but in that case the logging should be tied to the sysctl being the non-default value (I can post this version next if you prefer). The current logging setup doesn't make sense. > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > because it will make it far to difficult to ever migrate. Instead it > > should imply MFD_EXEC. > > > Though the purpose of memfd_noexec=2 is not to help with migration - > but to disable creation of executable memfd for the current system/pid > namespace. > During the migration, vm.memfd_noexe = 1 helps overwriting for > unmigrated user code as a temporary measure. My point is that the current behaviour for =2 means that nobody other than *maybe* ChromeOS will ever be able to use it because it requires auditing every program on the system. In fact, it's possible even ChromeOS will run into issues given that one of the arguments made for the nosymfollow mount option was that auditing all of ChromeOS to replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] (which I agreed with). Maybe this is less of an issue with memfd_create(2) (which is much newer than open(2)) but it still seems like a lot of busy work when the =1 behaviour is entirely sane even in the strict threat model that =2 is trying to protect against. To me, using =1 as a migration path (and in fact, calling =1 a migration path further argues that the warning for not setting _EXEC or _NOEXEC_SEAL should be tied to =1) would mean finding every program that uses executable memfds and changing it to stop doing that. Not that you use =1 to go and rewrite every userspace program that uses memfd_create(2) at all, without using executable memfds (rebooting each time to test the behaviour because we use pr_warn_once). If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, there are several tools for doing this (both seccomp and LSM hooks). [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@google.com/ > Additional functionality/features should be implemented through > security hook and LSM, not sysctl, I think. This issue with =2 cannot be fixed in an LSM. (On the other hand, you could implement either =2 behaviour with an LSM using =1, and the current strict =2 behaviour could be implemented purely with seccomp.) > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > security mechanism because a CAP_SYS_ADMIN capable user can create > > executable binaries in a hidden tmpfs very easily, not to mention the > > many other things they can do. > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > sysctl even after compromising some system service with high > privilege, YAMA has the same approach for ptrace_scope=3 Personally, I also think this behaviour from YAMA is a little goofy too, but given that it only locks the most extreme setting and there is no way to get around the most extreme setting, I guess it makes some sense (not to mention it's an LSM and so there is an argument that it should be possible to lock out privileged users from modifying it). There are many other security sysctls, and very few have this behaviour because it doesn't make much sense in most cases. > In addition, this sysctl is pid_name spaced, this means child > pid_namespace will alway have the same or stricter security setting > than its parent, this allows admin to maintain a tree like view. If we > allow the child pid namespace to elevate its setting, then the > system-wide setting is no longer meaningful. "no longer meaningful" is too strong of a statement imho. It is still useful for constraining non-root processes and presumably ChromeOS disallows random processes to do CLONE_NEWUSER (otherwise the protection of this sysctl is pointless) so in practice for ChromeOS there is no change in the attack surface. (FWIW, I think tying this to the user namespace would've made more sense since this is about privilege restrictions, but that ship has sailed.) > The code sample shared in this patch set indicates that the attacker > already has the ability of creating tmpfs and executing complex steps, > at that point, it doesn't matter if the code execution is from memfd > or not. For a safe by default system such as ChromeOS, attackers won't > easily run arbitrary code, memfd is one of the open doors for that, so > we are disabling executable memfd in ChromeOS. In other words: if an > attacker can already execute the arbitrary code as sample given in > ChromeOS, without using executable memfd, then memfd is no longer the > thing we need to worry about, the arbitrary code execution is already > achieved by the attacker. Even though I use ChromeOS as an example, I > think the same type of threat model applies to any system that wants > to disable executable memfd entirely. I understand the threat model this sysctl is blocking, my point is that blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense from that threat model. An attacker that manages to trick some process into creating a memfd with an executable payload is not going to be able to change the sysctl setting (unless there's a confused deputy with CAP_SYS_ADMIN, in which case you have much bigger issues). If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it doesn't add any security because that process could create a memfd-like fd to execute without issues. What practical attack does this ratcheting mechanism protect against? (This is a question you can answer with the YAMA sysctl, but not this one AFAICS.) But even if you feel that allowing this in child user namespaces is unsafe or undesirable, it's absolutely necessary that capable(CAP_SYS_ADMIN) should be able to un-brick the running system by changing the sysctl. The alternative is that you need to reboot your server in order to un-set a sysctl that broke some application you run. Also, by the same token, this ratcheting mechanism doesn't make sense with =1 *at all* because it could break programs in a way that would require a reboot but it's not a "security setting" (and the YAMA sysctl mentioned only locks the sysctl at the highest setting). > > * The memfd selftests would not exit with a non-zero error code when > > certain tests that ran in a forked process (specifically the ones > > related to MFD_EXEC and MFD_NOEXEC_SEAL) failed. > > > I will test this code and follow up. Thanks! -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-07-19 3:10 ` Aleksa Sarai @ 2023-07-19 7:17 ` Jeff Xu 2023-08-02 1:04 ` Aleksa Sarai 0 siblings, 1 reply; 20+ messages in thread From: Jeff Xu @ 2023-07-19 7:17 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening On Tue, Jul 18, 2023 at 8:10 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-07-17, Jeff Xu <jeffxu@chromium.org> wrote: > > Hello Aleksa, > > > > Thanks for your email and patches for discussion. > > > > On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > It seems that the most critical issue with vm.memfd_noexec=2 (the fact > > > that passing MFD_EXEC would bypass it entirely[1]) has been fixed in > > > Andrew's tree[2], but there are still some outstanding issues that need > > > to be addressed: > > > > > > * The dmesg warnings are pr_warn_once, which on most systems means that > > > they will be used up by systemd or some other boot process and > > > userspace developers will never see it. The original patch posted to > > > the ML used pr_warn_ratelimited but the merged patch had it changed > > > (with a comment about it being "per review"), but given that the > > > current warnings are useless, pr_warn_ratelimited makes far more > > > sense. > > > > > Ya, This was discussed in [1] > > Replacing pr_warn_once with pr_warn_ratelimited won't address Peter > > Xu's observation that "ratelimited" will fill syslog [2], I'm not > > sure it is acceptable to ones who is not interested in memfd, I will > > defer this to maintainers. > > > > [1] https://lore.kernel.org/lkml/202212161233.85C9783FB@keescook/ > > [2] https://lwn.net/ml/linux-kernel/Y5yS8wCnuYGLHMj4@x1n/ > > I see Kees's point, but in that case the logging should be tied to the > sysctl being the non-default value (I can post this version next if you > prefer). The current logging setup doesn't make sense. > Is there a best practice in kernel for this problem: too much log vs too little log In other products, usually the log level or compiler flag (ifdef) are for such a situation. > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > because it will make it far to difficult to ever migrate. Instead it > > > should imply MFD_EXEC. > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > but to disable creation of executable memfd for the current system/pid > > namespace. > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > unmigrated user code as a temporary measure. > > My point is that the current behaviour for =2 means that nobody other > than *maybe* ChromeOS will ever be able to use it because it requires > auditing every program on the system. In fact, it's possible even > ChromeOS will run into issues given that one of the arguments made for > the nosymfollow mount option was that auditing all of ChromeOS to > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > (which I agreed with). Maybe this is less of an issue with > memfd_create(2) (which is much newer than open(2)) but it still seems > like a lot of busy work when the =1 behaviour is entirely sane even in > the strict threat model that =2 is trying to protect against. > It can also be a container (that have all memfd_create migrated to new API) One option I considered previously was "=2" would do overwrite+block , and "=3" just block. But then I worry that applications won't have motivation to ever change their existing code, the setting will forever stay at "=2", making "=3" even more impossible to ever be used system side. > To me, using =1 as a migration path (and in fact, calling =1 a migration > path further argues that the warning for not setting _EXEC or > _NOEXEC_SEAL should be tied to =1) would mean finding every program that > uses executable memfds and changing it to stop doing that. Not that you > use =1 to go and rewrite every userspace program that uses > memfd_create(2) at all, without using executable memfds (rebooting each > time to test the behaviour because we use pr_warn_once). > I tend to think logging and sysctl are orthogonal, tie them together making it more complex than necessary. If we need more logging, we should find what is the best practice in the kernel for that. > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > there are several tools for doing this (both seccomp and LSM hooks). > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@google.com/ > > > Additional functionality/features should be implemented through > > security hook and LSM, not sysctl, I think. > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > could implement either =2 behaviour with an LSM using =1, and the > current strict =2 behaviour could be implemented purely with seccomp.) > By migration, I mean a system that is not fully migrated, such a system should just use "=0" or "=1". Additional features can be implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if a system wants to limit executable memfd to specific programs or fully disable it. "=2" is for a system/container that is fully migrated, in that case, SELinux/Landlock/LSM can do the same, but sysctl provides a convenient alternative. Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is essentially what you want, iiuc.However, I do not wish to have this implemented in kernel, due to the thinking that I want kernel to get out of business of "overwriting" eventually. > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > many other things they can do. > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > sysctl even after compromising some system service with high > > privilege, YAMA has the same approach for ptrace_scope=3 > > Personally, I also think this behaviour from YAMA is a little goofy too, > but given that it only locks the most extreme setting and there is no > way to get around the most extreme setting, I guess it makes some sense > (not to mention it's an LSM and so there is an argument that it should > be possible to lock out privileged users from modifying it). > There are many other security sysctls, and very few have this behaviour > because it doesn't make much sense in most cases. > > > In addition, this sysctl is pid_name spaced, this means child > > pid_namespace will alway have the same or stricter security setting > > than its parent, this allows admin to maintain a tree like view. If we > > allow the child pid namespace to elevate its setting, then the > > system-wide setting is no longer meaningful. > > "no longer meaningful" is too strong of a statement imho. It is still > useful for constraining non-root processes and presumably ChromeOS > disallows random processes to do CLONE_NEWUSER (otherwise the protection > of this sysctl is pointless) so in practice for ChromeOS there is no > change in the attack surface. > > (FWIW, I think tying this to the user namespace would've made more sense > since this is about privilege restrictions, but that ship has sailed.) > The reason that this sysctl is a PID namespace is that I hope a container and host can have different sysctl values, e.g. host will allow runc's use of X mfd, while a container doesn't want X mfd. . To clarify what you meant, do you mean this: when a container is in its own pid_namespace, and has "=2", the programs inside the container can still use CLONE_NEWUSER to break out "=2" ? And what makes the user namespace a better choice than pid namespace for this sysctl ? > > The code sample shared in this patch set indicates that the attacker > > already has the ability of creating tmpfs and executing complex steps, > > at that point, it doesn't matter if the code execution is from memfd > > or not. For a safe by default system such as ChromeOS, attackers won't > > easily run arbitrary code, memfd is one of the open doors for that, so > > we are disabling executable memfd in ChromeOS. In other words: if an > > attacker can already execute the arbitrary code as sample given in > > ChromeOS, without using executable memfd, then memfd is no longer the > > thing we need to worry about, the arbitrary code execution is already > > achieved by the attacker. Even though I use ChromeOS as an example, I > > think the same type of threat model applies to any system that wants > > to disable executable memfd entirely. > > I understand the threat model this sysctl is blocking, my point is that > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > from that threat model. An attacker that manages to trick some process > into creating a memfd with an executable payload is not going to be able > to change the sysctl setting (unless there's a confused deputy with > CAP_SYS_ADMIN, in which case you have much bigger issues). > It is the reverse. An attacker that manages to trick some CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the setting to 0 if no ratcheting), will be able to continue to use mfd as part of the attack chain. In chromeOS, an attacker that can change sysctl might not necessarily gain full arbitrary code execution already. As I mentioned previously, the main threat model here is to prevent arbitrary code execution through mfd. If an attacker already gains arbitrary code execution, at that point, we no longer worry about mfd. > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > doesn't add any security because that process could create a memfd-like > fd to execute without issues. >What practical attack does this ratcheting > mechanism protect against? (This is a question you can answer with the > YAMA sysctl, but not this one AFAICS.) > > But even if you feel that allowing this in child user namespaces is > unsafe or undesirable, it's absolutely necessary that > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > changing the sysctl. The alternative is that you need to reboot your > server in order to un-set a sysctl that broke some application you run. > > Also, by the same token, this ratcheting mechanism doesn't make sense > with =1 *at all* because it could break programs in a way that would > require a reboot but it's not a "security setting" (and the YAMA sysctl > mentioned only locks the sysctl at the highest setting). > I think a system should use "=0" when it is unsure about its program's need or not need executable memfd. Technically, it is not that this sysctl breaks the user, but the admin made the mistake to set the wrong sysctl value, and an admin should know what they are doing for a sysctl. Yes. rebooting increases the steps to undo the mistake, but that could be an incentive for the admin to fully test its programs before turning on this sysctl - and avoid unexpected runtime errors. Thanks! -Jeff > > > * The memfd selftests would not exit with a non-zero error code when > > > certain tests that ran in a forked process (specifically the ones > > > related to MFD_EXEC and MFD_NOEXEC_SEAL) failed. > > > > > I will test this code and follow up. > > Thanks! > > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-07-19 7:17 ` Jeff Xu @ 2023-08-02 1:04 ` Aleksa Sarai 2023-08-02 19:47 ` Jeff Xu 2023-08-02 20:45 ` Jeff Xu 0 siblings, 2 replies; 20+ messages in thread From: Aleksa Sarai @ 2023-08-02 1:04 UTC (permalink / raw) To: Jeff Xu Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening [-- Attachment #1: Type: text/plain, Size: 15745 bytes --] On 2023-07-19, Jeff Xu <jeffxu@google.com> wrote: > On Tue, Jul 18, 2023 at 8:10 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-07-17, Jeff Xu <jeffxu@chromium.org> wrote: > > > Hello Aleksa, > > > > > > Thanks for your email and patches for discussion. > > > > > > On Thu, Jul 13, 2023 at 7:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > > > It seems that the most critical issue with vm.memfd_noexec=2 (the fact > > > > that passing MFD_EXEC would bypass it entirely[1]) has been fixed in > > > > Andrew's tree[2], but there are still some outstanding issues that need > > > > to be addressed: > > > > > > > > * The dmesg warnings are pr_warn_once, which on most systems means that > > > > they will be used up by systemd or some other boot process and > > > > userspace developers will never see it. The original patch posted to > > > > the ML used pr_warn_ratelimited but the merged patch had it changed > > > > (with a comment about it being "per review"), but given that the > > > > current warnings are useless, pr_warn_ratelimited makes far more > > > > sense. > > > > > > > Ya, This was discussed in [1] > > > Replacing pr_warn_once with pr_warn_ratelimited won't address Peter > > > Xu's observation that "ratelimited" will fill syslog [2], I'm not > > > sure it is acceptable to ones who is not interested in memfd, I will > > > defer this to maintainers. > > > > > > [1] https://lore.kernel.org/lkml/202212161233.85C9783FB@keescook/ > > > [2] https://lwn.net/ml/linux-kernel/Y5yS8wCnuYGLHMj4@x1n/ > > > > I see Kees's point, but in that case the logging should be tied to the > > sysctl being the non-default value (I can post this version next if you > > prefer). The current logging setup doesn't make sense. > > > Is there a best practice in kernel for this problem: too much log vs > too little log > In other products, usually the log level or compiler flag (ifdef) are > for such a situation. It depends on what the purpose of the warning is -- if the purpose of the warning is to tell developers to migrate their programs, then a rate-limited warning is the only reasonable behaviour. pr_warn_ratelimited() is the most straight-forward way of doing the warning. There are alternatives (the pids cgroup has a per-cgroup warning that is only triggered the first time a fork() fails in that cgroup) -- but the key point is that the log will appear in dmesg in a way that a developer will be able to notice. Having a single warning at the very beginning of the boot sequence is useless. On my system, systemd does memfd_create() without the new flags, so all you see is the following message: [ 1.531305] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd' in the middle of all of the other stuff that gets spit out during boot. This obviously is going to be missed by basically all users. > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > > because it will make it far to difficult to ever migrate. Instead it > > > > should imply MFD_EXEC. > > > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > > but to disable creation of executable memfd for the current system/pid > > > namespace. > > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > > unmigrated user code as a temporary measure. > > > > My point is that the current behaviour for =2 means that nobody other > > than *maybe* ChromeOS will ever be able to use it because it requires > > auditing every program on the system. In fact, it's possible even > > ChromeOS will run into issues given that one of the arguments made for > > the nosymfollow mount option was that auditing all of ChromeOS to > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > > (which I agreed with). Maybe this is less of an issue with > > memfd_create(2) (which is much newer than open(2)) but it still seems > > like a lot of busy work when the =1 behaviour is entirely sane even in > > the strict threat model that =2 is trying to protect against. > > > It can also be a container (that have all memfd_create migrated to new API) If ChromeOS would struggle to rewrite all of the libraries they use, containers are in even worse shape -- most container users don't have a complete list of every package installed in a container, let alone the ability to audit whether they pass a (no-op) flag to memfd_create(2) in every codepath. > One option I considered previously was "=2" would do overwrite+block , > and "=3" just block. But then I worry that applications won't have > motivation to ever change their existing code, the setting will > forever stay at "=2", making "=3" even more impossible to ever be used > system side. What is the downside of overwriting? Backwards-compatibility is a very important part of Linux -- being able to use old programs without having to modify them is incredibly important. Yes, this behaviour is opt-in -- but I don't see the point of making opting in more difficult than necessary. Surely overwite+block provides the security guarantee you need from the threat model -- othewise nobody will be able to use block because you never know if one library will call memfd_create() "incorrectly" without the new flags. > > To me, using =1 as a migration path (and in fact, calling =1 a migration > > path further argues that the warning for not setting _EXEC or > > _NOEXEC_SEAL should be tied to =1) would mean finding every program that > > uses executable memfds and changing it to stop doing that. Not that you > > use =1 to go and rewrite every userspace program that uses > > memfd_create(2) at all, without using executable memfds (rebooting each > > time to test the behaviour because we use pr_warn_once). > > > I tend to think logging and sysctl are orthogonal, tie them together > making it more complex than necessary. If we need more logging, we > should find what is the best practice in the kernel for that. See my above comment -- tying it to >= 1 would alleviate the log spam concerns. pr_warn_once() is useless for this purpose. > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > > there are several tools for doing this (both seccomp and LSM hooks). > > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@google.com/ > > > > > Additional functionality/features should be implemented through > > > security hook and LSM, not sysctl, I think. > > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > > could implement either =2 behaviour with an LSM using =1, and the > > current strict =2 behaviour could be implemented purely with seccomp.) > > > By migration, I mean a system that is not fully migrated, such a > system should just use "=0" or "=1". Additional features can be > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if > a system wants to limit executable memfd to specific programs or fully > disable it. > "=2" is for a system/container that is fully migrated, in that case, > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient > alternative. > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is > essentially what you want, iiuc.However, I do not wish to have this > implemented in kernel, due to the thinking that I want kernel to get > out of business of "overwriting" eventually. See my above comments -- "overwriting" is perfectly acceptable to me. There's also no way to "get out of the business of overwriting" -- Linux has strict backwards compatibility requirements. > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > many other things they can do. > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > sysctl even after compromising some system service with high > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > but given that it only locks the most extreme setting and there is no > > way to get around the most extreme setting, I guess it makes some sense > > (not to mention it's an LSM and so there is an argument that it should > > be possible to lock out privileged users from modifying it). > > There are many other security sysctls, and very few have this behaviour > > because it doesn't make much sense in most cases. > > > > > In addition, this sysctl is pid_name spaced, this means child > > > pid_namespace will alway have the same or stricter security setting > > > than its parent, this allows admin to maintain a tree like view. If we > > > allow the child pid namespace to elevate its setting, then the > > > system-wide setting is no longer meaningful. > > > > "no longer meaningful" is too strong of a statement imho. It is still > > useful for constraining non-root processes and presumably ChromeOS > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > of this sysctl is pointless) so in practice for ChromeOS there is no > > change in the attack surface. > > > > (FWIW, I think tying this to the user namespace would've made more sense > > since this is about privilege restrictions, but that ship has sailed.) > > > The reason that this sysctl is a PID namespace is that I hope a > container and host can have different sysctl values, e.g. host will > allow runc's use of X mfd, while a container doesn't want X mfd. . > To clarify what you meant, do you mean this: when a container is in > its own pid_namespace, and has "=2", the programs inside the container > can still use CLONE_NEWUSER to break out "=2" ? With the current implementation, this is not possible. My point was that even if it were possible to lower the sysctl, ChromeOS presumably already blocks the operations that a user would be able to use to create a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to the other security concerns). > And what makes the user namespace a better choice than pid namespace > for this sysctl ? It's too late for this discussion, the sysctl has already shipped and you can't change this behaviour anymore. However, the simple reason is that the kernel ties permission-related things to the user namespace because credentials are tied to the userns -- the only other pidns-related sysctl is kernel.ns_last_pid, which clearly needs to be tied to the pidns. > > > The code sample shared in this patch set indicates that the attacker > > > already has the ability of creating tmpfs and executing complex steps, > > > at that point, it doesn't matter if the code execution is from memfd > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > attacker can already execute the arbitrary code as sample given in > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > thing we need to worry about, the arbitrary code execution is already > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > think the same type of threat model applies to any system that wants > > > to disable executable memfd entirely. > > > > I understand the threat model this sysctl is blocking, my point is that > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > from that threat model. An attacker that manages to trick some process > > into creating a memfd with an executable payload is not going to be able > > to change the sysctl setting (unless there's a confused deputy with > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > It is the reverse. An attacker that manages to trick some > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > setting to 0 if no ratcheting), will be able to continue to use mfd as > part of the attack chain. > In chromeOS, an attacker that can change sysctl might not necessarily > gain full arbitrary code execution already. As I mentioned previously, > the main threat model here is to prevent arbitrary code execution > through mfd. If an attacker already gains arbitrary code execution, > at that point, we no longer worry about mfd. If an attacker can trick a privileged process into writing to arbitrary sysctls, the system has much bigger issues than arbitrary (presumably unprivileged) code execution. On the other hand, requiring you to reboot a server due to a misconfigured sysctl *is* broken. Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to change the setting is actually broken. > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > doesn't add any security because that process could create a memfd-like > > fd to execute without issues. > >What practical attack does this ratcheting > > mechanism protect against? (This is a question you can answer with the > > YAMA sysctl, but not this one AFAICS.) > > > > But even if you feel that allowing this in child user namespaces is > > unsafe or undesirable, it's absolutely necessary that > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > changing the sysctl. The alternative is that you need to reboot your > > server in order to un-set a sysctl that broke some application you run. > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > with =1 *at all* because it could break programs in a way that would > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > mentioned only locks the sysctl at the highest setting). > > > I think a system should use "=0" when it is unsure about its program's > need or not need executable memfd. Technically, it is not that this > sysctl breaks the user, but the admin made the mistake to set the > wrong sysctl value, and an admin should know what they are doing for a > sysctl. Yes. rebooting increases the steps to undo the mistake, but > that could be an incentive for the admin to fully test its programs > before turning on this sysctl - and avoid unexpected runtime errors. I don't think this stance is really acceptable -- if an admin that has privileges to load kernel modules is not able to disable a sysctl that can break working programs without rebooting there is When this sysctl was first proposed a few years ago (when kernel folks found out that runc was using executable memfds), my understanding is that the long-term goal was to switch programs to have non-executable-memfds by default on most distributions. Making it impossible for an admin to lower the sysctl value flies in the face of this goal. At the very least, being unable to lower the sysctl from =1 to =0 is just broken (even if you use the yama example -- yama only locks the sysctl at highest possible setting, not on lower settings). But in my view, having this sysctl ratchet at all doesn't make sense. > Thanks! > -Jeff -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 1:04 ` Aleksa Sarai @ 2023-08-02 19:47 ` Jeff Xu 2023-08-02 19:54 ` Jeff Xu 2023-08-02 21:38 ` Aleksa Sarai 2023-08-02 20:45 ` Jeff Xu 1 sibling, 2 replies; 20+ messages in thread From: Jeff Xu @ 2023-08-02 19:47 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > This thread is getting longer with different topics, I will try to respond with trimmed interleaved replies [1] There are 3 topics (logging/'migration/ratcheting), this response will be regarding ratcheting. [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions > > > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > > many other things they can do. > > > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > > sysctl even after compromising some system service with high > > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > > but given that it only locks the most extreme setting and there is no > > > way to get around the most extreme setting, I guess it makes some sense > > > (not to mention it's an LSM and so there is an argument that it should > > > be possible to lock out privileged users from modifying it). > > > There are many other security sysctls, and very few have this behaviour > > > because it doesn't make much sense in most cases. > > > > > > > In addition, this sysctl is pid_name spaced, this means child > > > > pid_namespace will alway have the same or stricter security setting > > > > than its parent, this allows admin to maintain a tree like view. If we > > > > allow the child pid namespace to elevate its setting, then the > > > > system-wide setting is no longer meaningful. > > > > > > "no longer meaningful" is too strong of a statement imho. It is still > > > useful for constraining non-root processes and presumably ChromeOS > > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > > of this sysctl is pointless) so in practice for ChromeOS there is no > > > change in the attack surface. > > > > > > (FWIW, I think tying this to the user namespace would've made more sense > > > since this is about privilege restrictions, but that ship has sailed.) > > > > > The reason that this sysctl is a PID namespace is that I hope a > > container and host can have different sysctl values, e.g. host will > > allow runc's use of X mfd, while a container doesn't want X mfd. . > > To clarify what you meant, do you mean this: when a container is in > > its own pid_namespace, and has "=2", the programs inside the container > > can still use CLONE_NEWUSER to break out "=2" ? > > With the current implementation, this is not possible. My point was that > even if it were possible to lower the sysctl, ChromeOS presumably > already blocks the operations that a user would be able to use to create > a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl > without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to > the other security concerns). > > > > > > The code sample shared in this patch set indicates that the attacker > > > > already has the ability of creating tmpfs and executing complex steps, > > > > at that point, it doesn't matter if the code execution is from memfd > > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > > attacker can already execute the arbitrary code as sample given in > > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > > thing we need to worry about, the arbitrary code execution is already > > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > > think the same type of threat model applies to any system that wants > > > > to disable executable memfd entirely. > > > > > > I understand the threat model this sysctl is blocking, my point is that > > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > > from that threat model. An attacker that manages to trick some process > > > into creating a memfd with an executable payload is not going to be able > > > to change the sysctl setting (unless there's a confused deputy with > > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > > > It is the reverse. An attacker that manages to trick some > > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > > setting to 0 if no ratcheting), will be able to continue to use mfd as > > part of the attack chain. > > In chromeOS, an attacker that can change sysctl might not necessarily > > gain full arbitrary code execution already. As I mentioned previously, > > the main threat model here is to prevent arbitrary code execution > > through mfd. If an attacker already gains arbitrary code execution, > > at that point, we no longer worry about mfd. > > If an attacker can trick a privileged process into writing to arbitrary > sysctls, the system has much bigger issues than arbitrary (presumably > unprivileged) code execution. On the other hand, requiring you to reboot > a server due to a misconfigured sysctl *is* broken. > > Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to > change the setting is actually broken. > > > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > > doesn't add any security because that process could create a memfd-like > > > fd to execute without issues. > > >What practical attack does this ratcheting > > > mechanism protect against? (This is a question you can answer with the > > > YAMA sysctl, but not this one AFAICS.) > > > > > > But even if you feel that allowing this in child user namespaces is > > > unsafe or undesirable, it's absolutely necessary that > > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > > changing the sysctl. The alternative is that you need to reboot your > > > server in order to un-set a sysctl that broke some application you run. > > > > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > > with =1 *at all* because it could break programs in a way that would > > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > > mentioned only locks the sysctl at the highest setting). > > > > > I think a system should use "=0" when it is unsure about its program's > > need or not need executable memfd. Technically, it is not that this > > sysctl breaks the user, but the admin made the mistake to set the > > wrong sysctl value, and an admin should know what they are doing for a > > sysctl. Yes. rebooting increases the steps to undo the mistake, but > > that could be an incentive for the admin to fully test its programs > > before turning on this sysctl - and avoid unexpected runtime errors. > > I don't think this stance is really acceptable -- if an admin that has > privileges to load kernel modules is not able to disable a sysctl that > can break working programs without rebooting there is > > When this sysctl was first proposed a few years ago (when kernel folks > found out that runc was using executable memfds), my understanding is > that the long-term goal was to switch programs to have > non-executable-memfds by default on most distributions. Making it > impossible for an admin to lower the sysctl value flies in the face of > this goal. > > At the very least, being unable to lower the sysctl from =1 to =0 is > just broken (even if you use the yama example -- yama only locks the > sysctl at highest possible setting, not on lower settings). But in my > view, having this sysctl ratchet at all doesn't make sense. > To reiterate/summarize the current mechanism for vm.memfd_noexec 1> It is a pid namespace sysctl, init ns and child pid ns can have different setting values. 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. 3> There are 3 values for the sysctl, each higher value is more restrictive than the lower one. Once set, doesn't allow downgrading. It can be used as following: 1> init ns: vm.memfd_noexec = 2 (at boot time) Not allow executable memfd for the entire system, including its containers. 2> init ns: vm.memfd_noexec = 0 or 1 container (child init namespace) vm.memfd_noexec = 2. The host allows runc's usage of executable memfd during container creation. Inside the container, executable memfd is not allowed. The inherence + not allow downgrading is to reason with how vm.memfd_noexec is applied in the process tree. Without it, essentially we are losing the hierarchy view across the process tree and a process can evaluate its capability by modifying the setting. I think that is a less secure approach I would not prefer. Thanks -Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 19:47 ` Jeff Xu @ 2023-08-02 19:54 ` Jeff Xu 2023-08-02 21:38 ` Aleksa Sarai 1 sibling, 0 replies; 20+ messages in thread From: Jeff Xu @ 2023-08-02 19:54 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening On Wed, Aug 2, 2023 at 12:47 PM Jeff Xu <jeffxu@chromium.org> wrote: > > On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > This thread is getting longer with different topics, I will try to > respond with trimmed interleaved replies [1] > There are 3 topics (logging/'migration/ratcheting), this response will > be regarding ratcheting. > > [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions > > > > > > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > > > many other things they can do. > > > > > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > > > sysctl even after compromising some system service with high > > > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > > > but given that it only locks the most extreme setting and there is no > > > > way to get around the most extreme setting, I guess it makes some sense > > > > (not to mention it's an LSM and so there is an argument that it should > > > > be possible to lock out privileged users from modifying it). > > > > There are many other security sysctls, and very few have this behaviour > > > > because it doesn't make much sense in most cases. > > > > > > > > > In addition, this sysctl is pid_name spaced, this means child > > > > > pid_namespace will alway have the same or stricter security setting > > > > > than its parent, this allows admin to maintain a tree like view. If we > > > > > allow the child pid namespace to elevate its setting, then the > > > > > system-wide setting is no longer meaningful. > > > > > > > > "no longer meaningful" is too strong of a statement imho. It is still > > > > useful for constraining non-root processes and presumably ChromeOS > > > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > > > of this sysctl is pointless) so in practice for ChromeOS there is no > > > > change in the attack surface. > > > > > > > > (FWIW, I think tying this to the user namespace would've made more sense > > > > since this is about privilege restrictions, but that ship has sailed.) > > > > > > > The reason that this sysctl is a PID namespace is that I hope a > > > container and host can have different sysctl values, e.g. host will > > > allow runc's use of X mfd, while a container doesn't want X mfd. . > > > To clarify what you meant, do you mean this: when a container is in > > > its own pid_namespace, and has "=2", the programs inside the container > > > can still use CLONE_NEWUSER to break out "=2" ? > > > > With the current implementation, this is not possible. My point was that > > even if it were possible to lower the sysctl, ChromeOS presumably > > already blocks the operations that a user would be able to use to create > > a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl > > without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to > > the other security concerns). > > > > > > > > > The code sample shared in this patch set indicates that the attacker > > > > > already has the ability of creating tmpfs and executing complex steps, > > > > > at that point, it doesn't matter if the code execution is from memfd > > > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > > > attacker can already execute the arbitrary code as sample given in > > > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > > > thing we need to worry about, the arbitrary code execution is already > > > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > > > think the same type of threat model applies to any system that wants > > > > > to disable executable memfd entirely. > > > > > > > > I understand the threat model this sysctl is blocking, my point is that > > > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > > > from that threat model. An attacker that manages to trick some process > > > > into creating a memfd with an executable payload is not going to be able > > > > to change the sysctl setting (unless there's a confused deputy with > > > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > > > > > It is the reverse. An attacker that manages to trick some > > > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > > > setting to 0 if no ratcheting), will be able to continue to use mfd as > > > part of the attack chain. > > > In chromeOS, an attacker that can change sysctl might not necessarily > > > gain full arbitrary code execution already. As I mentioned previously, > > > the main threat model here is to prevent arbitrary code execution > > > through mfd. If an attacker already gains arbitrary code execution, > > > at that point, we no longer worry about mfd. > > > > If an attacker can trick a privileged process into writing to arbitrary > > sysctls, the system has much bigger issues than arbitrary (presumably > > unprivileged) code execution. On the other hand, requiring you to reboot > > a server due to a misconfigured sysctl *is* broken. > > > > Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to > > change the setting is actually broken. > > > > > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > > > doesn't add any security because that process could create a memfd-like > > > > fd to execute without issues. > > > >What practical attack does this ratcheting > > > > mechanism protect against? (This is a question you can answer with the > > > > YAMA sysctl, but not this one AFAICS.) > > > > > > > > But even if you feel that allowing this in child user namespaces is > > > > unsafe or undesirable, it's absolutely necessary that > > > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > > > changing the sysctl. The alternative is that you need to reboot your > > > > server in order to un-set a sysctl that broke some application you run. > > > > > > > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > > > with =1 *at all* because it could break programs in a way that would > > > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > > > mentioned only locks the sysctl at the highest setting). > > > > > > > I think a system should use "=0" when it is unsure about its program's > > > need or not need executable memfd. Technically, it is not that this > > > sysctl breaks the user, but the admin made the mistake to set the > > > wrong sysctl value, and an admin should know what they are doing for a > > > sysctl. Yes. rebooting increases the steps to undo the mistake, but > > > that could be an incentive for the admin to fully test its programs > > > before turning on this sysctl - and avoid unexpected runtime errors. > > > > I don't think this stance is really acceptable -- if an admin that has > > privileges to load kernel modules is not able to disable a sysctl that > > can break working programs without rebooting there is > > > > When this sysctl was first proposed a few years ago (when kernel folks > > found out that runc was using executable memfds), my understanding is > > that the long-term goal was to switch programs to have > > non-executable-memfds by default on most distributions. Making it > > impossible for an admin to lower the sysctl value flies in the face of > > this goal. > > > > At the very least, being unable to lower the sysctl from =1 to =0 is > > just broken (even if you use the yama example -- yama only locks the > > sysctl at highest possible setting, not on lower settings). But in my > > view, having this sysctl ratchet at all doesn't make sense. > > > To reiterate/summarize the current mechanism for vm.memfd_noexec > > 1> It is a pid namespace sysctl, init ns and child pid ns can have > different setting values. > 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. > 3> There are 3 values for the sysctl, each higher value is more > restrictive than the lower one. Once set, doesn't allow downgrading. > > It can be used as following: > 1> > init ns: vm.memfd_noexec = 2 (at boot time) > Not allow executable memfd for the entire system, including its containers. > > 2> > init ns: vm.memfd_noexec = 0 or 1 > container (child init namespace) vm.memfd_noexec = 2. > The host allows runc's usage of executable memfd during container > creation. Inside the container, executable memfd is not allowed. > > The inherence + not allow downgrading is to reason with how > vm.memfd_noexec is applied in the process tree. > Without it, essentially we are losing the hierarchy view across the > process tree and a process can evaluate its capability by modifying *elevate* > the setting. I think that is a less secure approach I would not > prefer. > > Thanks > > -Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 19:47 ` Jeff Xu 2023-08-02 19:54 ` Jeff Xu @ 2023-08-02 21:38 ` Aleksa Sarai 2023-08-02 23:11 ` Jeff Xu 1 sibling, 1 reply; 20+ messages in thread From: Aleksa Sarai @ 2023-08-02 21:38 UTC (permalink / raw) To: Jeff Xu Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, Christian Brauner, linux-mm, linux-kernel, linux-kselftest, linux-hardening [-- Attachment #1: Type: text/plain, Size: 10996 bytes --] On 2023-08-02, Jeff Xu <jeffxu@chromium.org> wrote: > On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > This thread is getting longer with different topics, I will try to > respond with trimmed interleaved replies [1] > There are 3 topics (logging/'migration/ratcheting), this response will > be regarding ratcheting. The migration and ratcheting topics are interconnected because the migration issue makes ratcheting an even more severe issue. But I'll respond to each thread separately. > [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions > > > > > > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > > > many other things they can do. > > > > > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > > > sysctl even after compromising some system service with high > > > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > > > but given that it only locks the most extreme setting and there is no > > > > way to get around the most extreme setting, I guess it makes some sense > > > > (not to mention it's an LSM and so there is an argument that it should > > > > be possible to lock out privileged users from modifying it). > > > > There are many other security sysctls, and very few have this behaviour > > > > because it doesn't make much sense in most cases. > > > > > > > > > In addition, this sysctl is pid_name spaced, this means child > > > > > pid_namespace will alway have the same or stricter security setting > > > > > than its parent, this allows admin to maintain a tree like view. If we > > > > > allow the child pid namespace to elevate its setting, then the > > > > > system-wide setting is no longer meaningful. > > > > > > > > "no longer meaningful" is too strong of a statement imho. It is still > > > > useful for constraining non-root processes and presumably ChromeOS > > > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > > > of this sysctl is pointless) so in practice for ChromeOS there is no > > > > change in the attack surface. > > > > > > > > (FWIW, I think tying this to the user namespace would've made more sense > > > > since this is about privilege restrictions, but that ship has sailed.) > > > > > > > The reason that this sysctl is a PID namespace is that I hope a > > > container and host can have different sysctl values, e.g. host will > > > allow runc's use of X mfd, while a container doesn't want X mfd. . > > > To clarify what you meant, do you mean this: when a container is in > > > its own pid_namespace, and has "=2", the programs inside the container > > > can still use CLONE_NEWUSER to break out "=2" ? > > > > With the current implementation, this is not possible. My point was that > > even if it were possible to lower the sysctl, ChromeOS presumably > > already blocks the operations that a user would be able to use to create > > a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl > > without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to > > the other security concerns). > > > > > > > > > The code sample shared in this patch set indicates that the attacker > > > > > already has the ability of creating tmpfs and executing complex steps, > > > > > at that point, it doesn't matter if the code execution is from memfd > > > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > > > attacker can already execute the arbitrary code as sample given in > > > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > > > thing we need to worry about, the arbitrary code execution is already > > > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > > > think the same type of threat model applies to any system that wants > > > > > to disable executable memfd entirely. > > > > > > > > I understand the threat model this sysctl is blocking, my point is that > > > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > > > from that threat model. An attacker that manages to trick some process > > > > into creating a memfd with an executable payload is not going to be able > > > > to change the sysctl setting (unless there's a confused deputy with > > > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > > > > > It is the reverse. An attacker that manages to trick some > > > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > > > setting to 0 if no ratcheting), will be able to continue to use mfd as > > > part of the attack chain. > > > In chromeOS, an attacker that can change sysctl might not necessarily > > > gain full arbitrary code execution already. As I mentioned previously, > > > the main threat model here is to prevent arbitrary code execution > > > through mfd. If an attacker already gains arbitrary code execution, > > > at that point, we no longer worry about mfd. > > > > If an attacker can trick a privileged process into writing to arbitrary > > sysctls, the system has much bigger issues than arbitrary (presumably > > unprivileged) code execution. On the other hand, requiring you to reboot > > a server due to a misconfigured sysctl *is* broken. > > > > Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to > > change the setting is actually broken. > > > > > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > > > doesn't add any security because that process could create a memfd-like > > > > fd to execute without issues. > > > >What practical attack does this ratcheting > > > > mechanism protect against? (This is a question you can answer with the > > > > YAMA sysctl, but not this one AFAICS.) > > > > > > > > But even if you feel that allowing this in child user namespaces is > > > > unsafe or undesirable, it's absolutely necessary that > > > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > > > changing the sysctl. The alternative is that you need to reboot your > > > > server in order to un-set a sysctl that broke some application you run. > > > > > > > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > > > with =1 *at all* because it could break programs in a way that would > > > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > > > mentioned only locks the sysctl at the highest setting). > > > > > > > I think a system should use "=0" when it is unsure about its program's > > > need or not need executable memfd. Technically, it is not that this > > > sysctl breaks the user, but the admin made the mistake to set the > > > wrong sysctl value, and an admin should know what they are doing for a > > > sysctl. Yes. rebooting increases the steps to undo the mistake, but > > > that could be an incentive for the admin to fully test its programs > > > before turning on this sysctl - and avoid unexpected runtime errors. > > > > I don't think this stance is really acceptable -- if an admin that has > > privileges to load kernel modules is not able to disable a sysctl that > > can break working programs without rebooting there is > > > > When this sysctl was first proposed a few years ago (when kernel folks > > found out that runc was using executable memfds), my understanding is > > that the long-term goal was to switch programs to have > > non-executable-memfds by default on most distributions. Making it > > impossible for an admin to lower the sysctl value flies in the face of > > this goal. > > > > At the very least, being unable to lower the sysctl from =1 to =0 is > > just broken (even if you use the yama example -- yama only locks the > > sysctl at highest possible setting, not on lower settings). But in my > > view, having this sysctl ratchet at all doesn't make sense. > > > To reiterate/summarize the current mechanism for vm.memfd_noexec > > 1> It is a pid namespace sysctl, init ns and child pid ns can have > different setting values. > 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. > 3> There are 3 values for the sysctl, each higher value is more > restrictive than the lower one. Once set, doesn't allow downgrading. > > It can be used as following: > 1> > init ns: vm.memfd_noexec = 2 (at boot time) > Not allow executable memfd for the entire system, including its containers. > > 2> > init ns: vm.memfd_noexec = 0 or 1 > container (child init namespace) vm.memfd_noexec = 2. > The host allows runc's usage of executable memfd during container > creation. Inside the container, executable memfd is not allowed. > > The inherence + not allow downgrading is to reason with how > vm.memfd_noexec is applied in the process tree. > Without it, essentially we are losing the hierarchy view across the > process tree and a process can evaluate its capability by modifying > the setting. I think that is a less secure approach I would not > prefer. If you really want the hierarchical aspect, we can implement it so that it's _actually_ hierarchical like so: * By default, your setting is the same as your parent (this is checked by going up the pidns tree -- a-la cgroups). This is less efficient but you want a hierarchy, so we can do it this way instead. * If you set a specific setting, that takes precedence but only if it's a greater or equal setting to your parent. * Trying to set a lower setting than your parent fails regardless of privileges. This will allow *privileged users* to lower the setting, but only if the parent pidns also has a lower setting. This allows a system admin to enforce the setting. It seems to me that this fulfils all the requirements you have. Most importantly, this would allow for a hierarchical view without having a sysctl that will break systems and nobody will use. I need to re-iterate this point -- nobody is going to use this sysctl as it currently works because it ratchets in a way that admins cannot undo. In practice this would mean you would need to reboot your whole datacenter if you didn't catch that an update to one of you dependencies didn't pass a required *noop* flag to memfd_create(). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 21:38 ` Aleksa Sarai @ 2023-08-02 23:11 ` Jeff Xu 2023-08-03 4:39 ` Aleksa Sarai 0 siblings, 1 reply; 20+ messages in thread From: Jeff Xu @ 2023-08-02 23:11 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, Christian Brauner, linux-mm, linux-kernel, linux-kselftest, linux-hardening On Wed, Aug 2, 2023 at 2:39 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-08-02, Jeff Xu <jeffxu@chromium.org> wrote: > > On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > This thread is getting longer with different topics, I will try to > > respond with trimmed interleaved replies [1] > > There are 3 topics (logging/'migration/ratcheting), this response will > > be regarding ratcheting. > > The migration and ratcheting topics are interconnected because the > migration issue makes ratcheting an even more severe issue. But I'll > respond to each thread separately. > > > [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions > > > > > > > > > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > > > > many other things they can do. > > > > > > > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > > > > sysctl even after compromising some system service with high > > > > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > > > > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > > > > but given that it only locks the most extreme setting and there is no > > > > > way to get around the most extreme setting, I guess it makes some sense > > > > > (not to mention it's an LSM and so there is an argument that it should > > > > > be possible to lock out privileged users from modifying it). > > > > > There are many other security sysctls, and very few have this behaviour > > > > > because it doesn't make much sense in most cases. > > > > > > > > > > > In addition, this sysctl is pid_name spaced, this means child > > > > > > pid_namespace will alway have the same or stricter security setting > > > > > > than its parent, this allows admin to maintain a tree like view. If we > > > > > > allow the child pid namespace to elevate its setting, then the > > > > > > system-wide setting is no longer meaningful. > > > > > > > > > > "no longer meaningful" is too strong of a statement imho. It is still > > > > > useful for constraining non-root processes and presumably ChromeOS > > > > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > > > > of this sysctl is pointless) so in practice for ChromeOS there is no > > > > > change in the attack surface. > > > > > > > > > > (FWIW, I think tying this to the user namespace would've made more sense > > > > > since this is about privilege restrictions, but that ship has sailed.) > > > > > > > > > The reason that this sysctl is a PID namespace is that I hope a > > > > container and host can have different sysctl values, e.g. host will > > > > allow runc's use of X mfd, while a container doesn't want X mfd. . > > > > To clarify what you meant, do you mean this: when a container is in > > > > its own pid_namespace, and has "=2", the programs inside the container > > > > can still use CLONE_NEWUSER to break out "=2" ? > > > > > > With the current implementation, this is not possible. My point was that > > > even if it were possible to lower the sysctl, ChromeOS presumably > > > already blocks the operations that a user would be able to use to create > > > a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl > > > without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to > > > the other security concerns). > > > > > > > > > > > > The code sample shared in this patch set indicates that the attacker > > > > > > already has the ability of creating tmpfs and executing complex steps, > > > > > > at that point, it doesn't matter if the code execution is from memfd > > > > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > > > > attacker can already execute the arbitrary code as sample given in > > > > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > > > > thing we need to worry about, the arbitrary code execution is already > > > > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > > > > think the same type of threat model applies to any system that wants > > > > > > to disable executable memfd entirely. > > > > > > > > > > I understand the threat model this sysctl is blocking, my point is that > > > > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > > > > from that threat model. An attacker that manages to trick some process > > > > > into creating a memfd with an executable payload is not going to be able > > > > > to change the sysctl setting (unless there's a confused deputy with > > > > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > > > > > > > It is the reverse. An attacker that manages to trick some > > > > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > > > > setting to 0 if no ratcheting), will be able to continue to use mfd as > > > > part of the attack chain. > > > > In chromeOS, an attacker that can change sysctl might not necessarily > > > > gain full arbitrary code execution already. As I mentioned previously, > > > > the main threat model here is to prevent arbitrary code execution > > > > through mfd. If an attacker already gains arbitrary code execution, > > > > at that point, we no longer worry about mfd. > > > > > > If an attacker can trick a privileged process into writing to arbitrary > > > sysctls, the system has much bigger issues than arbitrary (presumably > > > unprivileged) code execution. On the other hand, requiring you to reboot > > > a server due to a misconfigured sysctl *is* broken. > > > > > > Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to > > > change the setting is actually broken. > > > > > > > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > > > > doesn't add any security because that process could create a memfd-like > > > > > fd to execute without issues. > > > > >What practical attack does this ratcheting > > > > > mechanism protect against? (This is a question you can answer with the > > > > > YAMA sysctl, but not this one AFAICS.) > > > > > > > > > > But even if you feel that allowing this in child user namespaces is > > > > > unsafe or undesirable, it's absolutely necessary that > > > > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > > > > changing the sysctl. The alternative is that you need to reboot your > > > > > server in order to un-set a sysctl that broke some application you run. > > > > > > > > > > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > > > > with =1 *at all* because it could break programs in a way that would > > > > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > > > > mentioned only locks the sysctl at the highest setting). > > > > > > > > > I think a system should use "=0" when it is unsure about its program's > > > > need or not need executable memfd. Technically, it is not that this > > > > sysctl breaks the user, but the admin made the mistake to set the > > > > wrong sysctl value, and an admin should know what they are doing for a > > > > sysctl. Yes. rebooting increases the steps to undo the mistake, but > > > > that could be an incentive for the admin to fully test its programs > > > > before turning on this sysctl - and avoid unexpected runtime errors. > > > > > > I don't think this stance is really acceptable -- if an admin that has > > > privileges to load kernel modules is not able to disable a sysctl that > > > can break working programs without rebooting there is > > > > > > When this sysctl was first proposed a few years ago (when kernel folks > > > found out that runc was using executable memfds), my understanding is > > > that the long-term goal was to switch programs to have > > > non-executable-memfds by default on most distributions. Making it > > > impossible for an admin to lower the sysctl value flies in the face of > > > this goal. > > > > > > At the very least, being unable to lower the sysctl from =1 to =0 is > > > just broken (even if you use the yama example -- yama only locks the > > > sysctl at highest possible setting, not on lower settings). But in my > > > view, having this sysctl ratchet at all doesn't make sense. > > > > > To reiterate/summarize the current mechanism for vm.memfd_noexec > > > > 1> It is a pid namespace sysctl, init ns and child pid ns can have > > different setting values. > > 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. > > 3> There are 3 values for the sysctl, each higher value is more > > restrictive than the lower one. Once set, doesn't allow downgrading. > > > > It can be used as following: > > 1> > > init ns: vm.memfd_noexec = 2 (at boot time) > > Not allow executable memfd for the entire system, including its containers. > > > > 2> > > init ns: vm.memfd_noexec = 0 or 1 > > container (child init namespace) vm.memfd_noexec = 2. > > The host allows runc's usage of executable memfd during container > > creation. Inside the container, executable memfd is not allowed. > > > > The inherence + not allow downgrading is to reason with how > > vm.memfd_noexec is applied in the process tree. > > Without it, essentially we are losing the hierarchy view across the > > process tree and a process can evaluate its capability by modifying > > the setting. I think that is a less secure approach I would not > > prefer. > > If you really want the hierarchical aspect, we can implement it so that > it's _actually_ hierarchical like so: > > * By default, your setting is the same as your parent (this is checked > by going up the pidns tree -- a-la cgroups). This is less efficient > but you want a hierarchy, so we can do it this way instead. > * If you set a specific setting, that takes precedence but only if it's > a greater or equal setting to your parent. > * Trying to set a lower setting than your parent fails regardless of > privileges. > > This will allow *privileged users* to lower the setting, but only if > the parent pidns also has a lower setting. This allows a system admin to > enforce the setting. It seems to me that this fulfils all the > requirements you have. > > Most importantly, this would allow for a hierarchical view without > having a sysctl that will break systems and nobody will use. I need to > re-iterate this point -- nobody is going to use this sysctl as it > currently works because it ratchets in a way that admins cannot undo. In > practice this would mean you would need to reboot your whole datacenter > if you didn't catch that an update to one of you dependencies didn't > pass a required *noop* flag to memfd_create(). > Yes. I agree this is another way to implement a hierarchical view, which is a little more costly, because it goes up the process tree. I respectfully disagree that nobody will use the current sysctl though, I can still see that a container might want this, e.g. a small container that doesn't require a lot of refactoring to add NX, and restarting container usually isn't a problem, and container might like the fact that downgrade is denied at run time. Thanks Best regards, -Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 23:11 ` Jeff Xu @ 2023-08-03 4:39 ` Aleksa Sarai 2023-08-03 14:40 ` Jeff Xu 0 siblings, 1 reply; 20+ messages in thread From: Aleksa Sarai @ 2023-08-03 4:39 UTC (permalink / raw) To: Jeff Xu Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, Christian Brauner, linux-mm, linux-kernel, linux-kselftest, linux-hardening [-- Attachment #1: Type: text/plain, Size: 13108 bytes --] On 2023-08-02, Jeff Xu <jeffxu@google.com> wrote: > On Wed, Aug 2, 2023 at 2:39 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > On 2023-08-02, Jeff Xu <jeffxu@chromium.org> wrote: > > > On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > > This thread is getting longer with different topics, I will try to > > > respond with trimmed interleaved replies [1] > > > There are 3 topics (logging/'migration/ratcheting), this response will > > > be regarding ratcheting. > > > > The migration and ratcheting topics are interconnected because the > > migration issue makes ratcheting an even more severe issue. But I'll > > respond to each thread separately. > > > > > [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions > > > > > > > > > > > > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > > > > > many other things they can do. > > > > > > > > > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > > > > > sysctl even after compromising some system service with high > > > > > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > > > > > > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > > > > > but given that it only locks the most extreme setting and there is no > > > > > > way to get around the most extreme setting, I guess it makes some sense > > > > > > (not to mention it's an LSM and so there is an argument that it should > > > > > > be possible to lock out privileged users from modifying it). > > > > > > There are many other security sysctls, and very few have this behaviour > > > > > > because it doesn't make much sense in most cases. > > > > > > > > > > > > > In addition, this sysctl is pid_name spaced, this means child > > > > > > > pid_namespace will alway have the same or stricter security setting > > > > > > > than its parent, this allows admin to maintain a tree like view. If we > > > > > > > allow the child pid namespace to elevate its setting, then the > > > > > > > system-wide setting is no longer meaningful. > > > > > > > > > > > > "no longer meaningful" is too strong of a statement imho. It is still > > > > > > useful for constraining non-root processes and presumably ChromeOS > > > > > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > > > > > of this sysctl is pointless) so in practice for ChromeOS there is no > > > > > > change in the attack surface. > > > > > > > > > > > > (FWIW, I think tying this to the user namespace would've made more sense > > > > > > since this is about privilege restrictions, but that ship has sailed.) > > > > > > > > > > > The reason that this sysctl is a PID namespace is that I hope a > > > > > container and host can have different sysctl values, e.g. host will > > > > > allow runc's use of X mfd, while a container doesn't want X mfd. . > > > > > To clarify what you meant, do you mean this: when a container is in > > > > > its own pid_namespace, and has "=2", the programs inside the container > > > > > can still use CLONE_NEWUSER to break out "=2" ? > > > > > > > > With the current implementation, this is not possible. My point was that > > > > even if it were possible to lower the sysctl, ChromeOS presumably > > > > already blocks the operations that a user would be able to use to create > > > > a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl > > > > without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to > > > > the other security concerns). > > > > > > > > > > > > > > > The code sample shared in this patch set indicates that the attacker > > > > > > > already has the ability of creating tmpfs and executing complex steps, > > > > > > > at that point, it doesn't matter if the code execution is from memfd > > > > > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > > > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > > > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > > > > > attacker can already execute the arbitrary code as sample given in > > > > > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > > > > > thing we need to worry about, the arbitrary code execution is already > > > > > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > > > > > think the same type of threat model applies to any system that wants > > > > > > > to disable executable memfd entirely. > > > > > > > > > > > > I understand the threat model this sysctl is blocking, my point is that > > > > > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > > > > > from that threat model. An attacker that manages to trick some process > > > > > > into creating a memfd with an executable payload is not going to be able > > > > > > to change the sysctl setting (unless there's a confused deputy with > > > > > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > > > > > > > > > It is the reverse. An attacker that manages to trick some > > > > > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > > > > > setting to 0 if no ratcheting), will be able to continue to use mfd as > > > > > part of the attack chain. > > > > > In chromeOS, an attacker that can change sysctl might not necessarily > > > > > gain full arbitrary code execution already. As I mentioned previously, > > > > > the main threat model here is to prevent arbitrary code execution > > > > > through mfd. If an attacker already gains arbitrary code execution, > > > > > at that point, we no longer worry about mfd. > > > > > > > > If an attacker can trick a privileged process into writing to arbitrary > > > > sysctls, the system has much bigger issues than arbitrary (presumably > > > > unprivileged) code execution. On the other hand, requiring you to reboot > > > > a server due to a misconfigured sysctl *is* broken. > > > > > > > > Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to > > > > change the setting is actually broken. > > > > > > > > > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > > > > > doesn't add any security because that process could create a memfd-like > > > > > > fd to execute without issues. > > > > > >What practical attack does this ratcheting > > > > > > mechanism protect against? (This is a question you can answer with the > > > > > > YAMA sysctl, but not this one AFAICS.) > > > > > > > > > > > > But even if you feel that allowing this in child user namespaces is > > > > > > unsafe or undesirable, it's absolutely necessary that > > > > > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > > > > > changing the sysctl. The alternative is that you need to reboot your > > > > > > server in order to un-set a sysctl that broke some application you run. > > > > > > > > > > > > > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > > > > > with =1 *at all* because it could break programs in a way that would > > > > > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > > > > > mentioned only locks the sysctl at the highest setting). > > > > > > > > > > > I think a system should use "=0" when it is unsure about its program's > > > > > need or not need executable memfd. Technically, it is not that this > > > > > sysctl breaks the user, but the admin made the mistake to set the > > > > > wrong sysctl value, and an admin should know what they are doing for a > > > > > sysctl. Yes. rebooting increases the steps to undo the mistake, but > > > > > that could be an incentive for the admin to fully test its programs > > > > > before turning on this sysctl - and avoid unexpected runtime errors. > > > > > > > > I don't think this stance is really acceptable -- if an admin that has > > > > privileges to load kernel modules is not able to disable a sysctl that > > > > can break working programs without rebooting there is > > > > > > > > When this sysctl was first proposed a few years ago (when kernel folks > > > > found out that runc was using executable memfds), my understanding is > > > > that the long-term goal was to switch programs to have > > > > non-executable-memfds by default on most distributions. Making it > > > > impossible for an admin to lower the sysctl value flies in the face of > > > > this goal. > > > > > > > > At the very least, being unable to lower the sysctl from =1 to =0 is > > > > just broken (even if you use the yama example -- yama only locks the > > > > sysctl at highest possible setting, not on lower settings). But in my > > > > view, having this sysctl ratchet at all doesn't make sense. > > > > > > > To reiterate/summarize the current mechanism for vm.memfd_noexec > > > > > > 1> It is a pid namespace sysctl, init ns and child pid ns can have > > > different setting values. > > > 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. > > > 3> There are 3 values for the sysctl, each higher value is more > > > restrictive than the lower one. Once set, doesn't allow downgrading. > > > > > > It can be used as following: > > > 1> > > > init ns: vm.memfd_noexec = 2 (at boot time) > > > Not allow executable memfd for the entire system, including its containers. > > > > > > 2> > > > init ns: vm.memfd_noexec = 0 or 1 > > > container (child init namespace) vm.memfd_noexec = 2. > > > The host allows runc's usage of executable memfd during container > > > creation. Inside the container, executable memfd is not allowed. > > > > > > The inherence + not allow downgrading is to reason with how > > > vm.memfd_noexec is applied in the process tree. > > > Without it, essentially we are losing the hierarchy view across the > > > process tree and a process can evaluate its capability by modifying > > > the setting. I think that is a less secure approach I would not > > > prefer. > > > > If you really want the hierarchical aspect, we can implement it so that > > it's _actually_ hierarchical like so: > > > > * By default, your setting is the same as your parent (this is checked > > by going up the pidns tree -- a-la cgroups). This is less efficient > > but you want a hierarchy, so we can do it this way instead. > > * If you set a specific setting, that takes precedence but only if it's > > a greater or equal setting to your parent. > > * Trying to set a lower setting than your parent fails regardless of > > privileges. > > > > This will allow *privileged users* to lower the setting, but only if > > the parent pidns also has a lower setting. This allows a system admin to > > enforce the setting. It seems to me that this fulfils all the > > requirements you have. > > > > Most importantly, this would allow for a hierarchical view without > > having a sysctl that will break systems and nobody will use. I need to > > re-iterate this point -- nobody is going to use this sysctl as it > > currently works because it ratchets in a way that admins cannot undo. In > > practice this would mean you would need to reboot your whole datacenter > > if you didn't catch that an update to one of you dependencies didn't > > pass a required *noop* flag to memfd_create(). > > > Yes. I agree this is another way to implement a hierarchical view, > which is a little more costly, because it goes up the process tree. The maximum height is 32 namespaces, it's not that bad. I'm working on a v2 that implements it this way instead. I also just figured out that there is a flaw with the current implementation's "hierarchy" -- if you create a pid namespace and then change vm.memfd_noexec, the limit doesn't apply to the child pidns. This makes sense given the implementation, but it means that the "tree view" you talked about doesn't actually apply to your implementation. > I respectfully disagree that nobody will use the current sysctl > though, I can still see that a container might want this, e.g. a > small container that doesn't require a lot of refactoring to add NX, > and restarting container usually isn't a problem, and container might > like the fact that downgrade is denied at run time. I should've specified that I was talking about using the setting on the host (which is presumably the goal -- at the very least I presume the goal is to get distributions to use =1 at some point). Also, funnily enough, container runtimes use executable memfds. :P -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-03 4:39 ` Aleksa Sarai @ 2023-08-03 14:40 ` Jeff Xu 0 siblings, 0 replies; 20+ messages in thread From: Jeff Xu @ 2023-08-03 14:40 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, Christian Brauner, linux-mm, linux-kernel, linux-kselftest, linux-hardening On Wed, Aug 2, 2023 at 9:40 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-08-02, Jeff Xu <jeffxu@google.com> wrote: > > On Wed, Aug 2, 2023 at 2:39 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > On 2023-08-02, Jeff Xu <jeffxu@chromium.org> wrote: > > > > On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > > > > This thread is getting longer with different topics, I will try to > > > > respond with trimmed interleaved replies [1] > > > > There are 3 topics (logging/'migration/ratcheting), this response will > > > > be regarding ratcheting. > > > > > > The migration and ratcheting topics are interconnected because the > > > migration issue makes ratcheting an even more severe issue. But I'll > > > respond to each thread separately. > > > > > > > [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions > > > > > > > > > > > > > > > > > > * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a > > > > > > > > > security mechanism because a CAP_SYS_ADMIN capable user can create > > > > > > > > > executable binaries in a hidden tmpfs very easily, not to mention the > > > > > > > > > many other things they can do. > > > > > > > > > > > > > > > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this > > > > > > > > sysctl even after compromising some system service with high > > > > > > > > privilege, YAMA has the same approach for ptrace_scope=3 > > > > > > > > > > > > > > Personally, I also think this behaviour from YAMA is a little goofy too, > > > > > > > but given that it only locks the most extreme setting and there is no > > > > > > > way to get around the most extreme setting, I guess it makes some sense > > > > > > > (not to mention it's an LSM and so there is an argument that it should > > > > > > > be possible to lock out privileged users from modifying it). > > > > > > > There are many other security sysctls, and very few have this behaviour > > > > > > > because it doesn't make much sense in most cases. > > > > > > > > > > > > > > > In addition, this sysctl is pid_name spaced, this means child > > > > > > > > pid_namespace will alway have the same or stricter security setting > > > > > > > > than its parent, this allows admin to maintain a tree like view. If we > > > > > > > > allow the child pid namespace to elevate its setting, then the > > > > > > > > system-wide setting is no longer meaningful. > > > > > > > > > > > > > > "no longer meaningful" is too strong of a statement imho. It is still > > > > > > > useful for constraining non-root processes and presumably ChromeOS > > > > > > > disallows random processes to do CLONE_NEWUSER (otherwise the protection > > > > > > > of this sysctl is pointless) so in practice for ChromeOS there is no > > > > > > > change in the attack surface. > > > > > > > > > > > > > > (FWIW, I think tying this to the user namespace would've made more sense > > > > > > > since this is about privilege restrictions, but that ship has sailed.) > > > > > > > > > > > > > The reason that this sysctl is a PID namespace is that I hope a > > > > > > container and host can have different sysctl values, e.g. host will > > > > > > allow runc's use of X mfd, while a container doesn't want X mfd. . > > > > > > To clarify what you meant, do you mean this: when a container is in > > > > > > its own pid_namespace, and has "=2", the programs inside the container > > > > > > can still use CLONE_NEWUSER to break out "=2" ? > > > > > > > > > > With the current implementation, this is not possible. My point was that > > > > > even if it were possible to lower the sysctl, ChromeOS presumably > > > > > already blocks the operations that a user would be able to use to create > > > > > a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl > > > > > without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to > > > > > the other security concerns). > > > > > > > > > > > > > > > > > > The code sample shared in this patch set indicates that the attacker > > > > > > > > already has the ability of creating tmpfs and executing complex steps, > > > > > > > > at that point, it doesn't matter if the code execution is from memfd > > > > > > > > or not. For a safe by default system such as ChromeOS, attackers won't > > > > > > > > easily run arbitrary code, memfd is one of the open doors for that, so > > > > > > > > we are disabling executable memfd in ChromeOS. In other words: if an > > > > > > > > attacker can already execute the arbitrary code as sample given in > > > > > > > > ChromeOS, without using executable memfd, then memfd is no longer the > > > > > > > > thing we need to worry about, the arbitrary code execution is already > > > > > > > > achieved by the attacker. Even though I use ChromeOS as an example, I > > > > > > > > think the same type of threat model applies to any system that wants > > > > > > > > to disable executable memfd entirely. > > > > > > > > > > > > > > I understand the threat model this sysctl is blocking, my point is that > > > > > > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense > > > > > > > from that threat model. An attacker that manages to trick some process > > > > > > > into creating a memfd with an executable payload is not going to be able > > > > > > > to change the sysctl setting (unless there's a confused deputy with > > > > > > > CAP_SYS_ADMIN, in which case you have much bigger issues). > > > > > > > > > > > > > It is the reverse. An attacker that manages to trick some > > > > > > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the > > > > > > setting to 0 if no ratcheting), will be able to continue to use mfd as > > > > > > part of the attack chain. > > > > > > In chromeOS, an attacker that can change sysctl might not necessarily > > > > > > gain full arbitrary code execution already. As I mentioned previously, > > > > > > the main threat model here is to prevent arbitrary code execution > > > > > > through mfd. If an attacker already gains arbitrary code execution, > > > > > > at that point, we no longer worry about mfd. > > > > > > > > > > If an attacker can trick a privileged process into writing to arbitrary > > > > > sysctls, the system has much bigger issues than arbitrary (presumably > > > > > unprivileged) code execution. On the other hand, requiring you to reboot > > > > > a server due to a misconfigured sysctl *is* broken. > > > > > > > > > > Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to > > > > > change the setting is actually broken. > > > > > > > > > > > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it > > > > > > > doesn't add any security because that process could create a memfd-like > > > > > > > fd to execute without issues. > > > > > > >What practical attack does this ratcheting > > > > > > > mechanism protect against? (This is a question you can answer with the > > > > > > > YAMA sysctl, but not this one AFAICS.) > > > > > > > > > > > > > > But even if you feel that allowing this in child user namespaces is > > > > > > > unsafe or undesirable, it's absolutely necessary that > > > > > > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by > > > > > > > changing the sysctl. The alternative is that you need to reboot your > > > > > > > server in order to un-set a sysctl that broke some application you run. > > > > > > > > > > > > > > > > > > > > Also, by the same token, this ratcheting mechanism doesn't make sense > > > > > > > with =1 *at all* because it could break programs in a way that would > > > > > > > require a reboot but it's not a "security setting" (and the YAMA sysctl > > > > > > > mentioned only locks the sysctl at the highest setting). > > > > > > > > > > > > > I think a system should use "=0" when it is unsure about its program's > > > > > > need or not need executable memfd. Technically, it is not that this > > > > > > sysctl breaks the user, but the admin made the mistake to set the > > > > > > wrong sysctl value, and an admin should know what they are doing for a > > > > > > sysctl. Yes. rebooting increases the steps to undo the mistake, but > > > > > > that could be an incentive for the admin to fully test its programs > > > > > > before turning on this sysctl - and avoid unexpected runtime errors. > > > > > > > > > > I don't think this stance is really acceptable -- if an admin that has > > > > > privileges to load kernel modules is not able to disable a sysctl that > > > > > can break working programs without rebooting there is > > > > > > > > > > When this sysctl was first proposed a few years ago (when kernel folks > > > > > found out that runc was using executable memfds), my understanding is > > > > > that the long-term goal was to switch programs to have > > > > > non-executable-memfds by default on most distributions. Making it > > > > > impossible for an admin to lower the sysctl value flies in the face of > > > > > this goal. > > > > > > > > > > At the very least, being unable to lower the sysctl from =1 to =0 is > > > > > just broken (even if you use the yama example -- yama only locks the > > > > > sysctl at highest possible setting, not on lower settings). But in my > > > > > view, having this sysctl ratchet at all doesn't make sense. > > > > > > > > > To reiterate/summarize the current mechanism for vm.memfd_noexec > > > > > > > > 1> It is a pid namespace sysctl, init ns and child pid ns can have > > > > different setting values. > > > > 2> child pid ns inherits parent's pid ns's sysctl at the time of fork. > > > > 3> There are 3 values for the sysctl, each higher value is more > > > > restrictive than the lower one. Once set, doesn't allow downgrading. > > > > > > > > It can be used as following: > > > > 1> > > > > init ns: vm.memfd_noexec = 2 (at boot time) > > > > Not allow executable memfd for the entire system, including its containers. > > > > > > > > 2> > > > > init ns: vm.memfd_noexec = 0 or 1 > > > > container (child init namespace) vm.memfd_noexec = 2. > > > > The host allows runc's usage of executable memfd during container > > > > creation. Inside the container, executable memfd is not allowed. > > > > > > > > The inherence + not allow downgrading is to reason with how > > > > vm.memfd_noexec is applied in the process tree. > > > > Without it, essentially we are losing the hierarchy view across the > > > > process tree and a process can evaluate its capability by modifying > > > > the setting. I think that is a less secure approach I would not > > > > prefer. > > > > > > If you really want the hierarchical aspect, we can implement it so that > > > it's _actually_ hierarchical like so: > > > > > > * By default, your setting is the same as your parent (this is checked > > > by going up the pidns tree -- a-la cgroups). This is less efficient > > > but you want a hierarchy, so we can do it this way instead. > > > * If you set a specific setting, that takes precedence but only if it's > > > a greater or equal setting to your parent. > > > * Trying to set a lower setting than your parent fails regardless of > > > privileges. > > > > > > This will allow *privileged users* to lower the setting, but only if > > > the parent pidns also has a lower setting. This allows a system admin to > > > enforce the setting. It seems to me that this fulfils all the > > > requirements you have. > > > > > > Most importantly, this would allow for a hierarchical view without > > > having a sysctl that will break systems and nobody will use. I need to > > > re-iterate this point -- nobody is going to use this sysctl as it > > > currently works because it ratchets in a way that admins cannot undo. In > > > practice this would mean you would need to reboot your whole datacenter > > > if you didn't catch that an update to one of you dependencies didn't > > > pass a required *noop* flag to memfd_create(). > > > > > Yes. I agree this is another way to implement a hierarchical view, > > which is a little more costly, because it goes up the process tree. > > The maximum height is 32 namespaces, it's not that bad. I'm working on a > v2 that implements it this way instead. > > I also just figured out that there is a flaw with the current > implementation's "hierarchy" -- if you create a pid namespace and then > change vm.memfd_noexec, the limit doesn't apply to the child pidns. This > makes sense given the implementation, but it means that the "tree view" > you talked about doesn't actually apply to your implementation. > That is by design. The return on investment of implementing dynamic propagation isn't high, i.e. vm.memfd_noexec is something that needs to be fully tested before set and changing it at runtime is not recommended in production, downgrade requires reboot/restart. On the host side, the kernel cmd line now supports setting sysctl, and it can be applied from boot time. These combined factors make me choose the current implementation. Simplicity is something I value greatly. feature richness can be achieved, but sometimes with the cost of complexity. > > I respectfully disagree that nobody will use the current sysctl > > though, I can still see that a container might want this, e.g. a > > small container that doesn't require a lot of refactoring to add NX, > > and restarting container usually isn't a problem, and container might > > like the fact that downgrade is denied at run time. > > I should've specified that I was talking about using the setting on the > host (which is presumably the goal -- at the very least I presume the > goal is to get distributions to use =1 at some point). > > Also, funnily enough, container runtimes use executable memfds. :P > Yes. runc uses executable memfd. Current implementation considered runc's case at host. As I said, containers that don't require executable memfd can migrate their code and set "vm.memfd_noexec = 2", which I hope is easier to do than on host. > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 1:04 ` Aleksa Sarai 2023-08-02 19:47 ` Jeff Xu @ 2023-08-02 20:45 ` Jeff Xu 2023-08-02 21:48 ` Aleksa Sarai 1 sibling, 1 reply; 20+ messages in thread From: Jeff Xu @ 2023-08-02 20:45 UTC (permalink / raw) To: Aleksa Sarai Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening > > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > > > because it will make it far to difficult to ever migrate. Instead it > > > > > should imply MFD_EXEC. > > > > > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > > > but to disable creation of executable memfd for the current system/pid > > > > namespace. > > > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > > > unmigrated user code as a temporary measure. > > > > > > My point is that the current behaviour for =2 means that nobody other > > > than *maybe* ChromeOS will ever be able to use it because it requires > > > auditing every program on the system. In fact, it's possible even > > > ChromeOS will run into issues given that one of the arguments made for > > > the nosymfollow mount option was that auditing all of ChromeOS to > > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > > > (which I agreed with). Maybe this is less of an issue with > > > memfd_create(2) (which is much newer than open(2)) but it still seems > > > like a lot of busy work when the =1 behaviour is entirely sane even in > > > the strict threat model that =2 is trying to protect against. > > > > > It can also be a container (that have all memfd_create migrated to new API) > > If ChromeOS would struggle to rewrite all of the libraries they use, > containers are in even worse shape -- most container users don't have a > complete list of every package installed in a container, let alone the > ability to audit whether they pass a (no-op) flag to memfd_create(2) in > every codepath. > > > One option I considered previously was "=2" would do overwrite+block , > > and "=3" just block. But then I worry that applications won't have > > motivation to ever change their existing code, the setting will > > forever stay at "=2", making "=3" even more impossible to ever be used > > system side. > > What is the downside of overwriting? Backwards-compatibility is a very > important part of Linux -- being able to use old programs without having > to modify them is incredibly important. Yes, this behaviour is opt-in -- > but I don't see the point of making opting in more difficult than > necessary. Surely overwite+block provides the security guarantee you > need from the threat model -- othewise nobody will be able to use block > because you never know if one library will call memfd_create() > "incorrectly" without the new flags. > > > > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > > > there are several tools for doing this (both seccomp and LSM hooks). > > > > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@google.com/ > > > > > > > Additional functionality/features should be implemented through > > > > security hook and LSM, not sysctl, I think. > > > > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > > > could implement either =2 behaviour with an LSM using =1, and the > > > current strict =2 behaviour could be implemented purely with seccomp.) > > > > > By migration, I mean a system that is not fully migrated, such a > > system should just use "=0" or "=1". Additional features can be > > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if > > a system wants to limit executable memfd to specific programs or fully > > disable it. > > "=2" is for a system/container that is fully migrated, in that case, > > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient > > alternative. > > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and > > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is > > essentially what you want, iiuc.However, I do not wish to have this > > implemented in kernel, due to the thinking that I want kernel to get > > out of business of "overwriting" eventually. > > See my above comments -- "overwriting" is perfectly acceptable to me. > There's also no way to "get out of the business of overwriting" -- Linux > has strict backwards compatibility requirements. > I agree, if we weigh on the short term goal of letting the user space applications to do minimum, then having 4 state sysctl (or 2 sysctl, one controls overwrite, one disable/enable executable memfd) will do. But with that approach, I'm afraid a version of the future (say in 20 years), most applications stays with memfd_create with the old API style, not setting the NX bit. With the current approach, it might seem to be less convenient, but I hope it offers a bit of incentive to make applications migrating their code towards the new API, explicitly setting the NX bit. I understand this hope is questionable, we might still end up the same in 20 years, but at least I tried :-). I will leave this decision to maintainers when you supply patches for that, and I wouldn't feel bad either way, there is a valid reason on both sides. To supplement, there are two other ways for what you want: 1> seccomp to block MFD_EXEC, and leaving the setting to 1. 2> implement the blocking using a security hook and LSM, imo, which is probably the most common way to deal with this type of request (block something). I admit those two ways will be less convenient than just having sysctl do all the things, from the user space's perspective. Thanks -Jeff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec 2023-08-02 20:45 ` Jeff Xu @ 2023-08-02 21:48 ` Aleksa Sarai 0 siblings, 0 replies; 20+ messages in thread From: Aleksa Sarai @ 2023-08-02 21:48 UTC (permalink / raw) To: Jeff Xu Cc: Jeff Xu, Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp, Luis Chamberlain, YueHaibing, linux-mm, linux-kernel, linux-kselftest, linux-hardening [-- Attachment #1: Type: text/plain, Size: 7074 bytes --] On 2023-08-02, Jeff Xu <jeffxu@chromium.org> wrote: > > > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > > > > because it will make it far to difficult to ever migrate. Instead it > > > > > > should imply MFD_EXEC. > > > > > > > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > > > > but to disable creation of executable memfd for the current system/pid > > > > > namespace. > > > > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > > > > unmigrated user code as a temporary measure. > > > > > > > > My point is that the current behaviour for =2 means that nobody other > > > > than *maybe* ChromeOS will ever be able to use it because it requires > > > > auditing every program on the system. In fact, it's possible even > > > > ChromeOS will run into issues given that one of the arguments made for > > > > the nosymfollow mount option was that auditing all of ChromeOS to > > > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > > > > (which I agreed with). Maybe this is less of an issue with > > > > memfd_create(2) (which is much newer than open(2)) but it still seems > > > > like a lot of busy work when the =1 behaviour is entirely sane even in > > > > the strict threat model that =2 is trying to protect against. > > > > > > > It can also be a container (that have all memfd_create migrated to new API) > > > > If ChromeOS would struggle to rewrite all of the libraries they use, > > containers are in even worse shape -- most container users don't have a > > complete list of every package installed in a container, let alone the > > ability to audit whether they pass a (no-op) flag to memfd_create(2) in > > every codepath. > > > > > One option I considered previously was "=2" would do overwrite+block , > > > and "=3" just block. But then I worry that applications won't have > > > motivation to ever change their existing code, the setting will > > > forever stay at "=2", making "=3" even more impossible to ever be used > > > system side. > > > > What is the downside of overwriting? Backwards-compatibility is a very > > important part of Linux -- being able to use old programs without having > > to modify them is incredibly important. Yes, this behaviour is opt-in -- > > but I don't see the point of making opting in more difficult than > > necessary. Surely overwite+block provides the security guarantee you > > need from the threat model -- othewise nobody will be able to use block > > because you never know if one library will call memfd_create() > > "incorrectly" without the new flags. > > > > > > > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > > > > there are several tools for doing this (both seccomp and LSM hooks). > > > > > > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@google.com/ > > > > > > > > > Additional functionality/features should be implemented through > > > > > security hook and LSM, not sysctl, I think. > > > > > > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > > > > could implement either =2 behaviour with an LSM using =1, and the > > > > current strict =2 behaviour could be implemented purely with seccomp.) > > > > > > > By migration, I mean a system that is not fully migrated, such a > > > system should just use "=0" or "=1". Additional features can be > > > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if > > > a system wants to limit executable memfd to specific programs or fully > > > disable it. > > > "=2" is for a system/container that is fully migrated, in that case, > > > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient > > > alternative. > > > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and > > > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is > > > essentially what you want, iiuc.However, I do not wish to have this > > > implemented in kernel, due to the thinking that I want kernel to get > > > out of business of "overwriting" eventually. > > > > See my above comments -- "overwriting" is perfectly acceptable to me. > > There's also no way to "get out of the business of overwriting" -- Linux > > has strict backwards compatibility requirements. > > > > I agree, if we weigh on the short term goal of letting the user space > applications to do minimum, then having 4 state sysctl (or 2 sysctl, > one controls overwrite, one disable/enable executable memfd) will do. > But with that approach, I'm afraid a version of the future (say in 20 > years), most applications stays with memfd_create with the old API > style, not setting the NX bit. With the current approach, it might seem > to be less convenient, but I hope it offers a bit of incentive to make > applications migrating their code towards the new API, explicitly > setting the NX bit. I understand this hope is questionable, we might > still end up the same in 20 years, but at least I tried :-). I will > leave this decision to maintainers when you supply patches for that, > and I wouldn't feel bad either way, there is a valid reason on both sides. People will not switch =2 on if it has the possibility of breaking existing programs that are doing nothing wrong by not passing a noop flag. In 20 years at best you would have =1 in widespread use because the rewriting behaviour is what users expect of kernel uAPIs. They expect old programs to work without modifying them if they aren't doing anything wrong. A uAPI knob that requires every userspace program to change before you can safely enable it (especially because it ratchets in a way that makes it dangerous to enable on production machines) will simply not be used. If the goal is to get programs to update (which it seems it is), having a knob that nobody will turn on doesn't help. Doing proper warning logging is the way to get userspace to switch -- userspace usually notices when their programs trigger warnings in dmesg. > To supplement, there are two other ways for what you want: > 1> seccomp to block MFD_EXEC, and leaving the setting to 1. I made this point in an earlier mail. However my point is that =2 is not an acceptable uAPI and if you want something that looks like =2 you can also implement that with seccomp too! In fact, the key difference is that you cannot implement the rewriting easily in seccomp -- you would need to install a seccomp_notify monitor that does nothing but rewrite syscall arguments. This would be equivalent to running the entire system under GDB to work around a uAPI flaw. > 2> implement the blocking using a security hook and LSM, imo, which is > probably the most common way to deal with this type of request (block > something). The issue is not the blocking, it's the rewriting. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2023-08-03 14:41 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-07-13 14:33 [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec Aleksa Sarai 2023-07-14 0:07 ` Aleksa Sarai 2023-07-13 14:33 ` [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails Aleksa Sarai 2023-07-19 23:38 ` Jeff Xu 2023-08-03 2:25 ` Aleksa Sarai 2023-08-03 2:55 ` Jeff Xu 2023-07-18 0:47 ` [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Jeff Xu 2023-07-19 3:10 ` Aleksa Sarai 2023-07-19 7:17 ` Jeff Xu 2023-08-02 1:04 ` Aleksa Sarai 2023-08-02 19:47 ` Jeff Xu 2023-08-02 19:54 ` Jeff Xu 2023-08-02 21:38 ` Aleksa Sarai 2023-08-02 23:11 ` Jeff Xu 2023-08-03 4:39 ` Aleksa Sarai 2023-08-03 14:40 ` Jeff Xu 2023-08-02 20:45 ` Jeff Xu 2023-08-02 21:48 ` Aleksa Sarai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox