Re: [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Aleksa Sarai <cyphar@cyphar.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Jeff Xu <jeffxu@google.com>,  YueHaibing <yuehaibing@huawei.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	 Kees Cook <keescook@chromium.org>,
	Daniel Verkamp <dverkamp@chromium.org>
Cc: linux-mm@kvack.org, Dominique Martinet <asmadeus@codewreck.org>,
	 Christian Brauner <brauner@kernel.org>,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec
Date: Fri, 14 Jul 2023 10:07:21 +1000	[thread overview]
Message-ID: <pyylsri7uzypafzv7ar2w4j2lr6puh6bfowedalr226rswdzoo@dg54fdtx5nwd> (raw)
In-Reply-To: <20230713143406.14342-3-cyphar@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 4901 bytes --]

On 2023-07-14, Aleksa Sarai <cyphar@cyphar.com> wrote:
> This sysctl has the very unusal behaviour of not allowing any user (even
> CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you
> were to set this sysctl to a more restrictive option in the host pidns
> you would need to reboot your machine in order to reset it.
> 
> The justification given in [1] is that this is a security feature and
> thus it should not be possible to disable. Aside from the fact that we
> have plenty of security-related sysctls that can be disabled after being
> enabled (fs.protected_symlinks for instance), the protection provided by
> the sysctl is to stop users from being able to create a binary and then
> execute it. A user with CAP_SYS_ADMIN can trivially do this without
> memfd_create(2):
> 
>   % cat mount-memfd.c
>   #include <fcntl.h>
>   #include <string.h>
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <unistd.h>
>   #include <linux/mount.h>
> 
>   #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
> 
>   int main(void)
>   {
>   	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
>   	assert(fsfd >= 0);
>   	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
> 
>   	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
>   	assert(dfd >= 0);
> 
>   	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
                                                                      0777

Oops. I must've garbled something when copying from my test program.

>   	assert(execfd >= 0);
>   	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
>   	assert(!close(execfd));
> 
>   	char *execpath = NULL;
>   	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
>   	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
>   	assert(execfd >= 0);
>   	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
>   	assert(!execve(execpath, argv, envp));
>   }
>   % ./mount-memfd
>   this file was executed from this totally private tmpfs: /proc/self/fd/5
>   %
> 
> Given that it is possible for CAP_SYS_ADMIN users to create executable
> binaries without memfd_create(2) and without touching the host
> filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> would be able to do that would be equivalent or worse), it seems strange
> to cause a fair amount of headache to admins when there doesn't appear
> to be an actual security benefit to blocking this.
> 
> It should be noted that with this change, programs that can do an
> unprivileged unshare(CLONE_NEWUSER) would be able to create an
> executable memfd even if their current pidns didn't allow it. However,
> the same sample program above can also be used in this scenario, meaning
> that even with this consideration, blocking CAP_SYS_ADMIN makes little
> sense:
> 
>   % unshare -rm ./mount-memfd
>   this file was executed from this totally private tmpfs: /proc/self/fd/5
> 
> This simply further reinforces that locked-down environments need to
> disallow CLONE_NEWUSER for unprivileged users (as is already the case in
> most container environments).
> 
> [1]: https://lore.kernel.org/all/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
> 
> Cc: Dominique Martinet <asmadeus@codewreck.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: stable@vger.kernel.org # v6.3+
> Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  kernel/pid_sysctl.h | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h
> index b26e027fc9cd..8a22bc29ebb4 100644
> --- a/kernel/pid_sysctl.h
> +++ b/kernel/pid_sysctl.h
> @@ -24,13 +24,6 @@ static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
>  	if (ns != &init_pid_ns)
>  		table_copy.data = &ns->memfd_noexec_scope;
>  
> -	/*
> -	 * set minimum to current value, the effect is only bigger
> -	 * value is accepted.
> -	 */
> -	if (*(int *)table_copy.data > *(int *)table_copy.extra1)
> -		table_copy.extra1 = table_copy.data;
> -
>  	return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
>  }

I also have a patch to properly tie the sysctl to the pid namespace
rather that having a global sysctl that magically has its value changed
in this pid_mfd_noexec_dointvec_minmax() and another to do the same for
the other pidns-tied sysctl (kernel.ns_last_pid) but I'm not sure
whether it's needed. It does make vm.memfd_noexec a bit cleaner but
because the two sysctls are in different tables you can't register them
together AFAICS which means a bunch of needless duplication.

>  
> -- 
> 2.41.0
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

next prev parent reply	other threads:[~2023-07-14  0:07 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-13 14:33 [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
2023-07-13 14:33 ` [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling Aleksa Sarai
2023-07-13 14:33 ` [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec Aleksa Sarai
2023-07-14  0:07   ` Aleksa Sarai [this message]
2023-07-13 14:33 ` [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails Aleksa Sarai
2023-07-19 23:38   ` Jeff Xu
2023-08-03  2:25     ` Aleksa Sarai
2023-08-03  2:55       ` Jeff Xu
2023-07-18  0:47 ` [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Jeff Xu
2023-07-19  3:10   ` Aleksa Sarai
2023-07-19  7:17     ` Jeff Xu
2023-08-02  1:04       ` Aleksa Sarai
2023-08-02 19:47         ` Jeff Xu
2023-08-02 19:54           ` Jeff Xu
2023-08-02 21:38           ` Aleksa Sarai
2023-08-02 23:11             ` Jeff Xu
2023-08-03  4:39               ` Aleksa Sarai
2023-08-03 14:40                 ` Jeff Xu
2023-08-02 20:45         ` Jeff Xu
2023-08-02 21:48           ` Aleksa Sarai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pyylsri7uzypafzv7ar2w4j2lr6puh6bfowedalr226rswdzoo@dg54fdtx5nwd \
    --to=cyphar@cyphar.com \
    --cc=akpm@linux-foundation.org \
    --cc=asmadeus@codewreck.org \
    --cc=brauner@kernel.org \
    --cc=dverkamp@chromium.org \
    --cc=jeffxu@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=yuehaibing@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox