linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Xu <jeffxu@chromium.org>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: Jeff Xu <jeffxu@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Shuah Khan <shuah@kernel.org>, Kees Cook <keescook@chromium.org>,
	 Daniel Verkamp <dverkamp@chromium.org>,
	Luis Chamberlain <mcgrof@kernel.org>,
	 YueHaibing <yuehaibing@huawei.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 linux-kselftest@vger.kernel.org,
	linux-hardening@vger.kernel.org
Subject: Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec
Date: Wed, 2 Aug 2023 12:47:34 -0700	[thread overview]
Message-ID: <CABi2SkXWfup2_UeKqm7C-xkjF5gnhKuxOP7TsRVa5MLbxabFQg@mail.gmail.com> (raw)
In-Reply-To: <20230801.032503-medium.noises.extinct.omen-CStYZUqcNLCS@cyphar.com>

On Tue, Aug 1, 2023 at 6:05 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
This thread is getting longer with different topics, I will try to
respond with trimmed interleaved replies [1]
There are 3 topics (logging/'migration/ratcheting), this response will
be regarding ratcheting.

[1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html?highlight=signed%20off#use-trimmed-interleaved-replies-in-email-discussions

>
> > > > >  * The ratcheting mechanism for vm.memfd_noexec doesn't make sense as a
> > > > >    security mechanism because a CAP_SYS_ADMIN capable user can create
> > > > >    executable binaries in a hidden tmpfs very easily, not to mention the
> > > > >    many other things they can do.
> > > > >
> > > > By further limiting CAP_SYS_ADMIN, an attacker can't modify this
> > > > sysctl even after compromising some system service with high
> > > > privilege, YAMA has the same approach for ptrace_scope=3
> > >
> > > Personally, I also think this behaviour from YAMA is a little goofy too,
> > > but given that it only locks the most extreme setting and there is no
> > > way to get around the most extreme setting, I guess it makes some sense
> > > (not to mention it's an LSM and so there is an argument that it should
> > > be possible to lock out privileged users from modifying it).
> > > There are many other security sysctls, and very few have this behaviour
> > > because it doesn't make much sense in most cases.
> > >
> > > > In addition, this sysctl is pid_name spaced, this means child
> > > > pid_namespace will alway have the same or stricter security setting
> > > > than its parent, this allows admin to maintain a tree like view. If we
> > > > allow the child pid namespace to elevate its setting, then the
> > > > system-wide setting is no longer meaningful.
> > >
> > > "no longer meaningful" is too strong of a statement imho. It is still
> > > useful for constraining non-root processes and presumably ChromeOS
> > > disallows random processes to do CLONE_NEWUSER (otherwise the protection
> > > of this sysctl is pointless) so in practice for ChromeOS there is no
> > > change in the attack surface.
> > >
> > > (FWIW, I think tying this to the user namespace would've made more sense
> > > since this is about privilege restrictions, but that ship has sailed.)
> > >
> > The reason that this sysctl is a PID namespace is that I hope a
> > container and host can have different sysctl values, e.g. host will
> > allow runc's use of X mfd, while a container  doesn't want X mfd. .
> > To clarify what you meant, do you mean this: when a container is in
> > its own pid_namespace, and has "=2", the programs inside the container
> > can still use CLONE_NEWUSER to break out "=2" ?
>
> With the current implementation, this is not possible. My point was that
> even if it were possible to lower the sysctl, ChromeOS presumably
> already blocks the operations that a user would be able to use to create
> a memfd (an unprivileged user cannot CLONE_NEWPID to modify the sysctl
> without CLONE_NEWUSER, which is presumably blocked on ChromeOS due to
> the other security concerns).
>
>
> > > > The code sample shared in this patch set indicates that the attacker
> > > > already has the ability of creating tmpfs and executing complex steps,
> > > > at that point, it doesn't matter if the code execution is from memfd
> > > > or not. For a safe by default system such as ChromeOS, attackers won't
> > > > easily run arbitrary code, memfd is one of the open doors for that, so
> > > > we are disabling executable memfd in ChromeOS. In other words:  if an
> > > > attacker can already execute the arbitrary code as sample given in
> > > > ChromeOS, without using executable memfd,  then memfd is no longer the
> > > > thing we need to worry about, the arbitrary code execution is already
> > > > achieved by the attacker. Even though I use ChromeOS as an example, I
> > > > think the same type of threat model applies to any system that wants
> > > > to disable executable memfd entirely.
> > >
> > > I understand the threat model this sysctl is blocking, my point is that
> > > blocking CAP_SYS_ADMIN from modifying the setting doesn't make sense
> > > from that threat model. An attacker that manages to trick some process
> > > into creating a memfd with an executable payload is not going to be able
> > > to change the sysctl setting (unless there's a confused deputy with
> > > CAP_SYS_ADMIN, in which case you have much bigger issues).
> > >
> > It is the reverse.  An attacker that manages to trick some
> > CAP_SYSADMIN processes into changing this sysctl value (i.e. lower the
> > setting to 0 if no ratcheting), will be able to continue to use mfd as
> > part of the attack chain.
> >  In chromeOS, an attacker that can change sysctl might not necessarily
> > gain full arbitrary code execution already. As I mentioned previously,
> > the main threat model here is to prevent  arbitrary code execution
> > through mfd.  If an attacker already gains arbitrary code execution,
> > at that point, we no longer worry about mfd.
>
> If an attacker can trick a privileged process into writing to arbitrary
> sysctls, the system has much bigger issues than arbitrary (presumably
> unprivileged) code execution. On the other hand, requiring you to reboot
> a server due to a misconfigured sysctl *is* broken.
>
> Again, at the very least, not even allowing capable(CAP_SYS_ADMIN) to
> change the setting is actually broken.
>
> > > If a CAP_SYS_ADMIN-capable user wants to change the sysctl, blocking it
> > > doesn't add any security because that process could create a memfd-like
> > > fd to execute without issues.
> > >What practical attack does this ratcheting
> > > mechanism protect against? (This is a question you can answer with the
> > > YAMA sysctl, but not this one AFAICS.)
> > >
> > > But even if you feel that allowing this in child user namespaces is
> > > unsafe or undesirable, it's absolutely necessary that
> > > capable(CAP_SYS_ADMIN) should be able to un-brick the running system by
> > > changing the sysctl. The alternative is that you need to reboot your
> > > server in order to un-set a sysctl that broke some application you run.
> > >
> >
> > > Also, by the same token, this ratcheting mechanism doesn't make sense
> > > with =1 *at all* because it could break programs in a way that would
> > > require a reboot but it's not a "security setting" (and the YAMA sysctl
> > > mentioned only locks the sysctl at the highest setting).
> > >
> > I think a system should use "=0" when it is unsure about its program's
> > need or not need executable memfd. Technically, it is not that this
> > sysctl breaks the user, but the admin  made the mistake to set the
> > wrong sysctl value, and an admin should know what they are doing for a
> > sysctl. Yes. rebooting increases the steps to undo the mistake, but
> > that could be an incentive for the admin to fully test its programs
> > before turning on this sysctl - and avoid unexpected runtime errors.
>
> I don't think this stance is really acceptable -- if an admin that has
> privileges to load kernel modules is not able to disable a sysctl that
> can break working programs without rebooting there is
>
> When this sysctl was first proposed a few years ago (when kernel folks
> found out that runc was using executable memfds), my understanding is
> that the long-term goal was to switch programs to have
> non-executable-memfds by default on most distributions. Making it
> impossible for an admin to lower the sysctl value flies in the face of
> this goal.
>
> At the very least, being unable to lower the sysctl from =1 to =0 is
> just broken (even if you use the yama example -- yama only locks the
> sysctl at highest possible setting, not on lower settings). But in my
> view, having this sysctl ratchet at all doesn't make sense.
>
To reiterate/summarize the current mechanism for vm.memfd_noexec

1> It is a pid namespace sysctl,  init ns and child pid ns can have
different setting values.
2> child pid ns inherits parent's pid ns's sysctl at the time of fork.
3> There are  3 values for the sysctl, each higher value is more
restrictive than the lower one. Once set, doesn't allow downgrading.

It can be used as  following:
1>
init ns: vm.memfd_noexec = 2 (at boot time)
Not allow executable memfd for the entire system, including its containers.

2>
init ns: vm.memfd_noexec = 0 or 1
container (child init namespace) vm.memfd_noexec = 2.
The host allows runc's usage of executable memfd during container
creation. Inside the container, executable memfd is not allowed.

The inherence + not allow downgrading is to reason with how
vm.memfd_noexec is applied in the process tree.
Without it, essentially we are losing the hierarchy view across the
process tree and  a process can evaluate its capability by modifying
the setting. I think that is a less secure approach I would not
prefer.

Thanks

-Jeff


  reply	other threads:[~2023-08-02 19:47 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-13 14:33 Aleksa Sarai
2023-07-13 14:33 ` [RFC PATCH 1/3] memfd: cleanups for vm.memfd_noexec handling Aleksa Sarai
2023-07-13 14:33 ` [RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec Aleksa Sarai
2023-07-14  0:07   ` Aleksa Sarai
2023-07-13 14:33 ` [RFC PATCH 3/3] selftests: memfd: error out test process when child test fails Aleksa Sarai
2023-07-19 23:38   ` Jeff Xu
2023-08-03  2:25     ` Aleksa Sarai
2023-08-03  2:55       ` Jeff Xu
2023-07-18  0:47 ` [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec Jeff Xu
2023-07-19  3:10   ` Aleksa Sarai
2023-07-19  7:17     ` Jeff Xu
2023-08-02  1:04       ` Aleksa Sarai
2023-08-02 19:47         ` Jeff Xu [this message]
2023-08-02 19:54           ` Jeff Xu
2023-08-02 21:38           ` Aleksa Sarai
2023-08-02 23:11             ` Jeff Xu
2023-08-03  4:39               ` Aleksa Sarai
2023-08-03 14:40                 ` Jeff Xu
2023-08-02 20:45         ` Jeff Xu
2023-08-02 21:48           ` Aleksa Sarai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABi2SkXWfup2_UeKqm7C-xkjF5gnhKuxOP7TsRVa5MLbxabFQg@mail.gmail.com \
    --to=jeffxu@chromium.org \
    --cc=akpm@linux-foundation.org \
    --cc=cyphar@cyphar.com \
    --cc=dverkamp@chromium.org \
    --cc=jeffxu@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=shuah@kernel.org \
    --cc=yuehaibing@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox