From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 190E2EB64DD for ; Tue, 18 Jul 2023 19:04:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6BD068D001A; Tue, 18 Jul 2023 15:04:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 66B038D0012; Tue, 18 Jul 2023 15:04:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 532B18D001A; Tue, 18 Jul 2023 15:04:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 440308D0012 for ; Tue, 18 Jul 2023 15:04:38 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A8D4A140411 for ; Tue, 18 Jul 2023 19:04:37 +0000 (UTC) X-FDA: 81025658994.05.27D9981 Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) by imf29.hostedemail.com (Postfix) with ESMTP id 811A4120024 for ; Tue, 18 Jul 2023 19:04:34 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=21v6dXrU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of jeffxu@google.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=jeffxu@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689707074; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UP4E+krXJUqcP/XL8AvesdP7FQu4JyJQ3RJDPmgOZF8=; b=I9q0ldkr/946UIEuTOfGKOxbTpn5Ft6Zz9Fkwngt1lWj0JDA4beR9Q/Lu779R24PxBf4zw rZaDUpAT9Udf0qeDjgoOwSzRmJEClfr53jg8WIRZDOvzJghANSauGyzXNvD4/+UZhHeAsF TZsFZX10A0ilaW+8yf10lZe/EeNoxQE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=21v6dXrU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of jeffxu@google.com designates 209.85.160.180 as permitted sender) smtp.mailfrom=jeffxu@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689707074; a=rsa-sha256; cv=none; b=dvTcgzTLCAfHcJqtVLTguVOjc/iHYljB3IQhY6iTuHD7W1XoqTjDHKmEsWeYVfgsHjI5Mh 3o7TkAGSkLclc6bxYcDBGUYPFY0IkMs7BR28p5EWRxffOWy2gtC/lsR5Zej8WK/WcXAQMH ppeE2IYzQ7d0jmZDUBk0MN0F4lYdFnY= Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-4036bd4fff1so434121cf.0 for ; Tue, 18 Jul 2023 12:04:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689707073; x=1692299073; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UP4E+krXJUqcP/XL8AvesdP7FQu4JyJQ3RJDPmgOZF8=; b=21v6dXrULsCH8KDeCheN7Pz5gu95bHxPXaU2t1wV8WDWMJA77Iz6q0xb8xHeB8eBNK zBajR7tCwfn41oLjzTQRqjX4Hxyfa3QqUZKBAGEuS24b0VN+RcHrf+aNZjohziD/rmKw 9vJBch+0p3qJNDmsj+Twj5JT7Y/zjc5mpzoknWLforM903oigoaouaveKT4ZgDFuBtGl MsyxmPg4xZberTUdczm/quU9iQabGH6LJ4h5gtlztS2pkC/5JC3xJmGf79W1tVmrpxJG qNt+YDelDSvg7wVBL2a4tJ6l6o3+BAKSN8yEWd7GUIcw/UQtBVFPNsS53HB5hg/1z/u2 qomQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689707073; x=1692299073; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UP4E+krXJUqcP/XL8AvesdP7FQu4JyJQ3RJDPmgOZF8=; b=DAqUs/xUOaPAwYXqMU/uq+Ex6AOwAmszZj8h57jH99RoN2bL8kC2AmlM7MBiRgFUQi Yylz4fkOpqkEc0amXS3QzXNF8Tgk/3kt1rth18JomoFj1v6ZeE2SOmLodCVV6ctb5QAy jT2ySRjHlfk3imraMxZirruxsQ8WdkWh0k9Gn0RKjgweaadKSNvwkKviu+jbtqTnlSv1 9m706QB2+LVkH3sDhKZRTSJvt5xEocOrl57jIwaMgLIwbWEbnuCUWr306rDeaZaHeZFm 2QFWphR2n4k/pp+1D0iX3mBJFRdxofrYUaNX7oZXjYvLXqtxkzb6gqyYuZRY9ekEPEhT xgww== X-Gm-Message-State: ABy/qLaxzMT6JYZnpMLpq5TY60G2L0tqnVn2b6P8Ai2tsYoDJaDHsF1y KaZWIWtX96Rl+/YVmgjin748+GUtvnlN8etx3HugPw== X-Google-Smtp-Source: APBJJlEqTCm929lAwWSFDX8sdYEaCp3/sAVGWtRDWQbwohxt/YNaKGxUn5vTLiOLRpmy5hemXBjhYQqtkuV9q1y42Xc= X-Received: by 2002:a05:622a:1882:b0:3f2:2c89:f1ef with SMTP id v2-20020a05622a188200b003f22c89f1efmr288088qtc.5.1689707073502; Tue, 18 Jul 2023 12:04:33 -0700 (PDT) MIME-Version: 1.0 References: <20230714114753.170814-1-david@readahead.eu> In-Reply-To: <20230714114753.170814-1-david@readahead.eu> From: Jeff Xu Date: Tue, 18 Jul 2023 12:03:57 -0700 Message-ID: Subject: Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC To: David Rheinsberg Cc: linux-kernel@vger.kernel.org, Andrew Morton , Kees Cook , Daniel Verkamp , linux-mm@kvack.org, Peter Xu , linux-hardening@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 811A4120024 X-Stat-Signature: d6az5r4suj1uzwm1o7ptb7uxaepghp9b X-HE-Tag: 1689707074-43407 X-HE-Meta: U2FsdGVkX1+gvJIsPAPFy1J4qfF6+0mdlXkTEwlUMvlGYQNM3YZQ5MRGud9pnWt9+DTZZ+tlydStwpaavd3N4Y2FRJ+RdFmvF3Ri10r/qgyley43CM9cYwdBxXjVC869iZNXGXRV4s/xV2TpScg+ip09Dt105JwbOBhe9OIBFYn9p2LxYKaRDqgCvOBv4+W526fbJddB4d4BmwIIUGgWVsS0a1pOOrUzT7vPl9m0hRRQcztPkkoRR225AXO5u6fOgdqX7aXEQzReRo/8l/amWahrkYHCp4Dlu0f8Tjdu5rVt2OpB9OCNNYBXtkyjVdOB4TpbwglPzX/wR6LI7tAzf+0XoPpCKkHGI4jp/aOxJueBIj4CXPJezP6xsADmBDpyO6XxJQ18a742/pJajalyzkS8juSmkaxNFQxyf8LBYD2Tv91Vhq2Qm4GpI4780OX0opP08Wk9pDA/yzaJ/aAaYwQFP1hr5LMnVYh56AXd6ApXe/+f53jk5OT6KUIWIJS9EHt0spAz1um31ndt/EsHrc3hVjGmdj4PtK4Rk7Q89yYRGOmvXBeLbdqaekh4E6lDXpk514KmvDlk1SOncttgkmDXChmjZKrMAmUw0963zuR7Arj/Hshf1atg97jMA4Bf9KIx7hAvDQeIbF2GMi1fdzTscSMsfTDI0V3e8t0t5MSUtlB0gWS00UHY7xNeRnagqrxSqqKXcaBzW0hTLuIqp+pnD3Z7sTR3FaTC0yJ9z1RNNT4OxEuhUShsEqZG7EW4LwXKpxIWp4U/MZe19g8mQGJkC725870Rg4pPilSQEZGYMWUf2Sh6L0mmUf1KF0v3APCxKkJmqwdZWA7t9Bg36mJnB0bDKcJ8zo++0lJpe+wA3qC1REU06O8NuRGuPWVryFI1e6fkFPgc7iPjHa8kzqPHmUzS32KowXCCTD0MQnS9OS+kR8bBOqbTUq7Czp7oWCkiuWjrH2GWlWexTj0 hX9oKZJP d1TcaB6CJR27XHPG+tVBCDKKdNtDMIZL2KYtBUyJEeDjGprfZigBDJ8M9Uxwgfbx8fCuV8XFZsnM51rITmA0ztHRzJgYIIvn8nbwpKU5iynvngkzNP+xEvtl86i0nr/2PS3Qo/dlUY/kNpxwhXturFU7qswqAVIVxxAX7TfGBgpL/R/gAP9MjoN956FYFh7yUCvDJF1PNpiBrW3eJlRqM+SPsQpPP8Vn+KjmIdW7j0sycoHrsnsECH/qVCvcrkfEeZuOkLbiAzGvJeiZzhkcbZ60TFXTqnfYv5nUwhJ1ikkeM8aSSzu752RvBtm7x3utH8rMI07YgPnYvhkM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi David Thanks email and patch for discussion. On Fri, Jul 14, 2023 at 4:48=E2=80=AFAM David Rheinsberg wrote: > > Add a new flag for memfd_create() called MFD_NOEXEC, which has the > opposite effect as MFD_EXEC (i.e., it strips the executable bits from > the inode file mode). > I previously thought about having the symmetric flags, such as MFD_NOEXEC/MFD_EXEC/MFD_NOEXEC_SEAL/MFD_EXEC_SEAL, but decided against it. The app shall decide beforehand what the memfd is created for, if it is no-executable, then it should be sealed, such that it can't be chmod to enable "X" bit. > The default mode for memfd_create() has historically been to use 0777 as > file modes. The new `MFD_EXEC` flag has now made this explicit, paving > the way to reduce the default to 0666 and thus dropping the executable > bits for security reasons. Additionally, the `MFD_NOEXEC_SEAL` flag has > been added which allows this without changing the default right now. > > Unfortunately, `MFD_NOEXEC_SEAL` enforces `MFD_ALLOW_SEALING` and > `F_SEAL_EXEC`, with no alternatives available. This leads to multiple > issues: > > * Applications that do not want to allow sealing either have to use > `MFD_EXEC` (which we don't want, unless they actually need the > executable bits), or they must add `F_SEAL_SEAL` immediately on > return of memfd_create(2) with `MFD_NOEXEC_SEAL`, since this > implicitly enables sealing. > > Note that we explicitly added `MFD_ALLOW_SEALING` when creating > memfd_create(2), because enabling seals on existing users of shmem > without them knowing about it can easily introduce DoS scenarios. The application that doesn't want MFD_NOEXEC_SEAL can use MFD_EXEC, the kernel won't add MFD_ALLOW_SEALING implicitly. MFD_EXEC makes the kernel behave the same as before, this is also why sysctl vm.memfd_noexec=3D0 can work seamlessly. > It > is unclear why `MFD_NOEXEC_SEAL` was designed to enable seals, and > this is especially dangerous with `MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL` > set via sysctl, since it will silently enable sealing on every memfd > created. > Without sealing, chmod(2) can modify the mfd to be executable, that is the consideration that MFD_NOEXEC is not provided as an option. Indeed, current design is "biased" towards promoting MFD_NOEXEC_SEAL as the safest approach, and try to avoid the pitfall that dev accidently uses "MFD_NOEXEC" without realizing it can still be chmod(). > * Applications that do not want `MFD_EXEC`, but rely on > `F_GET_SEALS` to *NOT* return `F_SEAL_EXEC` have no way of achieving > this other than using `MFD_EXEC` and clearing the executable bits > manually via fchmod(2). Using `MFD_NOEXEC_SEAL` will set > `F_SEAL_EXEC` and thus break existing code that hard-codes the > seal-set. > > This is already an issue when sending log-memfds to systemd-journald > today, which verifies the seal-set of the received FD and fails if > unknown seals are set. Hence, you have to use `MFD_EXEC` when > creating memfds for this purpose, even though you really do not need > the executable bits set. > > * Applications that want to enable the executable bit later on, > currently have no way to create the memfd without it. They have to > clear the bits immediately after creating it via fchmod(2), or just > leave them set. > Is it OK to do what you want in two steps ? What is the concern there ? i.e= . memfd_create(MFD_EXEC), then chmod to remove the "X" bit. I imagine this is probably already what the application does for creating no-executable mfd before my patch, i.e.: memfd_create(), then chmod() to remove "X" to remove "X" bit. Thanks! -Jeff -Jeff > By adding MFD_NOEXEC, user-space is now able to clear the executable > bits on all memfds immediately, clearly stating their intention, without > requiring fixups after creating the memfd. In particular, it is highly > useful for existing use-cases that cannot allow new seals to appear on > memfds. > > Signed-off-by: David Rheinsberg > --- > include/linux/pid_namespace.h | 4 ++-- > include/uapi/linux/memfd.h | 1 + > mm/memfd.c | 29 ++++++++++++++--------------- > 3 files changed, 17 insertions(+), 17 deletions(-) > > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.= h > index c758809d5bcf..02f8acf94512 100644 > --- a/include/linux/pid_namespace.h > +++ b/include/linux/pid_namespace.h > @@ -19,9 +19,9 @@ struct fs_pin; > #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > /* > * sysctl for vm.memfd_noexec > - * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC > * acts like MFD_EXEC was set. > - * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC > * acts like MFD_NOEXEC_SEAL was set. > * 2: memfd_create() without MFD_NOEXEC_SEAL will be > * rejected. > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h > index 273a4e15dfcf..b9e13bd65817 100644 > --- a/include/uapi/linux/memfd.h > +++ b/include/uapi/linux/memfd.h > @@ -12,6 +12,7 @@ > #define MFD_NOEXEC_SEAL 0x0008U > /* executable */ > #define MFD_EXEC 0x0010U > +#define MFD_NOEXEC 0x0020U > > /* > * Huge page size encoding when MFD_HUGETLB is specified, and a huge pag= e > diff --git a/mm/memfd.c b/mm/memfd.c > index e763e76f1106..2afe49368fc5 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -266,7 +266,9 @@ long memfd_fcntl(struct file *file, unsigned int cmd,= unsigned int arg) > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | M= FD_NOEXEC_SEAL | MFD_EXEC) > +#define MFD_ALL_FLAGS \ > + (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ > + MFD_NOEXEC_SEAL | MFD_EXEC | MFD_NOEXEC) > > SYSCALL_DEFINE2(memfd_create, > const char __user *, uname, > @@ -289,11 +291,13 @@ SYSCALL_DEFINE2(memfd_create, > return -EINVAL; > } > > - /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ > - if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) > + if (flags & MFD_NOEXEC_SEAL) > + flags |=3D MFD_ALLOW_SEALING | MFD_NOEXEC; > + > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC)) > return -EINVAL; > > - if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > + if (!(flags & (MFD_EXEC | MFD_NOEXEC))) { > #ifdef CONFIG_SYSCTL > int sysctl =3D MEMFD_NOEXEC_SCOPE_EXEC; > struct pid_namespace *ns; > @@ -366,20 +370,15 @@ SYSCALL_DEFINE2(memfd_create, > file->f_mode |=3D FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > file->f_flags |=3D O_LARGEFILE; > > - if (flags & MFD_NOEXEC_SEAL) { > - struct inode *inode =3D file_inode(file); > + if (flags & MFD_NOEXEC) > + file_inode(file)->i_mode &=3D ~0111; > > - inode->i_mode &=3D ~0111; > - file_seals =3D memfd_file_seals_ptr(file); > - if (file_seals) { > + file_seals =3D memfd_file_seals_ptr(file); > + if (file_seals) { > + if (flags & MFD_ALLOW_SEALING) > *file_seals &=3D ~F_SEAL_SEAL; > + if (flags & MFD_NOEXEC_SEAL) > *file_seals |=3D F_SEAL_EXEC; > - } > - } else if (flags & MFD_ALLOW_SEALING) { > - /* MFD_EXEC and MFD_ALLOW_SEALING are set */ > - file_seals =3D memfd_file_seals_ptr(file); > - if (file_seals) > - *file_seals &=3D ~F_SEAL_SEAL; > } > > fd_install(fd, file); > -- > 2.41.0 >