From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8406BEB64DD for ; Thu, 29 Jun 2023 04:14:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 98EB18D0002; Thu, 29 Jun 2023 00:14:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 93F4D8D0001; Thu, 29 Jun 2023 00:14:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7DFF98D0002; Thu, 29 Jun 2023 00:14:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6B8128D0001 for ; Thu, 29 Jun 2023 00:14:13 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 334DB80B90 for ; Thu, 29 Jun 2023 04:14:13 +0000 (UTC) X-FDA: 80954467986.15.811A220 Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com [209.85.167.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 4060C20012 for ; Thu, 29 Jun 2023 04:14:10 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=lPERBNzh; spf=pass (imf03.hostedemail.com: domain of jeffxu@chromium.org designates 209.85.167.172 as permitted sender) smtp.mailfrom=jeffxu@chromium.org; dmarc=pass (policy=none) header.from=chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688012050; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gFI0ylmPxIwEVqfEdxipNZjolrHgwmHjIaIurmggdCk=; b=o7OlK6NMJ61JEVWOyisN7oUZQ+cEqEmcocqfeWaCN1yolT5szhFgQm9nuKBcR8/8fG0NLH iNw8QOmSR5bkGZd9UYRDw42U8eTWlTwXkP0diJn2n0loeCkaDEn6gmCRSujWs8XJKOqgAo RX/yBoMfCch/XBCMYVgESOpWEfQPQmU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688012050; a=rsa-sha256; cv=none; b=5jm+wW16CZMGWGlmSPTW2b8MfWJI9pi8aWba7FSTizM7+pj/seYDK+eQg7yEmItX+m04XI +YiyLyhs/3XmhccFdvDu/wvs2bz3gsa/eWSfuQB4/Ytd0ECGq/lDW6fl/ZcU/SNy0WgSaK ztVh1BENrBmHwwQsEG35U939WxK3w+Y= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=lPERBNzh; spf=pass (imf03.hostedemail.com: domain of jeffxu@chromium.org designates 209.85.167.172 as permitted sender) smtp.mailfrom=jeffxu@chromium.org; dmarc=pass (policy=none) header.from=chromium.org Received: by mail-oi1-f172.google.com with SMTP id 5614622812f47-3a36b52b4a4so183356b6e.1 for ; Wed, 28 Jun 2023 21:14:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1688012049; x=1690604049; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=gFI0ylmPxIwEVqfEdxipNZjolrHgwmHjIaIurmggdCk=; b=lPERBNzh9oBZQ+QUplulviVjCOP/BA2YFs+vzqqvva8Lh9HBJ3T/XqpVdZVvIJCV5R oiUD0xCDsg8ZRy/IuoMjPMt6X0bw9U1BBP/72qABBfaPmwufk6rXtWqIwV0r5guJjFGN ODCSn8zVYcvXw+OlqM3CRrhAxTBJUNYWGe9PY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688012049; x=1690604049; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gFI0ylmPxIwEVqfEdxipNZjolrHgwmHjIaIurmggdCk=; b=E2SXsAjIp3Lj6byqdmQMra8xITn+swLRxVt4bI3sLiC6lu2AHCw4fbJdPd5/lBNaOU SvZ1AM4XD7+7rQTSw9d3LhKm+O+ftGWjtdOEpchpLxiacVGeuD9BE0XhIrkPxS2PyGHN KYoE4/5k1M5Ey3OrfMnDyGlRU5+4jddWqn8Sc1C8erkmPQb1NCybm+omO45n9TgSnxv9 YX6/qAy+UCtAgzY0nJttqgsaPk/hBWAyrQuT4NpA/eZbGygH2P/xDe2ZJzwa+wtcLof5 OzbOlVHl4FWO9oLmgOHON0E/pifOpcxb2R7XMvN4wh4J5nbvW+QSG8kgaZ/UmrbhX6B4 9IPg== X-Gm-Message-State: AC+VfDyf8CnWAZFQcayFjV6wMOkVD31YKhiTBvC6LKU5ydBsoOhKhSG/ n7CFMobHHHPTVijVQcAV/uV6+7fCqsZ/mkhWZycnww== X-Google-Smtp-Source: ACHHUZ4SX0IpPrAi8VAGsBsq9P7eKVBPxCksuggGgL4DjpZRAebGQhJ6Xnawo7+AkZotPjkeDDXTPwlHAh/4d5FsqCM= X-Received: by 2002:aca:bbc5:0:b0:3a0:54ea:e416 with SMTP id l188-20020acabbc5000000b003a054eae416mr2055075oif.17.1688012049129; Wed, 28 Jun 2023 21:14:09 -0700 (PDT) MIME-Version: 1.0 References: <20221215001205.51969-1-jeffxu@google.com> <20221215001205.51969-4-jeffxu@google.com> In-Reply-To: From: Jeff Xu Date: Wed, 28 Jun 2023 21:13:58 -0700 Message-ID: Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC To: Dominique Martinet Cc: skhan@linuxfoundation.org, keescook@chromium.org, akpm@linux-foundation.org, dmitry.torokhov@gmail.com, dverkamp@chromium.org, hughd@google.com, jeffxu@google.com, jorgelo@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, linux-hardening@vger.kernel.org, linux-security-module@vger.kernel.org, kernel test robot Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: u5udhmuizku4rmuoorqiwhobceeg5jsb X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 4060C20012 X-Rspam-User: X-HE-Tag: 1688012050-999310 X-HE-Meta: U2FsdGVkX18kWsw7KSOJ+xi1V+hnPAEO0q+E3BEFGCxTnmjL+KwFVoH05UxM0oITZOY/OW16i8uBnFhc6Kxn2C1boXyf9vv5AzBdeNvnNfJwedHM/UM9QnwVm12mWGuCzsxMkwhhjgsHOF1AcJWbRWeiUmbCzcOPw5xOs2KbmHdrnXw3m2wm/hPgJckzuoGmMFoVrdXC70SmFhY+1ZZ756uYU28gCXRKkFBLTR3YzVsoNfJvZF1IJcN03H2ZrbRs6RUE9U9X2mEE0pTHPdngc5P2BOGF1UkjgdsnCJDAs11YGoik4VQFKSOAYKS3PxdDzh86YIh5b43xhwH64WDvbmWuY4SC7xmn8GKGFjQUdH5mweqFCODe11OiNFbPwPuvye9LzUZWpt2FvxZbuRk6vE9tvfxcif80yJb4MDJqqvXfxkj7ngpZgISNP+GYUzCzxBz1Ur16L2hYkxAklgwlFRB7zoyF557mK48LIERkR45AzOzOGak2bsdxDMTHJjUeuiBEhUaJbVFbxdOGIgmDGFb7E7F97GNt5JhxzECSS/ajh0TFbaVwjdEQvRam1BFbjzgAzhvIoYsLN6xLfs7EiN9i83Qas1kOjRjKnueN1v3H+r24ci3HCNIDwc6Pyma2fidSWksLL6G/FjUNrFUFvvYyzCf7vTWJyUc1eyyAQWb7nzyufxOW4fvelRZOBB8yAqaoe0FKxlcEYn/jGoZf4vUwRAxsI2unM0sXLpGfcCHtDqAJQD9g5JsfnDl/2CLHJxpUgs4ivLcwAAJR81QxUKJevC/q8LBCmiUKsnNJfy71zIop+fNze5slZHvJxwFmtXYXcCuBJQadr3kJqkVDqWPqT0bIkPzTC9chvpYMjLQDPQ4l+jzx0rLrGgRCQviuLqqn6Tl//vbnPVHJnrL+twI0A/ylbx4omfYNtK1QIA9pcBg4RRXp1fJ9+MvP+KiYaGJCUuLo67q0KxokKJp Z8n+jJAB FMGD673kKvKIycGsFQ6bi3OAH9IG0VzOkkf6Ol3Uu3ypWGr+/ImcUtccyzhvRKC+TZWKWsMewz4bksO0X0LGwkAp1NDcTCk950Mzmp5wYb291FPs2cww5Vk2hRDAmUfmsa/3xC3V8EqwAGRqOovB8iRtB7EAMLQtYrh0a8nfjmGcYcUeLlO+dJ6KZz1UBUAl5MVmgI+rL5kJHsMXZu744XQwqteDlq7o8PwRXgodMGyOrn6vHyY4hd0w7fiw1OiVI7M9oYTwb3C9I823qUDUwkMS704WZoap1hHsQ1J7IcO9/0IjfLSI4EQjsPoD0wKLBoHstu1c2eEqEaR/uWk1n2cvvNI/QgS060B6kQgZKIg/tWUd2TqYHGi9kEmkPqx9g4O9p9Qql13U84MLSm0XH6PqkhU9WEf7Vwaq8UPHrXXYPDxIsIW02waTVqCLatDbW0x7tQjb1bVUhvShCMYOiAmCSQAsxDv3L71+us8LZ4SZihn2PNayRqHpHl2PB3Bd7L6cViS5Fmd3xdBUpGOCsE2YuQqo/WciJUvrbn2zz/G9lW1zwwg1E8/7tMxFopMTgMSwB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello. Thank you for your email and interested in using memfd_noexec ! On Wed, Jun 28, 2023 at 4:43=E2=80=AFAM Dominique Martinet wrote: > > jeffxu@chromium.org wrote on Thu, Dec 15, 2022 at 12:12:03AM +0000: > > From: Jeff Xu > > > > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to > > set executable bit at creation time (memfd_create). > > > > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit > > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to > > be executable (mode: 0777) after creation. > > > > when MFD_EXEC flag is set, memfd is created with executable bit > > (mode:0777), this is the same as the old behavior of memfd_create. > > > > The new pid namespaced sysctl vm.memfd_noexec has 3 values: > > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_EXEC was set. > > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_NOEXEC_SEAL was set. > > 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. > > So, erm, I'm a bit late to the party but I was just looking at a way of > blocking memfd_create+exec in a container and this sounded perfect: my > reading is that this is a security feature meant to be set for > container's namespaces that'd totally disable something like > memfd_create followed by fexecve (because we don't want weird binaries > coming from who knows where to be executed on a shiny secure system), > but. . . is this actually supposed to work? > (see below) > > > [...] > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int = cmd, unsigned long arg) > > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) > > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB |= MFD_NOEXEC_SEAL | MFD_EXEC) > > > > SYSCALL_DEFINE2(memfd_create, > > const char __user *, uname, > > unsigned int, flags) > > { > > + char comm[TASK_COMM_LEN]; > > + struct pid_namespace *ns; > > unsigned int *file_seals; > > struct file *file; > > int fd, error; > > @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create, > > return -EINVAL; > > } > > > > + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ > > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) > > + return -EINVAL; > > + > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > + [code that checks the sysctl] > > If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check > the sysctl at all. > > This can be verified easily: > ----- > $ cat > memfd_exec.c <<'EOF' > #define _GNU_SOURCE > > #include > #include > #include > #include > #include > > #ifndef MFD_EXEC > #define MFD_EXEC 0x0010U > #endif > > int main() { > int fd =3D memfd_create("script", MFD_EXEC); > if (fd =3D=3D -1)l > perror("memfd"); > > char prog[] =3D "#!/bin/sh\necho Ran script\n"; > if (write(fd, prog, sizeof(prog)-1) !=3D sizeof(prog)-1) > perror("write"); > > char *const argv[] =3D { "script", NULL }; > char *const envp[] =3D { NULL }; > fexecve(fd, argv, envp); > perror("fexecve"); > } > EOF > $ gcc -o memfd_exec memfd_exec.c > $ ./memfd_exec > Ran script > $ sysctl vm.memfd_noexec > vm.memfd_noexec =3D 2 > ----- > (as opposed to failing hard on memfd_create if flag unset on sysctl=3D2, > and failing on fexecve with flag unset and sysctl=3D1) > > What am I missing? > > At one point, I was thinking of having a security hook to block executable memfd [1], so this sysctl only works for the application that doesn't set EXEC bit. Now I think it makes sense to use vm.memfd_noexec =3D 2 to block the MFD_EXEC also. Anyway the commit msg says: 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. Not doing that is a bug. I will send a fix for that. [1] https://lore.kernel.org/lkml/20221206150233.1963717-7-jeffxu@google.com= / > > BTW I find the current behaviour rather hard to use: setting this to 2 > should still set NOEXEC by default in my opinion, just refuse anything > that explicitly requested EXEC. > At one point [2] (v2 of patch) there were two sysctls, one is doing overwrite, one is enforcing, later I decided with one sysctl, the rationale is the kernel will eventually get out of the business of overwriting user space code. Yes. It might take a long time to migrate all of the userspace. In the meantime, to meet what you want, the solution is keep vm.memfd_noexec =3D 1 (for overwrite), and a new security policy (SELInux or Landlock) that uses security hook security_memfd_create, this can block one process from creating executable memfd. Indeed, security policy is better fit to cases like this than sysctl. [2] https://lore.kernel.org/linux-mm/CABi2SkWGo9Jrd=3Di1e2PoDWYGenGhR=3DpG= =3DyGsQP5VLmizTmg-iA@mail.gmail.com/ > Sure there's a warn_once that memfd_create was used without seal, but > right now on my system it's "used up" 5 seconds after boot by systemd: > [ 5.854378] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid= =3D1 'systemd' > > And anyway, older kernels will barf up EINVAL when calling memfd_create > with MFD_NOEXEC_SEAL, so even if userspace will want to adapt they'll > need to try calling memfd_create with the flag once and retry on EINVAL, > which let's face it is going to take a while to happen. > (Also, the flag has been added to glibc, but not in any release yet) > Yes. Application will need to do some detection of the kernel. This is not avoidable. > Making calls default to noexec AND refuse exec does what you want > (forbid use of exec in an app that wasn't in a namespace that allows > exec) while allowing apps that require it to work; that sounds better > than making all applications that haven't taken the pain of adding the > new flag to me. > Well, I guess an app that did require exec without setting the flag will > fail in a weird place instead of failing at memfd_create and having a > chance to fallback, so it's not like it doesn't make any sense; > I don't have such strong feelings about this if the sysctl works, but > for my use case I'm more likely to want to take a chance at memfd_create > not needing exec than having the flag set. Perhaps a third value if I > cared enough... > > -- > Dominique Martinet | Asmadeus Thanks -Jeff