From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4151C761A6 for ; Mon, 3 Apr 2023 08:22:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 37EB66B0072; Mon, 3 Apr 2023 04:22:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 306BB6B0074; Mon, 3 Apr 2023 04:22:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 180746B0075; Mon, 3 Apr 2023 04:22:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 022CF6B0072 for ; Mon, 3 Apr 2023 04:22:06 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B196E1A046C for ; Mon, 3 Apr 2023 08:22:06 +0000 (UTC) X-FDA: 80639387052.18.55F44F5 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 004AC1A0013 for ; Mon, 3 Apr 2023 08:22:03 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fgiD7Yk6; spf=pass (imf19.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680510124; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jQzDBAQKeqgfsxxyrlkQCm/LEl+p6ixMadQZXFxQxwQ=; b=j4SNIBH5pftkiaJCAp5SbcHZjHLZHRRU02hNWlNv98CoLP5zZdFkcfi5NfFgsxwMDz0Q7m u1HtpBvyKb8svDVjFwzedSY3xOd+PGsyPTwsf2ijI42VMV2ggBYkOikmUi8YmPft7mYduh 671W9FYn7r6vUaluH5Nbs7l7JRNSC2M= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fgiD7Yk6; spf=pass (imf19.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680510124; a=rsa-sha256; cv=none; b=TwhUmZUVvcNpWdMRGAcBNEVHFiLXvmUjYbql40PUU4y5zam07dtOw2zgHSQqfbx3Xl5KUS htUcKj6VPyLorCvmLmuWJB0R3CBhJ9SoT9ofreymtmw4Tu6LUAH0cYwGqrFFXtRzMXHUds HujGW5GGkXUNGpu1VJiGWbb6ZXhWj4A= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1680510123; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jQzDBAQKeqgfsxxyrlkQCm/LEl+p6ixMadQZXFxQxwQ=; b=fgiD7Yk6V/l/V3Om6z4RxMbQAPql9XTF400fiu5ibOqAhFp8mgSt3dZv13Oz50Fd4VitSk EgcKeeorhpRfjK5f3yeK4MpKZ6lyEndo10IWLSKkf9ufYcQCZ+v+3SOMO5FV+NCQ73/8BI e1uNAv3S16ofE9J4hf5AOn1pfN6bDtg= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-581-6pYMoEovM862VeaoLKu2LQ-1; Mon, 03 Apr 2023 04:21:52 -0400 X-MC-Unique: 6pYMoEovM862VeaoLKu2LQ-1 Received: by mail-wm1-f71.google.com with SMTP id m7-20020a05600c4f4700b003ee7e120bdfso14154568wmq.6 for ; Mon, 03 Apr 2023 01:21:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680510112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jQzDBAQKeqgfsxxyrlkQCm/LEl+p6ixMadQZXFxQxwQ=; b=lva1uGmy0dIflNxWIW0xumn/lB+Ws4VX8yf9oJ/BISLjdnDqnSjekZi87JJ6SbTDsj yyAjWDRrtcjanjmw0ub1b5lzZevz107Rct87QRyhTYU/zm9ZNbF/zxJdMraV1z5fwoyh I4IF0kAb4jSOOGfLM/f6GzacZdV6kwGwO8o4KcLl38wUe04az31yDMzDq8t9iHlmZk+V lqCMFiC8xzjd0KHjcT/sl90n9OnibjbTRNRBh5uoFIp8s0rik9hBIfKJarEcrr/u3WQa LWc4S354olH1AxvU6ixZMmCxtuaEWxS/soHqvrS3rO2PDm+sz2yNtQfXqbH4jvzqjq1P zgVg== X-Gm-Message-State: AO0yUKUClWpQEw0DePUoRmhURAZT+9K3fCVw1gmLLDdhasSGWyG79Yt9 ar3ULnH/yWGl53AXNdrjcn0683VRa3v5LEQg/7q7HPi++Pt5zvo1m1gIRCei9y9Tl6FtnQ6PLLY 7h6Mmucm3l3g= X-Received: by 2002:a05:600c:20d:b0:3ee:672d:caae with SMTP id 13-20020a05600c020d00b003ee672dcaaemr26843816wmi.36.1680510111657; Mon, 03 Apr 2023 01:21:51 -0700 (PDT) X-Google-Smtp-Source: AK7set/AgkVl4NRObZrwop/hyYIrLuclF7oALAw2adBku9J1XTbgT5nDnZVWZTlkdOAVCBUc2F54wA== X-Received: by 2002:a05:600c:20d:b0:3ee:672d:caae with SMTP id 13-20020a05600c020d00b003ee672dcaaemr26843772wmi.36.1680510111270; Mon, 03 Apr 2023 01:21:51 -0700 (PDT) Received: from ?IPV6:2003:cb:c702:5e00:8e78:71f3:6243:77f0? (p200300cbc7025e008e7871f3624377f0.dip0.t-ipconnect.de. [2003:cb:c702:5e00:8e78:71f3:6243:77f0]) by smtp.gmail.com with ESMTPSA id c2-20020adfe702000000b002d6f285c0a2sm9135348wrm.42.2023.04.03.01.21.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 03 Apr 2023 01:21:50 -0700 (PDT) Message-ID: Date: Mon, 3 Apr 2023 10:21:48 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 To: Ackerley Tng , kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org Cc: aarcange@redhat.com, ak@linux.intel.com, akpm@linux-foundation.org, arnd@arndb.de, bfields@fieldses.org, bp@alien8.de, chao.p.peng@linux.intel.com, corbet@lwn.net, dave.hansen@intel.com, ddutile@redhat.com, dhildenb@redhat.com, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, jmattson@google.com, joro@8bytes.org, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, linmiaohe@huawei.com, luto@kernel.org, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, mingo@redhat.com, naoya.horiguchi@nec.com, pbonzini@redhat.com, qperret@google.com, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, tabba@google.com, tglx@linutronix.de, vannapurve@google.com, vbabka@suse.cz, vkuznets@redhat.com, wanpengli@tencent.com, wei.w.wang@intel.com, x86@kernel.org, yu.c.zhang@linux.intel.com References: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted In-Reply-To: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: ah65xb9a5mk91uhmq8sow8bs8n9zxfc5 X-Rspamd-Queue-Id: 004AC1A0013 X-HE-Tag: 1680510123-755862 X-HE-Meta: U2FsdGVkX1/3ANTPE98UPxW/FzWFTs6sPYfYHIMqfhPLHMId/qkr5vkfeAzWKf8RJafrtLsfTqlzd5q9nJD9ly9qdG8KO0lFEhKmSu+60GMyn7WFJYHSrqMMed7tYbiYTyIdx3AhOUnNYCm1dNgaXR8n5VkKDccQfdDsSVkT5xjnCMTQ+amREZHpCpkjxLZUhyI0KWHwPWzvtrN+KRNbXQdffUn0UUiSLYvbxlqVOufwCKoFdGi4UoHHST9BhMk2cxxMb8Kf9ySeNIhZoJ3coU3T3hPk1WFnTl2q13pU1QqVhNSHy6unduYSAtsCcZJC8M9u+gGnwZT66icRle3yVQPVqrDuf61odzZto8XeW8djM1Q9p3ZUDruq0PrLRXt2p6IMN3Qk1CKlM5PTuFr35gwCjZN9BlgeImpWT4R7g7EvS2/QqSnT/ALQRo86XSCSuZmkTev7M4ZFAoIl/nfMmHkiFKFwpax7L975+zDpWnsEFctY/vSX4CGAPPf3cONhD2JCNwblMjbMDwKJIXEKs+Xf6KbW/b/FqM6fXuQdUmY+O75f06SIrGh6WoCx16+i4rYcCi/pQDf66pNvkYRgVVP1yU5eRgckZqXifzWGdB+2jDv70UxUFmsjLUB2QZyjrpCv6k/DpgeEscuElOieH9w4pClzW6tJ7jdDCrCEvU0lDlO3B+xzmGKV1kbzaVFWyWJrSUoKbJW0wIXlAM/uI5vsJvUSlVpm4XxQjclWM1V9QjWP5Ogvp9WEUZbs90StrLvjK3TAgYPxXCULKmGuzQn84jBFn2BJky+8TmpgCkZ+yS+TPZlf6UCeWflG7TaC0coVugTodWROvy90POUwEAuseX4eLB8+WZ66nXYd15rwmQUPTAAcBncjL0+d5saNSuVhxNMgPgj/3CsYcokQsbXNz14GPmByQOT88uMoO49TbonTzP0hCumEceOqGYpfhHmOcRHMHuviY55NZBm xYZqnkAt H+AFWVGo5DhvRQdzynFTCWcQ/eXji85Uv0f4FZjuyB0+UHyy40MMt7ekvnxds+8kT39CyyKNe4QPJkmGyWqDJlOaQrY3hDoGNLAK/0IrDWmtzCUdx5o5zIF8AVwK/nAKG2tMRZWSSKa7S/45l2uybEBu3GpzasYE2Wny2WpKyMoWRrc3GXWiG2S7S7/N6hCeFKzQbq4pZk5f1PHW0Bs5ce7XIn7xLI4HVT13OmaLY15AInDC26tP7Ka0sdSYyEwOb/StFXHQzyT16fKi5894xGoy2DVwp2l0wtMkYEB8U/ImGBhayep4AiM4AJZZD+ZL/p15hpC7MoOFreD6ZFsxqnxm73c5YmrclYHJAX7uMgaaa7kSS8M1CFrHKdTlrrO7EpBhdhp3Chml2eK6z/UpJUnXZ7HWM2XNghjTM14D80qtK2RBdUc83AwJnHhvrvOJwPS6shqFf3wu+01by+IPNrEMig/JWt8/4qa4zecPpELHZWnSYF24uf/nOtn7AFb8PtGRG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 01.04.23 01:50, Ackerley Tng wrote: > By default, the backing shmem file for a restrictedmem fd is created > on shmem's kernel space mount. > > With this patch, an optional tmpfs mount can be specified via an fd, > which will be used as the mountpoint for backing the shmem file > associated with a restrictedmem fd. > > This will help restrictedmem fds inherit the properties of the > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > binding hints, etc. > > Permissions for the fd passed to memfd_restricted() is modeled after > the openat() syscall, since both of these allow creation of a file > upon a mount/directory. > > Permission to reference the mount the fd represents is checked upon fd > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > etc) and any process that can present memfd_restricted() with a valid > fd is expected to have obtained permission to use the mount > represented by the fd. This behavior is intended to parallel that of > the openat() syscall. > > memfd_restricted() will check that the tmpfs superblock is > writable, and that the mount is also writable, before attempting to > create a restrictedmem file on the mount. > > Signed-off-by: Ackerley Tng > --- > include/linux/syscalls.h | 2 +- > include/uapi/linux/restrictedmem.h | 8 ++++ > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > 3 files changed, 77 insertions(+), 7 deletions(-) > create mode 100644 include/uapi/linux/restrictedmem.h > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index f9e9e0c820c5..a23c4c385cd3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > -asmlinkage long sys_memfd_restricted(unsigned int flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > new file mode 100644 > index 000000000000..22d6f2285f6d > --- /dev/null > +++ b/include/uapi/linux/restrictedmem.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > +#define _UAPI_LINUX_RESTRICTEDMEM_H > + > +/* flags for memfd_restricted */ > +#define RMFD_USERMNT 0x0001U I wonder if we can come up with a more expressive prefix than RMFD. Sounds more like "rm fd" ;) Maybe it should better match the "memfd_restricted" syscall name, like "MEMFD_RSTD_USERMNT". > + > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index c5d869d8c2d8..f7b62364a31a 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -1,11 +1,12 @@ > // SPDX-License-Identifier: GPL-2.0 > -#include "linux/sbitmap.h" Looks like an unrelated change? > +#include > #include > #include > #include > #include > #include > #include > +#include > #include > > struct restrictedmem { > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file, *restricted_file; > int fd, err; > > - if (flags) > - return -EINVAL; > - > fd = get_unused_fd_flags(0); > if (fd < 0) > return fd; > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (mount) > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > + else > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > return err; > } > > +static bool is_shmem_mount(struct vfsmount *mnt) > +{ > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > +} > + > +static bool is_mount_root(struct file *file) > +{ > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > +} I'd inline at least that function, pretty self-explaining. > + > +static int restrictedmem_create_on_user_mount(int mount_fd) > +{ > + int ret; > + struct fd f; > + struct vfsmount *mnt; > + > + f = fdget_raw(mount_fd); > + if (!f.file) > + return -EBADF; > + > + ret = -EINVAL; > + if (!is_mount_root(f.file)) > + goto out; > + > + mnt = f.file->f_path.mnt; > + if (!is_shmem_mount(mnt)) > + goto out; > + > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); > + if (ret) > + goto out; > + > + ret = mnt_want_write(mnt); > + if (unlikely(ret)) > + goto out; > + > + ret = restrictedmem_create(mnt); > + > + mnt_drop_write(mnt); > +out: > + fdput(f); > + > + return ret; > +} > + > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > +{ > + if (flags & ~RMFD_USERMNT) > + return -EINVAL; > + > + if (flags == RMFD_USERMNT) { > + if (mount_fd < 0) > + return -EINVAL; > + > + return restrictedmem_create_on_user_mount(mount_fd); > + } else { > + return restrictedmem_create(NULL); > + } You can drop the else case: if (flags == RMFD_USERMNT) { ... return restrictedmem_create_on_user_mount(mount_fd); } return restrictedmem_create(NULL); I do wonder if you want to properly check for a flag instead of comparing values. Results in a more natural way to deal with flags: if (flags & RMFD_USERMNT) { } > +} > + > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { The "memfd_restricted" vs. "restrictedmem" terminology is a bit unfortunate, but not your fault here. I'm not a FS person, but it does look good to me. -- Thanks, David / dhildenb