From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24636CDB484 for ; Tue, 17 Oct 2023 09:08:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8BA478D010C; Tue, 17 Oct 2023 05:08:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8686A8D0108; Tue, 17 Oct 2023 05:08:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6BB8F8D010C; Tue, 17 Oct 2023 05:08:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 549648D0108 for ; Tue, 17 Oct 2023 05:08:33 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 2021D1CBA71 for ; Tue, 17 Oct 2023 09:08:33 +0000 (UTC) X-FDA: 81354377706.20.2CDAB30 Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com [209.85.160.171]) by imf05.hostedemail.com (Postfix) with ESMTP id 4DEF710000E for ; Tue, 17 Oct 2023 09:08:30 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=S1C5d0w6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jeffxu@google.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=jeffxu@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697533710; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OJB/KZh+XHTSt6vkeQK0/vvYJ6jT49v2mbim13MB91k=; b=1hBerqodIM8ZxROUQ2ArlxabaiRaCpWOp+KHLfQ10mDeyrbIiEQ0hegq1MPlsUP8tcQESE 0op7fRYV8K2TY6ETDXsQm8eDurWkwWgx4EHlFPfPDhlM5fkacU4VDUhsvNnOWy0RrL93S1 EoSR7/DNnWxTvPnin+S8LaL5JSyr1pA= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=S1C5d0w6; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of jeffxu@google.com designates 209.85.160.171 as permitted sender) smtp.mailfrom=jeffxu@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697533710; a=rsa-sha256; cv=none; b=xOReQzaVjp0++Izh8be7+oOmJOhsJ7hs25RUd2Qr/y5QfOPnkFLeSJPsxCOw10DuNDdDET NFTrmrPWPUxAcm+2tEqP4Vyj2Lxe0sCXXSPZTYu05es35alqywxrP035HHUYvFPhnXYTr9 VOZ7kWsp8pPG1n4LLBprr2JGw1Sg7/U= Received: by mail-qt1-f171.google.com with SMTP id d75a77b69052e-419b53acc11so194021cf.0 for ; Tue, 17 Oct 2023 02:08:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1697533709; x=1698138509; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OJB/KZh+XHTSt6vkeQK0/vvYJ6jT49v2mbim13MB91k=; b=S1C5d0w67SBU5Ed6lD2vQnk/tg1in2de2dOjVWvGetjyAQ3rljQ6F6bjXV7fZwiz0M pXGfTDrfgVBH5AF8Aihnxaxc35IKQ/iYp4j4YIAJ2jr4qfIFtyE7g7m/WuYuRzh/j59t qiYAxxBHsJLf4+/1EM7phozwvITRKxIvUrLLsyczjKhZcKr55RCZxcJfGt4PLDPep4ir wMLDSN9Puk+BAqw/t+ydskZEq8Tgm9MIXYrUP6UczqjRmfR948R3vM0lEqEBKdybHbIA 9iIhHwsbniHhCQDAg2c7KbKduep7C93++9YDR02PiCsP0r6EtJE03071SgI6TcwS9weD fWTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697533709; x=1698138509; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OJB/KZh+XHTSt6vkeQK0/vvYJ6jT49v2mbim13MB91k=; b=t4UkQMJ5UO+VHuRRBb3RLBDzRfrESaSjZlQFkWjmKXCy+JfS+73fFqkLhKeWIrM3/7 jcEp3ZeK/IhRZK2UFUnzVesjSC5u46WTzkobyCZu9GPP5u73NYEfyaz0kER6NDdYDiTK 12UcHbRFmUk8jjWqbKz17b1d0TGDip+s17Z8FkZzuBqxknjHQ+GH6oH40vxgGeYaG5ni eP7aCzxTujO9EzCSYQDdA/aCdlbKcQJpPqsT2w1GSgK3Lcxa+eQ4d2abhZkaQTTJNPL3 4WoIMcHT2DLXdcCIcqOABMdJ2E47NJ50B9hBcdjEwfbZgbMgjc/zJMHh46t5Ut3od/Sl XcHg== X-Gm-Message-State: AOJu0Yxal3C0fKKXH0H+NpoNaZQXb+AMx86wPSTlYBwmCNo+ym/+lZcX FhkD5C0r9kNzJDHxLAHefJwjfAPgApYf4DUyD940EA== X-Google-Smtp-Source: AGHT+IG1Wm4mK6Xhs/lc9gPqJBN7fbtaCLDjWvydeqel3QbWoveJj6PDGxuKsHLuwo955YiGUPJxiyZoROusg2wo3Ts= X-Received: by 2002:a05:622a:288f:b0:419:6cf4:2474 with SMTP id ke15-20020a05622a288f00b004196cf42474mr188894qtb.2.1697533709123; Tue, 17 Oct 2023 02:08:29 -0700 (PDT) MIME-Version: 1.0 References: <20231016143828.647848-1-jeffxu@chromium.org> In-Reply-To: From: Jeff Xu Date: Tue, 17 Oct 2023 02:07:52 -0700 Message-ID: Subject: Re: [RFC PATCH v1 0/8] Introduce mseal() syscall To: Linus Torvalds Cc: jeffxu@chromium.org, akpm@linux-foundation.org, keescook@chromium.org, sroettger@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, surenb@google.com, alex.sierra@amd.com, apopple@nvidia.com, aneesh.kumar@linux.ibm.com, axelrasmussen@google.com, ben@decadent.org.uk, catalin.marinas@arm.com, david@redhat.com, dwmw@amazon.co.uk, ying.huang@intel.com, hughd@google.com, joey.gouly@arm.com, corbet@lwn.net, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, lstoakes@gmail.com, willy@infradead.org, mawupeng1@huawei.com, linmiaohe@huawei.com, namit@vmware.com, peterx@redhat.com, peterz@infradead.org, ryan.roberts@arm.com, shr@devkernel.io, vbabka@suse.cz, xiujianfeng@huawei.com, yu.ma@intel.com, zhangpeng362@huawei.com, dave.hansen@intel.com, luto@kernel.org, linux-hardening@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4DEF710000E X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: jg7epppn3yk7ucgn6y3cii7n3htwr6kd X-HE-Tag: 1697533710-756617 X-HE-Meta: U2FsdGVkX19fiqTk+iXSxq8+M13VdSwb74ysbf7AyvAu4P7EtRKXd6EhkQ3mb/ldo5Gr2s5h9UUUWVZzqyemJDUkemr3owUIar1LM17J8mHFQe6ErZ/ZEPVGPdcMEZ8YQpVy2HGZekEaJdZk+MgKt84iXUpv3ZHuUYjNN+u9DjuZfJxw0gcqkPcC0wV0cbr0YWKTpucfblCB187bgrpIKlT0+DmmBAJ44j6RujsGg4B8CtcG2UNp/gfKWWeRSAJ5cInMzOcFl902jGmMNrUtvqHdp5Qzy+n4rHLBRCtzJtKnttidOj4/0EiqJfmlunRxATJFyzH1Bes6vGnvAf+aHdl+a1+QgT8GJxMkxNoYIyKzxfC/zoSFi7p3moR658FYIYWBjLir5+JmluoLCiaMwzHTKma3M0+2syW7YOT2P8v7dSdnfc0duTB1vZxNHovzbC04KNjwSR6hlh+7Ul52oceDwSFerliPHhWOvp2Y1iHgmzW+n/QTc4R/txIR+kVUJCeN5jHkQB1Mfrrll73PMIxN8DxOSMzFn1lpp7ufr2VpMvwGN8yaCmwy0RcBuRM3LrcCaPv044C7lKkrcUEeDeqJX5r6t1mB62ad6YXCxBLO8VXDpKDTiWxYOrrQtEMNp1MbFdAVy5ka/DtUSloYMDzuXUHBpBoft7W3NKgw4iE1RC9mqRp0CSZtQfXpg5nxh8v2hBnSbgQCTZuco6B8pd8J+NoEAd3NTcN335De0d77qhkMG2yZJaz34mdhTh2VryMlJ0gC6BG5Cyt0qEs/z6x6OetXJT5O6oJleTzEaL0R73y24o0/8YWSd3F//yDUmOESe6SWPWui2C707tEZ8JUgViLhdFOWjTC5Dw3MqNU2Q8ONgvpvBiSBDRhZoSSPBkLMZh7EoesbdjcAe3y+HtSzyLkbny25031MaZUglgVasFYLiyjRkGyx5UlLWgCC54x4mB4zSHo8KdYUkkq Ba8e54HV rRQl5ymq53u1ga+IZBW1VcNVtVth4YTNPDu8tWK6okF0rY9BEkRvBMSpI1sWtXhNP96yqbyMKh6ZW8Uvr8pBTyIZag2m6FNk6vDfoTQvJ+y+eHt1si+8UWZ/gbV/TD1k/foEJsyuF4X656W+KREvV/v98XwSQq/P7uRgohnu6oPGMmItee41YHYdAzI4M2WjVWi5kFbfBpw9Mu/sihGwaRjK44GSByMo9fwqZP6iLtp40i1Dbm++ct6AloauckqOHvQcDAD4JUl+kI6kKleGmFD2XODr+GKgmW8lmeess2H8o9+hvcvESOPtZb8lpmiDleX3jEFZNbM3+mZqk7xhhKYQUtpGmy874gRPrW7kF1Ls6KXxxHpoHm+8c0c+sSKflGAGSPeSE87og05dAHcMQdm8rF4ALu/0sieyStpNjpAC8U4c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello Linus, Thank you for the time reviewing this patch set. On Mon, Oct 16, 2023 at 10:23=E2=80=AFAM Linus Torvalds wrote: > > On Mon, 16 Oct 2023 at 07:38, wrote: > > > > This patchset proposes a new mseal() syscall for the Linux kernel. > > So I have no objections to adding some kind of "lock down memory > mappings" model, but this isn't it. > > First off, the simple stuff: the commit messages are worthless. Having > > check seal for mmap(2) > > as the commit message is not even remotely acceptable, to pick one > random example from the series (7/8). > > But that doesn't matter much, since I think the more fundamental > problems are much worse: > > - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is > just complete noise and totally illogical. The whole concept needs to > be redone. > > Example of complete nonsense (again, picking 7/8, but that's again > just a random example): > > > @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, > > flags |=3D MAP_LOCKED; > > > > file =3D get_file(vma->vm_file); > > - ret =3D do_mmap(vma->vm_file, start, size, > > - prot, flags, pgoff, &populate, NULL); > > + ret =3D do_mmap(vma->vm_file, start, size, prot, flags, pgoff, > > + &populate, NULL, ON_BEHALF_OF_KERNEL); > > fput(file); > > out: > > mmap_write_unlock(mm); > > Christ. That's *literally* the remap_file_pages() system call > definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense > in this context. > > It's not the only situation. "mremap() as the same thing. vm_munmap() > has the same thing. vm_brk_flags() has the same thing. None of them > make any sense. > > But even if those obvious kinds of complete mis-designs were to be > individually fixed, the whole naming and concept is bogus. The > "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check > sealing". Naming should describe what the thing *means*, not some > random policy thing that may or may not be at all relevant. > I apologize that I didn't think of a better name for ON_BEHALF_OF_XX and I should have written a more clear commit message. I prepared a V2 patchset with a more detailed commit message, hopefully that will help to explain the design. Indeed, the ON_BEHALF_OF_XX is confusing, especially with remap_file_pathes(). remap_file_pathes(2) is not supported in this patch set, this covers mprotect(2), mmap(2), munmap(2), mremap(2) as the first feature set. I could extend the sealing to more syscalls, if it is determined necessary from the outcome of this discussion. The initial set of 4 syscalls was chosen based on Chrome's initial wish list. Regarding the ON_BEHALF_OF flag, my intention is to have a flag, set at syscall entry points of mprotect(2), munmap(2), mremap(2), mmap(2), pass the flag along the call stack, till reaching can_modify_mm(), can_modify_mm() does the actual check for the sealing type. It is probably worth noting that I choose to check one and only one sealing type per syscall. i.e. munmap(2) checks MM_SEAL_MUNMAP only. With this approach, sealing can be implemented incrementally. For example, When implementing sealing for munmap(2), I don't need to care that mremap(2) can also call internal functions to unmap the VMAs. The mremap(2) will be sealed by MM_SEAL_MREMAP(), and be dealt with separately. This approach also allows dev to expand the sealing to madvice(), mlock(), or whatever syscalls or cases that modify VMA's meta data. As Yann points out, There is a list of cases that we might care about. Having all of those implemented will take time. Using bitmasks will help to add those incrementally. An application will be backward compatible when a new sealing type is added, i.e. It has to set the new sealing type explicitly. > - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go = away. > > Not only is it unacceptable to pointlessly create two different name > spaces for no obvious reason, code like this (from 1/8) should not > exist: > > > + if (types & MM_SEAL_MSEAL) > > + newtypes |=3D VM_SEAL_MSEAL; > > + > > + if (types & MM_SEAL_MPROTECT) > > + newtypes |=3D VM_SEAL_MPROTECT; > > + > > + if (types & MM_SEAL_MUNMAP) > > + newtypes |=3D VM_SEAL_MUNMAP; > > + > > + if (types & MM_SEAL_MMAP) > > + newtypes |=3D VM_SEAL_MMAP; > > + > > + if (types & MM_SEAL_MREMAP) > > + newtypes |=3D VM_SEAL_MREMAP; > > Sure, we have that in some cases when there was a *reason* for > namespacing imposed on us from an API standpoint (ie the "open()" > flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags > being turned into MNT_xyz flags), but those tend to be signs of > problems in the system call API where some mixup made it impossible to > use the flags directly (most commonly because there is some extended > internal representation of said things). > > For example, the MS_xyz namespace is a combination of "these are flags > for the new mount" (like MS_RDONLY) and "this is how you should mount > it" (like MS_REMOUNT), and so the user interface has that odd mixing > of different things, and the MNT_xyz namespace is a distillation of > the former. > > But we certainly should not strive to introduce *new* interfaces that > start out with this kind of mixup and pointless "translate from one > bit to another" code. > The two namespaces can go away, that means the bitmap will be stored as is by vm_seals in VMA struct. (1/8) Copied below. +++ b/include/linux/mm_types.h @@ -660,6 +660,13 @@ struct vm_area_struct { + unsigned long vm_seals; /* seal flags, see mm.h. */ My original considerations are: 1. vm_seals is a new field, and mseal() currently uses 5 bits. vm_seals can be repurposed to store other VMA flags in future, having two namespaces (API and internal one) can be useful. 2. vm_flags is full and it seems to me there is pending work on expanding vm_flags. [1] if that happens, we will need translation logic. [1] https://lore.kernel.org/linux-mm/4F6CA298.4000301@jp.fujitsu.com/ I might have over-engineered this, so I removed the VM_SEAL_XX in V2. > - And finally (for now), I hate that MM_ACTION_xyz thing too! > > Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8): > > > + switch (action) { > > + case MM_ACTION_MPROTECT: > > + if (vma->vm_seals & VM_SEAL_MPROTECT) > > + return false; > > + break; > > when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and > sealing bitmask and just say > > if (vma->vm_seal & vm_action) > return -EPERM; > Make sense. My original thought is that can_modify_vma() will check one and only one seal type, and having an enum type will enforce that. This restriction feels unnecessary. I removed the action type in V2. > IOW, you have pointlessly introduced not *two* different namespaces, > but *three*. All doing the exact same thing, and all just causing > pointless and ugly code to "translate" the actions of one into the > model of another. > > So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz" > thing. Use *one* single "these are the sealing bits". > > And get rid of "enum caller_origin" entirely. I don't know what the > right model for that thing is, but that isn't it. > > *Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space > cannot set, but that the kernel can use internally - and if that is > the right model, then dammit, the *uses* should be very very obvious > why the override is fine, because that remap_file_pages() use sure as > hell was *not* obvious. > > In fact, it's not at all obvious why anything should override the > sealing bits - EVER. So I'm not convinced that the right model is > "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we > *ever* want to override sealing? It sounds like complete garbage. Most > of the users seem to be things like "execve()", which is nonsensical, > since the VM wouldn't have been sealed at that point _anyway_, so > having a "don't bother checking sealing bits" flag seems entirely > useless. > Would the new commit message and comments in V2 help to explain the design better ? (will send shortly) Another code change I can make to help the readability (not in v2), is to set and pass checkSeals flag from syscall entry point, all the way to can_modify_vma(). Currently, I didn't do that if I checked the internal function is only used by syscal entry point, e.g. in do_mprotect_pkey(), mremap_to(), ksys_mmap_pgoff() cases. Doing that does increase the size of the patch set though. Thanks. -Jeff -Jeff > Anyway, this is all a resounding NAK on this series in this form. My > complaints are not some kind of small "fix this up". These are > fundamental issues. > > Linus