From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9215CC25B46 for ; Mon, 23 Oct 2023 17:43:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 217DF6B0128; Mon, 23 Oct 2023 13:43:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1A0DE6B012A; Mon, 23 Oct 2023 13:43:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 04D066B0133; Mon, 23 Oct 2023 13:43:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E53CD6B0128 for ; Mon, 23 Oct 2023 13:43:15 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A5F40120384 for ; Mon, 23 Oct 2023 17:43:15 +0000 (UTC) X-FDA: 81377447550.13.D680579 Received: from mail-oa1-f45.google.com (mail-oa1-f45.google.com [209.85.160.45]) by imf11.hostedemail.com (Postfix) with ESMTP id 4D00040012 for ; Mon, 23 Oct 2023 17:43:12 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=R6Iid5Gs; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf11.hostedemail.com: domain of jeffxu@chromium.org designates 209.85.160.45 as permitted sender) smtp.mailfrom=jeffxu@chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698082992; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bw5zQKp06M21UFu8ml4r7q055LQP16Jqvc/bc2xTyYM=; b=fRXDZmVak0s8PWqlx8sQNwqRlZPjserW00xp5AEnSEOcmSmwBa0WaH8t2CKYxIYiWQTOhY ViSvTKR/Q/76OwGrCDFJuBUdhZ0+sfJK/huqcLvsLr/RHto1Y7KGCNvZk/en3eMFMIKKxs otBEjx2hzUdCGTLw82q6L/ciGtzgR+8= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=R6Iid5Gs; dmarc=pass (policy=none) header.from=chromium.org; spf=pass (imf11.hostedemail.com: domain of jeffxu@chromium.org designates 209.85.160.45 as permitted sender) smtp.mailfrom=jeffxu@chromium.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698082992; a=rsa-sha256; cv=none; b=uhc/574tb7LqevfsHeJZZFr8bVcazz0XQTLZwbLM2t/84/U/WRQKqfX9mUTjgPICVyo5ek G2mq/vHwfZlYaheumLDr5Izz1SL36V+jgPkRlq9nmHdfzd5rjMLNWp7N+XcQyYnY3zcNPW QWhAmGteBWRlTEfK9shSdsFtwc1caOY= Received: by mail-oa1-f45.google.com with SMTP id 586e51a60fabf-1e9d3cc6e7aso2476358fac.2 for ; Mon, 23 Oct 2023 10:43:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1698082991; x=1698687791; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=bw5zQKp06M21UFu8ml4r7q055LQP16Jqvc/bc2xTyYM=; b=R6Iid5GsglwE9ZKhvi1HsLx9SzEeNoqzIeJvYkdzK/urK0NCqe0H3tIqVhYnkZ/tsV G9CrWzUE9sVbgsr2lO4UsMvhSs4oN+FkOmfQg3LihD88s2w+2j7fWbq9nLlaiO6pw+Br HA1j7k6JpT15e7z4glWetMPcswoF/kcz7tY4c= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698082991; x=1698687791; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bw5zQKp06M21UFu8ml4r7q055LQP16Jqvc/bc2xTyYM=; b=Tg/0ggv431QJZjx6PpjQcIhXDoPQLNPQ6GLKSQJxKtH9JUiFhA09zPSRtFEFzWWf7F Kx5hhFKGpfId3NaX4spK/qSeAVdtMgFm9D9ZVenUtX5phvw8lKdd5tG0Zb6Wqj+Sp/li 8DQATLTfWFJipe2HpFxJ2lnzP1cCsp3mDR7D2NNInQU/bIQnczatdvsEom2qMLGvTv2h M95WQEukHdQbrsfYChwx+LGhF6gHQ1WwJVfUxu13YNAJDebtciBykoArZQGAf3kBK9Of c9qnLRbX6HM2KrIM0oKkBlIY0J9lJ1CtjbM0bUrqlbNPnQj1efkenTwEm58ibRrnQ9tK BDzA== X-Gm-Message-State: AOJu0YyLly6z0K9r5T0PFRkGkvHuuG0mwbk9ACVa7w76MlWB6ZAjQr8L 976UP6BaMTTSa/J4UKZjrb5xc/JdjyaHAb2gUOy0KQ== X-Google-Smtp-Source: AGHT+IEeNjIGnwNtnVhYa1lFz777xw6hLiHMZcj8JhoqUJZpWbbhwHupLYIGAuEzb4uauzqq6rgdBqXJgVB7IP29TwY= X-Received: by 2002:a05:6870:610b:b0:1e9:8b78:899c with SMTP id s11-20020a056870610b00b001e98b78899cmr11960747oae.55.1698082991197; Mon, 23 Oct 2023 10:43:11 -0700 (PDT) MIME-Version: 1.0 References: <20231016143828.647848-1-jeffxu@chromium.org> In-Reply-To: From: Jeff Xu Date: Mon, 23 Oct 2023 10:42:59 -0700 Message-ID: Subject: Re: [RFC PATCH v1 0/8] Introduce mseal() syscall To: Pedro Falcato Cc: Jeff Xu , Matthew Wilcox , akpm@linux-foundation.org, keescook@chromium.org, sroettger@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, surenb@google.com, alex.sierra@amd.com, apopple@nvidia.com, aneesh.kumar@linux.ibm.com, axelrasmussen@google.com, ben@decadent.org.uk, catalin.marinas@arm.com, david@redhat.com, dwmw@amazon.co.uk, ying.huang@intel.com, hughd@google.com, joey.gouly@arm.com, corbet@lwn.net, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, torvalds@linux-foundation.org, lstoakes@gmail.com, mawupeng1@huawei.com, linmiaohe@huawei.com, namit@vmware.com, peterx@redhat.com, peterz@infradead.org, ryan.roberts@arm.com, shr@devkernel.io, vbabka@suse.cz, xiujianfeng@huawei.com, yu.ma@intel.com, zhangpeng362@huawei.com, dave.hansen@intel.com, luto@kernel.org, linux-hardening@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4D00040012 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 5gbdbwrj8se7b33zudbkti5f1f9qr5cu X-HE-Tag: 1698082992-321288 X-HE-Meta: U2FsdGVkX1/S+4WOmZx/DkmjE/Ow1O/YJc9u84EZGnA3ByJR6gctXrv18ZQgWpbt+AwnZrMhqpy3zzpQyUqqIcilEW6ftvkvtrGvdkWGBAKAz+saDR1kXaC2GJ0Eeb+XB31Cehyp+uDm/Xee5oTh7sJ/670WzR8MamDvs6SZaoyNiembVoyaSNH9aQG7wsHvR3nyacVqFwqktqhhVtYlMdXgEU0C0QG4Hpm6A9dZRxlROpyPqneX8DxLmxPxjZ9HgwJBnPtTg5B9v7rs1cjgRzeY2oULz+PP7trnXZHJha9OynPpbm1myaUPqNJsoHnyv1l/oeCUNZq31aoAtAU1/eyKbzL08xYoeewkc0XJMi4+eE8pZzgULt+jYalYdyf8ky8q8YIs39CuMqyBOnNtpw2kfK1zdjH4A3Kq7M3iD2JctWlRpYVAtbbr+aneioB+wR3X/B3/AS+bwfXpbWE229ssKKwUfSYlApfikCK9/yhQ/SNcKDNhzUeJ9ZFo+y9ADNl+yKg565NYhj4vDYmS3ZMeifdiA0lGSU8oaxxrF/4QUg3EI9PC4/pONXqbGIT61wG7BWdrZq6HIaLYu98NNg8k4j0ib+nxnuBcM/VqrFkf4FKsbT5UKbqC6FqgIzSuHBXWxiCsb0CS+5+dnc/WjLRmKzwiszxu7J92ga8m0p/8ihNl6Xm5T9FLw0jHCG7aAhz8dFGY0D0F1AlmbI7ssHV6mtqSfN8gP538hyy8erkzoxoVxqL78g+9xOMNw2VbssTEN5Xtc0jQHGUFxxAMiMmrfWNURaupCQTJvKkR/6cC03rISdOX+05B14s67Lb9k+1MLuAF5N8y/miqEQKMiMZdny4Eh/ge7+9hVDVQ9IZx3zWWT0bzZ5ltko19pMUvtAGbgA5VTAvA9z6bQVTRwsl2iWTmiU8GpJO+FTqtFUu/tGlhu207GX4KFLtr4U1QtRlyhRpH/che7iZhD4C JYnXXjMG cLEubyRRzELbB6dlWthPOXEedlsoJSWqMdtBjOoWmAn/wqUwR2doWCZrUoHkQ62YixV9wKuPIvu+O3dRe3USLD4/8NW9YxhmrPMExbzCwJ8yEror4c7MKfcMYZebIrLH+XY7qDkxI/ZqaaAnGAveidO2g26gYSQ0KUO0KxTDY3aGQ0I1QoTBITRy0BTx9fzpXIcb6zDeOn6lyZ131qqvxRqf2KjpRzwtzdbyZP4ay2FCv/rUIA3VMTZBrfsfQxkQEGMl5WllGs4MvFxfAcQlXrKKlj08aQ8ZBRVmNgDZ3HhDQzx6VbW1CGVnnAbNOrd9YfmmL4LUkz3SO1IOu3UM5nl7saKvQZ+dKuiQJOUWN9r0YvrmPdLe7hN9RffoUwXITbQxZ9gE/1qyM1RYV8z8JTBAXgw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Oct 19, 2023 at 3:47=E2=80=AFPM Pedro Falcato wrote: > > On Thu, Oct 19, 2023 at 6:30=E2=80=AFPM Jeff Xu wrote= : > > > > Hi Pedro > > > > Some followup on mmap() + mprotect(): > > > > On Wed, Oct 18, 2023 at 11:20=E2=80=AFAM Jeff Xu wr= ote: > > > > > > On Tue, Oct 17, 2023 at 3:35=E2=80=AFPM Pedro Falcato wrote: > > > > > > > > > > > > > > > > I think it's worth pointing out that this suggestion (with PROT= _*) > > > > > > could easily integrate with mmap() and as such allow for one-sh= ot > > > > > > mmap() + mseal(). > > > > > > If we consider the common case as 'addr =3D mmap(...); mseal(ad= dr);', it > > > > > > definitely sounds like a performance win as we halve the number= of > > > > > > syscalls for a sealed mapping. And if we trivially look at e.g = OpenBSD > > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable()= seem > > > > > > like common patterns. > > > > > > > > > > > Yes. mmap() can support sealing as well, and memory is allocated = as > > > > > immutable from begining. > > > > > This is orthogonal to mseal() though. > > > > > > > > I don't see how this can be orthogonal to mseal(). > > > > In the case we opt for adding PROT_ bits, we should more or less on= ly > > > > need to adapt calc_vm_prot_bits(), and the rest should work without > > > > issues. > > > > vma merging won't merge vmas with different prots. The current > > > > interfaces (mmap and mprotect) would work just fine. > > > > In this case, mseal() or mimmutable() would only be needed if you n= eed > > > > to set immutability over a range of VMAs with different permissions= . > > > > > > > Agreed. By orthogonal, I meant we can have two APIs: > > > mmap() and mseal()/mprotect() > > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mim= mutable(). > > > Sealing can be applied after initial memory creation. > > > > > > > Note: modifications should look kinda like this: https://godbolt.or= g/z/Tbjjd14Pe > > > > The only annoying wrench in my plans here is that we have effective= ly > > > > run out of vm_flags bits in 32-bit architectures, so this approach = as > > > > I described is not compatible with 32-bit. > > > > > > > > > In case of ld.so, iiuc, memory can be first allocated as W, then = later > > > > > changed to RO, for example, during symbol resolution. > > > > > The important point is that the application can decide what type = of > > > > > sealing it wants, and when to apply it. There needs to be an api= (), > > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is= not > > > > > important to me. > > > > > > > > > > mprotect() in linux have the following signature: > > > > > int mprotect(void addr[.len], size_t len, int prot); > > > > > the prot bitmasks are all taken here. > > > > > I have not checked the prot field in mmap(), there might be bits = left, > > > > > even not, we could have mmap2(), so that is not an issue. > > > > > > > > I don't see what you mean. We have plenty of prot bits left (32-bit= s, > > > > and we seem to have around 8 different bits used). > > > > And even if we didn't, prot is the same in mprotect and mmap and mm= ap2 :) > > > > > > > > The only issue seems to be that 32-bit ran out of vm_flags, but tha= t > > > > can probably be worked around if need be. > > > > > > > Ah, you are right about this. vm_flags is full, and prot in mprotect(= ) is not. > > > Apology that I was wrong previously and caused confusion. > > > > > > There is a slight difference in the syntax of mprotect and mseal. > > > Each time when mprotect() is called, the kernel takes all of RWX bits > > > and updates vm_flags, > > > In other words, the application sets/unset each RWX, and kernel takes= it. > > > > > > In the mseal() case, the kernel will remember which seal types were > > > applied previously, and the application doesn=E2=80=99t need to repea= t all > > > existing seal types in the next mseal(). Once a seal type is applied= , > > > it can=E2=80=99t be unsealed. > > > > > > So if we want to use mprotect() for sealing, developers need to think > > > of sealing bits differently than the rest of prot bits. It is a > > > different programming model, might or might not be an obvious concept > > > to developers. > > > > > This probably doesn't matter much to developers. > > We can enforce the sealing bit to be the same as the rest of PROT bits. > > If mprotect() tries to unset sealing, it will fail. > > Yep. Erroneous or malicious mprotects would all be caught. However, if > we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect() > to less permissions or even downright munmap()) you'd want some care > to preserve that bit when setting permissions. > > > > > > There is a difference in input check and error handling as well. > > > for mseal(), if a given address range has a gap (unallocated memory), > > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is > > > updated. > > > For mprotect(), some VMAs can be updated, till an error happens to a = VMA. > > > > > This difference doesn't matter much. > > > > For mprotect()/mmap(), is Linux implementation limited by POSIX ? > > No. POSIX works merely as a baseline that UNIX systems aim towards. > You can (and very frequently do) extend POSIX interfaces (in fact, > it's how most of POSIX was written, through sheer > "design-by-committee" on a bunch of UNIX systems' extensions). > > > This can be made backward compatible. > > If there is no objection to adding linux specific values in mmap() and > > mprotect(), > > This works for me. > > Linux already has system-specific values for PROT_ (PROT_BTI, > PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc). > Whether this is the right interface is another question. I do like it > a lot, but there's of course value in being compatible with existing > solutions (like mimmutable()). > Thanks Pedro for providing examples on mm extension to POSIX. This opens more design options on solving the sealing problem. I will take a few days to research design options. -Jeff > -- > Pedro