From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0428AC001DF for ; Thu, 19 Oct 2023 22:48:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77D7C8D01AD; Thu, 19 Oct 2023 18:48:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 706228D0003; Thu, 19 Oct 2023 18:48:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CDFE8D01AD; Thu, 19 Oct 2023 18:48:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 4E1A28D0003 for ; Thu, 19 Oct 2023 18:48:00 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 23CAB1214F7 for ; Thu, 19 Oct 2023 22:48:00 +0000 (UTC) X-FDA: 81363700320.03.CAC2CAB Received: from mail-vs1-f50.google.com (mail-vs1-f50.google.com [209.85.217.50]) by imf01.hostedemail.com (Postfix) with ESMTP id 6A88440005 for ; Thu, 19 Oct 2023 22:47:58 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=At8ykXX6; spf=pass (imf01.hostedemail.com: domain of pedro.falcato@gmail.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=pedro.falcato@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697755678; a=rsa-sha256; cv=none; b=klldGJpuO7LSn4Ys7gLY8lpVne3ApW19hBWH6cwGrvfOf3YTd9XCVPAST1fIN3SszWq1/y 7BebOBgZh+JllgIDzu4y+o1aan1/wlVQ6wpJF1+ic9+J5WtiF+JCp44MLAyLz5Athh/Yir qWE/CV4gIKxTuDxLHHX4jZjg7nHKkns= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=At8ykXX6; spf=pass (imf01.hostedemail.com: domain of pedro.falcato@gmail.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=pedro.falcato@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697755678; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CP2THLW7eHytlPbt5BfwCcXMf/R2fQdouS9d2T2WxRc=; b=PwR8xe+og+wPgqwCTF9UPfqIuZCgnnLKWZLNSShSlHKN1ARk39RVRzBShnjqQNASD/wlpi rWT+kGxBCN1YeJInIoNf4tTfgIdEn8ZE6HJ2prqEDjPKQFGYlFmoI2k3zF1WV1yEE+zl9p UgeO7Xry5Vj9ip7Zdy/kAiyja9z7NTs= Received: by mail-vs1-f50.google.com with SMTP id ada2fe7eead31-457c057bdb5so75366137.0 for ; Thu, 19 Oct 2023 15:47:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697755677; x=1698360477; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CP2THLW7eHytlPbt5BfwCcXMf/R2fQdouS9d2T2WxRc=; b=At8ykXX6bbLQ25ZUMwfMf6fMmbZOphCrDeTjXmC6VfaIzsKCjxCTVHAEtRuJB6kYFw OGcgcGpeVtKJYybjEoSNh2di36MQU/U2tmPsQ4MpC1JZlvgkRJteg1CI8+YwBB/Af434 8OtGNgDdd7Lnnb14Czv5QCoIuoelNiWeAj/0zvZZXUOAjfuActezkUyXUH1r5mPizrvf yHdAnbzEh0A2vrfRLRsvoqLlHGJC5EDEuK3XncpdgsDiCugjrcLPywyRbO3n/uoGHCEq SFJ9x3TNhsZZLYopfdAtemKsRCUXZLpccVGmzvo3WhuuwS1GN9AklrJ/JdlblgmFVaOP Bpng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697755677; x=1698360477; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CP2THLW7eHytlPbt5BfwCcXMf/R2fQdouS9d2T2WxRc=; b=Ck+UH9+h8TRTyacpbaZW1DiEAUVQsHeK2CnK70EtgwzSBhP8E6m9LVWOjNg1gUpGMc gId5VUFbEDthb8tM6qIplgncH9MipA7iBanRkyQQhe3xbTv36Kz7mLlfrIlIwlAaNzW5 6kIM+YDcQWkfwZ147ebZGNgqJtcZS89F4PQkrSekTyEhaxXjNWebdHRLE7b5nCA3QRtx MuH4MQEsOZpcSEkd/K62ikB2FGO/Z7pFJqpEQWoLbraFUuzhvSs3NlyYYJCGOO0zjgtB JpbwsfRzXmBS/+WUvye4PvJvuvLA3qCdGkGegtWfjPY8KYJEXYBPJKTTbsETS+o4gG4y ftgA== X-Gm-Message-State: AOJu0YyzWxx2EXs4K6jBQd9YuTAWYc8h4tXj2f1M4NxLL8SmCIZL7Qj7 3A1W9cHKbV8MRAc2K8VDQTsHOr0ThBooOdfrQdY= X-Google-Smtp-Source: AGHT+IHciFs/BQ6U6a7rjOnDJ2Xq+ukEzpkuMKRLiU5U69JMH0j5zohwipplwWO5SP4nlGF81WuRLROHWoNhvgHtxmc= X-Received: by 2002:a67:a209:0:b0:457:b85e:a9fe with SMTP id l9-20020a67a209000000b00457b85ea9femr91187vse.27.1697755677500; Thu, 19 Oct 2023 15:47:57 -0700 (PDT) MIME-Version: 1.0 References: <20231016143828.647848-1-jeffxu@chromium.org> In-Reply-To: From: Pedro Falcato Date: Thu, 19 Oct 2023 23:47:46 +0100 Message-ID: Subject: Re: [RFC PATCH v1 0/8] Introduce mseal() syscall To: Jeff Xu Cc: Matthew Wilcox , jeffxu@chromium.org, akpm@linux-foundation.org, keescook@chromium.org, sroettger@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, surenb@google.com, alex.sierra@amd.com, apopple@nvidia.com, aneesh.kumar@linux.ibm.com, axelrasmussen@google.com, ben@decadent.org.uk, catalin.marinas@arm.com, david@redhat.com, dwmw@amazon.co.uk, ying.huang@intel.com, hughd@google.com, joey.gouly@arm.com, corbet@lwn.net, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, torvalds@linux-foundation.org, lstoakes@gmail.com, mawupeng1@huawei.com, linmiaohe@huawei.com, namit@vmware.com, peterx@redhat.com, peterz@infradead.org, ryan.roberts@arm.com, shr@devkernel.io, vbabka@suse.cz, xiujianfeng@huawei.com, yu.ma@intel.com, zhangpeng362@huawei.com, dave.hansen@intel.com, luto@kernel.org, linux-hardening@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 6A88440005 X-Stat-Signature: dnygcfupc7qypc81b5kbarp9fh47c1kg X-Rspam-User: X-HE-Tag: 1697755678-82688 X-HE-Meta: U2FsdGVkX19IfntrDEP8BtC3UsElZvcHwxQOKSCbM2+zeNGv6hNNcLrTMX7PC4wkBYglslJ1dQnVjq3Np78AJTmlbmJf30+K6rfsOdl4/UOTP5JE1S+KGzXjHnPdclAoazyAmMqHVGVS8z/7a5d3YkU3BBTK4iMEBdJNwB+QIHGKR6YdIurc9czBFHKDiI79vfqAShnEJN/qb/xsIhE/v8Wc0V/dAQXq6updr/V5WdDMp3K7rb6TN87sKHngfgOqGLT1J2ImPnxMcLg73GuNMs9BMasUf64mxY7gSZYNvlKuMgxu2W/9C+cI46KmT+4950L3NDK8bydgn90lRdOt6fM/X9amu3Hnl5RI6tZnqXhsrugQTnEwpH1TDEv0yy/UO0OT/pevL56tgTbLMMMzIYLZGLe3YlCaDhwvSY+5l+eHuWV0b1QdcLt+B5IObKYmYUyfmzt1m8sriQouAfv71zQCkaosh3uh6mXg5+nzIo8n/hIFas55O3N8yGH2ks+o7gmIdPPayg8onVS6rpiFaKwZBvp6WHcwpxxCrSPriUGQeNzlNyjHRvIztvJWJtqhG0Ccisn4xuiUeUKi7LWt5PpbIdr0LUKcb2nc5iqEl48fHsQf6Pb9XArXG8nrgRjJkhwoC+CXTcgh8jTSOtjG/bPSlVCAwAzpEXc0AOkP8+zfq/iTX1HHc2HH/FKUBSfLlmzKV+KvA8Hrr3n2vp9ng2NXzm9VHYPouxyYxXTt3YhjJ9dNyZ0ZWuJWEXuOK1nKzdCyKZlUCcSWEtqwRfJDZRJ62ujGT6VR308xIQWuBtTGXuaNBnpr+n2f7HND93lBtDfe+LS8IohOaZDBHzAt6DnlLg1Oao1DV8JmMBIsKYpqZWuBOG8mA6o9peK7YPYQYxrSiC4aTvUAeyqeOPkqO+B3rETQpAJ83E0Si36c/Hz5RBaJ1PFJhePVoy/ks45vr+HZomnkFrdmOzKUQzv yKTr2d7v ieWieW+kFRGPTnNLjGTJBYe1NvXlhYvX5Yqw4+uEji9dDP/Ao9qW/9SKWrrQiqVWVH0U4YE/pDWg7oe+s8aRR7WFgUzdCCiDe51KdLNbrAVlCTr+6KrQx0e/z0AGe1lG0LaLKawNaVvwz1/xkVPAd1V9sj9bdyXC5frgatd75qc71JNpLTSg5bDC030kh322MZFXBH/ERr7vxuGVypcHpBCJ5lPiBpE30vME3dI4YJ6Lmw7ejnl2+4WvpbGYmkbd+PJKNZN20guBcVexmAIMCqZ9fYbRWz99sgtYc60ScSxh8lFYrvr+U6iRpEkOIlJPezqjnhkJN0YweRTQC44IR9n2sYKoPj4ITB4jZqnTI/2duAkuoLv4MOVcy3A8HH06vscClB9cAs9b/2eyU3ZNG9/EOxx3d9Wz4IJ74FL2L3xMH4b9a8Zl5DbfdA/lM925kOIVUYU92I/Y9NHPRSy9Q84x/mObAskkZJhwtxXSvtcz24+Q= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 19, 2023 at 6:30=E2=80=AFPM Jeff Xu wrote: > > Hi Pedro > > Some followup on mmap() + mprotect(): > > On Wed, Oct 18, 2023 at 11:20=E2=80=AFAM Jeff Xu wrot= e: > > > > On Tue, Oct 17, 2023 at 3:35=E2=80=AFPM Pedro Falcato wrote: > > > > > > > > > > > > > I think it's worth pointing out that this suggestion (with PROT_*= ) > > > > > could easily integrate with mmap() and as such allow for one-shot > > > > > mmap() + mseal(). > > > > > If we consider the common case as 'addr =3D mmap(...); mseal(addr= );', it > > > > > definitely sounds like a performance win as we halve the number o= f > > > > > syscalls for a sealed mapping. And if we trivially look at e.g Op= enBSD > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() s= eem > > > > > like common patterns. > > > > > > > > > Yes. mmap() can support sealing as well, and memory is allocated as > > > > immutable from begining. > > > > This is orthogonal to mseal() though. > > > > > > I don't see how this can be orthogonal to mseal(). > > > In the case we opt for adding PROT_ bits, we should more or less only > > > need to adapt calc_vm_prot_bits(), and the rest should work without > > > issues. > > > vma merging won't merge vmas with different prots. The current > > > interfaces (mmap and mprotect) would work just fine. > > > In this case, mseal() or mimmutable() would only be needed if you nee= d > > > to set immutability over a range of VMAs with different permissions. > > > > > Agreed. By orthogonal, I meant we can have two APIs: > > mmap() and mseal()/mprotect() > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmu= table(). > > Sealing can be applied after initial memory creation. > > > > > Note: modifications should look kinda like this: https://godbolt.org/= z/Tbjjd14Pe > > > The only annoying wrench in my plans here is that we have effectively > > > run out of vm_flags bits in 32-bit architectures, so this approach as > > > I described is not compatible with 32-bit. > > > > > > > In case of ld.so, iiuc, memory can be first allocated as W, then la= ter > > > > changed to RO, for example, during symbol resolution. > > > > The important point is that the application can decide what type of > > > > sealing it wants, and when to apply it. There needs to be an api()= , > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is n= ot > > > > important to me. > > > > > > > > mprotect() in linux have the following signature: > > > > int mprotect(void addr[.len], size_t len, int prot); > > > > the prot bitmasks are all taken here. > > > > I have not checked the prot field in mmap(), there might be bits le= ft, > > > > even not, we could have mmap2(), so that is not an issue. > > > > > > I don't see what you mean. We have plenty of prot bits left (32-bits, > > > and we seem to have around 8 different bits used). > > > And even if we didn't, prot is the same in mprotect and mmap and mmap= 2 :) > > > > > > The only issue seems to be that 32-bit ran out of vm_flags, but that > > > can probably be worked around if need be. > > > > > Ah, you are right about this. vm_flags is full, and prot in mprotect() = is not. > > Apology that I was wrong previously and caused confusion. > > > > There is a slight difference in the syntax of mprotect and mseal. > > Each time when mprotect() is called, the kernel takes all of RWX bits > > and updates vm_flags, > > In other words, the application sets/unset each RWX, and kernel takes i= t. > > > > In the mseal() case, the kernel will remember which seal types were > > applied previously, and the application doesn=E2=80=99t need to repeat = all > > existing seal types in the next mseal(). Once a seal type is applied, > > it can=E2=80=99t be unsealed. > > > > So if we want to use mprotect() for sealing, developers need to think > > of sealing bits differently than the rest of prot bits. It is a > > different programming model, might or might not be an obvious concept > > to developers. > > > This probably doesn't matter much to developers. > We can enforce the sealing bit to be the same as the rest of PROT bits. > If mprotect() tries to unset sealing, it will fail. Yep. Erroneous or malicious mprotects would all be caught. However, if we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect() to less permissions or even downright munmap()) you'd want some care to preserve that bit when setting permissions. > > > There is a difference in input check and error handling as well. > > for mseal(), if a given address range has a gap (unallocated memory), > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is > > updated. > > For mprotect(), some VMAs can be updated, till an error happens to a VM= A. > > > This difference doesn't matter much. > > For mprotect()/mmap(), is Linux implementation limited by POSIX ? No. POSIX works merely as a baseline that UNIX systems aim towards. You can (and very frequently do) extend POSIX interfaces (in fact, it's how most of POSIX was written, through sheer "design-by-committee" on a bunch of UNIX systems' extensions). > This can be made backward compatible. > If there is no objection to adding linux specific values in mmap() and > mprotect(), > This works for me. Linux already has system-specific values for PROT_ (PROT_BTI, PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc). Whether this is the right interface is another question. I do like it a lot, but there's of course value in being compatible with existing solutions (like mimmutable()). --=20 Pedro