From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8039ACEB2CF for ; Tue, 1 Oct 2024 00:24:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F16CF280042; Mon, 30 Sep 2024 20:24:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC5FC280036; Mon, 30 Sep 2024 20:24:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D3F86280042; Mon, 30 Sep 2024 20:24:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B3850280036 for ; Mon, 30 Sep 2024 20:24:55 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 207C5A0BEA for ; Tue, 1 Oct 2024 00:24:55 +0000 (UTC) X-FDA: 82623138150.08.10F9277 Received: from mail-oi1-f170.google.com (mail-oi1-f170.google.com [209.85.167.170]) by imf09.hostedemail.com (Postfix) with ESMTP id 3403F140010 for ; Tue, 1 Oct 2024 00:24:53 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=Y3Pu++NK; spf=pass (imf09.hostedemail.com: domain of jeffxu@chromium.org designates 209.85.167.170 as permitted sender) smtp.mailfrom=jeffxu@chromium.org; dmarc=pass (policy=none) header.from=chromium.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727742229; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GRjyf8DNLWw+FQRgr1LU0I9O1VNiCFlWqQFNhlCICtY=; b=kP053nAPsvTf4Aw+7Mw/jnhhA2hBdmMX3f/5Tt4fbNHyDYBrXF7OzFLTQkESsAJgFVJeFX ng93sMVVSemmewM6qyt9pdcskK5BjbftygcO9lQa/RdrxNEYtb14FXUL0VAzmO0xxpWKq2 Cbn5QMSunoOtT7CaALomWu7RpZK78pI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=chromium.org header.s=google header.b=Y3Pu++NK; spf=pass (imf09.hostedemail.com: domain of jeffxu@chromium.org designates 209.85.167.170 as permitted sender) smtp.mailfrom=jeffxu@chromium.org; dmarc=pass (policy=none) header.from=chromium.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727742229; a=rsa-sha256; cv=none; b=Wi53G17y/sLtTMD+yZbEHQxBDlggE0SoTwNeSVD7fmG1+I3crL396JsVWBom1nUfxIIdmC 52bv6ZynNzPdHHGTPt6lPeYhvB2qcIYNFnHDIt/KPN6QaW6XEH8ErBmnsmN65c/nY5CZDy 2zX6j9RgXVa0Syh9BdyjlWAXFN3caMY= Received: by mail-oi1-f170.google.com with SMTP id 5614622812f47-3e04ea9d715so442189b6e.1 for ; Mon, 30 Sep 2024 17:24:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1727742292; x=1728347092; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GRjyf8DNLWw+FQRgr1LU0I9O1VNiCFlWqQFNhlCICtY=; b=Y3Pu++NKN/Z4/gX0mI0NMJAlzIOWqEXVqIMd7u5XbHaPI+alEb8/YnJiYOXeLKzQ5h T+QcwG+4zijdyuafVzT1jRil2HyL9SLl1prJX33mFEJyIQoAEG4uOyGjNDIIQdTXDcVw XCRu0rK4yWumv8uSXqKewQHIRJNzCux66335c= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727742292; x=1728347092; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GRjyf8DNLWw+FQRgr1LU0I9O1VNiCFlWqQFNhlCICtY=; b=FHj96RhGigEeqefavElW0rhYjQaspzG6421v6kgCl8mmsqf+BmEfuEA7MntPASsq5N OBnXaHEKG3/9lZpxHJhIN53ZMyDrr4wb+/vAtJDSIL3UMaWk0jrRVnsVS039kNovRzID 6/cIx/GDFaG13dinSSh/lAkFbKMAcQGtbMIi4FVSEabZs0HvVOcLlqOiilEO9Yvt/UrF rgfeDdsY2epBbo7m18L/wvcC3XZWQ2mfMUInZyCP53XQjXOAtslfLNR1ry9U01tFuA+g fHU1mwCl/fFVkLxkjqv2MzoceV8UrBBT1hsYahKtsDIs2x+fRmEMw4eQ17Loqb9C+e1p 3l9g== X-Forwarded-Encrypted: i=1; AJvYcCUmm0sdlReR/VS+4CfaATq+Fjgjy2daePDn3wuXjlWupp6WOj4SwVtxc6f26Y4apLL1h4STIqbdgg==@kvack.org X-Gm-Message-State: AOJu0Yz5iFEt4pX4O779zjqFUztOqcbcTW1E6VC2F7W/7rpU2cb+YjvL mYcJM2p8k2Na4i+7R8nzbs+tigV2VpnoCLbqO+61lM8gfBDTDxA/PjPp0hyEVFUznanbMgesNgf psyzzQLPdMu4ngYoZ4z2M3lqFV+UgGdHrzaqZ X-Google-Smtp-Source: AGHT+IFCsfySBWujRZtWHM3rWdTyMvEGDNIuSvn9PACkABfd+R8ovsObosCVlzhPRk2UblrlgRoo1AHxGo6PyxdmkeM= X-Received: by 2002:a05:6870:3112:b0:27c:df37:9e0c with SMTP id 586e51a60fabf-2872ab08c4dmr1819194fac.8.1727742292181; Mon, 30 Sep 2024 17:24:52 -0700 (PDT) MIME-Version: 1.0 References: <20240927185211.729207-1-jeffxu@chromium.org> <20240927185211.729207-2-jeffxu@chromium.org> <2vkppisejac42wnawjkd7qzyybuycu667yxwmsd4pfk5rwhiqc@gszyo5lu24ge> <2q6hzkvep2g3z6m2jrwbw2j3sbydf6tgj2obwd6hgmm7xzgsg3@ddr5ghmsia5k> In-Reply-To: <2q6hzkvep2g3z6m2jrwbw2j3sbydf6tgj2obwd6hgmm7xzgsg3@ddr5ghmsia5k> From: Jeff Xu Date: Mon, 30 Sep 2024 17:24:39 -0700 Message-ID: Subject: Re: [PATCH v1 1/1] mseal: update mseal.rst To: Pedro Falcato Cc: akpm@linux-foundation.org, keescook@chromium.org, corbet@lwn.net, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, sroettger@google.com, linux-hardening@vger.kernel.org, willy@infradead.org, gregkh@linuxfoundation.org, torvalds@linux-foundation.org, deraadt@openbsd.org, usama.anjum@collabora.com, surenb@google.com, merimus@google.com, rdunlap@infradead.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, enh@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 3403F140010 X-Stat-Signature: 3w9rkpdp957kpxkea7mw5erugxhdq98i X-HE-Tag: 1727742293-178510 X-HE-Meta: U2FsdGVkX18IsZJTOTpo0+UJdxq6vqWQd96dH3CnSPNQ8eupVbQnG6KWX2tmKYmxyCdN703Iw/ivxEOkwxKtGRh0uq1Ff47YXM1rNY7O85qcRo3GCIES6bnakwKktjZ0rVW8k5I6/kKgEekyRRKVRbpUbedOeLqXSbzwlWi255tq7ehrVtmCqzJOzmquVSxyUdah0SUnG8o7ZlXzSXVEczlorpxqO0CjOwRXMtzZBedfFqoq4E/2HHquCaWkFZ7VfdoyE/NO8hRGdWPlP28fZ5sE2x1i86sks38LgPCVqEcBF2ffYadG7qelmquBqCOLbxqz5VsaP3N1ElGsMVBBOVDKXr7hvqFa3jOa+R2LyVE5+t3lg6znmgEtKju+8M5quZ4iWEhb4eZYPx8z/EWxE+EZB1uLU4n/qMLPsf5gc9WyYQcOKneq1TcOeIbNgMy3NE4cmIh7u1ObXAJKjv6CBNexGPmMYO3IttNIJpuoJyL6qSvr9yC8uFgBINmnsjgWHzshDswf1/h1eQdqyEHqjzjtBDtzanCAHxTE1Dt7c9wWUKzpq6j9JqT609yEOQSlHTutalHu8fkQWI5zZ0HH+IR1v3avof6mSt+9QrjDo6vNkOqCY8l0zRi9GCKwv6NzdPS2RAvPh+KOTrXTzEIrRMdzAhPjFmyNsWzHoD4JAGx6ABpH3eNzz8oC63dVzE+j8HIgGggbC+LSxFUu74tW1rF79YDRkmXeCRi7Tjwh4TQ4no0tP2V5732oXxaKUkEG1UNgg3S25uSigBdtT5LfwHljMwCWV1Jgjy7BdC4fhd4Ep2D2XrgMHZUS4BhoXDNQ7bxyFsiCfYfGaNdycsiQ1Ua8VBKDeLSxSD1yFZUE3HBz5/YYFXn6Ym6lX6/lWRHewoV+S4a6bjD/vemvRiyAJnOOHv1UmofOigmRWNPxB4F3ny7bWHsZme80TGtK4pCxRx6hlk8Bzx8aKXXn/08 bVBOhcWn QW8yJTC8P303pe43LgwfNmWqKrWlovaUgw6Y9f02KY++B80WQ4tt2/CaTWBkqdRebaiPT9xNAXv0xu53dcZLteHslG1HPRnQk86iOADfp2zL/ceKdZYcy9ecHuiEB7jiv9MBPfgy6enO7MffT8XL7CLh3LHH5f8yvsB1DdmF1XAAWWIa1F7tgeQrfl4Yl2m8J2nMwjF5Nkg6QP9nvRzzOzGctkhThlTpriPOJLzdRQdAQ1F2/qUB3w184BLwe6eZG/zoLLDtua5eMTzgdszrxOKQcPtbMP+7Xd7FZ85XcbOn37/A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Pedro On Sat, Sep 28, 2024 at 6:43=E2=80=AFAM Pedro Falcato wrote: > > On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote: > > Hi Pedro, > > > > On Fri, Sep 27, 2024 at 3:59=E2=80=AFPM Pedro Falcato wrote: > > > > > + > > > > + Blocked mm syscall: > > > > + - munmap > > > > + - mmap > > > > + - mremap > > > > + - mprotect and pkey_mprotect > > > > + - some destructive madvise behaviors: MADV_DONTNEED, MADV_FR= EE, > > > > + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEO= NFORK > > > > + > > > > + The first set of syscall to block is munmap, mremap, mmap. They= can > > > > + either leave an empty space in the address space, therefore all= ow > > > > + replacement with a new mapping with new set of attributes, or c= an > > > > + overwrite the existing mapping with another mapping. > > > > + > > > > + mprotect and pkey_mprotect are blocked because they changes the > > > change > > > > + protection bits (rwx) of the mapping. > > > > + > > > > + Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE, > > > > + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK= ) > > > > + for anonymous memory, when users don't have write permission to= the > > > > + memory. Those behaviors can alter region contents by discarding= pages, > > > > + effectively a memset(0) for anonymous memory. > > > > > > What's the difference between anonymous memory and MAP_PRIVATE | MAP_= FILE? > > > > > MAP_FILE seems not used ? > > anonymous mapping is the mapping that is not backed by a file. > > MAP_FILE is actually defined as 0 usually :) But I meant file-backed priv= ate mappings. > OK, we are on the same page for this. > > > The feature now, as is (as far as I understand!) will allow you to do= things like MADV_DONTNEED > > > on a read-only file mapping. e.g .text. This is obviously wrong? > > > > > When a MADV_DONTNEED is called, pages will be freed, on file-backed > > mapping, if the process reads from the mapping again, the content > > will be retrieved from the file. > > > > Sorry, it was late and I gave you a crap example. Consider this: > a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + m= seal. > > The attacker later gets me to MADV_DONTNEED that VMA. You've just lost da= ta. > > The big problem here is with anon _pages_, not anon vmas. > That depends on the app's threat-model. What you described seems to be a case below 1. The file is rw 2. The process opens the file as rw 3. the process mmap the fd as rw 4 The process writes the memory, and the change isn't flushed to the file on disk. 5 The process changes the mapping to RO 6. The process seals the mapping 7. The process is called MADV_DONTNEED , and because the change isn't flush to file on disk, so it loses the change, (retrieve the old data from disk when read from the mapped address later) I'm not sure this is a valid use case, the problem here seems to be that the app needs to flush the change from memory to disk if the expectation is writing is permanent. In any case, the mseal currently just blocks a subset of madvise, those we know with a security implication. If there is something mseal needs to block additionally, one can always extend it by using the "flags" field. I do think the bar is high though, e.g. a valid use case to support that. > > For anonymous mapping, since there is no file backup, if process > > reads from the mapping, 0 is filled, hence equivalent to memset(0) > > > > > > + > > > > + Kernel will return -EPERM for blocked syscalls. > > > > + > > > > + When blocked syscall return -EPERM due to sealing, the memory r= egions may or may not be changed, depends on the syscall being blocked: > > > > + - munmap: munmap is atomic. If one of VMAs in the given rang= e is > > > > + sealed, none of VMAs are updated. > > > > + - mprotect, pkey_mprotect, madvise: partial update might hap= pen, e.g. > > > > + when mprotect over multiple VMAs, mprotect might update th= e beginning > > > > + VMAs before reaching the sealed VMA and return -EPERM. > > > > + - mmap and mremap: undefined behavior. > > > > > > mmap and mremap are actually not undefined as they use munmap semanti= cs for their unmapping. > > > Whether this is something we'd want to document, I don't know honestl= y (nor do I think is ever written down in POSIX?) > > > > > I'm not sure if I can declare mmap/mremap as atomic. > > > > Although, it might be possible to achieve this due to munmap being > > atomic. I'm not sure as I didn't test this. Would you like to find > > out ? > > I just told you they use munmap under the hood. It's just that the requir= ement isn't actually > written down anywhere. > I knew about mmap/mremap calling munmap. I don't know what exactly you are asking though. In your patch and its discussion, you did not mention the mmap/mremap (for sealing) is or should be atomic. My point is: since there isn't a clear statement from your patch descriptio= n or POSIX, that mremap/mmap is atomic, and I haven't tested it myself with regards to sealing, let's leave them as "undefined" for now. (I could get = back to this later after the merging window) > > > > > > > > > > Use cases: > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > - glibc: > > > > The dynamic linker, during loading ELF executables, can apply se= aling to > > > > - non-writable memory segments. > > > > + mapping segments. > > > > > > > > - Chrome browser: protect some security sensitive data-structures. > > > > > > > > -Notes on which memory to seal: > > > > -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D > > > > - > > > > -It might be important to note that sealing changes the lifetime of= a mapping, > > > > -i.e. the sealed mapping won=E2=80=99t be unmapped till the process= terminates or the > > > > -exec system call is invoked. Applications can apply sealing to any= virtual > > > > -memory region from userspace, but it is crucial to thoroughly anal= yze the > > > > -mapping's lifetime prior to apply the sealing. > > > > +Don't use mseal on: > > > > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > +Applications can apply sealing to any virtual memory region from u= serspace, > > > > +but it is *crucial to thoroughly analyze the mapping's lifetime* p= rior to > > > > +apply the sealing. This is because the sealed mapping *won=E2=80= =99t be unmapped* > > > > +till the process terminates or the exec system call is invoked. > > > > > > There should probably be a nice disclaimer as to how most people don'= t need this or shouldn't use this. > > > At least in its current form. > > > > > Ya, the mseal is not for most apps. I mention the malloc example to str= ess that. > > > > > > > > > - > > > > - > > > > -Additional notes: > > > > -=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > As Jann Horn pointed out in [3], there are still a few ways to wri= te > > > > -to RO memory, which is, in a way, by design. Those cases are not c= overed > > > > -by mseal(). If applications want to block such cases, sandbox tool= s (such as > > > > -seccomp, LSM, etc) might be considered. > > > > +to RO memory, which is, in a way, by design. And those could be bl= ocked > > > > +by different security measures. > > > > > > > > Those cases are: > > > > - > > > > -- Write to read-only memory through /proc/self/mem interface. > > > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEX= T). > > > > -- userfaultfd. > > > > + - Write to read-only memory through /proc/self/mem interface (F= OLL_FORCE). > > > > + - Write to read-only memory through ptrace (such as PTRACE_POKE= TEXT). > > > > + - userfaultfd. > > > > > > I don't understand how this is not a problem, but MADV_DONTNEED is. > > > To me it seems that what we have now is completely useless, because y= ou can trivially > > > bypass it using /proc/self/mem, which is enabled on most Linux system= s. > > > > > > Before you mention ChromeOS or Chrome, I don't care. Kernel features = aren't designed > > > for Chrome. They need to work with every other distro and application= as well. > > > > > > It seems to me that the most sensible change is blocking/somehow dist= inguishing between /proc/self/mem and > > > /proc//mem (some other process) and ptrace. As in blocking /proc= /self/mem but allowing the other FOLL_FORCE's > > > as the traditional UNIX permission model allows. > > > > > IMO, it is a matter of Divide and Conquer. In a nutshell, mseal only > > prevents VMA's certain attributes (such as prot bits) from changing. > > It doesn't mean to say that sealed RO memory is immutable. To achieve > > that, the system needs to apply multiple security measures. > > No, it's a matter of providing a sane API without tons of edgecases. Maki= ng a VMA immutable should make a VMA > immutable, and not require you to provide a crap ton of other mechanisms = in order to truly make it immutable. > If I call mseal, I expect it to be sealed, not "sealed except when it's n= ot, lol". > > You haven't been able to quite specify what semantics are desirable out o= f this whole thing. Making > prot flags "immutable" is completely worthless if you can simply write to= a random pseudofile and > have it bypass the whole thing (where a write to /proc/self/mem is semant= ically equivalent to > mprotect RW + write + mprotect RO). Making the vma immutable is completel= y worthless > if I can simply wipe anon pages. There has to be some end goal here (make= contents immutable? > make sure VMA protection can't be changed? both?) which seems to be uncle= ar from the kernel mmap-side. > > If you insist on providing half-baked APIs (and waving off any concerns),= I'm sure this would've been better > implemented as a random bpf program for chrome. Maybe we could revert thi= s whole thing and give eBPF one > or two bits of vma flags for their own uses :) > > > > > For writing to /proc/pid/mem, it can be disabled via [1]. SELINUX and > > Landlock can achieve the same protection too. > > I'm not blocking /proc/pid/mem, and my distro doesn't run any of those se= curity modules :/ > It is a choice you can make :-) Linux is diverse, from desktop to mobile to cloud hosting to embedded syste= ms. For a safe-by-default system, some of them might like those security enhancements. Thanks -Jeff -Jeff > -- > Pedro