From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4EF4CCF884E for ; Fri, 4 Oct 2024 17:02:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF9358D000A; Fri, 4 Oct 2024 13:02:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C81B48D0001; Fri, 4 Oct 2024 13:02:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AADF18D000A; Fri, 4 Oct 2024 13:02:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 83E978D0001 for ; Fri, 4 Oct 2024 13:02:31 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2E591161A4B for ; Fri, 4 Oct 2024 17:02:31 +0000 (UTC) X-FDA: 82636538502.27.5AF9096 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf21.hostedemail.com (Postfix) with ESMTP id 291E41C0020 for ; Fri, 4 Oct 2024 17:02:28 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lURIzXfm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of pedro.falcato@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=pedro.falcato@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728061205; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JJw4/UeiRQMts5EdhkhnYEPBapPPSq1WmYpPKALcDQE=; b=QAfzl5wrYDqKRgSVJ560kxeAF8PhNs4wxb36o/CBmz7uKI5lmv8/Hbd81V7+PGUhcbApS4 ejQxXU3Lnj628ixeXiZsJ7HcHVcbK3O69+p2Fza8LH+abUYp1pDsJ+a2h64+2GrY4lGbvR dV8O5PaqcvQq6Ve1LWZBZKxomKMm6Fg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728061205; a=rsa-sha256; cv=none; b=cE6uB7FX3NSRfrByxe9Jhovh6F+ySdAypLE9PY/F+BmjD1hX2VQ/5VWJTlZQ1hFYcQPVUO O8G3mevahYuXt/gEd7+9BTXlG7b0ePWtIQ2/HMrtUpeawQulqmSQBMBqEZJvkGtKuFcTF3 ilUusJ5ybCyTupbNvrfQ3Drw89pWjII= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lURIzXfm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of pedro.falcato@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=pedro.falcato@gmail.com Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-42e7b7bef42so21361055e9.3 for ; Fri, 04 Oct 2024 10:02:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728061348; x=1728666148; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=JJw4/UeiRQMts5EdhkhnYEPBapPPSq1WmYpPKALcDQE=; b=lURIzXfmxUWtWoYzl/YlwWVk+8BD7iwlilQVpmxpokmReONzMWXC4/A8ryrTv3EEIU 91Ytg7LIu1jj6+jd6UJi8vK6QzgQOKpkVmQewNoKjbPQtZgUnhHT8/qFooyOOJ8JBZp1 USH/NT584zq6j1eVe//XjMC756BwN4KLXAgeKkDXA0oRFlm6c/rUzpzZSl0Go6XAvq3K UwGvic7FuA+SMB6HN5kGltJQC+h1D+tzDsJbWdKJHVvVO5o0EFenavRJ+/t7Ookoagf+ CaT7uEWJrKWN8pCkCUusZYSzk8Ra4s8xmVvEPI7mGgb1gwTCAke1VXfPM63ibkNsiTie ltIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728061348; x=1728666148; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=JJw4/UeiRQMts5EdhkhnYEPBapPPSq1WmYpPKALcDQE=; b=E/DSFBUG6QKfoUeu3iBow4nVYDJoLbyTOMxmHlCRQ0/L8V8hcENuls8qteHH75sfLZ yoAXAqUspYMEGoG5Rjaq9rWKbDoePr7MB4zIiNoUbZo6awOFr6k4uqn/5LyMk/tXd5Ud e4CYjgVThyPak/R/94EAfoAFUilLmuejvnWLigktq3Y7xT449U6Fk6HSG/7PKAKxRMal NYRjrSF2W3yxMv9AFtM7WFF+cmMTdxQsimgsHkEXjPlYhPNdXF8p0VRaR8mJQHH+JkiN Fd4Th5ChzZRUxdAHWkArJJnd3SzROx+v5WMZ7vHtqxoENXpYbFQMycsBG5dtb4xIXgdN s58A== X-Forwarded-Encrypted: i=1; AJvYcCWPjqW9+9+L4jye78vf/FtlTqAthR2wIbEJFk8bEy72y5VD3GufMlDTTiQbWGt8PN6RV54kZuq7BA==@kvack.org X-Gm-Message-State: AOJu0YyBzKpziZuWPjRcDDHqdaHWcVBWhBtGJR++8BTONVhnsYe6RmWw OPUFwn2f84McJ+Zs+CNXNeM3/uksiIQqoZgMZ8Dxxyu+K5h+6D16jviwnvUq X-Google-Smtp-Source: AGHT+IHgyKEVCMvFOFJlwxR71UPbgQZJ2TM5z1lsH7M0nd+dKaJ3zHAChkE2Y5QMLxyfF4CZC3zmkA== X-Received: by 2002:a05:600c:4fcc:b0:426:60b8:d8ba with SMTP id 5b1f17b1804b1-42f85aef9c7mr23760955e9.28.1728061347467; Fri, 04 Oct 2024 10:02:27 -0700 (PDT) Received: from PC-PEDRO-ARCH ([2001:818:e92f:6400:a118:25f3:b27f:9f34]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-42f89ec71aesm1965205e9.33.2024.10.04.10.02.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Oct 2024 10:02:26 -0700 (PDT) Date: Fri, 4 Oct 2024 18:02:24 +0100 From: Pedro Falcato To: Jeff Xu Cc: akpm@linux-foundation.org, keescook@chromium.org, corbet@lwn.net, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, sroettger@google.com, linux-hardening@vger.kernel.org, willy@infradead.org, gregkh@linuxfoundation.org, torvalds@linux-foundation.org, deraadt@openbsd.org, usama.anjum@collabora.com, surenb@google.com, merimus@google.com, rdunlap@infradead.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, enh@google.com Subject: Re: [PATCH v1 1/1] mseal: update mseal.rst Message-ID: References: <20240927185211.729207-1-jeffxu@chromium.org> <20240927185211.729207-2-jeffxu@chromium.org> <2vkppisejac42wnawjkd7qzyybuycu667yxwmsd4pfk5rwhiqc@gszyo5lu24ge> <2q6hzkvep2g3z6m2jrwbw2j3sbydf6tgj2obwd6hgmm7xzgsg3@ddr5ghmsia5k> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 291E41C0020 X-Stat-Signature: 7ycdedq1sgdotobpszo4ax45d7pdobde X-Rspam-User: X-HE-Tag: 1728061348-436220 X-HE-Meta: U2FsdGVkX19daOBjhxICZAOJBJ4QOsx4Qm1vKJTLVdtGgGyEueweF+mKamEh/zYSvGsJUQ0UqtQ6q4xyuVWGG89PH1T4Fw08gzy3GcVx3dFik8a6mmVjvlTLOjG8mlgB3t13zdFgUxAtvt3s9Ja0Uacu9fE8jr4068qQj5pLTf+Ong3DCn54cIfgVE8wgzn1yUPWkkAkjcXcmoYdA/wASNGMa8/mHSJTQZ3IJ4M7todGcnJhfmJfJhpB2L8zoY06floFsuQmALlBnwih+eF5IkPqtt+je/UlHAqo1C+hDkG7qug44sBvE+Trz1s4IvU+PqIupUsdvBucwWEs1hWN2kHJjtiKvL8rSdh0yeR9Mn+K5/Jasiael1W+NDisJ9hZhCMkkLQmEIEx9HSzNMBxMC9AazN6EEspNG46hROSdu6Ik93+1DVV6fxitMsFwEOTY5sWVM1WQP/lfUBwo/zKoMSXgXe8QoCtnkslaHjNta8sKmxDRmbw+RT0ucXYHFOm1LbaJLozgfu7/5eNgsAKM/RXqe3Bjbg6KC+bpjqQJ2PcwwUF7OLuPhQNDzz1513/wE8BwNxrmNUQbp1mVm3vGbRPu/ancvHa5Iteu0jeUWmppKq7aZAxBmpBlIxnDLPhNnTIrzM8vRHVmJJiWXop6YJErwEsgERs7ydfst07LI6BlN79tNs6uravUswzMJts5W2Tnvizv5urPBosh1sJwb+bFWa5+wq8W2OJjCGU19aZD2llioCglLrkmS7hPqrbgnrG+MzJVvjHa5LEW7/l2+KpXkJPK9dSThQkDkkei09j8Q7I6I6yYq/LwVk2nFgeUs/hqUcM+2dTM/L7IOVxe3mw7Mf7sa9O5CcHsw8QRwQHL11RaKAOG6L6SvhiCvSe0Jkxg2Cu/cGn2oZ/aMiXWr5mQEekQ5AfLIn9tUPH4+6/rOTJg0ZP6k1SbrWTgA4HEJHemWDGly36F98Bj0L aawpTfwy Pqv0hXkTl46skoPuPgcFB/TR6yyENnXKL9MNBmgqriXGO2MJdpKH2CygR7XneFJnGO1V4vA78xS0I0vpgENrReNaWTdrP5HJxDhBvIXB+jfB3Vdf7CbqmwjCRhHqOUrd9F9Yh/04SFLDMejjrcYIoRsHMilmjwq9PJtWmlaNtbfjb0ki16lGE/N2goUPQZxzhB371M3B4S7zKF9Bvy/RpVj0+uiCZmCYUWCJWbGeoq4IHuUK8Mk4NX+ilJWgXJ1HBZ54M/1xBk0yrt6EN2lF1067EYYqdcDJgcMOQKg3cIShNwrZbVErQe3ONbuSz555lRk/T35p9f+0WCbDT0fOTVbxOBCupOuUj7AADiuFwz5atXstp+hHzOiWu0wixof6pVB90g4LEDevw4H281O8BtOXKDU+uvT0BgIRg9uo2gcdTZWp/y3ujhM2GxZd1C+PtlA046VtNygzRZfsM/owINQgOhm2qeL1ilFgtfpuXXnvxh4E= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Sep 30, 2024 at 05:24:39PM -0700, Jeff Xu wrote: > Hi Pedro > > On Sat, Sep 28, 2024 at 6:43 AM Pedro Falcato wrote: > > > > On Fri, Sep 27, 2024 at 06:29:30PM GMT, Jeff Xu wrote: > > > Hi Pedro, > > > > > > On Fri, Sep 27, 2024 at 3:59 PM Pedro Falcato wrote: > > > > > > > + > > > > > + Blocked mm syscall: > > > > > + - munmap > > > > > + - mmap > > > > > + - mremap > > > > > + - mprotect and pkey_mprotect > > > > > + - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE, > > > > > + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK > > > > > + > > > > > + The first set of syscall to block is munmap, mremap, mmap. They can > > > > > + either leave an empty space in the address space, therefore allow > > > > > + replacement with a new mapping with new set of attributes, or can > > > > > + overwrite the existing mapping with another mapping. > > > > > + > > > > > + mprotect and pkey_mprotect are blocked because they changes the > > > > change > > > > > + protection bits (rwx) of the mapping. > > > > > + > > > > > + Some destructive madvice behaviors (MADV_DONTNEED, MADV_FREE, > > > > > + MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK) > > > > > + for anonymous memory, when users don't have write permission to the > > > > > + memory. Those behaviors can alter region contents by discarding pages, > > > > > + effectively a memset(0) for anonymous memory. > > > > > > > > What's the difference between anonymous memory and MAP_PRIVATE | MAP_FILE? > > > > > > > MAP_FILE seems not used ? > > > anonymous mapping is the mapping that is not backed by a file. > > > > MAP_FILE is actually defined as 0 usually :) But I meant file-backed private mappings. > > > OK, we are on the same page for this. > > > > > The feature now, as is (as far as I understand!) will allow you to do things like MADV_DONTNEED > > > > on a read-only file mapping. e.g .text. This is obviously wrong? > > > > > > > When a MADV_DONTNEED is called, pages will be freed, on file-backed > > > mapping, if the process reads from the mapping again, the content > > > will be retrieved from the file. > > > > > > > Sorry, it was late and I gave you a crap example. Consider this: > > a file-backed MAP_PRIVATE vma is marked RW. I write to it, then RO-it + mseal. > > > > The attacker later gets me to MADV_DONTNEED that VMA. You've just lost data. > > > > The big problem here is with anon _pages_, not anon vmas. > > > That depends on the app's threat-model. What you described seems to be > a case below > 1. The file is rw > 2. The process opens the file as rw > 3. the process mmap the fd as rw > 4 The process writes the memory, and the change isn't flushed to the > file on disk. > 5 The process changes the mapping to RO > 6. The process seals the mapping > 7. The process is called MADV_DONTNEED , and because the change isn't > flush to file on disk, so it loses the change, (retrieve the old data > from disk when read from the mapped address later) > > I'm not sure this is a valid use case, the problem here seems to be > that the app needs to flush the change from memory to disk if the > expectation is writing is permanent. > MAP_PRIVATE never does writeback. That's not what this is about. I can trivially discard anonymous pages for private "file VMAs", which aren't refilled with the exact same contents. This is a problem. > In any case, the mseal currently just blocks a subset of madvise, those > we know with a security implication. If there is something mseal needs > to block additionally, one can always extend it by using the "flags" field. > I do think the bar is high though, e.g. a valid use case to support that. No, this has nothing to do with a flag. It's about providing sane semantics. > > > > For anonymous mapping, since there is no file backup, if process > > > reads from the mapping, 0 is filled, hence equivalent to memset(0) > > > > > > > > + > > > > > + Kernel will return -EPERM for blocked syscalls. > > > > > + > > > > > + When blocked syscall return -EPERM due to sealing, the memory regions may or may not be changed, depends on the syscall being blocked: > > > > > + - munmap: munmap is atomic. If one of VMAs in the given range is > > > > > + sealed, none of VMAs are updated. > > > > > + - mprotect, pkey_mprotect, madvise: partial update might happen, e.g. > > > > > + when mprotect over multiple VMAs, mprotect might update the beginning > > > > > + VMAs before reaching the sealed VMA and return -EPERM. > > > > > + - mmap and mremap: undefined behavior. > > > > > > > > mmap and mremap are actually not undefined as they use munmap semantics for their unmapping. > > > > Whether this is something we'd want to document, I don't know honestly (nor do I think is ever written down in POSIX?) > > > > > > > I'm not sure if I can declare mmap/mremap as atomic. > > > > > > Although, it might be possible to achieve this due to munmap being > > > atomic. I'm not sure as I didn't test this. Would you like to find > > > out ? > > > > I just told you they use munmap under the hood. It's just that the requirement isn't actually > > written down anywhere. > > > I knew about mmap/mremap calling munmap. I don't know what exactly you > are asking though. In your patch and its discussion, you did not mention > the mmap/mremap (for sealing) is or should be atomic. > > My point is: since there isn't a clear statement from your patch description > or POSIX, that mremap/mmap is atomic, and I haven't tested it myself with > regards to sealing, let's leave them as "undefined" for now. (I could get back > to this later after the merging window) > > > > > > > > > > > > > > Use cases: > > > > > ========== > > > > > - glibc: > > > > > The dynamic linker, during loading ELF executables, can apply sealing to > > > > > - non-writable memory segments. > > > > > + mapping segments. > > > > > > > > > > - Chrome browser: protect some security sensitive data-structures. > > > > > > > > > > -Notes on which memory to seal: > > > > > -============================== > > > > > - > > > > > -It might be important to note that sealing changes the lifetime of a mapping, > > > > > -i.e. the sealed mapping won’t be unmapped till the process terminates or the > > > > > -exec system call is invoked. Applications can apply sealing to any virtual > > > > > -memory region from userspace, but it is crucial to thoroughly analyze the > > > > > -mapping's lifetime prior to apply the sealing. > > > > > +Don't use mseal on: > > > > > +=================== > > > > > +Applications can apply sealing to any virtual memory region from userspace, > > > > > +but it is *crucial to thoroughly analyze the mapping's lifetime* prior to > > > > > +apply the sealing. This is because the sealed mapping *won’t be unmapped* > > > > > +till the process terminates or the exec system call is invoked. > > > > > > > > There should probably be a nice disclaimer as to how most people don't need this or shouldn't use this. > > > > At least in its current form. > > > > > > > Ya, the mseal is not for most apps. I mention the malloc example to stress that. > > > > > > > > > > > > - > > > > > - > > > > > -Additional notes: > > > > > -================= > > > > > As Jann Horn pointed out in [3], there are still a few ways to write > > > > > -to RO memory, which is, in a way, by design. Those cases are not covered > > > > > -by mseal(). If applications want to block such cases, sandbox tools (such as > > > > > -seccomp, LSM, etc) might be considered. > > > > > +to RO memory, which is, in a way, by design. And those could be blocked > > > > > +by different security measures. > > > > > > > > > > Those cases are: > > > > > - > > > > > -- Write to read-only memory through /proc/self/mem interface. > > > > > -- Write to read-only memory through ptrace (such as PTRACE_POKETEXT). > > > > > -- userfaultfd. > > > > > + - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE). > > > > > + - Write to read-only memory through ptrace (such as PTRACE_POKETEXT). > > > > > + - userfaultfd. > > > > > > > > I don't understand how this is not a problem, but MADV_DONTNEED is. > > > > To me it seems that what we have now is completely useless, because you can trivially > > > > bypass it using /proc/self/mem, which is enabled on most Linux systems. > > > > > > > > Before you mention ChromeOS or Chrome, I don't care. Kernel features aren't designed > > > > for Chrome. They need to work with every other distro and application as well. > > > > > > > > It seems to me that the most sensible change is blocking/somehow distinguishing between /proc/self/mem and > > > > /proc//mem (some other process) and ptrace. As in blocking /proc/self/mem but allowing the other FOLL_FORCE's > > > > as the traditional UNIX permission model allows. > > > > > > > IMO, it is a matter of Divide and Conquer. In a nutshell, mseal only > > > prevents VMA's certain attributes (such as prot bits) from changing. > > > It doesn't mean to say that sealed RO memory is immutable. To achieve > > > that, the system needs to apply multiple security measures. > > > > No, it's a matter of providing a sane API without tons of edgecases. Making a VMA immutable should make a VMA > > immutable, and not require you to provide a crap ton of other mechanisms in order to truly make it immutable. > > If I call mseal, I expect it to be sealed, not "sealed except when it's not, lol". > > > > You haven't been able to quite specify what semantics are desirable out of this whole thing. Making > > prot flags "immutable" is completely worthless if you can simply write to a random pseudofile and > > have it bypass the whole thing (where a write to /proc/self/mem is semantically equivalent to > > mprotect RW + write + mprotect RO). Making the vma immutable is completely worthless > > if I can simply wipe anon pages. There has to be some end goal here (make contents immutable? > > make sure VMA protection can't be changed? both?) which seems to be unclear from the kernel mmap-side. > > > > If you insist on providing half-baked APIs (and waving off any concerns), I'm sure this would've been better > > implemented as a random bpf program for chrome. Maybe we could revert this whole thing and give eBPF one > > or two bits of vma flags for their own uses :) > > Please reply to the above. We're struggling to understand exactly what semantics you want from this. *That* is what we want to document and get set in stone, and we'll move from there. > > > > > > For writing to /proc/pid/mem, it can be disabled via [1]. SELINUX and > > > Landlock can achieve the same protection too. > > > > I'm not blocking /proc/pid/mem, and my distro doesn't run any of those security modules :/ > > > It is a choice you can make :-) Your feature needs to work without "extra choices". -- Pedro