From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D8D5C54E60 for ; Tue, 12 Mar 2024 14:48:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 91D3E6B01F9; Tue, 12 Mar 2024 10:48:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CDCC6B01FA; Tue, 12 Mar 2024 10:48:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 795D66B01FB; Tue, 12 Mar 2024 10:48:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 675B26B01F9 for ; Tue, 12 Mar 2024 10:48:35 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 23A841C0D46 for ; Tue, 12 Mar 2024 14:48:35 +0000 (UTC) X-FDA: 81888668190.15.34AE1FA Received: from bee.tesarici.cz (bee.tesarici.cz [37.205.15.56]) by imf10.hostedemail.com (Postfix) with ESMTP id D1A7BC0005 for ; Tue, 12 Mar 2024 14:48:32 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=tesarici.cz header.s=mail header.b=2a2IJjC8; dmarc=pass (policy=quarantine) header.from=tesarici.cz; spf=pass (imf10.hostedemail.com: domain of petr@tesarici.cz designates 37.205.15.56 as permitted sender) smtp.mailfrom=petr@tesarici.cz ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710254913; a=rsa-sha256; cv=none; b=dQGD0lgsAuCbyhxXJOXDsWJqTSkwJjA+CAXyKp+OKUVXTyL7m8/dyMStbeJtQSpi7ULiaC voQVL1H1NOHMQDLX3UilmkxUCBSzDR3FGnE03xEgS7meVCmnyqz+P43eLq/AsHwhUwg5hn AMnxRa9EFlnu2BGiQvWqt7QsYZEeioE= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=tesarici.cz header.s=mail header.b=2a2IJjC8; dmarc=pass (policy=quarantine) header.from=tesarici.cz; spf=pass (imf10.hostedemail.com: domain of petr@tesarici.cz designates 37.205.15.56 as permitted sender) smtp.mailfrom=petr@tesarici.cz ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710254913; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vf1nK0/VJmF0lk6CQrRWKN9WINbG0iGKZNWS2JFrhw0=; b=R+x7ysgCuGckvCoaP10ap1w6AJeVPH0UC+4DdzPLqk9lZdhCQyCS1KZmxU0EHjflnnsY3p GtNaXGIRc+e+fzSShV4hhKlobrG6eumcf57X6wiAnjmsgjuUTtBssnjNg88TzfQ39fVpsq I40jmIQ59Mter6cWV78T2xtBugKl3T0= Received: from meshulam.tesarici.cz (dynamic-2a00-1028-83b8-1e7a-4427-cc85-6706-c595.ipv6.o2.cz [IPv6:2a00:1028:83b8:1e7a:4427:cc85:6706:c595]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by bee.tesarici.cz (Postfix) with ESMTPSA id 71D04192421; Tue, 12 Mar 2024 15:48:29 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tesarici.cz; s=mail; t=1710254909; bh=vf1nK0/VJmF0lk6CQrRWKN9WINbG0iGKZNWS2JFrhw0=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=2a2IJjC8TFTY+3slU/SKnq1pGlyFq2p2fFupRq/Nz6xC7JrbxC10SFG8mvVVzSqIR iYelcpHAYwQvQxGMJmP87Sd+2NC7MPtLlqSq9s48MgdEvTX+6FDlEXUxbMWzROzHmw h02b/3C0LI1I9J4Iax92oVoXzmZb9nQAktR6ZellykxQ8roWZRjgsNwBY9IvBG1Zen 62DfPyCiwF0LHwq06w9r8S/oCm6GCt6J8nxm52ADiqnQY0NYnlwPVcTlQBD8VDoO2o 1C+MY1pPhIDXUJRnYVB3MZnGlyFTqKLgjlq7ENdAqnmjFuEJEN5NHwrGsHgDZZe6ka mGSIt9Saz//8A== Date: Tue, 12 Mar 2024 15:48:28 +0100 From: Petr =?UTF-8?B?VGVzYcWZw61r?= To: Brendan Jackman Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, x86@kernel.org, Junaid Shahid , Reiji Watanabe , Patrick Bellasi , Yosry Ahmed , Frank van der Linden , pbonzini@redhat.com, Jim Mattson , Paul Turner , Ofir Weisse , alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, Peter Zijlstra , Thomas Gleixner , luto@kernel.org, Andrew.Cooper3@citrix.com Subject: Re: [LSF/MM/BPF TOPIC] Address Space Isolation Message-ID: <20240312154828.0efc76a4@meshulam.tesarici.cz> In-Reply-To: References: X-Mailer: Claws Mail 4.2.0 (GTK 3.24.39; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D1A7BC0005 X-Stat-Signature: 194nxkrspw5osidgxowex3944r7euejj X-HE-Tag: 1710254912-2571 X-HE-Meta: U2FsdGVkX1+Rps2P26JQwHjbYIpHE1ujCLuMzu4hXgamPM8u2UxzG5nhdd6jTBqkVNi4NpxTuL6hzG11jy9Nlic9GRgp2DkBsQBUAb0VKzcWRym9wTWEa5M1DxuUWvlFgYvAbC39HOYFCHg0e9WpQHBLDI+2ep4OoivdsbTYqsbql6E6CJ4K2fF8t5Dthv/Fv/YzV5QzN6yphXjpEfEaOAJAdFDzapP8V/CIn1H4PpSnU6EXO4cgU4PDcdWUjcm6j4hUeXTvkMcHDH5Grh/zfFrPaXAQa9alz51zbDaVxWlnEYV+Tsv4VxsZn9kOhFf9E82fQbuX+6m18xxRpSKpC63WKRTbk9UQE9qNHCwMgIhBAc+8O+2BCM6JpS/wXIg/DjnkHT/YOW4+lRdLXc0zWtq7mJC50x4MJXayuhPr0wSMNoeQSYUn9ZmsRRH7VQqAx7k63wzV7HdHe4HzwyTFHIwlNH7nzzj9jDFAVo2ilISG7w4KaBWT4sJq2ssuCJTnW+iQbHOL0MEvf0+K9n4kPVtxScddlPmSLcvzNoy+c6XVp5PbDcr+M+gAYV/c+AzAGOFAWYrdO3hoS3zgmQnhV5r9cpMBr7at7ZE4rZ4EMjde5b80ZyyxdONTnAcdkTc/PeMG/AH4mLqtATRTcFbwvWXgZZwMpyuToWpTx6s3ayQ4ipQ4KA/raFGbbZg2d5hLXDJ6oNzFn4p/2Tyz127+OG/T2f8ZQdN1/jKVAHRxroeJgK8wPJnIAfI6VUqJPnbRU5eJBXwf11WEP/f8mnSJK94CPEH453Rbac4dxoamINpl9b1SaOQ4lTZSt0MipQss38FOOZdJ+lA8imeHRwfXuLqIDwfba5jVx/jN59/bvWU6clKW1MRxvYsyDLDTpcS2tiAMTndaNmwyJuyDw2wy7/vsdPwDx+h/4shdhI/wvKU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 29 Feb 2024 10:57:21 +0100 Brendan Jackman wrote: > Address Space Isolation (ASI) is a technique to mitigate broad classes > of CPU vulnerabilities. >=20 > ASI logically separates memory into =E2=80=9Csensitive=E2=80=9D and =E2= =80=9Cnonsensitive=E2=80=9D, > the former is memory that may contain secrets and the latter is memory > that, though it might be owned by a privileged component, we don=E2=80=99t > actually care about leaking. The core aim is to execute comprehensive > mitigations for protecting sensitive memory, while avoiding the cost > of protected nonsensitive memory. This is implemented by creating > another =E2=80=9Crestricted=E2=80=9D address space for the kernel, in whi= ch sensitive > data is not mapped. >=20 > The implementation contains two broad areas of functionality: >=20 > ::Sensitivity tracking:: provides mechanisms to determine which data > is sensitive and keep the restricted address space page tables > up-to-date. At present this is done by adding new allocator flags > which allocation sites use to annotate data whenever its sensitivity > differs from the default. >=20 > The definition of =E2=80=9Csensitive=E2=80=9D memory isn=E2=80=99t a topi= c we=E2=80=99ve fully > explored yet - it=E2=80=99s possible that this will vary from any given > deployment to the next. The framework is implemented so that any given > allocation can have an arbitrary sensitivity setting. >=20 > What is =E2=80=9Csensitive=E2=80=9D is in reality of course contextual. U= ser data is > sensitive in the general sense, but we don=E2=80=99t really care if a use= r is > able to leak _its own_ data via CPU bugs. In one implementation we > divide =E2=80=9Cnonsensitive=E2=80=9D data into =E2=80=9Cglobal=E2=80=9D = and =E2=80=9Clocal=E2=80=9D nonsensitive. > Local-nonsensitive data is mapped into the restricted address space of > the entity (process/KVM guest) that it belongs to. This adds quite a > lot of complexity, so at present we=E2=80=99re working without > local-nonsensitivity support - if we can achieve all the security > coverage we want with acceptable performance then this will be big > maintainability win. >=20 > The biggest challenge we=E2=80=99ve faced so far in sensitivity tracking = is > that transitioning memory from nonsensitive to sensitive requires > flushing the TLB. Aside from the performance impact, this cannot be > done with IRQs disabled. The simple workaround for this is to keep all > free pages unmapped from the restricted address space (so that they > can be allocated with any sensitivity without a TLB flush), and > process freeing of nonsensitive pages (requiring a TLB flush under > this simple scheme) via an asynchronous worker. This creates lots of > unnecessary TLB flushes, but perhaps worse it can create artificial > OOM conditions as pages are stranded on the asynchronous worker=E2=80=99s > queue. >=20 > ::Sandboxing:: is the logic that switches between address spaces and > executes actual mitigations. Before running untrusted code, i.e. > userspace processes and KVM guests, the kernel enters the restricted > address space. If a later kernel entry accesses sensitive data - as > detected by a page fault - it returns to the normal kernel address > space. Each of these address space transitions involves a buffer > flush: on exiting the restricted address space (that is, right before > accessing sensitive data for the first time since running untrusted > code) we flush branch prediction buffers that can be exploited through > Spectre-like attacks. On entering the restricted address space (that > is, right before running untrusted code for the first time since > accessing sensitive data) we flush data buffers that can be exploited > as side channels with Meltdown-like attacks. The =E2=80=9Chappy path=E2= =80=9D for ASI > is getting back to the untrusted code without accessing any secret, > and thus incurring no buffer flushes. If the sensitive/nonsensitive > distinction is well-chosen, it should be possible to afford extremely > defensive buffer-flushes on address space transitions, since those > transitions are rare. >=20 > Some interesting details of sandboxing logic relate to interrupt > handling: when an interrupt triggers a transition out of the > restricted address space, we may need to return to it before exiting > the interrupt. A simple implementation could just unconditionally > return to the original address space after servicing any interrupt, > but that can also lead to unnecessary transitions. Thus ASI becomes > something like a small state machine. >=20 > ASI has been proposed and discussed several times over the years, most > recently by Junaid Shahid & co in [1] and [2]. Since then, the > sophistication of CPU bug exploitation has advanced Google=E2=80=99s inte= rest > in ASI has continued to grow. We=E2=80=99re now working on deploying an > internal implementation, to prove that this concept has real-world > value. Our current implementation has undergone lots of testing and is > now close to production-ready. >=20 > We=E2=80=99d like to share our progress since the last RFC and discuss the > challenges we=E2=80=99ve faced so far in getting this feature > production-ready. Hopefully this will prompt interesting discussion to > guide the next upstream posting. Some areas that would be fruitful to > discuss: >=20 > - Feedback on the overall design. >=20 > - How we=E2=80=99ve generalised ASI as a framework that goes beyond the K= VM use case >=20 > - How we=E2=80=99ve implemented sensitivity tracking as a =E2=80=9Cdeny l= ist=E2=80=9D to ease > initial deployment, and how to develop this into a longer-term > solution. The policy defining what memory objects are > sensitive/non-sensitive ought to be decoupled from the ASI framework > and ideally even from the code that allocates memory >=20 > - If/how KPTI should be implemented in the ASI framework. We plan to add > a Userspace-ASI class that would map all non-sensitive kernel memory > in the restricted address space, but perhaps there may be value in > also having an ASI class that mirrors exactly how the current KPTI > works. >=20 > - How we=E2=80=99ve solved the TLB flushing issues in sensitivity trackin= g, and > how it could be done better. Hello and welcome! I ran into a similar challenge with SandBox Mode. My solution was to run sandbox code with CPL=3D3 (on x86) and control page access with the U/S PTE bit rather than the P bit, which allowed me to implement lazy TLB invalidation. The x86 folks didn't like idea... For the record, SandBox Mode was designed with confidentiality in mind, although the initial patch series left out this part for simplicity. I wonder if your objective is to protect kernel data from user space, or if you have also considered decomposing the kernel into components that are isolated from each other (and then it we could potentially find some synergies). Petr T