From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C74BC54798 for ; Thu, 29 Feb 2024 09:57:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 007A36B008C; Thu, 29 Feb 2024 04:57:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EF8196B00CA; Thu, 29 Feb 2024 04:57:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DC1816B00CB; Thu, 29 Feb 2024 04:57:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C675B6B008C for ; Thu, 29 Feb 2024 04:57:36 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6BF6E1612F7 for ; Thu, 29 Feb 2024 09:57:36 +0000 (UTC) X-FDA: 81844389312.21.D9E4924 Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by imf24.hostedemail.com (Postfix) with ESMTP id EB9DE180009 for ; Thu, 29 Feb 2024 09:57:34 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=341ZOhmd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of jackmanb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=jackmanb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709200655; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=wjFur37ZwgWSJ+0b6+aRvrekHiokcIRvxfESDgnwI6o=; b=AtWhzg0WU/kvDiHHWcLNz3uEsLL6acyuBxWB/Zdkbkxz+5w3LIA87mBAtLy2EjH5EqJvM1 iYlq0WpLSjavt6ibMAOlrarv3GJcq/eP4YI41GoFoIVk9UAEYxg9IRo43DIiQHio6Yhbv1 oo4GKyb8ootHgfUHPcslfJVpSfSvJcg= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=341ZOhmd; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of jackmanb@google.com designates 209.85.160.182 as permitted sender) smtp.mailfrom=jackmanb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709200655; a=rsa-sha256; cv=none; b=pUsFoPidplNk4vI0yrknqPc4H8p8Ge0dzslrS1umCgXoeLraT9udkUrRdYYJVZNb6EBWtb mjbi0FmCszIBdruUAmZeXTM2lHDJAS45CLPKFP5wVqk8fC+07QmDNZRqqguls49mg2nFA1 HP/KcBQR1fb1RHA+8RL6MyYqcIQV9lU= Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-42e7f5e24beso147241cf.1 for ; Thu, 29 Feb 2024 01:57:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709200654; x=1709805454; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=wjFur37ZwgWSJ+0b6+aRvrekHiokcIRvxfESDgnwI6o=; b=341ZOhmdtT9E7haAg72kB0QjWkD4/nf4mH3ucJadcnQxQ+mV1KwW5XO3p4dpO3wYXO NwLU3agYeyhjefMCPW1m9j/18kJiPrVxwdr1SynshWyQc3STCuKqKLuFYSKBMEPRj+bb ENJk/wPHxSWcRoClYp/Jd5kl3GtTyHT4rq46efxK/MPzupRZ8/zTzOL74QQVoGneAcO9 qUwQEXVi/sAVRRQzeLjtaBsdycEgGYUHiVs2YTnRsScRU9LmWrboY4oL4AluY4v6FP7y iOwSn7vKDD7eLQPaE992nY8ri7FRrEPl4WgmOzEaliT19Iu0b5Z/7fTbf2w4eNky0psp mAFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709200654; x=1709805454; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=wjFur37ZwgWSJ+0b6+aRvrekHiokcIRvxfESDgnwI6o=; b=VFPYmHttOknnpgMnfshULh2e1BDHP8gT3AE4LPZDwYFenXmeBRbnMrNY/sQLJShtx7 makA6HxU5dokCFl+fUj9VM8iEGpjY6CUr7yepF4bSzqHE3PmPbCPMZitYHmaaHKXL9OY nMb+I5xDrxVmyk+T/5Od3vrVj7Y8M7QLEAeEeU6+G8s+RHm+jAMjeMx+9SH1u2e/lyp6 HaNoUWr9X7wqYN0VkjKFMr0Ms+sVkCu1SGm2s1xpqFu7VtmPuSlw6FZV0LjLON1JKsCL DkMYI18OSeY7Wp2qmzgHsiHRFhGxr4qcvyvKsGzV462NzHVfDkgPbB9v3LG0c/DHF5vj W6vg== X-Gm-Message-State: AOJu0Yz7jFGtWB64R+2usfI2iOPRkfgXBuvCP15qckPIuSIyGcWp97Eg vAVfZ6zTQLPYCAhkdpMBbZoWjObDAYA98pTH9/+Sbb55PP343Tzb4brJP+HISOwPN7c08Fdj/Vu mEDoF/PSxvqI9drpCKDP/la0PCgngctaxg7r3 X-Google-Smtp-Source: AGHT+IH+BmWY0lPNIAchO09w1HrMoWpH6wPT7Csh5lATnksESYyGEZ1dEV5i6kMEuF9CJu0brkirLzcgl3Q0f8Nv5mw= X-Received: by 2002:ac8:5ad3:0:b0:42e:b685:6ad7 with SMTP id d19-20020ac85ad3000000b0042eb6856ad7mr188146qtd.4.1709200653771; Thu, 29 Feb 2024 01:57:33 -0800 (PST) MIME-Version: 1.0 From: Brendan Jackman Date: Thu, 29 Feb 2024 10:57:21 +0100 Message-ID: Subject: [LSF/MM/BPF TOPIC] Address Space Isolation To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, x86@kernel.org, Junaid Shahid , Reiji Watanabe , Patrick Bellasi , Yosry Ahmed , Frank van der Linden , pbonzini@redhat.com, Jim Mattson , Paul Turner , Ofir Weisse , alexandre.chartre@oracle.com, rppt@linux.ibm.com, dave.hansen@linux.intel.com, Peter Zijlstra , Thomas Gleixner , luto@kernel.org, Andrew.Cooper3@citrix.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: EB9DE180009 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: gnijkmojj67o1tq83txe148cxfdjd861 X-HE-Tag: 1709200654-824123 X-HE-Meta: U2FsdGVkX19MivfNiU0DB9+LTQEz0/71HMZ3DOTs6XGdp/EOEMZez8KXtzMZveaNdKB0sGPh9HbWSXS8tr6C1NgRRGWyUFKzRCd8tQS59O7v+sVXnHTiVD2IEb0HiBNHslv6nZtEl4eeJiUtu+PA5fUjEO2Ktmn5HsaACiZz66PV2YP2MtMt4ireni2nwTimha28iEW5+MRj4OrFN1ab2izObCY3AakYHIawLlGsFX2DaRnwZ49+UanlpjPOIlF9dIwo6BrTvRcCTQZ3tffn+6EvYLfGflwcj6T8mL0OKKIIMAHXWuk0zIls/BpYGNSl8RAKhFjGfOcqpSGc03TUOJolhx4KizVCFRzdJwT05+UDFf5F3guvkJ+x/+i2gxWf1YS7DfJXfsJ20WFIvECqihr6bU+oBaSMlXmH0sdXzGYWmqZPkg2OP42KaXGlKcwJu4reLfa7dkV4gIgF7otwf3I5KidqRUgXlyG3OE7HmSOU0r8SX/6gdpdN7WeMGsYWtH5Mj+l5UODugitUclPaXdWFjXXcp2arHLFus8sjLNvHm9umqn+yW1dH+5GLraIHWSGnPPf0tH4jEXAb/sB4w07vV9fFekKyfNRXJxxfne4EP4iXicWGgTHE4sX/5s1a0vSSkDn2RT7QhZLfcuAZV730ba5rQaVggHyNV5YlVOTvsx9nFHi01JYKqz5amBZWweBSNvki4sayNYsJ9/V94TqsJ/pEon3ss2/OCxWU9d/hIT+nHmoPD5x4puu7xun7qRRix+Y8Dz0vaOlj1XUzgC7GbmLQL4WPr705iOPurgNEjm2LnLNhQKh7vyqdnt7KinJbf6xygPgAC+eYbpxqFDRorTalJ81gtsk2dmN10xJBM0eNCiz2LKNXyfKWLbEJgrmyBTnMJL0Fjekyyty60v4rlEuJ5jWuChII5K5bQCHtvU0onjJRE+w6YrO+qkfz5wElYTjXQE2ICRwu+lI u4yV25Ut Ntejs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Address Space Isolation (ASI) is a technique to mitigate broad classes of CPU vulnerabilities. ASI logically separates memory into =E2=80=9Csensitive=E2=80=9D and =E2=80= =9Cnonsensitive=E2=80=9D, the former is memory that may contain secrets and the latter is memory that, though it might be owned by a privileged component, we don=E2=80=99t actually care about leaking. The core aim is to execute comprehensive mitigations for protecting sensitive memory, while avoiding the cost of protected nonsensitive memory. This is implemented by creating another =E2=80=9Crestricted=E2=80=9D address space for the kernel, in which= sensitive data is not mapped. The implementation contains two broad areas of functionality: ::Sensitivity tracking:: provides mechanisms to determine which data is sensitive and keep the restricted address space page tables up-to-date. At present this is done by adding new allocator flags which allocation sites use to annotate data whenever its sensitivity differs from the default. The definition of =E2=80=9Csensitive=E2=80=9D memory isn=E2=80=99t a topic = we=E2=80=99ve fully explored yet - it=E2=80=99s possible that this will vary from any given deployment to the next. The framework is implemented so that any given allocation can have an arbitrary sensitivity setting. What is =E2=80=9Csensitive=E2=80=9D is in reality of course contextual. Use= r data is sensitive in the general sense, but we don=E2=80=99t really care if a user = is able to leak _its own_ data via CPU bugs. In one implementation we divide =E2=80=9Cnonsensitive=E2=80=9D data into =E2=80=9Cglobal=E2=80=9D an= d =E2=80=9Clocal=E2=80=9D nonsensitive. Local-nonsensitive data is mapped into the restricted address space of the entity (process/KVM guest) that it belongs to. This adds quite a lot of complexity, so at present we=E2=80=99re working without local-nonsensitivity support - if we can achieve all the security coverage we want with acceptable performance then this will be big maintainability win. The biggest challenge we=E2=80=99ve faced so far in sensitivity tracking is that transitioning memory from nonsensitive to sensitive requires flushing the TLB. Aside from the performance impact, this cannot be done with IRQs disabled. The simple workaround for this is to keep all free pages unmapped from the restricted address space (so that they can be allocated with any sensitivity without a TLB flush), and process freeing of nonsensitive pages (requiring a TLB flush under this simple scheme) via an asynchronous worker. This creates lots of unnecessary TLB flushes, but perhaps worse it can create artificial OOM conditions as pages are stranded on the asynchronous worker=E2=80=99s queue. ::Sandboxing:: is the logic that switches between address spaces and executes actual mitigations. Before running untrusted code, i.e. userspace processes and KVM guests, the kernel enters the restricted address space. If a later kernel entry accesses sensitive data - as detected by a page fault - it returns to the normal kernel address space. Each of these address space transitions involves a buffer flush: on exiting the restricted address space (that is, right before accessing sensitive data for the first time since running untrusted code) we flush branch prediction buffers that can be exploited through Spectre-like attacks. On entering the restricted address space (that is, right before running untrusted code for the first time since accessing sensitive data) we flush data buffers that can be exploited as side channels with Meltdown-like attacks. The =E2=80=9Chappy path=E2=80= =9D for ASI is getting back to the untrusted code without accessing any secret, and thus incurring no buffer flushes. If the sensitive/nonsensitive distinction is well-chosen, it should be possible to afford extremely defensive buffer-flushes on address space transitions, since those transitions are rare. Some interesting details of sandboxing logic relate to interrupt handling: when an interrupt triggers a transition out of the restricted address space, we may need to return to it before exiting the interrupt. A simple implementation could just unconditionally return to the original address space after servicing any interrupt, but that can also lead to unnecessary transitions. Thus ASI becomes something like a small state machine. ASI has been proposed and discussed several times over the years, most recently by Junaid Shahid & co in [1] and [2]. Since then, the sophistication of CPU bug exploitation has advanced Google=E2=80=99s intere= st in ASI has continued to grow. We=E2=80=99re now working on deploying an internal implementation, to prove that this concept has real-world value. Our current implementation has undergone lots of testing and is now close to production-ready. We=E2=80=99d like to share our progress since the last RFC and discuss the challenges we=E2=80=99ve faced so far in getting this feature production-ready. Hopefully this will prompt interesting discussion to guide the next upstream posting. Some areas that would be fruitful to discuss: - Feedback on the overall design. - How we=E2=80=99ve generalised ASI as a framework that goes beyond the KVM= use case - How we=E2=80=99ve implemented sensitivity tracking as a =E2=80=9Cdeny lis= t=E2=80=9D to ease initial deployment, and how to develop this into a longer-term solution. The policy defining what memory objects are sensitive/non-sensitive ought to be decoupled from the ASI framework and ideally even from the code that allocates memory - If/how KPTI should be implemented in the ASI framework. We plan to add a Userspace-ASI class that would map all non-sensitive kernel memory in the restricted address space, but perhaps there may be value in also having an ASI class that mirrors exactly how the current KPTI works. - How we=E2=80=99ve solved the TLB flushing issues in sensitivity tracking,= and how it could be done better. [1] https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com= / [2] https://www.phoronix.com/news/Google-LPC-ASI-2022