[LSF/MM/BPF TOPIC] Address Space Isolation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Address Space Isolation
@ 2024-02-29  9:57 Brendan Jackman
  2024-03-12 14:48 ` Petr Tesařík
  0 siblings, 1 reply; 3+ messages in thread
From: Brendan Jackman @ 2024-02-29  9:57 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, x86, Junaid Shahid, Reiji Watanabe, Patrick Bellasi,
	Yosry Ahmed, Frank van der Linden, pbonzini, Jim Mattson,
	Paul Turner, Ofir Weisse, alexandre.chartre, rppt, dave.hansen,
	Peter Zijlstra, Thomas Gleixner, luto, Andrew.Cooper3

Address Space Isolation (ASI) is a technique to mitigate broad classes
of CPU vulnerabilities.

ASI logically separates memory into “sensitive” and “nonsensitive”,
the former is memory that may contain secrets and the latter is memory
that, though it might be owned by a privileged component, we don’t
actually care about leaking. The core aim is to execute comprehensive
mitigations for protecting sensitive memory, while avoiding the cost
of protected nonsensitive memory. This is implemented by creating
another “restricted” address space for the kernel, in which sensitive
data is not mapped.

The implementation contains two broad areas of functionality:

::Sensitivity tracking:: provides mechanisms to determine which data
is sensitive and keep the restricted address space page tables
up-to-date. At present this is done by adding new allocator flags
which allocation sites use to annotate data whenever its sensitivity
differs from the default.

The definition of “sensitive” memory isn’t a topic we’ve fully
explored yet - it’s possible that this will vary from any given
deployment to the next. The framework is implemented so that any given
allocation can have an arbitrary sensitivity setting.

What is “sensitive” is in reality of course contextual. User data is
sensitive in the general sense, but we don’t really care if a user is
able to leak _its own_ data via CPU bugs. In one implementation we
divide “nonsensitive” data into “global” and “local” nonsensitive.
Local-nonsensitive data is mapped into the restricted address space of
the entity (process/KVM guest) that it belongs to. This adds quite a
lot of complexity, so at present we’re working without
local-nonsensitivity support - if we can achieve all the security
coverage we want with acceptable performance then this will be big
maintainability win.

The biggest challenge we’ve faced so far in sensitivity tracking is
that transitioning memory from nonsensitive to sensitive requires
flushing the TLB. Aside from the performance impact, this cannot be
done with IRQs disabled. The simple workaround for this is to keep all
free pages unmapped from the restricted address space (so that they
can be allocated with any sensitivity without a TLB flush), and
process freeing of nonsensitive pages (requiring a TLB flush under
this simple scheme) via an asynchronous worker. This creates lots of
unnecessary TLB flushes, but perhaps worse it can create artificial
OOM conditions as pages are stranded on the asynchronous worker’s
queue.

::Sandboxing:: is the logic that switches between address spaces and
executes actual mitigations. Before running untrusted code, i.e.
userspace processes and KVM guests, the kernel enters the restricted
address space. If a later kernel entry accesses sensitive data - as
detected by a page fault - it returns to the normal kernel address
space. Each of these address space transitions involves a buffer
flush: on exiting the restricted address space (that is, right before
accessing sensitive data for the first time since running untrusted
code) we flush branch prediction buffers that can be exploited through
Spectre-like attacks. On entering the restricted address space (that
is, right before running untrusted code for the first time since
accessing sensitive data) we flush data buffers that can be exploited
as side channels with Meltdown-like attacks. The “happy path” for ASI
is getting back to the untrusted code without accessing any secret,
and thus incurring no buffer flushes. If the sensitive/nonsensitive
distinction is well-chosen, it should be possible to afford extremely
defensive buffer-flushes on address space transitions, since those
transitions are rare.

Some interesting details of sandboxing logic relate to interrupt
handling: when an interrupt triggers a transition out of the
restricted address space, we may need to return to it before exiting
the interrupt. A simple implementation could just unconditionally
return to the original address space after servicing any interrupt,
but that can also lead to unnecessary transitions. Thus ASI becomes
something like a small state machine.

ASI has been proposed and discussed several times over the years, most
recently by Junaid Shahid & co in [1] and [2]. Since then, the
sophistication of CPU bug exploitation has advanced Google’s interest
in ASI has continued to grow. We’re now working on deploying an
internal implementation, to prove that this concept has real-world
value. Our current implementation has undergone lots of testing and is
now close to production-ready.

We’d like to share our progress since the last RFC and discuss the
challenges we’ve faced so far in getting this feature
production-ready. Hopefully this will prompt interesting discussion to
guide the next upstream posting.  Some areas that would be fruitful to
discuss:

- Feedback on the overall design.

- How we’ve generalised ASI as a framework that goes beyond the KVM use case

- How we’ve implemented sensitivity tracking as a “deny list” to ease
initial deployment, and how to develop this into a longer-term
solution. The policy defining what memory objects are
sensitive/non-sensitive ought to be decoupled from the ASI framework
and ideally even from the code that allocates memory

- If/how KPTI should be implemented in the ASI framework. We plan to add
a Userspace-ASI class that would map all non-sensitive kernel memory
in the restricted address space, but perhaps there may be value in
also having an ASI class that mirrors  exactly how the current KPTI
works.

- How we’ve solved the TLB flushing issues in sensitivity tracking, and
how it could be done better.

[1] https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com/
[2] https://www.phoronix.com/news/Google-LPC-ASI-2022

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Address Space Isolation
  2024-02-29  9:57 [LSF/MM/BPF TOPIC] Address Space Isolation Brendan Jackman
@ 2024-03-12 14:48 ` Petr Tesařík
  2024-03-12 16:45   ` Brendan Jackman
  0 siblings, 1 reply; 3+ messages in thread
From: Petr Tesařík @ 2024-03-12 14:48 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: lsf-pc, linux-mm, x86, Junaid Shahid, Reiji Watanabe,
	Patrick Bellasi, Yosry Ahmed, Frank van der Linden, pbonzini,
	Jim Mattson, Paul Turner, Ofir Weisse, alexandre.chartre, rppt,
	dave.hansen, Peter Zijlstra, Thomas Gleixner, luto,
	Andrew.Cooper3

On Thu, 29 Feb 2024 10:57:21 +0100
Brendan Jackman <jackmanb@google.com> wrote:

> Address Space Isolation (ASI) is a technique to mitigate broad classes
> of CPU vulnerabilities.
> 
> ASI logically separates memory into “sensitive” and “nonsensitive”,
> the former is memory that may contain secrets and the latter is memory
> that, though it might be owned by a privileged component, we don’t
> actually care about leaking. The core aim is to execute comprehensive
> mitigations for protecting sensitive memory, while avoiding the cost
> of protected nonsensitive memory. This is implemented by creating
> another “restricted” address space for the kernel, in which sensitive
> data is not mapped.
> 
> The implementation contains two broad areas of functionality:
> 
> ::Sensitivity tracking:: provides mechanisms to determine which data
> is sensitive and keep the restricted address space page tables
> up-to-date. At present this is done by adding new allocator flags
> which allocation sites use to annotate data whenever its sensitivity
> differs from the default.
> 
> The definition of “sensitive” memory isn’t a topic we’ve fully
> explored yet - it’s possible that this will vary from any given
> deployment to the next. The framework is implemented so that any given
> allocation can have an arbitrary sensitivity setting.
> 
> What is “sensitive” is in reality of course contextual. User data is
> sensitive in the general sense, but we don’t really care if a user is
> able to leak _its own_ data via CPU bugs. In one implementation we
> divide “nonsensitive” data into “global” and “local” nonsensitive.
> Local-nonsensitive data is mapped into the restricted address space of
> the entity (process/KVM guest) that it belongs to. This adds quite a
> lot of complexity, so at present we’re working without
> local-nonsensitivity support - if we can achieve all the security
> coverage we want with acceptable performance then this will be big
> maintainability win.
> 
> The biggest challenge we’ve faced so far in sensitivity tracking is
> that transitioning memory from nonsensitive to sensitive requires
> flushing the TLB. Aside from the performance impact, this cannot be
> done with IRQs disabled. The simple workaround for this is to keep all
> free pages unmapped from the restricted address space (so that they
> can be allocated with any sensitivity without a TLB flush), and
> process freeing of nonsensitive pages (requiring a TLB flush under
> this simple scheme) via an asynchronous worker. This creates lots of
> unnecessary TLB flushes, but perhaps worse it can create artificial
> OOM conditions as pages are stranded on the asynchronous worker’s
> queue.
> 
> ::Sandboxing:: is the logic that switches between address spaces and
> executes actual mitigations. Before running untrusted code, i.e.
> userspace processes and KVM guests, the kernel enters the restricted
> address space. If a later kernel entry accesses sensitive data - as
> detected by a page fault - it returns to the normal kernel address
> space. Each of these address space transitions involves a buffer
> flush: on exiting the restricted address space (that is, right before
> accessing sensitive data for the first time since running untrusted
> code) we flush branch prediction buffers that can be exploited through
> Spectre-like attacks. On entering the restricted address space (that
> is, right before running untrusted code for the first time since
> accessing sensitive data) we flush data buffers that can be exploited
> as side channels with Meltdown-like attacks. The “happy path” for ASI
> is getting back to the untrusted code without accessing any secret,
> and thus incurring no buffer flushes. If the sensitive/nonsensitive
> distinction is well-chosen, it should be possible to afford extremely
> defensive buffer-flushes on address space transitions, since those
> transitions are rare.
> 
> Some interesting details of sandboxing logic relate to interrupt
> handling: when an interrupt triggers a transition out of the
> restricted address space, we may need to return to it before exiting
> the interrupt. A simple implementation could just unconditionally
> return to the original address space after servicing any interrupt,
> but that can also lead to unnecessary transitions. Thus ASI becomes
> something like a small state machine.
> 
> ASI has been proposed and discussed several times over the years, most
> recently by Junaid Shahid & co in [1] and [2]. Since then, the
> sophistication of CPU bug exploitation has advanced Google’s interest
> in ASI has continued to grow. We’re now working on deploying an
> internal implementation, to prove that this concept has real-world
> value. Our current implementation has undergone lots of testing and is
> now close to production-ready.
> 
> We’d like to share our progress since the last RFC and discuss the
> challenges we’ve faced so far in getting this feature
> production-ready. Hopefully this will prompt interesting discussion to
> guide the next upstream posting.  Some areas that would be fruitful to
> discuss:
> 
> - Feedback on the overall design.
> 
> - How we’ve generalised ASI as a framework that goes beyond the KVM use case
> 
> - How we’ve implemented sensitivity tracking as a “deny list” to ease
> initial deployment, and how to develop this into a longer-term
> solution. The policy defining what memory objects are
> sensitive/non-sensitive ought to be decoupled from the ASI framework
> and ideally even from the code that allocates memory
> 
> - If/how KPTI should be implemented in the ASI framework. We plan to add
> a Userspace-ASI class that would map all non-sensitive kernel memory
> in the restricted address space, but perhaps there may be value in
> also having an ASI class that mirrors  exactly how the current KPTI
> works.
> 
> - How we’ve solved the TLB flushing issues in sensitivity tracking, and
> how it could be done better.

Hello and welcome! I ran into a similar challenge with SandBox Mode. My
solution was to run sandbox code with CPL=3 (on x86) and control page
access with the U/S PTE bit rather than the P bit, which allowed me to
implement lazy TLB invalidation. The x86 folks didn't like idea...

For the record, SandBox Mode was designed with confidentiality in mind,
although the initial patch series left out this part for simplicity. I
wonder if your objective is to protect kernel data from user space, or
if you have also considered decomposing the kernel into components that
are isolated from each other (and then it we could potentially find
some synergies).

Petr T


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Address Space Isolation
  2024-03-12 14:48 ` Petr Tesařík
@ 2024-03-12 16:45   ` Brendan Jackman
  0 siblings, 0 replies; 3+ messages in thread
From: Brendan Jackman @ 2024-03-12 16:45 UTC (permalink / raw)
  To: Petr Tesařík
  Cc: lsf-pc, linux-mm, x86, Junaid Shahid, Reiji Watanabe,
	Patrick Bellasi, Yosry Ahmed, Frank van der Linden, pbonzini,
	Jim Mattson, Paul Turner, Ofir Weisse, alexandre.chartre, rppt,
	dave.hansen, Peter Zijlstra, Thomas Gleixner, luto,
	Andrew.Cooper3

On Tue, 12 Mar 2024 at 15:48, Petr Tesařík <petr@tesarici.cz> wrote:
> > - How we’ve solved the TLB flushing issues in sensitivity tracking, and
> > how it could be done better.
>
> Hello and welcome! I ran into a similar challenge with SandBox Mode. My
> solution was to run sandbox code with CPL=3 (on x86) and control page
> access with the U/S PTE bit rather than the P bit, which allowed me to
> implement lazy TLB invalidation. The x86 folks didn't like idea...

Hmm, a similar idea might be to use protection keys. I'm not sure if
that really works though, we haven't given it any serious thought,
since not all CPUs support it. So that would be something to explore
as a later optimisation rather than a basic principle.

> For the record, SandBox Mode was designed with confidentiality in mind,
> although the initial patch series left out this part for simplicity. I
> wonder if your objective is to protect kernel data from user space, or
> if you have also considered decomposing the kernel into components that
> are isolated from each other (and then it we could potentially find
> some synergies).

Yeah that's something we've pondered. What I've presented here is
definitely about protecting the kernel from userspace/VM guest but
it's a framework where you could conceivably isolate all sorts of
things. Maybe there's a world where ASI makes unprivileged BPF a more
viable notion.

The thing is, what I'm presenting here doesn't protect against
software bugs at all - if you can get the kernel to architecturally
access data and do something bad with it, ASI will happily remap that
data and branch back to the buggy code. That probably simplifies
things quite a lot as compared to SBM.

But yes, the whole "sensitivity tracking" thing does seem to share
requirements with SandBox Mode, I will need to ponder this some more.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-03-12 16:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-29  9:57 [LSF/MM/BPF TOPIC] Address Space Isolation Brendan Jackman
2024-03-12 14:48 ` Petr Tesařík
2024-03-12 16:45   ` Brendan Jackman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox