* [LSF/MM/BPF TOPIC] Page allocation for ASI
@ 2025-01-29 12:40 Brendan Jackman
2025-01-29 16:35 ` Brendan Jackman
2025-03-26 15:36 ` Brendan Jackman
0 siblings, 2 replies; 5+ messages in thread
From: Brendan Jackman @ 2025-01-29 12:40 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm
At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is
a mitigation for a broad class of CPU vulnerabilities that works by creating a
second “restricted” kernel address space which has “sensitive” data unmapped. If
you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad
overview of the whole system. The v1 of my RFC [2] also has some explanatory
discussion in the cover letter.
Last year my talk was pretty high-level, taking the temperature of the MM
community about how to integrate this into the broader kernel and whether there
are any major roadblocks.
Since then, I’ve posted a new RFC [1] and Google’s internal implementation has
continued to expand its footprint in production - it’s now a cornerstone of our
CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are
a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making
actual requests to merge ASI upstream.
The one I’d like to talk about at this session is how to best integrate ASI into
the page allocator. “Sensitivty” of memory in ASI is currently all decided at
the allocation site. This means when allocating pages we need to alter the
pagetables for the restricted address space. This is a little tricky from the
page allocator:
1. In the most general case, adding pages to the restricted address space requires
allocating pagetables. Allocating while you allocate requires some thought to
avoid spaghetti code/deadlock risk.
2. Removing them requires a TLB flush, which can’t be done from all
page-freeing/allocating contexts.
In the RFCs, we’ve simply kept all free pages unmapped from the restricted
address space. The allocator itself is largely unchanged; at the very end of
allocation we map pages (if appropriate), allocating pagetables via totally
separate allocation calls. When ASI-mapped pages are freed, they go onto a queue
that is then freed asynchronously from a context that’s able to batch up the TLB
flushes before making them available for re-allocation. Reclaim is then made
aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can
block on it where necessary.
Although we’ve been able to hammer this approach into a viable shape for the
Google workloads we’ve been concerned with so far, it’s not a general solution.
Some concrete reasons include:
a. It leads to pointless TLB shootdowns; there must be pathological cases where
lots of pages get un-mapped only to get immediately re-allocated and mapped
again.
b. The asynchronous worker creates CPU jitter.
v. It provides no ability to prioritise re-allocating pages with the same
sensitivity as prior allocations. As well as TLB issues this creates page
zeroing costs as pages that were formerly sensitive need to be zeroed before
they can be mapped into the restricted address space.
d. This all creates unnecessary allocation latency and extra work to free pages.
At last year’s session I touched on the idea of instead using something akin to
migratetypes to track sensitivity (more accurately: presence in ASI’s restricted
pagetables) of free pages/pageblocks. The feedback on that idea was basically
“dunno, we would need more details”. I’m now working on a design based on this
approach and I’d like to use this session to go over such details. I don’t have
a prototype yet, but by March I hope to have shared some illustrative code.
Some questions I’m currently investigating that I’d like to discuss details of
(hopefully, with proposed answers by the time of the conference!):
- Can we totally avoid the need to allocate pagetables during allocation, by
keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one,
different only in _PAGE_PRESENT?
- If not, what’s the best way to allocate while we allocate?
- When a TLB shootdown would let us satisfy an allocation that is getting into the
deeper end of the slowpath, how is that prioritised and structured wrt. direct
compact/reclaim/other fallbacks etc?
- How do we maintain a balance of sensitivities among free pages, and what does
that desired balance look like?
- (Note: if no page-table-allocation is needed to map nonsensitive pages, the
second question goes away: since mapping is cheap but unmapping is
expensive, we would mostly just want to minimize the number of free pages
mapped into the restricted address space).
[0] https://lwn.net/Articles/974390/
https://www.youtube.com/watch?v=DxaN6X_fdlI
[1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/
[2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page allocation for ASI
2025-01-29 12:40 [LSF/MM/BPF TOPIC] Page allocation for ASI Brendan Jackman
@ 2025-01-29 16:35 ` Brendan Jackman
2025-01-31 11:08 ` Brendan Jackman
2025-02-05 13:32 ` Brendan Jackman
2025-03-26 15:36 ` Brendan Jackman
1 sibling, 2 replies; 5+ messages in thread
From: Brendan Jackman @ 2025-01-29 16:35 UTC (permalink / raw)
To: lsf-pc
Cc: linux-mm, Andrew Morton, Mel Gorman, David Hildenbrand,
Vlastimil Babka, Michal Hocko, Johannes Weiner
On Wed, 29 Jan 2025 at 13:40, Brendan Jackman <jackmanb@google.com> wrote:
>
> At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is
> a mitigation for a broad class of CPU vulnerabilities that works by creating a
> second “restricted” kernel address space which has “sensitive” data unmapped. If
> you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad
> overview of the whole system. The v1 of my RFC [2] also has some explanatory
> discussion in the cover letter.
>
> Last year my talk was pretty high-level, taking the temperature of the MM
> community about how to integrate this into the broader kernel and whether there
> are any major roadblocks.
>
> Since then, I’ve posted a new RFC [1] and Google’s internal implementation has
> continued to expand its footprint in production - it’s now a cornerstone of our
> CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are
> a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making
> actual requests to merge ASI upstream.
>
> The one I’d like to talk about at this session is how to best integrate ASI into
> the page allocator. “Sensitivty” of memory in ASI is currently all decided at
> the allocation site. This means when allocating pages we need to alter the
> pagetables for the restricted address space. This is a little tricky from the
> page allocator:
>
> 1. In the most general case, adding pages to the restricted address space requires
> allocating pagetables. Allocating while you allocate requires some thought to
> avoid spaghetti code/deadlock risk.
>
> 2. Removing them requires a TLB flush, which can’t be done from all
> page-freeing/allocating contexts.
>
> In the RFCs, we’ve simply kept all free pages unmapped from the restricted
> address space. The allocator itself is largely unchanged; at the very end of
> allocation we map pages (if appropriate), allocating pagetables via totally
> separate allocation calls. When ASI-mapped pages are freed, they go onto a queue
> that is then freed asynchronously from a context that’s able to batch up the TLB
> flushes before making them available for re-allocation. Reclaim is then made
> aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can
> block on it where necessary.
>
> Although we’ve been able to hammer this approach into a viable shape for the
> Google workloads we’ve been concerned with so far, it’s not a general solution.
> Some concrete reasons include:
>
> a. It leads to pointless TLB shootdowns; there must be pathological cases where
> lots of pages get un-mapped only to get immediately re-allocated and mapped
> again.
>
> b. The asynchronous worker creates CPU jitter.
>
> v. It provides no ability to prioritise re-allocating pages with the same
> sensitivity as prior allocations. As well as TLB issues this creates page
> zeroing costs as pages that were formerly sensitive need to be zeroed before
> they can be mapped into the restricted address space.
>
> d. This all creates unnecessary allocation latency and extra work to free pages.
>
> At last year’s session I touched on the idea of instead using something akin to
> migratetypes to track sensitivity (more accurately: presence in ASI’s restricted
> pagetables) of free pages/pageblocks. The feedback on that idea was basically
> “dunno, we would need more details”. I’m now working on a design based on this
> approach and I’d like to use this session to go over such details. I don’t have
> a prototype yet, but by March I hope to have shared some illustrative code.
>
> Some questions I’m currently investigating that I’d like to discuss details of
> (hopefully, with proposed answers by the time of the conference!):
>
> - Can we totally avoid the need to allocate pagetables during allocation, by
> keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one,
> different only in _PAGE_PRESENT?
>
> - If not, what’s the best way to allocate while we allocate?
>
> - When a TLB shootdown would let us satisfy an allocation that is getting into the
> deeper end of the slowpath, how is that prioritised and structured wrt. direct
> compact/reclaim/other fallbacks etc?
>
> - How do we maintain a balance of sensitivities among free pages, and what does
> that desired balance look like?
>
> - (Note: if no page-table-allocation is needed to map nonsensitive pages, the
> second question goes away: since mapping is cheap but unmapping is
> expensive, we would mostly just want to minimize the number of free pages
> mapped into the restricted address space).
>
> [0] https://lwn.net/Articles/974390/
> https://www.youtube.com/watch?v=DxaN6X_fdlI
> [1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/
> [2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/
Hmm, I did not CC anyone except the list. Adding some people in case
it prompts a discussion.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page allocation for ASI
2025-01-29 16:35 ` Brendan Jackman
@ 2025-01-31 11:08 ` Brendan Jackman
2025-02-05 13:32 ` Brendan Jackman
1 sibling, 0 replies; 5+ messages in thread
From: Brendan Jackman @ 2025-01-31 11:08 UTC (permalink / raw)
To: lsf-pc
Cc: linux-mm, Andrew Morton, Mel Gorman, David Hildenbrand,
Vlastimil Babka, Michal Hocko, Johannes Weiner
On Wed, 29 Jan 2025 at 17:35, Brendan Jackman <jackmanb@google.com> wrote:
>
> On Wed, 29 Jan 2025 at 13:40, Brendan Jackman <jackmanb@google.com> wrote:
> > - Can we totally avoid the need to allocate pagetables during allocation, by
> > keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one,
> > different only in _PAGE_PRESENT?
Er, doing that specifically (relying on _PAGE_PRESENT) would be a bad
idea, I forgot how broken the old CPUs are, this would probably let
you exploit L1TF or something.
We would have to munge the PFN too. Anyway as long as we only have to
do that at the leaf pagetable, I think the idea still works, we can
actually just make the entry pmd_none().
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page allocation for ASI
2025-01-29 16:35 ` Brendan Jackman
2025-01-31 11:08 ` Brendan Jackman
@ 2025-02-05 13:32 ` Brendan Jackman
1 sibling, 0 replies; 5+ messages in thread
From: Brendan Jackman @ 2025-02-05 13:32 UTC (permalink / raw)
To: lsf-pc
Cc: linux-mm, Andrew Morton, Mel Gorman, David Hildenbrand,
Vlastimil Babka, Michal Hocko, Johannes Weiner
On Wed, 29 Jan 2025 at 17:35, Brendan Jackman <jackmanb@google.com> wrote:
> > Although we’ve been able to hammer this approach into a viable shape for the
> > Google workloads we’ve been concerned with so far, it’s not a general solution.
> > Some concrete reasons include:
> >
> > a. It leads to pointless TLB shootdowns; there must be pathological cases where
> > lots of pages get un-mapped only to get immediately re-allocated and mapped
> > again.
> >
> > b. The asynchronous worker creates CPU jitter.
> >
> > v. It provides no ability to prioritise re-allocating pages with the same
> > sensitivity as prior allocations. As well as TLB issues this creates page
> > zeroing costs as pages that were formerly sensitive need to be zeroed before
> > they can be mapped into the restricted address space.
> >
> > d. This all creates unnecessary allocation latency and extra work to free pages.
Oh I forgot another important detail: this uses a pageflag to track
whether a page is currently mapped in the restricted address space.
That isn't necessary, a proper page_alloc integration should get rid
of that.
Anyway, while important it isn't the hard/interesting part. If we got
really stuck on this point, we could probably always resort to walking
the ASI pagetables to see if the page is mapped.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Page allocation for ASI
2025-01-29 12:40 [LSF/MM/BPF TOPIC] Page allocation for ASI Brendan Jackman
2025-01-29 16:35 ` Brendan Jackman
@ 2025-03-26 15:36 ` Brendan Jackman
1 sibling, 0 replies; 5+ messages in thread
From: Brendan Jackman @ 2025-03-26 15:36 UTC (permalink / raw)
To: Brendan Jackman, lsf-pc; +Cc: linux-mm
Here are the slides for this talk:
https://docs.google.com/presentation/d/1waibhMBXhfJ2qVEz8KtXop9MZ6UyjlWmK71i0WIH7CY/edit?slide=id.p
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-03-26 15:36 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-29 12:40 [LSF/MM/BPF TOPIC] Page allocation for ASI Brendan Jackman
2025-01-29 16:35 ` Brendan Jackman
2025-01-31 11:08 ` Brendan Jackman
2025-02-05 13:32 ` Brendan Jackman
2025-03-26 15:36 ` Brendan Jackman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox