From: Brendan Jackman <jackmanb@google.com>
To: lsf-pc@lists.linux-foundation.org
Cc: linux-mm@kvack.org
Subject: [LSF/MM/BPF TOPIC] Page allocation for ASI
Date: Wed, 29 Jan 2025 12:40:33 +0000 [thread overview]
Message-ID: <20250129124034.2612562-1-jackmanb@google.com> (raw)
At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is
a mitigation for a broad class of CPU vulnerabilities that works by creating a
second “restricted” kernel address space which has “sensitive” data unmapped. If
you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad
overview of the whole system. The v1 of my RFC [2] also has some explanatory
discussion in the cover letter.
Last year my talk was pretty high-level, taking the temperature of the MM
community about how to integrate this into the broader kernel and whether there
are any major roadblocks.
Since then, I’ve posted a new RFC [1] and Google’s internal implementation has
continued to expand its footprint in production - it’s now a cornerstone of our
CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are
a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making
actual requests to merge ASI upstream.
The one I’d like to talk about at this session is how to best integrate ASI into
the page allocator. “Sensitivty” of memory in ASI is currently all decided at
the allocation site. This means when allocating pages we need to alter the
pagetables for the restricted address space. This is a little tricky from the
page allocator:
1. In the most general case, adding pages to the restricted address space requires
allocating pagetables. Allocating while you allocate requires some thought to
avoid spaghetti code/deadlock risk.
2. Removing them requires a TLB flush, which can’t be done from all
page-freeing/allocating contexts.
In the RFCs, we’ve simply kept all free pages unmapped from the restricted
address space. The allocator itself is largely unchanged; at the very end of
allocation we map pages (if appropriate), allocating pagetables via totally
separate allocation calls. When ASI-mapped pages are freed, they go onto a queue
that is then freed asynchronously from a context that’s able to batch up the TLB
flushes before making them available for re-allocation. Reclaim is then made
aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can
block on it where necessary.
Although we’ve been able to hammer this approach into a viable shape for the
Google workloads we’ve been concerned with so far, it’s not a general solution.
Some concrete reasons include:
a. It leads to pointless TLB shootdowns; there must be pathological cases where
lots of pages get un-mapped only to get immediately re-allocated and mapped
again.
b. The asynchronous worker creates CPU jitter.
v. It provides no ability to prioritise re-allocating pages with the same
sensitivity as prior allocations. As well as TLB issues this creates page
zeroing costs as pages that were formerly sensitive need to be zeroed before
they can be mapped into the restricted address space.
d. This all creates unnecessary allocation latency and extra work to free pages.
At last year’s session I touched on the idea of instead using something akin to
migratetypes to track sensitivity (more accurately: presence in ASI’s restricted
pagetables) of free pages/pageblocks. The feedback on that idea was basically
“dunno, we would need more details”. I’m now working on a design based on this
approach and I’d like to use this session to go over such details. I don’t have
a prototype yet, but by March I hope to have shared some illustrative code.
Some questions I’m currently investigating that I’d like to discuss details of
(hopefully, with proposed answers by the time of the conference!):
- Can we totally avoid the need to allocate pagetables during allocation, by
keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one,
different only in _PAGE_PRESENT?
- If not, what’s the best way to allocate while we allocate?
- When a TLB shootdown would let us satisfy an allocation that is getting into the
deeper end of the slowpath, how is that prioritised and structured wrt. direct
compact/reclaim/other fallbacks etc?
- How do we maintain a balance of sensitivities among free pages, and what does
that desired balance look like?
- (Note: if no page-table-allocation is needed to map nonsensitive pages, the
second question goes away: since mapping is cheap but unmapping is
expensive, we would mostly just want to minimize the number of free pages
mapped into the restricted address space).
[0] https://lwn.net/Articles/974390/
https://www.youtube.com/watch?v=DxaN6X_fdlI
[1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@google.com/
[2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/
next reply other threads:[~2025-01-29 12:40 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-29 12:40 Brendan Jackman [this message]
2025-01-29 16:35 ` Brendan Jackman
2025-01-31 11:08 ` Brendan Jackman
2025-02-05 13:32 ` Brendan Jackman
2025-03-26 15:36 ` Brendan Jackman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250129124034.2612562-1-jackmanb@google.com \
--to=jackmanb@google.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox