From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 738D2C0218D for ; Wed, 29 Jan 2025 16:35:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA443280076; Wed, 29 Jan 2025 11:35:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D7B2A280075; Wed, 29 Jan 2025 11:35:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C4326280076; Wed, 29 Jan 2025 11:35:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A6EB4280075 for ; Wed, 29 Jan 2025 11:35:47 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0BDF4140BD6 for ; Wed, 29 Jan 2025 16:35:46 +0000 (UTC) X-FDA: 83061040734.14.DB0EC2D Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf07.hostedemail.com (Postfix) with ESMTP id F281A40002 for ; Wed, 29 Jan 2025 16:35:43 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=X8uYUiw7; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of jackmanb@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=jackmanb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738168544; a=rsa-sha256; cv=none; b=jr1hGidlhvJoc6kJpUDH/7n/NZ7DTF9sTwjKZwbOkEN4+KkZYjV5eh75gEurmhNEKg93Fe Y3MwuMqDvuG7/JPUGklm6j+J2ntICchdB2eE0vK+xQwu2rbNOYiw9JRt+D97WbUhVdQm0n MxiUB0MWr4upA98GQdiAJuQQ2FvzGPM= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=X8uYUiw7; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf07.hostedemail.com: domain of jackmanb@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=jackmanb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738168544; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VAyvxJ9TA7HmLjiGhQmp/WM68sd37G80MzmoQAM6uvg=; b=QHwOYfUWIoJI8Zwj85O2u5mCm2ovy0m+tZL7lq5R9oSCGa0PglvWWAr3zWvgl73KNEjzZZ vFN5blby/7BIk79xXquVDAHfyzbYt58Anoz38t9tyMy8XTLUAcrQoGrfNXo56ZuBQJ+q9Z mwUFtE8MqFpH+pwzLZrR3lqeBw6eF9M= Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-215740b7fb8so359195ad.0 for ; Wed, 29 Jan 2025 08:35:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738168542; x=1738773342; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VAyvxJ9TA7HmLjiGhQmp/WM68sd37G80MzmoQAM6uvg=; b=X8uYUiw7J2KrbQl4cNh0tnmjchRQ/ZyXopu4rVh07uR900FEBBBtSqqoaJvNyRFwQ4 a8Bj/06GPxWnHfKqVnuCpskARMU+KcntXH8IbbELBzmCdboDy9TzLXVK/z9EOS7zENg2 CLhxsK+XVv98zT4z9BuifHbKeKXGQ/Ov5Oxy5WEiGvTUaUa2rB64kLfi/GXzgggBHZ/C y4dx60cyJWKHbR56PARRc5WttOA0L+8R4XMDydVK9zEpSFBVZuSDKgblaifsto61m9Ha lZio8FAkQU+LjJGg8PNqwBhOG0ZgGjKj/SQzXXkHGtpVNfd+wf6Ols3ibAL+hcDRT6Rc ZmTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738168542; x=1738773342; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VAyvxJ9TA7HmLjiGhQmp/WM68sd37G80MzmoQAM6uvg=; b=BGLHGTzw4ZHwdQFNdAaomxke0T2ldRwQGettKgJRSv1ePZslX72mKyk5/at+Pjmx5T ac9IVimiO60pxJopCkoULWzQ3BUBM7r7ouC+g71bml48jQ1obghmebKn/2u/5kxMPsdi qJvsqeLUJQYY/e4xlMeiuV1Md+1zIJ58a4c2NHOQ4yyHh3crsqaDGAJdAC5D4MBbeaVY mrAd0STaqG6miLk3t9XUHHTRSTv6BwBFtxVHbXoJw6iEr4a2ecq4+B7K2N5Wo9gvWUB0 jg/eb+pqS0OvvgXuOQs3CyKovX4VI3jHccO5KKb7cCSwDlfmJrn1l+8+OrhoCvGQPL4H bo3Q== X-Gm-Message-State: AOJu0YyTl9ylN93Mp3eInloviR/uvlQ++t5g2Eppv0DtI4STy1PQohBj Bze/XeC/FplV1MHOva5uq+fm8lNltWNrvcQ7P3F/QN3b0VaBACn5XRnjYSL2EvEJFCeHSPYXrOk pSnUxvTF4CaW9J7PAhLIZ4vfgAsmQsm0Bx81M X-Gm-Gg: ASbGncvUUpAFUwNlP5o7ys7e8K0eDdfwYat7d1IQSINZIEy7Ibg2PRHXaCYDpOGY0gX osWbWvMHRXJzb4BhQi9fUo3zP7lJkALg8Jn6YF4BGyDQF2DK34AJTFMAe4C6zXu6DGrCP9wROQE nlFiAxXHaCasQ40pvgn/FBGdQO384= X-Google-Smtp-Source: AGHT+IEYsBHmbh3UfYd3wPqwii8aBSmEREjp1O++qiJ/iqsqqHggZr1rQTkmLW8TdVjC0qKn5EaCx2ghC7qiQNX5v+0= X-Received: by 2002:a17:903:2407:b0:216:6dab:803b with SMTP id d9443c01a7336-21dd8071bbbmr2857065ad.18.1738168542318; Wed, 29 Jan 2025 08:35:42 -0800 (PST) MIME-Version: 1.0 References: <20250129124034.2612562-1-jackmanb@google.com> In-Reply-To: <20250129124034.2612562-1-jackmanb@google.com> From: Brendan Jackman Date: Wed, 29 Jan 2025 17:35:29 +0100 X-Gm-Features: AWEUYZlHnthSU29FPdXHe5x4T5NnSKJ7PIRG_vQGtk-6Hxiq6eu3ERx8zWpoF4A Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Page allocation for ASI To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, Andrew Morton , Mel Gorman , David Hildenbrand , Vlastimil Babka , Michal Hocko , Johannes Weiner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: F281A40002 X-Stat-Signature: u5ywn8trwaodysnh5i7ow6yh37cq9ewg X-HE-Tag: 1738168543-706784 X-HE-Meta: U2FsdGVkX196Jsc6WddE99/6ZqdUH/+/yX32D0QavtxkZf8vOd3jwIS+RIEu5PYS9PK5ThfiCou0Tod61OaRS4GSyRg66dR8XEhG45fj3v45Q+7fvpQ/nvQp1qLIgWSxu1oV1zqZM5J6cEzGlX+whcOfRMRRIiaJcNqZt27KiqqVj8cyDyncDZ3G49vFSPom9E+XtOT/iJ4VgGXfgsJsHmxac2RfsLfxQhOxgYVhUUCB3pImWdysMEVplBPd4agta2OveDk0yf+O/Z4pfiTPA+2a39clCeouFCtr+HXJZdmmDH2l/vJmRIi0JQcOAJiz0+/FQcZx1HBTdATp19Q++/ByOmw6JuNxXqkp6zQEiYxf2LK4662rM33S8Dhim3uyN3P2hRVW+MiwNzep70+93f65qfibgoQozsLVl8qHjYqfn3F55Mw33qxKtjgZNe0c2mpHo3AuCfLVUcJiVN6vk+RgM0+90mSIjNk4RNjGHm6jiHNkrYWr2HsPBE1sdVStxaj+YNSTq6eGhvqBOFvhMO9bYJsef38LF6SyB8J0MChIX9+gWuaNvi8gIQwJa8L/rtlo7syCqK3e5C2cnYL9S3EIlbqnqbI9HiLGjAyCk1qE+Ly9WSJX4OP/EPkUA7JVl8NC+jJa3XcIBVVlN31BKWZx6ovytg4snmEddIZivpNkkUg2XVczt+A64nfdJhI1ObIrl5ldO1AlelUdF3klPDFSQfLrZjWdeCsN7SqbIdrH9wkjaRyxrHe62vRUzTMG9Z9LpK3HfYHvmX7ZK4H0yfDs6BGRRyBTojjMgOq0D9bZnuZGRrx2WtUjh3pTwHtZ2+wRAisIWhQMJl2CdrBHMCf+v3ZoYB0q/S3eGXg3MysU3cP2TOm/xvAFcAP0njMsDzdhUMSOi+fwzw1SNRyxyjh5swsEt5jzCHGEKCULkZI1uQ0MJ0Cniy7CMNhW5mTWVXytJoyfxHWD2voABDi HdxoVee4 aVGLTlNubpz8OhUTe/nVR0hAKhBn5Le8ZQ4Ma/pqqXh3tG16+gvdYOiilVrgiOl4fCLI+ebDFkMg1BY+us0GTlxUMp7PIechXp6MM3s6cAPrLz07RU9j3xp8lZ5FTFPY7RxfVvn8JZJFe6LNuORGEutX3qUXdHFeASZF0zA2dM5nqKom4QTEyB96zCyxgb1rbAh3jq58UBAtxCtlVwRH9WEQGPeVFQ2DuZBWeZiOMPoqp9bJ34iBzW4M1JauUuR/gVeQnrL0A0A30jiRabn6klZPwhKsIa7f2YCL2IirhetdXZz0FbD+z/VfqT2fp/9PfGWhq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 29 Jan 2025 at 13:40, Brendan Jackman wrote: > > At last year=E2=80=99s LSF/MM/BPF I presented Address Space Isolation (AS= I) [0]. ASI is > a mitigation for a broad class of CPU vulnerabilities that works by creat= ing a > second =E2=80=9Crestricted=E2=80=9D kernel address space which has =E2=80= =9Csensitive=E2=80=9D data unmapped. If > you=E2=80=99re unfamiliar with ASI, the first 10-15 minutes of that talk = provide a broad > overview of the whole system. The v1 of my RFC [2] also has some explanat= ory > discussion in the cover letter. > > Last year my talk was pretty high-level, taking the temperature of the MM > community about how to integrate this into the broader kernel and whether= there > are any major roadblocks. > > Since then, I=E2=80=99ve posted a new RFC [1] and Google=E2=80=99s intern= al implementation has > continued to expand its footprint in production - it=E2=80=99s now a corn= erstone of our > CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter th= ere are > a few hurdles to overcome, at least in a proof-of-concept, before I=E2=80= =99ll be making > actual requests to merge ASI upstream. > > The one I=E2=80=99d like to talk about at this session is how to best int= egrate ASI into > the page allocator. =E2=80=9CSensitivty=E2=80=9D of memory in ASI is curr= ently all decided at > the allocation site. This means when allocating pages we need to alter th= e > pagetables for the restricted address space. This is a little tricky from= the > page allocator: > > 1. In the most general case, adding pages to the restricted address space= requires > allocating pagetables. Allocating while you allocate requires some tho= ught to > avoid spaghetti code/deadlock risk. > > 2. Removing them requires a TLB flush, which can=E2=80=99t be done from a= ll > page-freeing/allocating contexts. > > In the RFCs, we=E2=80=99ve simply kept all free pages unmapped from the r= estricted > address space. The allocator itself is largely unchanged; at the very end= of > allocation we map pages (if appropriate), allocating pagetables via total= ly > separate allocation calls. When ASI-mapped pages are freed, they go onto = a queue > that is then freed asynchronously from a context that=E2=80=99s able to b= atch up the TLB > flushes before making them available for re-allocation. Reclaim is then m= ade > aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocatio= ns can > block on it where necessary. > > Although we=E2=80=99ve been able to hammer this approach into a viable sh= ape for the > Google workloads we=E2=80=99ve been concerned with so far, it=E2=80=99s n= ot a general solution. > Some concrete reasons include: > > a. It leads to pointless TLB shootdowns; there must be pathological cases= where > lots of pages get un-mapped only to get immediately re-allocated and m= apped > again. > > b. The asynchronous worker creates CPU jitter. > > v. It provides no ability to prioritise re-allocating pages with the same > sensitivity as prior allocations. As well as TLB issues this creates p= age > zeroing costs as pages that were formerly sensitive need to be zeroed = before > they can be mapped into the restricted address space. > > d. This all creates unnecessary allocation latency and extra work to free= pages. > > At last year=E2=80=99s session I touched on the idea of instead using som= ething akin to > migratetypes to track sensitivity (more accurately: presence in ASI=E2=80= =99s restricted > pagetables) of free pages/pageblocks. The feedback on that idea was basic= ally > =E2=80=9Cdunno, we would need more details=E2=80=9D. I=E2=80=99m now work= ing on a design based on this > approach and I=E2=80=99d like to use this session to go over such details= . I don=E2=80=99t have > a prototype yet, but by March I hope to have shared some illustrative cod= e. > > Some questions I=E2=80=99m currently investigating that I=E2=80=99d like = to discuss details of > (hopefully, with proposed answers by the time of the conference!): > > - Can we totally avoid the need to allocate pagetables during allocation,= by > keeping ASI=E2=80=99s restricted copy of the physmap in-sync with the u= nrestricted one, > different only in _PAGE_PRESENT? > > - If not, what=E2=80=99s the best way to allocate while we allocate? > > - When a TLB shootdown would let us satisfy an allocation that is getting= into the > deeper end of the slowpath, how is that prioritised and structured wrt.= direct > compact/reclaim/other fallbacks etc? > > - How do we maintain a balance of sensitivities among free pages, and wha= t does > that desired balance look like? > > - (Note: if no page-table-allocation is needed to map nonsensitive page= s, the > second question goes away: since mapping is cheap but unmapping is > expensive, we would mostly just want to minimize the number of free p= ages > mapped into the restricted address space). > > [0] https://lwn.net/Articles/974390/ > https://www.youtube.com/watch?v=3DDxaN6X_fdlI > [1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc80= 5@google.com/ > [2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d= 8@google.com/ Hmm, I did not CC anyone except the list. Adding some people in case it prompts a discussion.