From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35DCFCCF9E4 for ; Wed, 25 Sep 2024 15:41:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 963616B009D; Wed, 25 Sep 2024 11:41:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 913E66B00B0; Wed, 25 Sep 2024 11:41:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B3CC6B00B2; Wed, 25 Sep 2024 11:41:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 564846B009D for ; Wed, 25 Sep 2024 11:41:23 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id C7DACABEC0 for ; Wed, 25 Sep 2024 15:41:22 +0000 (UTC) X-FDA: 82603674804.01.5BC7983 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) by imf16.hostedemail.com (Postfix) with ESMTP id 8FADC180008 for ; Wed, 25 Sep 2024 15:41:20 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=amazon.de header.s=amazon201209 header.b="QW/dkZWw"; spf=pass (imf16.hostedemail.com: domain of "prvs=99141730e=faresx@amazon.de" designates 99.78.197.218 as permitted sender) smtp.mailfrom="prvs=99141730e=faresx@amazon.de"; dmarc=pass (policy=quarantine) header.from=amazon.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727278845; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=w3ztpLhKheiGL86VzRoyFegLW0y8yDiy4GWDc/IATVw=; b=53uRxWwwRbRYxSe6wOg6URdeuU4DBkNvnUPg72RLKr8za4iVt2Ic3blZelhs9JACMiSfyc BLRqpqd0EMnckepTKq28/4bgjqfkQ+gSF6TW93XMCrYEbOEOyh37BEltX+p0P9CTJ0zhUz 2DXLXL4Px4AYD4NyWPvQXUN80IC50Z8= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=amazon.de header.s=amazon201209 header.b="QW/dkZWw"; spf=pass (imf16.hostedemail.com: domain of "prvs=99141730e=faresx@amazon.de" designates 99.78.197.218 as permitted sender) smtp.mailfrom="prvs=99141730e=faresx@amazon.de"; dmarc=pass (policy=quarantine) header.from=amazon.de ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727278845; a=rsa-sha256; cv=none; b=dGrDltYEsnKxbdCO2FbY/5cq3Mo9rWFv335PAX3drMhTC5UIcyoQe80DpgY249PPqyvDyT aWC/W3qIM+m9fCg6hEm44jneRxZq0SnMNJ9sv59xU03w7PS760n1XUB3B3qrJvBOMkJRHt 6eoz37YUxPax/xvY/UjkjrNrGIl9XEs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1727278881; x=1758814881; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=w3ztpLhKheiGL86VzRoyFegLW0y8yDiy4GWDc/IATVw=; b=QW/dkZWwVV389C4fjRS2V8Tvl9biM4KPGzahTzRDAGfzWb5QfYyZYuqX EXy0SYNluf+vcdoEf+oS1xA7DVNhg/oj51M4BrOH0lPYojFD9N5ys8i53 nZk3Thk2lxDcNVUTcsM+4VXT/OHRVBhAdwj78zzx0yy+dgPe2ewE6hLM+ 4=; X-IronPort-AV: E=Sophos;i="6.10,257,1719878400"; d="scan'208";a="335009767" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Sep 2024 15:41:01 +0000 Received: from EX19MTAEUB002.ant.amazon.com [10.0.17.79:26531] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.17.12:2525] with esmtp (Farcaster) id e0ffb752-46f6-4514-b306-44d8338eba99; Wed, 25 Sep 2024 15:40:59 +0000 (UTC) X-Farcaster-Flow-ID: e0ffb752-46f6-4514-b306-44d8338eba99 Received: from EX19D007EUA001.ant.amazon.com (10.252.50.133) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34; Wed, 25 Sep 2024 15:40:59 +0000 Received: from EX19MTAUWA002.ant.amazon.com (10.250.64.202) by EX19D007EUA001.ant.amazon.com (10.252.50.133) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.35; Wed, 25 Sep 2024 15:40:58 +0000 Received: from email-imr-corp-prod-iad-1box-1a-6851662a.us-east-1.amazon.com (10.25.36.210) by mail-relay.amazon.com (10.250.64.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1258.34 via Frontend Transport; Wed, 25 Sep 2024 15:40:58 +0000 Received: from dev-dsk-faresx-1b-27755bf1.eu-west-1.amazon.com (dev-dsk-faresx-1b-27755bf1.eu-west-1.amazon.com [10.253.79.181]) by email-imr-corp-prod-iad-1box-1a-6851662a.us-east-1.amazon.com (Postfix) with ESMTPS id 578CC401C3; Wed, 25 Sep 2024 15:40:55 +0000 (UTC) From: Fares Mehanna To: CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it Date: Wed, 25 Sep 2024 15:33:47 +0000 Message-ID: <20240925153347.94589-1-faresx@amazon.de> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: tabh8o3ei8w1wunw7f73hz57wzz8qji9 X-Rspamd-Queue-Id: 8FADC180008 X-Rspamd-Server: rspam11 X-HE-Tag: 1727278880-548932 X-HE-Meta: U2FsdGVkX18W16Lq6mURnxwoN3kSofICPs+bcAXZFdhoFHqWoNtBQfm9vJQm0h6ijzXyUoGfeWKHmyHNjT/A/K6e3xymrznoVwtab0Byckf3RPe4Yl5Nsqo0hU6T02OjBlzor9uJFd49TR9NsTT/EQhbLwxCPKv55splhBOdo5nc+BnGL1jTsobf+9jb1eh9NOtIc57aSkbLxzKiD8nTUvn3e2eH8ePWZx9AT0Gq8lnOWM3SlWdb1SKtWHhl32iJK0Evm5gBr68CWj+kdBfnxEwELliX3dpVyYNi+AAJ7ujAFEBWlOvOzTF6HW2AAM3wPmzGRCh7+uRr5SgYzLAo54gunqqD2ZT6PVk7qS9tnvo/5d+HV5LFuq/Fcc/rsUHX6rD+wHoobjxjtrlhPK36Wcf2N2Pu3EV70JICuub4bGjeDTE8A+wu8a1k+igqvuEVIUB1kfXC5z2r7no/FNH7Vcu2imxOJ+6cposS10fpwawkVYaX+mXwDVSIlUdwYDHeoHeyE7Nx632GV6pzk+eZI1WdVORh4t0krkINw7/VUPyC9tDJtFy5OuOtf9fxmRT3nPc9BRn2hqs9vO7yMadni1cCjrUgH785V2Y6G/pugW91UJwKXtrHwEDq0PYbjdE9zDz3fxnJPL8ZuCAAhlcWrUtgm2MB30Iy/6+0UWoKyyGBDahpctdFd3DPHk2/VZ2+qhgZ6CdyDXp/jdPS3mlyWZdB3+GkYlWy9ejztl7JuiHGq8WObOZ6rIj8/ESxH+9/DvZj7zt1NC6LdjVmAlcjO4CMsbce/ICqCedtHBnEFjQ+IKQHLCGdCLRHWn9Qxp6uKuPNLfm6w/nquLgQJYjv2UICrFdbcskbxImaK0prJ8jb3qRd69o816Fpsy7tHiHi+ZuuWj5Lmw6WdIB5alwjN0YZeO3RBsXOhZnzmG13/VFnAGKBbnRibOsiAlq3eYuRXI5ciTLx1XB7Dy/sRoH 1UpxYihb p7IBXImkeRcKjoSRUAd/oRe+76vNrpyk4wZFAQ20dZg5Ctzb43SNcHs65a54XpESNOkCqwwJCiYZ60rUwZ4GHYX9DXvyw67PzC/JiEpCFtmuJ4porEkOakHKR91XtA8E0wKcBdLIIC/BW+FDpwrDyO9g3xtRnO8ovUtxpOwwZ4qkS5wbpnjTd4ilyd6RXhw0rM3WBofCFksX2eQtT/pVruuCKskWY++nVl++3aZw7ig+PE7tZZuDz+RjZdbB+kjyTq2eJdJp5FBrbjPy+GKdyDZ4kvlLaED1WkxNEuQpHHv28TrWGyvuUVJS/uI0uj6yfKi4m8WNE00laTaZXWp1rkuFj5pfDV+hKUqLGTnzj4LaIhVHteYUPXtUBQ2aPcJyzPIPA1PBkgwMU1uc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Thanks for taking a look and apologies for my delayed response. > Having a VMA in user mappings for kernel memory seems weird to say the > least. I see your point and agree with you. Let me explain the motivation, pros and cons of the approach after answering your questions. > Core MM does not expect to have VMAs for kernel memory. What will happen if > userspace ftruncates that VMA? Or registers it with userfaultfd? In the patch, I make sure the pages are faulted in, locked and sealed to make sure the VMA is practically off-limits from the owner process. Only after that I change the permissions to be used by the kernel. > This approach seems much more reasonable and it's not that it was entirely > arch-specific. There is some plumbing at arch level, but the allocator is > anyway arch-independent. So I wanted to explore a simple solution to implement mm-local kernel secret memory without much arch dependent code. I also wanted to reuse as much of memfd_secret() as possible to benefit from what is done already and possible future improvements to it. Keeping the secret pages in user virtual addresses is easier as the page table entries are not global by default so no special handling for spawn(). keeping them tracked in VMA shouldn't require special handling for fork(). The challenge was to keep the virtual addresses / VMA away from user control as long as the kernel is using it, and signal the mm core that this VMA is special so it is not merged with other VMAs. I believe locking the pages, sealing the VMA, prefaulting the pages should make it practicality away of user space influence. But the current approach have those downsides: (That I can think of) 1. Kernel secret user virtual addresses can still be used in functions accepting user virtual addresses like copy_from_user / copy_to_user. 2. Even if we are sure the VMA is off-limits to userspace, adding VMA with kernel addresses will increase attack surface between userspace and the kernel. 3. Since kernel secret memory is mapped in user virtual addresses, it is very easy to guess the exact virtual address (using binary search), and since this functionality is designed to keep user data, it is fair to assume the userspace will always be able to influence what is written there. So it kind of breaks KASLR for those specific pages. 4. It locks user virtual memory away, this may break some software if they assumed they can mmap() into specific places. One way to address most of those concerns while keeping the solution almost arch agnostic is is to allocate reasonable chunk of user virtual memory to be only used for kernel secret memory, and not track them in VMAs. This is similar to the old approach but instead of creating non-global kernel PGD per arch it will use chunk of user virtual memory. This chunk can be defined per arch, and this solution won't use memfd_secret(). We can then easily enlighten the kernel about this range so the kernel can test for this range in functions like access_ok(). This approach however will make downside #4 even worse, as it will reserve bigger chunk of user virtual memory if this feature is enabled. I'm also very okay switching back to the old approach with the expense of: 1. Supporting fewer architectures that can afford to give away single PGD. 2. More complicated arch specific code. Also @graf mentioned how aarch64 uses TTBR0/TTBR1 for user and kernel page tables, I haven't looked at this yet but it probably means that kernel page table will be tracked per process and TTBR1 will be switched during context switching. What do you think? I would appreciate your opinion before working on the next RFC patch set. Thanks! Fares. Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597