From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7C47C7EE2C for ; Thu, 18 May 2023 21:04:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5449E900004; Thu, 18 May 2023 17:04:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4F594900003; Thu, 18 May 2023 17:04:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3BD40900004; Thu, 18 May 2023 17:04:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 27386900003 for ; Thu, 18 May 2023 17:04:15 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EF4F2AE3FA for ; Thu, 18 May 2023 21:04:14 +0000 (UTC) X-FDA: 80804603628.27.2E8AE8C Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf07.hostedemail.com (Postfix) with ESMTP id 3C2F040014 for ; Thu, 18 May 2023 21:04:11 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=GOO3EHLd; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of dave.hansen@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dave.hansen@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684443853; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QysoYep0TJwVugmSUOGkDrypPrD0qkeC5hmewVyg85c=; b=qR3xbKfvtnj7dFAF9eUbRTbjQkaVT4e50Xc/F6KaB2RNfQqdWEyBdoAZX5Orfg8RNiqpbe wIc3T9pkcWt/jMRS0iaWOV7QAHx60G3F+VseQ3vPnt5giZVBx3hu/wJSzu/N1JZbusFrQ3 eLUm7KWWaHapHjorEfcHPdpcvirmEoQ= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=GOO3EHLd; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of dave.hansen@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dave.hansen@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684443853; a=rsa-sha256; cv=none; b=PoHLj0BowqTFQ3JmRrVYHDGkwmq68SM9qJEP/K7aRTeKgxoGuc279JD0tLf7UEePsFQ0p7 wTPvVaW8h1b2hjKcbosi1EsgqUxvzWLsGpp8de+scWqT4V7XC0OMR7pwJs05NVuzyzxIva p9xyhKAd1PuuNKDDCCyZl19uwcr/ajY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1684443852; x=1715979852; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=BKZkWqVM3TSRGEgi6QXPku0HgKD5sRN6Pggh4RMq1Xs=; b=GOO3EHLd1i5Ji8xW7iaSxnfpX7WPZONR4JoqP6C28Aw/wW6okaKA8SW4 cOq1+MZawJMftWrLTYhTCSJU9nS7DRn7pk2D76JyAzLuEssvUDWUv5Xzv Kj9SMpmm3bC1EMfe5Z72KRu27PRhXYAKTSLyqdvyIXsozm5EyJXr1P9b+ veoyCTvh0X4aIHF/cKo5IzCnX4b+ZHWGMPZ+OyYu/Ril4JhsuXTFx8xjJ ssCvsuCiSt4yroA08keT1WZ27JCmJMsnNoBxq7OurFfOvUJeTarLd2E84 o2MzyC4BTAPQUm1dvOk8YKcCa/fhqHK+lI2DzOVoWkB8sIEc/aOIs+iYD w==; X-IronPort-AV: E=McAfee;i="6600,9927,10714"; a="355412912" X-IronPort-AV: E=Sophos;i="6.00,175,1681196400"; d="scan'208";a="355412912" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 14:04:10 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10714"; a="772026268" X-IronPort-AV: E=Sophos;i="6.00,175,1681196400"; d="scan'208";a="772026268" Received: from nroy-mobl1.amr.corp.intel.com (HELO [10.209.81.123]) ([10.209.81.123]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 14:04:08 -0700 Message-ID: <9d64c949-6d5f-06c0-47ef-caade67477e5@intel.com> Date: Thu, 18 May 2023 14:04:08 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 Content-Language: en-US To: Jeff Xu Cc: =?UTF-8?Q?Stephen_R=c3=b6ttger?= , jeffxu@chromium.org, luto@kernel.org, jorgelo@chromium.org, keescook@chromium.org, groeck@chromium.org, jannh@google.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, linux-hardening@vger.kernel.org References: <20230515130553.2311248-1-jeffxu@chromium.org> <2bcffc9f-9244-0362-2da9-ece230055320@intel.com> <2b14036e-aed8-4212-bc0f-51ec4fe5a5c1@intel.com> From: Dave Hansen In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: 4zrsxdbdmfi487okx8gju7kfqhr7ow9k X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 3C2F040014 X-HE-Tag: 1684443851-921421 X-HE-Meta: U2FsdGVkX1/+Hz6cJADTLdwAz0pqorf/qyaQIc8dE5RtsM2chirGals5twDkkFfhmlbx20MxU4MYC2JBV0VFIQaLSjT54VjDaRi7YfBJ36JdRYOw+MCKnowL1/wRb3n5lzNoz7PEvuEk25TUTYFU9r6Wukzh/HEIB56ZG2u2cJdXB32vaauB1h6NklAVjAepCrUyK+KGFNssAAak1RnR/FHrzQ3hKGoDqa2dRuERQviRArN3ZLC6V9gh1/wSzABk9ndon2P0i+J8HRqhBuUVl7Xnoz7aiccucS+bxvcynTffeZZV5IOHYMiwI3dzLg7ygopZA+3Tlgz2HHqZiGW+8lOCa40u7WhmVbmWcWCDuK/LxuCsLYY6dtrLQprnGmEhRQQZ9NNULBebZ+wnsb0KJC1rjLFvXNbQmpdQVYxOQiTFVShQmZpvYsPOrLFtxKqOMVbli8pvrYgBSNawqwRg7MYAU46iNgDsCGYI317hpu/EytfIrzt6ngSGED6gJ+TJTb8ZNfx7ZcAVx3GFLIoTSquBYhxnA3xrboT+puFwqhRH6gHDgGR9rclUwMqLYTBuuvIGpF26NVk092aievUXCqxt+fNF/PhgPMQBX0tUj9mjT3WrHNw51WjVuyHs1pQ9XRF3b1qI6A+QfcyoLKi0IitJrVfgOcwRnJWHpG5TVC9XCuKaZp5mRfFl/1OYjxpGnoECsih9z+2FNaWVKJxpey57V8+kOt9iYDVziQj/SCIDadYTDbVXVx94D60QVp2espAgcVCb3qVq/KnMngaAlEa4Yds6SssO6pSoRO3wFQBU2pZxgGR/5MFhIqgmLxojRPq2RVTQNidDdnJ8+oBbiCiVm03ufisj9TAHFf259QtBiYhI+KMoXNwULGN4xobwaJNsp6ysvJaxdyPjvjy8CU4gCkVlAUe3MWM1CODA2o40JF/hw13CZNBTZM9kB80MAwFa3gb2MbHk/T8pAbh 0ngAS3ye KDQT+AshseUpM8C8lueLaFjzSp4ioTSYwJHw3PnQJ8xhZV5Z7VsHpVGRpA+F2WMvzMMX3WM7V098MF1TFw9W9t7GabgzCmkV16G+ZYfHeST1B8FZuAMmu4fk3fs5VYNs29lexyv+viVAMLIIo5AOfaM+geDOt6B9kMhgP8h53pU/ppSmx7GcGZfqfNXb0vByPKheXrvMIReWVQmU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole thing: it's headed down a rabbit hole >> which is *highly* specialized both in the apps that will use it and the >> attacks it will mitigate. It probably *requires* turning off a bunch of >> syscalls (like io_uring) that folks kinda like in general. >> > ChromeOS currently disabled io_uring, but it is not required to do so. > io_uring supports the IORING_OP_MADVICE operation, which calls the > do_madvise() function. This means that io_uring will have the same > pkey checks as the madvice() system call. From that perspective, we > will fully support io_uring for this feature. io_uring fundamentally doesn't have the same checks. The kernel side work can be done from an asynchronous kernel thread. That kernel thread doesn't have a meaningful PKRU value. The register has a value, but it's not really related to the userspace threads that are sending it requests. >> We're balancing that highly specialized mitigation with a feature that >> add new ABI, touches core memory management code and signal handling. >> > The ABI change uses the existing flag field in pkey_alloc() which is > reserved. The implementation is backward compatible with all existing > pkey usages in both kernel and user space. Or do you have other > concerns about ABI in mind ? I'm not worried about the past, I'm worried any time we add a new ABI since we need to support it forever. > Yes, you are right about the risk of touching core mm code. To > minimize the risk, I try to control the scope of the change (it is > about 3 lines in mprotect, more in munmap but really just 3 effective > lines from syscall entry). I added new self-tests in mm to make sure > it doesn't regress in api behavior. I run those tests before and after > my kernel code change to make sure the behavior remains the same, I > tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing > discovered a behavior change for mprotect() between 6.1 and 6.4 (not > from this patch, there are refactoring works going on in mm) see this > thread [1] > I hope those steps will help to mitigate the risk. > > Agreed on signaling handling is a tough part: what do you think about > the approach (modifying PKRU from saved stack after XSAVE), is there a > blocker ? Yes, signal entry and sigreturn are not necessarily symmetric so you can't really have a stack. >> On the x86 side, PKRU is a painfully special snowflake. It's exposed in >> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. >> This would be making it an even more special snowflake because it would > > I admit I'm quite ignorant on XSAVE to understand the above > statement, and how that is related. Could you explain it to me please > ? And what is in your mind that might improve the situation ? In a nutshell: XSAVE components are classified as either user or supervisor. User components can be modified from userspace and supervisor ones only from the kernel. In general, user components don't affect the kernel; the kernel doesn't care what is in ZMM11 (an XSAVE-managed register). That lets us do fun stuff like be lazy about when ZMM11 is saved/restored. Being lazy is good because it give us things like faster context switches and KVM VMEXIT handling. PKRU is a user component, but it affects the kernel when the kernel does copy_to/from_user() and friends. That means that the kernel can't do any "fun stuff" with PKRU. As soon as userspace provides a new value, the kernel needs to start respecting it. That makes PKRU a very special snowflake. So, even though PKRU can be managed by XSAVE, it isn't. It isn't kept in the kernel XSAVE buffer. But it *IS* in the signal stack XSAVE buffer. You *can* save/restore it with the other XSAVE components with ptrace(). The user<->kernel ABI pretends that PKRU is XSAVE managed even though it is not. All of this is special-cased. There's a ton of code to handle this mess. It's _complicated_. I haven't even started talking about how this interacts with KVM and guests. How could we improve it? A time machine would help to either change the architecture or have Linux ignore the fact that XSAVE knows anything about PKRU. So, the bar is pretty high for things that want to further muck with PKRU. Add signal and sigaltstack in particular into the fray, and we've got a recipe for disaster. sigaltstack and XSAVE don't really get along very well. https://lwn.net/Articles/862541/ >> need new altstack ABI and handling. >> > I thought adding protected memory support to signaling handling is an > independent project with its own weight. As Jann Horn points out in > [2]: "we could prevent the attacker from corrupting the signal > context if we can protect the signal stack with a pkey." However, > the kernel will send SIGSEGV when the stack is protected by PKEY, so > there is a benefit to make this work. (Maybe Jann can share some more > thoughts on the benefits) > > And I believe we could do this in a way with minimum ABI change, as below: > - allocate PKEY with a new flag (PKEY_ALTSTACK) > - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected, > (similar as what mprotect does in this patch) and save it along with > stack address/size. > - at signaling handling, use the saved info to fill in PKRU. > The ABI change is similar to PKEY_ENFORCE_API, and there is no > backward compatibility issue. > > Will these mentioned help our case ? What do you think ? To be honest, no. What you've laid out here is the tip of the complexity iceberg. There are a lot of pieces of the kernel that are not yet factored in. Let's also remember: protection keys is *NOT* a security feature. It's arguable that pkeys is a square peg trying to go into a round security hole.