From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1159C433B4 for ; Wed, 7 Apr 2021 14:36:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2E03761260 for ; Wed, 7 Apr 2021 14:36:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2E03761260 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=shutemov.name Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B9BA66B0080; Wed, 7 Apr 2021 10:36:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B4C4B6B0081; Wed, 7 Apr 2021 10:36:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C5E66B0082; Wed, 7 Apr 2021 10:36:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50]) by kanga.kvack.org (Postfix) with ESMTP id 7EDFB6B0080 for ; Wed, 7 Apr 2021 10:36:13 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 30969824934B for ; Wed, 7 Apr 2021 14:36:13 +0000 (UTC) X-FDA: 78005821026.35.A698D09 Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) by imf13.hostedemail.com (Postfix) with ESMTP id A84AFE00011B for ; Wed, 7 Apr 2021 14:36:10 +0000 (UTC) Received: by mail-lf1-f46.google.com with SMTP id v140so11113549lfa.4 for ; Wed, 07 Apr 2021 07:36:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=aa4BFGHeJpWbGPL64/HwJte9ItGkmlkla6N09Ehy3vY=; b=ljbxVtnHP8ri2YPVz8de3UsCo/is8F+oxtmfACHwDUFjYSmDsNzacNntMNrovvMlKY jJhfl18ykDCxEfyh4kLYYjL/mztDwKrSJeQ6rk5ZIg6etOdfARM8NJAbAMTtj1UiHXXg VM00xfwsPPnB59WZ1mah9pWk1ud9ttQjcXQGc94H0aEf5TwSv2ZBPVTSC6n6atnSSGPK AhLNxZsuO2OXWKjtAufTZy4fsHpQWjbUznNoxXI5LMN0shPHuSrX1JndEqlIs+e5+Bpt xwZ6t2WxoW7DtYeVL3CU5NhwiFW+5XQwjMA392AwK4l0SV2yRQr1Vgmueje473HgzJAK iTAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=aa4BFGHeJpWbGPL64/HwJte9ItGkmlkla6N09Ehy3vY=; b=V8Iu5lnDjUtthaYmfcLbbqph5OsLy0r580/xEKh4vqgOf9Uo+v/dyMOHof/IoUBeCO tyRrs7+hoOjM4Z95BQyEkkOMTgMxzFAemiHa0aLkjTmCjY28+FiciDiuR4tmoWJFP35Z mIU6btg2CNBKxIo8mqjiGpssRYOfdBML4iBy+DTJjgp6Q2xQ6RoasXptj0UhLWw+LHV+ 1BsQ0Uq6OMOA7y0EVolscO0SQw4cmJ/n+70kef+0n57gtlB0VHsTC5G0qEGlr4cEsteV /HxbkA26P4gfuBBlf4razA1QHTNzr6+4DQZfuyRuMrsvlaY4QswFGvm6G99JfE98P++J McYg== X-Gm-Message-State: AOAM532Bn3t07z93y6+F8z/GIFByq+xaFrd7P6j2unjkVuQVROZKweiz oyc5BSk+hCsFjtWMNwLUYLEi1w== X-Google-Smtp-Source: ABdhPJwCuVZNiYL6VMzHS8fpF/jYx19BLzehtjDwRhlxyJnUiJZyIadz/Wtjv+XhngbgJAo1LNLlwA== X-Received: by 2002:a19:6518:: with SMTP id z24mr2683654lfb.512.1617806171034; Wed, 07 Apr 2021 07:36:11 -0700 (PDT) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id v20sm2478516ljh.105.2021.04.07.07.36.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Apr 2021 07:36:10 -0700 (PDT) Received: by box.localdomain (Postfix, from userid 1000) id 7EA8D102413; Wed, 7 Apr 2021 17:36:13 +0300 (+03) Date: Wed, 7 Apr 2021 17:36:13 +0300 From: "Kirill A. Shutemov" To: David Hildenbrand Cc: Dave Hansen , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Sean Christopherson , Jim Mattson , David Rientjes , "Edgecombe, Rick P" , "Kleen, Andi" , "Yamahata, Isaku" , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Message-ID: <20210407143613.4inmmgjh2qo5avfh@box.shutemov.name> References: <20210402152645.26680-1-kirill.shutemov@linux.intel.com> <20210402152645.26680-8-kirill.shutemov@linux.intel.com> <52518f09-7350-ebe9-7ddb-29095cd3a4d9@intel.com> <20210407131647.djajbwhqsmlafsyo@box.shutemov.name> <9c81fac4-9ac3-46d9-9ac6-da91312ad21b@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <9c81fac4-9ac3-46d9-9ac6-da91312ad21b@redhat.com> X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: A84AFE00011B X-Stat-Signature: nxkind1rjnni386q3zsy5dzyyzaokq8c Received-SPF: none (shutemov.name>: No applicable sender policy available) receiver=imf13; identity=mailfrom; envelope-from=""; helo=mail-lf1-f46.google.com; client-ip=209.85.167.46 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617806170-804259 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Apr 07, 2021 at 04:09:35PM +0200, David Hildenbrand wrote: > On 07.04.21 15:16, Kirill A. Shutemov wrote: > > On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote: > > > On 06.04.21 16:33, Dave Hansen wrote: > > > > On 4/6/21 12:44 AM, David Hildenbrand wrote: > > > > > On 02.04.21 17:26, Kirill A. Shutemov wrote: > > > > > > TDX architecture aims to provide resiliency against confident= iality and > > > > > > integrity attacks. Towards this goal, the TDX architecture he= lps enforce > > > > > > the enabling of memory integrity for all TD-private memory. > > > > > >=20 > > > > > > The CPU memory controller computes the integrity check value = (MAC) for > > > > > > the data (cache line) during writes, and it stores the MAC wi= th the > > > > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > > > > >=20 > > > > > > Checking of memory integrity is performed during memory reads= . If > > > > > > integrity check fails, CPU poisones cache line. > > > > > >=20 > > > > > > On a subsequent consumption (read) of the poisoned data by so= ftware, > > > > > > there are two possible scenarios: > > > > > >=20 > > > > > > =A0 - Core determines that the execution can continue and i= t treats > > > > > > =A0=A0=A0 poison with exception semantics signaled as a #MC= E > > > > > >=20 > > > > > > =A0 - Core determines execution cannot continue,and it does= an unbreakable > > > > > > =A0=A0=A0 shutdown > > > > > >=20 > > > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > > > > >=20 > > > > > > As some of integrity check failures may lead to system shutdo= wn host > > > > > > kernel must not allow any writes to TD-private memory. This r= equirment > > > > > > clashes with KVM design: KVM expects the guest memory to be m= apped into > > > > > > host userspace (e.g. QEMU). > > > > >=20 > > > > > So what you are saying is that if QEMU would write to such memo= ry, it > > > > > could crash the kernel? What a broken design. > > > >=20 > > > > IMNHO, the broken design is mapping the memory to userspace in th= e first > > > > place. Why the heck would you actually expose something with the= MMU to > > > > a context that can't possibly meaningfully access or safely write= to it? > > >=20 > > > I'd say the broken design is being able to crash the machine via a = simple > > > memory write, instead of only crashing a single process in case you= 're doing > > > something nasty. From the evaluation of the problem it feels like t= his was a > > > CPU design workaround: instead of properly cleaning up when it gets= tricky > > > within the core, just crash the machine. And that's a CPU "feature"= , not a > > > kernel "feature". Now we have to fix broken HW in the kernel - once= again. > > >=20 > > > However, you raise a valid point: it does not make too much sense t= o to map > > > this into user space. Not arguing against that; but crashing the ma= chine is > > > just plain ugly. > > >=20 > > > I wonder: why do we even *want* a VMA/mmap describing that memory? = Sounds > > > like: for hacking support for that memory type into QEMU/KVM. > > >=20 > > > This all feels wrong, but I cannot really tell how it could be bett= er. That > > > memory can really only be used (right now?) with hardware virtualiz= ation > > > from some point on. From that point on (right from the start?), the= re should > > > be no VMA/mmap/page tables for user space anymore. > > >=20 > > > Or am I missing something? Is there still valid user space access? > >=20 > > There is. For IO (e.g. virtio) the guest mark a range of memory as sh= ared > > (or unencrypted for AMD SEV). The range is not pre-defined. > >=20 >=20 > Ah right, rings a bell. One obvious alternative would be to let user sp= ace > only explicitly map what is shared and can be safely accessed, instead = of > doing it the other way around. But that obviously requires more thought= /work > and clashes with future MM changes you discuss below. IIUC, HyperV's VMBus uses pre-defined range that communicated through ACPI. KVM/virtio can do the same in theory, but it would require changes in the existing driver model. > > > > This started with SEV. QEMU creates normal memory mappings with = the SEV > > > > C-bit (encryption) disabled. The kernel plumbs those into NPT, b= ut when > > > > those are instantiated, they have the C-bit set. So, we have mis= matched > > > > mappings. Where does that lead? The two mappings not only diffe= r in > > > > the encryption bit, causing one side to read gibberish if the oth= er > > > > writes: they're not even cache coherent. > > > >=20 > > > > That's the situation *TODAY*, even ignoring TDX. > > > >=20 > > > > BTW, I'm pretty sure I know the answer to the "why would you expo= se this > > > > to userspace" question: it's what QEMU/KVM did alreadhy for > > > > non-encrypted memory, so this was the quickest way to get SEV wor= king. > > > >=20 > > >=20 > > > Yes, I guess so. It was the fastest way to "hack" it into QEMU. > > >=20 > > > Would we ever even want a VMA/mmap/process page tables for that mem= ory? How > > > could user space ever do something *not so nasty* with that memory = (in the > > > current context of VMs)? > >=20 > > In the future, the memory should be still managable by host MM: migra= tion, > > swapping, etc. But it's long way there. For now, the guest memory >=20 > I was involved in the s390x implementation where this already works, si= mply > because whenever encrypted memory is read/written from the hypervisor, = you > simple read/write the encrypted data; once the VM accesses that memory,= it > reads/writes unencrypted memory. For this reason, migration, swapping e= tc. > works fairly naturally. In TDX case, the encryption tied to the physical address of the encrypted block. Moving the block to other place in memory would produce garbage. It's done intentionally to protected against replay attack. > I do wonder how x86-64 wants to tackle that; In the far future, will it= be > valid to again read/write encrypted memory, especially from user space? > It would require assistance from the guest and/or TDX module. --=20 Kirill A. Shutemov