From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63672C433ED for ; Fri, 16 Apr 2021 17:31:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E148E610FA for ; Fri, 16 Apr 2021 17:31:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E148E610FA Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 725676B0036; Fri, 16 Apr 2021 13:31:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D5ED6B006C; Fri, 16 Apr 2021 13:31:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3AFC16B0070; Fri, 16 Apr 2021 13:31:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0131.hostedemail.com [216.40.44.131]) by kanga.kvack.org (Postfix) with ESMTP id 024096B0036 for ; Fri, 16 Apr 2021 13:31:37 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 4AF174FE6 for ; Fri, 16 Apr 2021 17:31:37 +0000 (UTC) X-FDA: 78038922234.35.3AA004E Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) by imf01.hostedemail.com (Postfix) with ESMTP id 8385B5001530 for ; Fri, 16 Apr 2021 17:30:35 +0000 (UTC) Received: by mail-pg1-f180.google.com with SMTP id g35so19620861pgg.9 for ; Fri, 16 Apr 2021 10:30:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=q5wMI7x8R3NVGtWXjOVwyo6hiimn4/q3w0CIc2jmNSo=; b=Z3/jKZUp1MokdrU6UgufTOWM2iZW3g+TEeyk6mG6eAJL8MH3sL2xRRsNxhzcnhJF+v hxkdN3/IfpjNT7upq8oMWCmWqKtdiSywMe6CQAedn1ax71Ap71zKE58ovuIOHbq6yyl/ h1buwHVecsDPzxv/XOxdZZQOknN1dwQ8GElJV1iDtLoKaddtZ8CMe0UAFaItm34sXBH1 +CoKkizmcXszQ3kcobnJpKIrSqJu8zu25Bf7c94wvcSe81NY2zAVBi46S7jsdJxPZKnp Rehzj/EbXo7Ih9UDQjWUkGiK/QQG/NCQ8tqEN7aGwclphMdv0ux9nZ3AX/jw3nwGF8ZI QNug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=q5wMI7x8R3NVGtWXjOVwyo6hiimn4/q3w0CIc2jmNSo=; b=Eo4dS2lGmETf5RzwjjsyxIpa7H85K3hB3K4fKXHH564lxGoC5Cxy3U2dfbeTctj8+C z3FTfoUxm2yl+/ioHM3rUBHj/pd0HvZCXFIJYIxej9dOqdePNa4HjL1/Mljncvg5DkcP bbkxLpXuqIQB7n3yZNnzYcd8YPCvB/OOQrPO/yFL2WYbjBcW4dqGy8iVkULi39hgB3eM dqdas7qcSQkOxC2+6mIIaWmplYkN0e1Fu49VrJDI8IsqkwNzek/zk89AihobBShgP5LW Z6LHX20jpKCa/+5shevBIHq1eRlNzSVwyo4z/iu+LsZVasRGr1XeVU+HN5N1PVCtlcM2 C7rw== X-Gm-Message-State: AOAM533gYyL+hSuWjrGnRJzbWN9S4WVLA7NVVL3AmNtuKSDrmK9s2C3G Iktk1oIaWYv4EHv/9mbK9cg3uw== X-Google-Smtp-Source: ABdhPJw1W57hS0urdalMCM5IzHRZOUpaR8Ra7DL1Debf6bpFzc/MHd7aB7tcCvtEBTvchwXnldSgtA== X-Received: by 2002:a05:6a00:2301:b029:247:7208:1d6f with SMTP id h1-20020a056a002301b029024772081d6fmr8835755pfh.43.1618594235257; Fri, 16 Apr 2021 10:30:35 -0700 (PDT) Received: from google.com (240.111.247.35.bc.googleusercontent.com. [35.247.111.240]) by smtp.gmail.com with ESMTPSA id 14sm5584986pgz.48.2021.04.16.10.30.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 16 Apr 2021 10:30:34 -0700 (PDT) Date: Fri, 16 Apr 2021 17:30:30 +0000 From: Sean Christopherson To: "Kirill A. Shutemov" Cc: Dave Hansen , Andy Lutomirski , Peter Zijlstra , Jim Mattson , David Rientjes , "Edgecombe, Rick P" , "Kleen, Andi" , "Yamahata, Isaku" , Erdem Aktas , Steve Rutherford , Peter Gonda , David Hildenbrand , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages Message-ID: References: <20210416154106.23721-1-kirill.shutemov@linux.intel.com> <20210416154106.23721-14-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210416154106.23721-14-kirill.shutemov@linux.intel.com> X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 8385B5001530 X-Stat-Signature: j7ez86q437c3amjzhk38eztg1rzd96ny Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf01; identity=mailfrom; envelope-from=""; helo=mail-pg1-f180.google.com; client-ip=209.85.215.180 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618594235-272441 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Apr 16, 2021, Kirill A. Shutemov wrote: > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 1b404e4d7dd8..f8183386abe7 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) > kvm_sched_yield(vcpu->kvm, a0); > ret = 0; > break; > + case KVM_HC_ENABLE_MEM_PROTECTED: > + ret = kvm_protect_memory(vcpu->kvm); > + break; > + case KVM_HC_MEM_SHARE: > + ret = kvm_share_memory(vcpu->kvm, a0, a1); Can you take a look at a proposed hypercall interface for SEV live migration and holler if you (or anyone else) thinks it will have extensibility issues? https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.kalra@amd.com > + break; > default: > ret = -KVM_ENOSYS; > break; > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index fadaccb95a4c..cd2374802702 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) > } > #endif > > +#define KVM_NR_SHARED_RANGES 32 > + > /* > * Note: > * memslots are not sorted by id anymore, please use id_to_memslot() > @@ -513,6 +515,10 @@ struct kvm { > pid_t userspace_pid; > unsigned int max_halt_poll_ns; > u32 dirty_ring_size; > + bool mem_protected; > + void *id; > + int nr_shared_ranges; > + struct range shared_ranges[KVM_NR_SHARED_RANGES]; Hard no for me. IMO, anything that requires KVM to track shared/pinned pages in a separate tree/array is non-starter. More specific to TDX #MCs, KVM should not be the canonical reference for the state of a page. > }; > > #define kvm_err(fmt, ...) \ ... > @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, > flags |= FOLL_WRITE; > if (async) > flags |= FOLL_NOWAIT; > + if (kvm->mem_protected) > + flags |= FOLL_ALLOW_POISONED; This is unsafe, only the flows that are mapping the PFN into the guest should use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page. > > npages = get_user_pages_unlocked(addr, 1, &page, flags); > if (npages != 1) > return npages; > > + if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) && > + kvm->mem_protected && !kvm_page_allowed(kvm, page)) > + return -EHWPOISON; > + > /* map read fault as writable if possible */ > if (unlikely(!write_fault) && writable) { > struct page *wpage; ... > @@ -2338,19 +2350,93 @@ static int next_segment(unsigned long len, int offset) > return len; > } > > -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn, > - void *data, int offset, int len) > +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len) > +{ > + int offset = offset_in_page(hva); > + struct page *page; > + int npages, seg; > + void *vaddr; > + > + if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) || > + !kvm->mem_protected) { > + return __copy_from_user(data, (void __user *)hva, len); > + } > + > + might_fault(); > + kasan_check_write(data, len); > + check_object_size(data, len, false); > + > + while ((seg = next_segment(len, offset)) != 0) { > + npages = get_user_pages_unlocked(hva, 1, &page, > + FOLL_ALLOW_POISONED); > + if (npages != 1) > + return -EFAULT; > + > + if (!kvm_page_allowed(kvm, page)) > + return -EFAULT; > + > + vaddr = kmap_atomic(page); > + memcpy(data, vaddr + offset, seg); > + kunmap_atomic(vaddr); Why is KVM allowed to access a poisoned page? I would expect shared pages to _not_ be poisoned. Except for pure software emulation of SEV, KVM can't access guest private memory. > + > + put_page(page); > + len -= seg; > + hva += seg; > + data += seg; > + offset = 0; > + } > + > + return 0; > +} ... > @@ -2693,6 +2775,41 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn) > } > EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty); > > +int kvm_protect_memory(struct kvm *kvm) > +{ > + if (mmap_write_lock_killable(kvm->mm)) > + return -KVM_EINTR; > + > + kvm->mem_protected = true; > + kvm_arch_flush_shadow_all(kvm); > + mmap_write_unlock(kvm->mm); > + > + return 0; > +} > + > +int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages) > +{ > + unsigned long end = gfn + npages; > + > + if (!npages || !IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY)) > + return 0; > + > + __kvm_share_memory(kvm, gfn, end); > + > + for (; gfn < end; gfn++) { > + struct page *page = gfn_to_page(kvm, gfn); > + unsigned long pfn = page_to_pfn(page); > + > + if (page == KVM_ERR_PTR_BAD_PAGE) > + continue; > + > + kvm_share_pfn(kvm, pfn); I like the idea of using "special" PTE value to denote guest private memory, e.g. in this RFC, HWPOISON. But I strongly dislike having KVM involved in the manipulation of the special flag/value. Today, userspace owns the gfn->hva translations and the kernel effectively owns the hva->pfn translations (with input from userspace). KVM just connects the dots. Having KVM own the shared/private transitions means KVM is now part owner of the entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary MMU and a co-owner of the primary MMU. This creates locking madness, e.g. KVM taking mmap_sem for write, mmu_lock under page lock, etc..., and also takes control away from userspace. E.g. userspace strategy could be to use a separate backing/pool for shared memory and change the gfn->hva translation (memslots) in reaction to a shared/private conversion. Automatically swizzling things in KVM takes away that option. IMO, KVM should be entirely "passive" in this process, e.g. the guest shares or protects memory, userspace calls into the kernel to change state, and the kernel manages the page tables to prevent bad actors. KVM simply does the plumbing for the guest page tables. > + put_page(page); > + } > + > + return 0; > +} > + > void kvm_sigset_activate(struct kvm_vcpu *vcpu) > { > if (!vcpu->sigset_active) > diff --git a/virt/kvm/mem_protected.c b/virt/kvm/mem_protected.c > new file mode 100644 > index 000000000000..81882bd3232b > --- /dev/null > +++ b/virt/kvm/mem_protected.c > @@ -0,0 +1,110 @@ > +#include > +#include > +#include > + > +static DEFINE_XARRAY(kvm_pfn_map); > + > +static bool gfn_is_shared(struct kvm *kvm, unsigned long gfn) > +{ > + bool ret = false; > + int i; > + > + spin_lock(&kvm->mmu_lock); Taking mmu_lock for write in the page fault path is a non-starter. The TDP MMU is being converted to support concurrent faults, which is a foundational pillar of its scalability. > + for (i = 0; i < kvm->nr_shared_ranges; i++) { > + if (gfn < kvm->shared_ranges[i].start) > + continue; > + if (gfn >= kvm->shared_ranges[i].end) > + continue; > + > + ret = true; > + break; > + } > + spin_unlock(&kvm->mmu_lock); > + > + return ret; > +}