From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19603C433F5 for ; Thu, 6 Oct 2022 15:35:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 826186B0072; Thu, 6 Oct 2022 11:35:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D6058E0002; Thu, 6 Oct 2022 11:35:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 677D68E0001; Thu, 6 Oct 2022 11:35:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4C6976B0072 for ; Thu, 6 Oct 2022 11:35:08 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 1DB7A1C69AB for ; Thu, 6 Oct 2022 15:35:08 +0000 (UTC) X-FDA: 79990923096.03.04A8295 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf02.hostedemail.com (Postfix) with ESMTP id 7CFFD8000F for ; Thu, 6 Oct 2022 15:35:07 +0000 (UTC) Received: by mail-pg1-f169.google.com with SMTP id bh13so2195868pgb.4 for ; Thu, 06 Oct 2022 08:35:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=SVH+mE8vTDt4ztWw0FuDerHaC6PXHO5VRvmOUup/utI=; b=BT6wFbdYtvryFMUPmG4baQJjXgZn2XpiWetEfLkz8x2RyoKztGe/UlFARXUiVD2Uyk zGpvUkPkTIRfx4JMN5hI2WK4Vgk1k8N885wJ1EBWwvMsuraG1ulQZ43ct2pF+3zIhJVX F1i8WjUx87St9iaud3AvDTYqXCqn3QgU8fn9TnOzEqb+U1DzCjFMpz9+e7NO4j+UhZ6F Xny2Sc+2/AmA/ruM4UQKTyH9J2g2UG6swWnfFrB5gYu1G4W/0wo2txos+zvpWVHb2gaF RAcl0gOYeTEgnMt3FEflaqus5BOqdgMjoCPePo0xlJLotiUsdEbJ+lpyHdUFudX5do3c 80VA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=SVH+mE8vTDt4ztWw0FuDerHaC6PXHO5VRvmOUup/utI=; b=y74ZPr1jojm44X/aER+BAKWx9QQ0z7iLWK/FAfYe9pWdvyXIDQHaE9ni2fmV/euMoE 53wN+0f++mwokU+ZpHK8uheIiFPVPRJQXj2UzNnCg8i38q2EfhZ1TCfw0r+ZMZgdAD31 /mgYQX6QnWzm4Jl+WqEe8VAvO56WxJQeJ0/OeGOoK6oDGux6TpVbKl4qNzoAtAN2q9NF l70cQ5K1/2sgtvuiDEnVnmpnD7m0wxZGYFMcbWizEZ+GonBMcotoFscnJCVVu5aej03D Af2cX6DlwUqchSDrEON4LlAkqjRX1PirTbBziaJS5tZOv12SMkEX5hK+0AlX1Byy5p9p uEdw== X-Gm-Message-State: ACrzQf0hRburWXFrAeMi/YzlBivIXltAxfFRcdavQUGQOGZ3hqN9nMPW ELBNK+CMCw2C8lMzfQNeMD+17TAcCqT8YA== X-Google-Smtp-Source: AMsMyM4soRpBX+s7aYmfnDBJSa29oksK0grYg4InwOjmhMAH+pEZFdQxT/WmwSBMyChsZKzSIdtRww== X-Received: by 2002:a05:6a00:1884:b0:562:6536:4844 with SMTP id x4-20020a056a00188400b0056265364844mr305684pfh.2.1665070506177; Thu, 06 Oct 2022 08:35:06 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id p23-20020a1709027ed700b001714e7608fdsm12311997plb.256.2022.10.06.08.35.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Oct 2022 08:35:03 -0700 (PDT) Date: Thu, 6 Oct 2022 15:34:58 +0000 From: Sean Christopherson To: Jarkko Sakkinen Cc: Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Subject: Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Message-ID: References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-3-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665070507; a=rsa-sha256; cv=none; b=Hw50OYZwWfE0OaL4oEIOiIQfhEpdWVTZkLofwoY83dTcnBZ3dTqiAlTMHXr6JaJ7Q2Hege fYKU+vr2VQPKhgFJphoDkYsyS00SRSP30kInCnoVwURSsvDSS4BFhhuTgmTAS6VkeZ5+EO ZB3sfmH9uutwoCpT5N9MDKEdAvFq+IY= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=BT6wFbdY; spf=pass (imf02.hostedemail.com: domain of seanjc@google.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665070507; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SVH+mE8vTDt4ztWw0FuDerHaC6PXHO5VRvmOUup/utI=; b=bl8jEpCFGgHdCcPThOk5E3NuwBmcXdsLgGiky8YGyn+PaGG9DB2QNnTJqbj0ZRtE8eGRRQ B6Mfoioh+5M4iB0WH8jo89vmyGjEkep7OsiRh228HeFgc4trQXyOoAAAvoUDp7gg3TMMvb wT2QKRTMBuTuotBzKIrUWZ0UbbJnEm0= X-Rspamd-Server: rspam08 X-Rspam-User: X-Rspamd-Queue-Id: 7CFFD8000F Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=BT6wFbdY; spf=pass (imf02.hostedemail.com: domain of seanjc@google.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=seanjc@google.com; dmarc=pass (policy=reject) header.from=google.com X-Stat-Signature: pxsckxzi8gy9fp4mdktxq1re8zk1d5na X-HE-Tag: 1665070507-616426 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 06, 2022, Jarkko Sakkinen wrote: > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote: > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote: > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two > > > additional KVM memslot fields private_fd/private_offset to allow > > > userspace to specify that guest private memory provided from the > > > private_fd and guest_phys_addr mapped at the private_offset of the > > > private_fd, spanning a range of memory_size. > > > > > > The extended memslot can still have the userspace_addr(hva). When use, a > > > single memslot can maintain both private memory through private > > > fd(private_fd/private_offset) and shared memory through > > > hva(userspace_addr). Whether the private or shared part is visible to > > > guest is maintained by other KVM code. > > > > What is anyway the appeal of private_offset field, instead of having just > > 1:1 association between regions and files, i.e. one memfd per region? Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's VMM). E.g. if a vCPU converts a single page, it will be forced to wait until all other vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting in memory. KVM's memslot updates also hold a mutex for the entire duration of the update, i.e. conversions on different vCPUs would be fully serialized, exacerbating the SRCU problem. KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any memslot is deleted. Taking both a private_fd and a shared userspace address allows userspace to convert between private and shared without having to manipulate memslots. Paolo's original idea (was sent off-list): : The problem is that KVM_SET_USER_MEMORY_REGION and memslots in general : are designed around (S)RCU. It is way too slow (in both QEMU and KVM) : to be called on every private<->shared conversion with 4K granularity, : and it tends naturally to have quadratic behavior (though, at least for : KVM, the in-progress "fast memslots" series would avoid that). : : Since private PTEs are persistent, and userspace cannot access the memfd : in any other way, userspace could use fallocate() to map/unmap an : address range as private, and KVM can treat everything that userspace : hasn't mapped as shared. : : This would be a new entry in struct guest_ops, called by fallocate(), : and the callback can take the mmu_lock for write to avoid racing with : page faults. This doesn't add any more contention than : KVM_SET_USER_MEMORY_REGION, since the latter takes slots_lock. If : there's something I'm missing then the mapping operation can use a : ioctl, while the unmapping can keep using FALLOC_FL_PUNCH_HOLE. : : Then: : : - for simplicity, mapping a private memslot fails if there are any : mappings (similar to the handling when F_SEAL_GUEST is set). : : - for TDX, accessing a nonexistent private PTE will cause a userspace : exit for a shared->private conversion request. For SNP, the guest will : do a page state change VMGEXIT to request an RMPUPDATE, which can cause : a userspace exit too; the consequent fallocate() on the private fd : invokes RMPUPDATE. : : - trying to map a shared PTE where there's already a private PTE causes : a userspace exit for a private->shared conversion request. : kvm_faultin_pfn or handle_abnormal_pfn can query this in the private-fd : inode, which is essentially a single pagecache_get_page call. : : - if userspace asks to map a private PTE where there's already a shared : PTE (which it can check because it has the mmu_lock taken for write), : KVM unmaps the shared PTE. > > > > If this was the case, then an extended struct would not be needed in the > > first place. A simple union inside the existing struct would do: > > > > union { > > __u64 userspace_addr, > > __u64 private_fd, > > }; > > Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd > consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not > have MFD_INACCESSIBLE set), KVM could just use the memory as it is using > mapped memory. This would simplify user space code, as you can the use the > same thing for both cases. I explored this idea too[*]. Because we want to support specifying both the private and shared backing stores in a single memslot, then we need two file descriptors so that shared memory can also use fd-based memory. [*] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com