From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8380EC4332F for ; Sat, 4 Nov 2023 10:28:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6559744014D; Sat, 4 Nov 2023 06:28:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 606038D000C; Sat, 4 Nov 2023 06:28:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CE7F44014D; Sat, 4 Nov 2023 06:28:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 3F6508D000C for ; Sat, 4 Nov 2023 06:28:40 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0FF79C0769 for ; Sat, 4 Nov 2023 10:28:40 +0000 (UTC) X-FDA: 81419898000.13.F762F17 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by imf03.hostedemail.com (Postfix) with ESMTP id D763420005 for ; Sat, 4 Nov 2023 10:28:36 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=k7m+J0oP; spf=none (imf03.hostedemail.com: domain of yilun.xu@linux.intel.com has no SPF policy when checking 134.134.136.31) smtp.mailfrom=yilun.xu@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699093717; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lTHseas1MRQx+uO1EJb2b3/5g7cyq01gSPlcusvxhGU=; b=J87f05fLBGVBeG7zbsg6SKIIiWyJtnqaEpe5EEUz6Fc+eHACN4XodsG7AsY9jtz1XgSYef PHnxs4ElQdyGRqKz0UJpRnCaKMvPIxutgiD+PIccWwSqokkVKBKUnFk/CihxxyELtXzovd Z5+2BCBXu7y4ePJABnqflRmLkirI80I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699093718; a=rsa-sha256; cv=none; b=RuA0HxIPHiH7KHLwLfUzsndGPciyHB9wP+8A7sdFKAhtdeCHe216zPS+v/dPMObzKgmSEv iaMZRfLzgwXsQF07rD2Hf1LdIrd5TFzimbUrefvd0qZP5MMBiqg/7hbFAVdGkqVFHxfOvB JS8LtnRJk35YgzHntsRMdfQYdYovaD8= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=k7m+J0oP; spf=none (imf03.hostedemail.com: domain of yilun.xu@linux.intel.com has no SPF policy when checking 134.134.136.31) smtp.mailfrom=yilun.xu@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1699093717; x=1730629717; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=5TP8v/J0LaSvm65b/oAssWoyHA+UqQe21NaYSLdqziw=; b=k7m+J0oPU/ZAhISvrcuhQN6+2+7Diq8iNfTgsk4VTyr3z//VptO1J7bq gc5j1zkhOCxUNE6HJCv30hZyajxOVU9lXZxykaYRvJKQ7gP3k57EY7mB0 0DoIpr9x99AjW6DIbGKYTij5WC0C/MVK6kbEjYRLhCsUT94EmzJXKs8ZM p3J2e7CB/r5awcmPkTK91eU+WHgiWxgZcivr2KtS1UukdOvXQzej5dz9S jg/kVOUyc6MycGGx62Mir0OYNF1iZFTn7Cdu8Vdtj5FizV89bvfjU/5sS 9c2lv0RGCwZ8IAfDn5LiXWVu0ceEWfm5Xn/lWf4ngdOVzyyhHgh/Zc+cu g==; X-IronPort-AV: E=McAfee;i="6600,9927,10883"; a="453377392" X-IronPort-AV: E=Sophos;i="6.03,276,1694761200"; d="scan'208";a="453377392" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2023 03:28:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10883"; a="796860179" X-IronPort-AV: E=Sophos;i="6.03,276,1694761200"; d="scan'208";a="796860179" Received: from yilunxu-optiplex-7050.sh.intel.com (HELO localhost) ([10.239.159.165]) by orsmga001.jf.intel.com with ESMTP; 04 Nov 2023 03:28:23 -0700 Date: Sat, 4 Nov 2023 18:26:55 +0800 From: Xu Yilun To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , David Matlack , Yu Zhang , Isaku Yamahata , =?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?= , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Message-ID: References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20231027182217.3615211-17-seanjc@google.com> X-Stat-Signature: zf7yaf47375wjqaoixtku5p3monbs6kn X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: D763420005 X-Rspam-User: X-HE-Tag: 1699093716-765300 X-HE-Meta: U2FsdGVkX19gtZ+tQQuI38+OTz8ANXYPfiDBA637kK6xJniaOQstMsys1cmIksjq4dhYCVG5k0hKsj9Wzxn2OSzy+D7l7qFyod2Nx6eTWyp411+HWuTlNFuZ5p1O+Pm8KGFR76jUS6QOGoQcj2dxddVdJlNKexFM8Z6Y3xzQGgvHCV5LY/6qMC3N4e/VxiE9NBQa6rsxuFgUiHk8EwJPYHARfnaXy1GQc4+hNPr3ydTpnEaXmLDoXkHSk5bSBRzfDyTQnKZonWbkJObdzVKxNJWikBCUZp6oJizMNr5JxfqvzzVcOOHRHng3R49ruc5/V7DeAWNMbHSqtk71UeqVnps5J7/qLpcry85ly8nVj5Cazwr7GjwrQqG4+cpeCxvdzyGHFjxkQ6rpe8lKalPWWwV8LhaUgRQr/lUz5WbJ8U1O3drA+2uO2c0Okg2lsMybdtzoVszGmA/drfxYFkV/Y9pC3EYjmLz5bmML1eC9BdVg/BxzzlQi9tszyiLyQSYnRE5AONq3EkSpnCE8nemqUwMSEVOMBUxRTFIVTwurbtomNapB9pZVA/7jYGXdigV5JUY/V6yapE6WWRH91B8ePW9641hYl+ZLn8QRNFYOGEhO4coeyOuNKOBiRghc2nrF493hJ9m3baUFkvdC4qQhTn4iXq+8DmEAZ6ZRfFLkoG27iK2TnIMjgVFkQDvPmzpM62ZFyqM6Iinc2UV4V7kuppjl+N6/MCBl2iM9iQ471zA7CfPNIq6ZDwICrcOa3+1zLEom1LAJDV0orVu3nUnKBzRWB8mx5Xu/oN2iBpcu++L7IbLMiuKC5vxz0od6khNvcUik2WI40dxNvH78rGoJoggYNG22QIsHbb95OI+LGeVnzB9QESTYbbp1Mq8AQEL7u3yqLyBm4FR0V3sSI0soMGNBM+xfymgt1+FzVBPmyTn7xOxSQxu5NbkgmZxl1gt64kuYdltaknwy3hqWahn yH7feWwZ 3Tz7+8lt0Ks4ycXSI/kx2mxybpUnen/eyDgQvq9slCJZEE+S8FSLXkX3Gmkq6uIIkzXOQO9HozRC25ttp92ramOFGwy02cnGQSxsyOQvNxes9Z+z0llcsy1yGtN/cM7mOG2xzGkxULi1SIUh3c5qVJ4qk9K2W10WEr6OBUdItzg2YKTBAuya76H60m1QADpDfdr3OFrKoUMC7sxWNLyzZea+RLZvniJTlf+ZTv+OHrt0M0l7T8S2IkSpWlpWsA70bMpiurDzDMICsy5Q2zsfQgeQm4bLH2ypZXg37 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that > +allows mapping guest_memfd memory into a guest. All fields shared with > +KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_PRIVATE in > +flags to have KVM bind the memory region to a given guest_memfd range of > +[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd ^ The range end should be exclusive, is it? > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and > +the target range must not be bound to any other memory region. All standard > +bounds checks apply (use common sense). > + > :: > > struct kvm_userspace_memory_region2 { > @@ -6087,9 +6096,24 @@ applied. > __u64 guest_phys_addr; > __u64 memory_size; /* bytes */ > __u64 userspace_addr; /* start of the userspace allocated memory */ > + __u64 guest_memfd_offset; > + __u32 guest_memfd; > + __u32 pad1; > + __u64 pad2[14]; > }; > [...] > +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) > +{ > + const char *anon_name = "[kvm-gmem]"; > + struct kvm_gmem *gmem; > + struct inode *inode; > + struct file *file; > + int fd, err; > + > + fd = get_unused_fd_flags(0); > + if (fd < 0) > + return fd; > + > + gmem = kzalloc(sizeof(*gmem), GFP_KERNEL); > + if (!gmem) { > + err = -ENOMEM; > + goto err_fd; > + } > + > + /* > + * Use the so called "secure" variant, which creates a unique inode > + * instead of reusing a single inode. Each guest_memfd instance needs > + * its own inode to track the size, flags, etc. > + */ > + file = anon_inode_getfile_secure(anon_name, &kvm_gmem_fops, gmem, > + O_RDWR, NULL); > + if (IS_ERR(file)) { > + err = PTR_ERR(file); > + goto err_gmem; > + } > + > + file->f_flags |= O_LARGEFILE; > + > + inode = file->f_inode; > + WARN_ON(file->f_mapping != inode->i_mapping); Just curious, why should we check the mapping fields which is garanteed in other subsystem? > + > + inode->i_private = (void *)(unsigned long)flags; > + inode->i_op = &kvm_gmem_iops; > + inode->i_mapping->a_ops = &kvm_gmem_aops; > + inode->i_mode |= S_IFREG; > + inode->i_size = size; > + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); > + mapping_set_unmovable(inode->i_mapping); > + /* Unmovable mappings are supposed to be marked unevictable as well. */ > + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); > + > + kvm_get_kvm(kvm); > + gmem->kvm = kvm; > + xa_init(&gmem->bindings); > + list_add(&gmem->entry, &inode->i_mapping->private_list); > + > + fd_install(fd, file); > + return fd; > + > +err_gmem: > + kfree(gmem); > +err_fd: > + put_unused_fd(fd); > + return err; > +} [...] > +int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, > + unsigned int fd, loff_t offset) > +{ > + loff_t size = slot->npages << PAGE_SHIFT; > + unsigned long start, end; > + struct kvm_gmem *gmem; > + struct inode *inode; > + struct file *file; > + > + BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff)); > + > + file = fget(fd); > + if (!file) > + return -EBADF; > + > + if (file->f_op != &kvm_gmem_fops) > + goto err; > + > + gmem = file->private_data; > + if (gmem->kvm != kvm) > + goto err; > + > + inode = file_inode(file); > + > + if (offset < 0 || !PAGE_ALIGNED(offset)) > + return -EINVAL; Should also "goto err" here. > + > + if (offset + size > i_size_read(inode)) > + goto err; > + > + filemap_invalidate_lock(inode->i_mapping); > + > + start = offset >> PAGE_SHIFT; > + end = start + slot->npages; > + > + if (!xa_empty(&gmem->bindings) && > + xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) { > + filemap_invalidate_unlock(inode->i_mapping); > + goto err; > + } > + > + /* > + * No synchronize_rcu() needed, any in-flight readers are guaranteed to > + * be see either a NULL file or this new file, no need for them to go > + * away. > + */ > + rcu_assign_pointer(slot->gmem.file, file); > + slot->gmem.pgoff = start; > + > + xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL); > + filemap_invalidate_unlock(inode->i_mapping); > + > + /* > + * Drop the reference to the file, even on success. The file pins KVM, > + * not the other way 'round. Active bindings are invalidated if the ^ around? Thanks, Yilun > + * file is closed before memslots are destroyed. > + */ > + fput(file); > + return 0; > + > +err: > + fput(file); > + return -EINVAL; > +}