From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54FDDC3ABB2 for ; Mon, 16 Sep 2024 20:01:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6CA2F6B0089; Mon, 16 Sep 2024 16:01:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 67B266B008A; Mon, 16 Sep 2024 16:01:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 51AD96B008C; Mon, 16 Sep 2024 16:01:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 32C076B0089 for ; Mon, 16 Sep 2024 16:01:25 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 98E4114014A for ; Mon, 16 Sep 2024 20:01:24 +0000 (UTC) X-FDA: 82571670888.13.7155A94 Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by imf22.hostedemail.com (Postfix) with ESMTP id 1543DC0010 for ; Mon, 16 Sep 2024 20:01:21 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=quicinc.com header.s=qcppdkim1 header.b=HbLcQeEo; spf=pass (imf22.hostedemail.com: domain of quic_eberman@quicinc.com designates 205.220.168.131 as permitted sender) smtp.mailfrom=quic_eberman@quicinc.com; dmarc=pass (policy=none) header.from=quicinc.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726516826; a=rsa-sha256; cv=none; b=0KTMXhcABORFbYSGA2rKkL2lPQ5uHQDhuaUtZTO7R2tNlEPSwhw5UGQmYgJXfcgWyXCUzc BYnOtL8wFxBtvehzzWmoLN7Qw2yDBLRvWiCeDKe5nN83cxUNzvM5SpSWjar6sOrQb3BtvU q2U0sBgLkkdvIQRyXxIG9+y8T151G7k= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=quicinc.com header.s=qcppdkim1 header.b=HbLcQeEo; spf=pass (imf22.hostedemail.com: domain of quic_eberman@quicinc.com designates 205.220.168.131 as permitted sender) smtp.mailfrom=quic_eberman@quicinc.com; dmarc=pass (policy=none) header.from=quicinc.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726516826; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6a4Rc1uDD13K09q2r/9hzOt+olsP8onR0t5S2DFb6Tg=; b=NL9Ple5osx7jO6jnHvtCjuAImrd3TJp+mE0D01te1SPoWWWEYnKCC0i8S/qhKh6eLgHqCp atgReeRtCvUrw5tsltSerVJk2xyVmn5UBebrVIAU5AWzzflGTyzcqfcCJy+sU7aufEBh3w ti+9OE31U9ipmTjwM3BHT8famyB4lbk= Received: from pps.filterd (m0279866.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 48GDfEBn010049; Mon, 16 Sep 2024 20:01:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h= cc:content-type:date:from:in-reply-to:message-id:mime-version :references:subject:to; s=qcppdkim1; bh=6a4Rc1uDD13K09q2r/9hzOt+ olsP8onR0t5S2DFb6Tg=; b=HbLcQeEoysyR2X1zA0MrWO+S4Cx/x+Qbk1qDNzON 5bgjrGCnqjY7Nc4dW7tG44dwVcg4/g/KO+PeYBQhK68SNoKrMhLJaSOhockKG+s1 t3ZoQ8NtIaE284v9smJFRWPx2VCdZXDqKCp/TpEbB/MAsH91nGLcWsvsqONknApX lUewp4a8MDn5rSpakMP4OLe+ne2FL9L0bROW83ZbKgzeFO7cuJaJclc1emnbKc5S Z01shC8sgRl9bIT6+9wuazDiVvhM4PK+9PbpC4la96AwnBU9jCqbRH8Br3j2VXOH DpigjyQUG9braJNJuMGI5gbTnC/QEswp9JZQ7Ik3nICTEw== Received: from nasanppmta01.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 41n4jdn42g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 16 Sep 2024 20:01:04 +0000 (GMT) Received: from nasanex01b.na.qualcomm.com (nasanex01b.na.qualcomm.com [10.46.141.250]) by NASANPPMTA01.qualcomm.com (8.18.1.2/8.18.1.2) with ESMTPS id 48GK0v6f032508 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 16 Sep 2024 20:00:57 GMT Received: from hu-eberman-lv.qualcomm.com (10.49.16.6) by nasanex01b.na.qualcomm.com (10.46.141.250) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.9; Mon, 16 Sep 2024 13:00:56 -0700 Date: Mon, 16 Sep 2024 13:00:56 -0700 From: Elliot Berman To: Ackerley Tng CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap Message-ID: <20240916120939512-0700.eberman@hu-eberman-lv.qualcomm.com> References: <24cf7a9b1ee499c4ca4da76e9945429072014d1e.1726009989.git.ackerleytng@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <24cf7a9b1ee499c4ca4da76e9945429072014d1e.1726009989.git.ackerleytng@google.com> X-Originating-IP: [10.49.16.6] X-ClientProxiedBy: nalasex01c.na.qualcomm.com (10.47.97.35) To nasanex01b.na.qualcomm.com (10.46.141.250) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: VtC6e3CZiXM64-CbxbBhxJzIwiUeMQt5 X-Proofpoint-GUID: VtC6e3CZiXM64-CbxbBhxJzIwiUeMQt5 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.60.29 definitions=2024-09-06_09,2024-09-06_01,2024-09-02_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 impostorscore=0 clxscore=1015 adultscore=0 mlxscore=0 suspectscore=0 malwarescore=0 priorityscore=1501 spamscore=0 bulkscore=0 lowpriorityscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2408220000 definitions=main-2409160136 X-Stat-Signature: bdxfobwf1p9oyfwzo9i4ziskfidd4kxg X-Rspamd-Queue-Id: 1543DC0010 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1726516881-13287 X-HE-Meta: U2FsdGVkX19bjguWrLw76K9ghxoDwMZleLhwBtJJx0NIcp/oJItl3xE0nvEW4WwFIhPgOJpmdlAbtVAE/6tAoyxgo64ZMNwHPnzqHG2deBGUMj/tOZjCHZ5ysh+KPoIU3Mw58dBmCjtnAtbpYXw2UjHo2gES9INKEDInXKf8D8FLElxMDdfa0tdMIqfXTRDZm7kDqA7ephpfx0HKovj+/sRxssj2TEJPNtfO0x7v1RKYvFF0IfT7uPzxw5/q0ySDFmZD3qHtt0rYoAyA5DQkZxckV8D+FXY+xk55YQbR2pzliDq2qTmyH6W1qhh/KPuBXzyvzayFjeAl4HLugnDfu2kBRkz6GrYBRrd3njFIm1WXihhIqA1YSRt0OPa0s82dUavJdOet6yspyuD7gq+aqHgp1/THU1RGEwg0mDTecHiSu1oh/LKCBSS2wVkswQ2n1CYY9alJJztraLfGDT/7gHgIUpPPiPpPYn5yl+rnbgkoPsBYUcJ9BSiodB200syZYXgPAvhtGdzZIc0LF+/X8MpHhjmME+iU6f0jkXB6pxxsKUDOQd28FfgpU6+B9IyYqvqRd0ym9X7k/n3VM2CskOV+FQ0kVyC10J6y8qva8m45LrpupwONYwYmpZ5lqaBS9J3RUy+gdiPNdLfNCEhA3p+FdoLsgh9hH9SCv/uNAovgSa2KnYHj+IVreatTI+b1YUF5HjsQbH2YJM+G4Umfh1lriP2WAnjVwBkQKXH+FzpCh7v42eLE72a/N/378sxlGyMVex54K5slEbdrKkp6aIce7nzTJT4ozfPQvKm9fwpm9Uh+BMCRscfBHjDx9uyWqJfe/bEL/5bc3gqHwZt9+9E2sphKOF3/6hRzYlSdzcJYxdYLoa/eCWv9V6jovil2DZ1qM8WJz/ZiEc6L/zfShJgwWT2Jh0bOEa1XZRxCyAFdsxWJKigE4sqQ8bheuavtgMBRHyjqy93Asarkwvr 6mWeDkol eL4QUM0aGI3aI/5CfpMMHluPv+8QAWRyNHI8Wsr6YjhsMBi8HHXX/qYtvPMUuub2tFbyHYxEll5cBrQ1Gs/ZE1wbCMOECbO7BJOlBmr4Da13G0AUIFUvtIgYxEXI1WsjMfi4HgcvqRCstd7az/5XupeUG1r3xwjlYMePcCBKFFHjHA5Xa29B0FB0kAj989oPJ84CdDwIC8Cu9zuXov5kglDITVi33MgxnTaVKCJBwm2McHHSe9AQ7dStd2BBjwBpazJ4aYzF+LfOWtO+BcE3Vu8lhWK6ICAHHlqHyjo3s+262MiOMUbe4/3vPQwxoKo0bM5j+V+1dAIgmh3ZW7r5pTkSW6NBSz0lz9vY2WpYapqDAwjlVlY/HfiE+j0GmjXyxM4HGBejj5ZIOTcl13DkUavzURg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote: > Since guest_memfd now supports mmap(), folios have to be prepared > before they are faulted into userspace. > > When memory attributes are switched between shared and private, the > up-to-date flags will be cleared. > > Use the folio's up-to-date flag to indicate being ready for the guest > usage and can be used to mark whether the folio is ready for shared OR > private use. Clearing the up-to-date flag also means that the page gets zero'd out whenever it transitions between shared and private (either direction). pKVM (Android) hypervisor policy can allow in-place conversion between shared/private. I believe the important thing is that sev_gmem_prepare() needs to be called prior to giving page to guest. In my series, I had made a ->prepare_inaccessible() callback where KVM would only do this part. When transitioning to inaccessible, only that callback would be made, besides the bookkeeping. The folio zeroing happens once when allocating the folio if the folio is initially accessible (faultable). >From x86 CoCo perspective, I think it also makes sense to not zero the folio when changing faultiblity from private to shared: - If guest is sharing some data with host, you've wiped the data and guest has to copy again. - Or, if SEV/TDX enforces that page is zero'd between transitions, Linux has duplicated the work that trusted entity has already done. Fuad and I can help add some details for the conversion. Hopefully we can figure out some of the plan at plumbers this week. Thanks, Elliot > > Signed-off-by: Ackerley Tng > > --- > virt/kvm/guest_memfd.c | 131 ++++++++++++++++++++++++++++++++++++++++- > virt/kvm/kvm_main.c | 2 + > virt/kvm/kvm_mm.h | 7 +++ > 3 files changed, 139 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 110c4bbb004b..fb292e542381 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -129,13 +129,29 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo > } > > /** > - * Use the uptodate flag to indicate that the folio is prepared for KVM's usage. > + * Use folio's up-to-date flag to indicate that this folio is prepared for usage > + * by the guest. > + * > + * This flag can be used whether the folio is prepared for PRIVATE or SHARED > + * usage. > */ > static inline void kvm_gmem_mark_prepared(struct folio *folio) > { > folio_mark_uptodate(folio); > } > > +/** > + * Use folio's up-to-date flag to indicate that this folio is not yet prepared for > + * usage by the guest. > + * > + * This flag can be used whether the folio is prepared for PRIVATE or SHARED > + * usage. > + */ > +static inline void kvm_gmem_clear_prepared(struct folio *folio) > +{ > + folio_clear_uptodate(folio); > +} > + > /* > * Process @folio, which contains @gfn, so that the guest can use it. > * The folio must be locked and the gfn must be contained in @slot. > @@ -148,6 +164,12 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, > pgoff_t index; > int r; > > + /* > + * Defensively zero folio to avoid leaking kernel memory in > + * uninitialized pages. This is important since pages can now be mapped > + * into userspace, where hardware (e.g. TDX) won't be clearing those > + * pages. > + */ > if (folio_test_hugetlb(folio)) { > folio_zero_user(folio, folio->index << PAGE_SHIFT); > } else { > @@ -1017,6 +1039,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > { > struct inode *inode; > struct folio *folio; > + bool is_prepared; > > inode = file_inode(vmf->vma->vm_file); > if (!kvm_gmem_is_faultable(inode, vmf->pgoff)) > @@ -1026,6 +1049,31 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > if (!folio) > return VM_FAULT_SIGBUS; > > + is_prepared = folio_test_uptodate(folio); > + if (!is_prepared) { > + unsigned long nr_pages; > + unsigned long i; > + > + if (folio_test_hugetlb(folio)) { > + folio_zero_user(folio, folio->index << PAGE_SHIFT); > + } else { > + /* > + * Defensively zero folio to avoid leaking kernel memory in > + * uninitialized pages. This is important since pages can now be > + * mapped into userspace, where hardware (e.g. TDX) won't be > + * clearing those pages. > + * > + * Will probably need a version of kvm_gmem_prepare_folio() to > + * prepare the page for SHARED use. > + */ > + nr_pages = folio_nr_pages(folio); > + for (i = 0; i < nr_pages; i++) > + clear_highpage(folio_page(folio, i)); > + } > + > + kvm_gmem_mark_prepared(folio); > + } > + > vmf->page = folio_file_page(folio, vmf->pgoff); > return VM_FAULT_LOCKED; > } > @@ -1593,6 +1641,87 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long > } > EXPORT_SYMBOL_GPL(kvm_gmem_populate); > > +static void kvm_gmem_clear_prepared_range(struct inode *inode, pgoff_t start, > + pgoff_t end) > +{ > + pgoff_t index; > + > + filemap_invalidate_lock_shared(inode->i_mapping); > + > + /* TODO: replace iteration with filemap_get_folios() for efficiency. */ > + for (index = start; index < end;) { > + struct folio *folio; > + > + /* Don't use kvm_gmem_get_folio to avoid allocating */ > + folio = filemap_lock_folio(inode->i_mapping, index); > + if (IS_ERR(folio)) { > + ++index; > + continue; > + } > + > + kvm_gmem_clear_prepared(folio); > + > + index = folio_next_index(folio); > + folio_unlock(folio); > + folio_put(folio); > + } > + > + filemap_invalidate_unlock_shared(inode->i_mapping); > +} > + > +/** > + * Clear the prepared flag for all folios in gfn range [@start, @end) in memslot > + * @slot. > + */ > +static void kvm_gmem_clear_prepared_slot(struct kvm_memory_slot *slot, gfn_t start, > + gfn_t end) > +{ > + pgoff_t start_offset; > + pgoff_t end_offset; > + struct file *file; > + > + file = kvm_gmem_get_file(slot); > + if (!file) > + return; > + > + start_offset = start - slot->base_gfn + slot->gmem.pgoff; > + end_offset = end - slot->base_gfn + slot->gmem.pgoff; > + > + kvm_gmem_clear_prepared_range(file_inode(file), start_offset, end_offset); > + > + fput(file); > +} > + > +/** > + * Clear the prepared flag for all folios for any slot in gfn range > + * [@start, @end) in @kvm. > + */ > +void kvm_gmem_clear_prepared_vm(struct kvm *kvm, gfn_t start, gfn_t end) > +{ > + int i; > + > + lockdep_assert_held(&kvm->slots_lock); > + > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { > + struct kvm_memslot_iter iter; > + struct kvm_memslots *slots; > + > + slots = __kvm_memslots(kvm, i); > + kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { > + struct kvm_memory_slot *slot; > + gfn_t gfn_start; > + gfn_t gfn_end; > + > + slot = iter.slot; > + gfn_start = max(start, slot->base_gfn); > + gfn_end = min(end, slot->base_gfn + slot->npages); > + > + if (iter.slot->flags & KVM_MEM_GUEST_MEMFD) > + kvm_gmem_clear_prepared_slot(iter.slot, gfn_start, gfn_end); > + } > + } > +} > + > /** > * Returns true if pages in range [@start, @end) in inode @inode have no > * userspace mappings. > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1a7bbcc31b7e..255d27df7f5c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2565,6 +2565,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, > KVM_BUG_ON(r, kvm); > } > > + kvm_gmem_clear_prepared_vm(kvm, start, end); > + > kvm_handle_gfn_range(kvm, &post_set_range); > > out_unlock: > diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h > index d8ff2b380d0e..25fd0d9f66cc 100644 > --- a/virt/kvm/kvm_mm.h > +++ b/virt/kvm/kvm_mm.h > @@ -43,6 +43,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, > void kvm_gmem_unbind(struct kvm_memory_slot *slot); > int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end, > unsigned long attrs); > +void kvm_gmem_clear_prepared_vm(struct kvm *kvm, gfn_t start, gfn_t end); > #else > static inline void kvm_gmem_init(struct module *module) > { > @@ -68,6 +69,12 @@ static inline int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, > return 0; > } > > +static inline void kvm_gmem_clear_prepared_slots(struct kvm *kvm, > + gfn_t start, gfn_t end) > +{ > + WARN_ON_ONCE(1); > +} > + > #endif /* CONFIG_KVM_PRIVATE_MEM */ > > #endif /* __KVM_MM_H__ */ > -- > 2.46.0.598.g6f2099f65c-goog >