From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69E63C54798 for ; Fri, 1 Mar 2024 00:40:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 00D356B006E; Thu, 29 Feb 2024 19:40:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EFF78940007; Thu, 29 Feb 2024 19:40:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA00E94000A; Thu, 29 Feb 2024 19:40:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C9EEC940007 for ; Thu, 29 Feb 2024 19:40:54 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8CA4380272 for ; Fri, 1 Mar 2024 00:40:54 +0000 (UTC) X-FDA: 81846615228.01.6B6504D Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by imf28.hostedemail.com (Postfix) with ESMTP id 3BA98C0008 for ; Fri, 1 Mar 2024 00:40:51 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=quicinc.com header.s=qcppdkim1 header.b=a9dsPynl; spf=pass (imf28.hostedemail.com: domain of quic_eberman@quicinc.com designates 205.220.168.131 as permitted sender) smtp.mailfrom=quic_eberman@quicinc.com; dmarc=pass (policy=none) header.from=quicinc.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709253652; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fAmtm81RwLGFezIhtTg09a8+FIlqjPr8NQ8dQHAa7QY=; b=7Zsts1dBTt57H+cXKuDrkiOt0XPC609qGJt9COFuz5okKOREdHQQSVLtcmecMC9Wnf7h+W gd4199lRUKcXwHKIvRrIlbPlJSqhxo+0FvjXz9XBgtD3AxyQCnsEhzEXxDQm7qS3tkRPFW 1MPmmYaDy/KCgLd/bcJ7umPPjv00D0U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709253652; a=rsa-sha256; cv=none; b=ZiCUYG4QJVOz7bDVLFyiYIfVuwm+fGwID0c+LfShKF8C8SCGHKO/8M1sCXcKT6Z6rRoEsx Kmvq67YuQEf3IUZPTyI4rebnsiTzoZPamgGLIy5r/tNOy8TtlSM+oL9nmWe/YeHFMJcpZn g9wRkB9GqqYkKQHXSGg6/eO1VZKyk/4= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=quicinc.com header.s=qcppdkim1 header.b=a9dsPynl; spf=pass (imf28.hostedemail.com: domain of quic_eberman@quicinc.com designates 205.220.168.131 as permitted sender) smtp.mailfrom=quic_eberman@quicinc.com; dmarc=pass (policy=none) header.from=quicinc.com Received: from pps.filterd (m0279864.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.24/8.17.1.24) with ESMTP id 4210QUGO026455; Fri, 1 Mar 2024 00:40:17 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h= date:from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=qcppdkim1; bh=fAmtm81RwLGFezIhtTg09 a8+FIlqjPr8NQ8dQHAa7QY=; b=a9dsPynlV72fvZAGVQYtd9ghO3/7H4p0GFiKu nNO83MJMpHZJuz00OyesNPnGPWDSuOO4+o+qUfyQiRlD20tgbQWvtn5XS4YChN5Y BjjFhRW6Q2i+KdPvuCqF/B6pmFxyXZDgHpuPPdID7+NQeLWXi5cMdTR/rAEbjgdz K3a4D0wfslswJVh/NcJcPjEBuS9aCeEyh+/M3NkFvoLZ6lWNCXc/alMkFLUzbyHX oc2q+ZIqpS0UsuClXO5Rb+fRjj4oQ+zpZtbJZhdcbbuTQh4cTiIuiXLW8Kvk662+ G1R60o0wy7AJK2FoiS0hdtwJWHKRg3US3FjZahclnkcyihaag== Received: from nasanppmta04.qualcomm.com (i-global254.qualcomm.com [199.106.103.254]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3wjycdguu7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 01 Mar 2024 00:40:16 +0000 (GMT) Received: from nasanex01b.na.qualcomm.com (nasanex01b.na.qualcomm.com [10.46.141.250]) by NASANPPMTA04.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 4210eGbs027001 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 1 Mar 2024 00:40:16 GMT Received: from hu-eberman-lv.qualcomm.com (10.49.16.6) by nasanex01b.na.qualcomm.com (10.46.141.250) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Thu, 29 Feb 2024 16:40:14 -0800 Date: Thu, 29 Feb 2024 16:40:14 -0800 From: Elliot Berman To: Fuad Tabba CC: David Hildenbrand , Quentin Perret , Matthew Wilcox , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: Re: folio_mmapped Message-ID: <20240229114526893-0800.eberman@hu-eberman-lv.qualcomm.com> Mail-Followup-To: Fuad Tabba , David Hildenbrand , Quentin Perret , Matthew Wilcox , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, keirf@google.com, linux-mm@kvack.org References: <40a8fb34-868f-4e19-9f98-7516948fc740@redhat.com> <20240226105258596-0800.eberman@hu-eberman-lv.qualcomm.com> <925f8f5d-c356-4c20-a6a5-dd7efde5ee86@redhat.com> <755911e5-8d4a-4e24-89c7-a087a26ec5f6@redhat.com> <99a94a42-2781-4d48-8b8c-004e95db6bb5@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: X-Originating-IP: [10.49.16.6] X-ClientProxiedBy: nalasex01c.na.qualcomm.com (10.47.97.35) To nasanex01b.na.qualcomm.com (10.46.141.250) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: KUnnGdlrkBnGMyW3fU4tsf-S_7ez_xmz X-Proofpoint-ORIG-GUID: KUnnGdlrkBnGMyW3fU4tsf-S_7ez_xmz X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-02-29_07,2024-02-29_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 phishscore=0 mlxlogscore=999 impostorscore=0 bulkscore=0 priorityscore=1501 lowpriorityscore=0 malwarescore=0 adultscore=0 suspectscore=0 spamscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2402120000 definitions=main-2403010004 X-Rspamd-Queue-Id: 3BA98C0008 X-Rspam-User: X-Stat-Signature: xxdey1imbpfg46czo95s9acz5bpz8rfw X-Rspamd-Server: rspam03 X-HE-Tag: 1709253651-52533 X-HE-Meta: U2FsdGVkX19AnXFXnfRobjIWeZzW5WmqOK21FnyrfYE02oOWOZNxwoksFTb8ApZv8oaiAA/ADFPaDp6wZIms7NZO8v06JIcNjWhiu8d11SWK1Vi1/mGxFX9jH/izosIUhqRLZCignzUYaZz0/LDRtBaU6K1thX5Ee6+kYWQZ/CLppIz7u0xD05aRWpynQExkAKupC22H9j1THJan6bSKgCZin6ygOk5hLAyX0+mys/4xTVElA++CEZyjVTymWoIe6OaXFhOLCP1ZWdXQf3HZb8EyZXX9YAG3up58WJTgj7gHI/gbjA9jf/G+3onRoD/DSQhUu0jrwmHpAQ4FI7so5MnfxaT8G2iy1QFMqbjdr2IeMkjVBU7zsGG/iLy9CTKvfLIDuBkwZXYlmRbwIeH4DeyegoubLKl9MAGUetodauFonssLZ/TtxCUFomRZw10mXMZbq4beq3D0o3bw9e6fEwdELoIvD3qRVlRf87VXW7Y2dVdcUlnzHYDBNM1AJvYq4SIkcxXX5ZBeMazwMbMbS/SiEs+ZZFjukAYJNtxpmE2PHbbqpSz4ESG9yIxtssLbq8hcEwhpPTcbJ8icP2HKNIiHt6F/DWvuIPqa7y0JIZ5LN4sd8q0ZWE8mznxgiEVT1p0VStjdC8XAvmxNQ+mKiWrS0fBodY02opAAwyIkSKMp/Kp8DFDUyqgFn+sX7soGTygehUD2udgeXsduIue0mFEiJUDiirNX3b2EQQzgh/kJ8isexPgCq49zEb3nAuLMcnwBif3+1aCpbZuzHvilcIlMhWbfIV8o6oJOMtd5ilJspjr7NZM2zXYVEAGCcb1ywghWCxVk8KNrr02zTo4UxrU4dPeEfdIKr7ZVCuQOLTGsh61Vs46RqMM/BdFdqdiA/sc8tXRbbSg3tDJmv6Z+ZbqqjYDWRBoDxHEcmWkzKVS9iqBaGjxgtSO/KdAbl3IkdsW/faJT7IO9G/4CfHx RlMEAFf9 9LNTvYuUMW/B0YGbKUmYrhZGK4584hcJr9bXR4oK+HwudbV+sZrBIZIkHPz9YwEx38x+EsmREBsJrd9aRv7UtW/FNxLK59t/0gxAkjGBRnn0R9gGpqHLciW75pftnXVHZH9/+6qpJWBwwvQ+gvD+3NTX8tu6em8ZSjGL4I30li3vgRmFdACsVbwn5Hyq8VUlvAHEan0jiZj9R696vBDNxREhBpza27VapeuAvK5j/+Xqe0rViKlTjLX96uH7qhwNsmbqEfK2yuY+r33/mFZ5XLz5XkTas6cXG5c7gQzmVMEIslqpBf+HSxERAd8CWXS+FoOAc X-Bogosity: Ham, tests=bogofilter, spamicity=0.000095, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 29, 2024 at 07:01:51PM +0000, Fuad Tabba wrote: > Hi David, > > ... > > >>>> "mmap() the whole thing once and only access what you are supposed to > > > (> > > access" sounds reasonable to me. If you don't play by the rules, you get a > > >>>> signal. > > >>> > > >>> "... you get a signal, or maybe you don't". But yes I understand your > > >>> point, and as per the above there are real benefits to this approach so > > >>> why not. > > >>> > > >>> What do we expect userspace to do when a page goes from shared back to > > >>> being guest-private, because e.g. the guest decides to unshare? Use > > >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note > > >>> that this will be needed when starting a guest as well, as userspace > > >>> needs to copy the guest payload in the guestmem file prior to starting > > >>> the protected VM. > > >> > > >> Let's assume we have the whole guest_memfd mapped exactly once in our > > >> process, a single VMA. > > >> > > >> When setting up the VM, we'll write the payload and then fire up the VM. > > >> > > >> That will (I assume) trigger some shared -> private conversion. > > >> > > >> When we want to convert shared -> private in the kernel, we would first > > >> check if the page is currently mapped. If it is, we could try unmapping that > > >> page using an rmap walk. > > > > > > I had not considered that. That would most certainly be slow, but a well > > > behaved userspace process shouldn't hit it so, that's probably not a > > > problem... > > > > If there really only is a single VMA that covers the page (or even mmaps > > the guest_memfd), it should not be too bad. For example, any > > fallocate(PUNCHHOLE) has to do the same, to unmap the page before > > discarding it from the pagecache. > > I don't think that we can assume that only a single VMA covers a page. > > > But of course, no rmap walk is always better. > > We've been thinking some more about how to handle the case where the > host userspace has a mapping of a page that later becomes private. > > One idea is to refuse to run the guest (i.e., exit vcpu_run() to back > to the host with a meaningful exit reason) until the host unmaps that > page, and check for the refcount to the page as you mentioned earlier. > This is essentially what the RFC I sent does (minus the bugs :) ) . > > The other idea is to use the rmap walk as you suggested to zap that > page. If the host tries to access that page again, it would get a > SIGBUS on the fault. This has the advantage that, as you'd mentioned, > the host doesn't need to constantly mmap() and munmap() pages. It > could potentially be optimised further as suggested if we have a > cooperating VMM that would issue a MADV_DONTNEED or something like > that, but that's just an optimisation and we would still need to have > the option of the rmap walk. However, I was wondering how practical > this idea would be if more than a single VMA covers a page? > Agree with all your points here. I changed Gunyah's implementation to do the unmap instead of erroring out. I didn't observe a significant performance difference. However, doing unmap might be a little faster because we can check folio_mapped() before doing the rmap walk. When erroring out at mmap() level, we always have to do the walk. > Also, there's the question of what to do if the page is gupped? In > this case I think the only thing we can do is refuse to run the guest > until the gup (and all references) are released, which also brings us > back to the way things (kind of) are... > If there are gup users who don't do FOLL_PIN, I think we either need to fix them or live with possibility here? We don't have a reliable refcount for a folio to be safe to unmap: it might be that another vCPU is trying to get the same page, has incremented the refcount, and waiting for the folio_lock. This problem exists whether we block the mmap() or do SIGBUS. Thanks, Elliot