From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 051F4C4332F for ; Wed, 16 Nov 2022 09:53:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 718B78E0001; Wed, 16 Nov 2022 04:53:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6C8016B0072; Wed, 16 Nov 2022 04:53:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 56A0A8E0001; Wed, 16 Nov 2022 04:53:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 46DE56B0071 for ; Wed, 16 Nov 2022 04:53:53 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0CB68807C8 for ; Wed, 16 Nov 2022 09:53:53 +0000 (UTC) X-FDA: 80138843946.24.EFD1F1A Received: from mail-wm1-f41.google.com (mail-wm1-f41.google.com [209.85.128.41]) by imf17.hostedemail.com (Postfix) with ESMTP id A4E8840009 for ; Wed, 16 Nov 2022 09:53:52 +0000 (UTC) Received: by mail-wm1-f41.google.com with SMTP id ja4-20020a05600c556400b003cf6e77f89cso2735857wmb.0 for ; Wed, 16 Nov 2022 01:53:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=content-transfer-encoding:mime-version:message-id:in-reply-to:date :subject:cc:to:from:user-agent:references:from:to:cc:subject:date :message-id:reply-to; bh=zze9Wsxt3ysTHHvg5WBu0SQosVjbcwx2aPdUf6crt3Y=; b=I2T59KEQLxIXM2pHNfvD9YeKkkSHE4ZyMd/bqV2bIhQhCd4r7G3BFqEMlifLl2/V3U Ltzz2PryU6B8uAKFDvsVQ1vtZoIrK9vgSISfVLVmvC111IWlhiu150wSuOeQnJ6nS591 EbyyF/lv+oHj3JMgx+8zk0gz81qJBc38w/3AY8cHqHQDXz76GRaAPK34tu5ONO4yvzJi 2AZTRHZQV9qWmZROHiHvF+90nX6/MmF+WWmcZMOj7hNAIEiOmXfJs2/gtzUTPmnerT3Y SDwEvPH6Zu2ekiEIvkl//9IDw/jJhHIejKcAMlZV7e7jFfc0EIr02UqMzf+azaJwFDJT h21A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:in-reply-to:date :subject:cc:to:from:user-agent:references:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=zze9Wsxt3ysTHHvg5WBu0SQosVjbcwx2aPdUf6crt3Y=; b=UTEymm0SwPQN3/R0u969Uys5rH7E4jSIEj0fOIjmQ+Jp6lIT1tDKPXNELitGVBFfKG 1SZH+8WbEHTlTMs1RWvjMv8k1t3zY2wIeHQXDkVhJwKIgQqD2X4KYdpX26wqjDO9meKR SrgU8EVoGr/4Wgyj3iDIhxtW2lvopDEXxp/uO6cc8ItMVP2uRwBon0M2Cl7Af2gQaL6O tJcw0AEbNLlXgPBbaS9isFrnUEoYyZQynRCFNlHu2HsP4ZQ3lesrGJJ4EWV+GNRq10Di SkF4N8fqWlh/ptpQKvvV/5TDqwWqTX49peyfxneGe0HkThunXD3I//f2nPX9vHKty9at oJFQ== X-Gm-Message-State: ANoB5plF+1jC4pM/tWm63bqS1lAxgwnuKVlsiCgvgy5y5UQD253tup/z o8dgNJ7dIrAG8123J0ro3Jx+WQ== X-Google-Smtp-Source: AA0mqf450k4KT3b612+SvZbSE765F/0XKp3AXD24C9sDIk/4dxF+lDjo+r4A3/pJpT3ovdAyPzm4jg== X-Received: by 2002:a7b:ca43:0:b0:3cf:ade4:d529 with SMTP id m3-20020a7bca43000000b003cfade4d529mr1563443wml.193.1668592431026; Wed, 16 Nov 2022 01:53:51 -0800 (PST) Received: from zen.linaroharston ([185.81.254.11]) by smtp.gmail.com with ESMTPSA id l42-20020a05600c1d2a00b003cf4eac8e80sm2160083wms.23.2022.11.16.01.53.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Nov 2022 01:53:50 -0800 (PST) Received: from zen (localhost [127.0.0.1]) by zen.linaroharston (Postfix) with ESMTP id B067B1FFB7; Wed, 16 Nov 2022 09:53:49 +0000 (GMT) References: <20221025151344.3784230-1-chao.p.peng@linux.intel.com> <87k03xbvkt.fsf@linaro.org> <20221116050022.GC364614@chaop.bj.intel.com> User-agent: mu4e 1.9.2; emacs 28.2.50 From: Alex =?utf-8?Q?Benn=C3=A9e?= To: Chao Peng Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Viresh Kumar , Mathieu Poirier , AKASHI Takahiro Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Date: Wed, 16 Nov 2022 09:40:23 +0000 In-reply-to: <20221116050022.GC364614@chaop.bj.intel.com> Message-ID: <87v8nf8bte.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668592432; a=rsa-sha256; cv=none; b=UxaDBiWC+jTp8jOrqZPvVRMB0B3Qd/h12hNuncrn+vPXY9pvC0RMgdlzM8iGBxXpxId+9M MctkMzCPxB1Y9gYPXRtdb9mrIINfJgoWF0557urdBULT+DsHdzYiS3isG3KRBcV7eNrR+s l2Jx+pmMLoBjKafr4LOrHUWz9kSPXOw= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=linaro.org header.s=google header.b=I2T59KEQ; spf=pass (imf17.hostedemail.com: domain of alex.bennee@linaro.org designates 209.85.128.41 as permitted sender) smtp.mailfrom=alex.bennee@linaro.org; dmarc=pass (policy=none) header.from=linaro.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668592432; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zze9Wsxt3ysTHHvg5WBu0SQosVjbcwx2aPdUf6crt3Y=; b=ecI68J6sOx8H4PZ5JlW/RCj22CUkNv/XZv9BOscTVfkMPmT9JifjbUG2BGaOp56ghXH8ne Zfu8f37G1Ly32VS0n08E9vYqqG05+R9Sntkf7yKdVdq0RIZ8SwIHyzIlJJLKoXh1Mi7P+5 WOkyQqNYdYHNZ8wPlOKyd1AkTYTZZKE= X-Stat-Signature: df69b6aamn4z76c8kqdce78cymaeqa6d X-Rspamd-Queue-Id: A4E8840009 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=linaro.org header.s=google header.b=I2T59KEQ; spf=pass (imf17.hostedemail.com: domain of alex.bennee@linaro.org designates 209.85.128.41 as permitted sender) smtp.mailfrom=alex.bennee@linaro.org; dmarc=pass (policy=none) header.from=linaro.org X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1668592432-423522 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Chao Peng writes: > On Mon, Nov 14, 2022 at 11:43:37AM +0000, Alex Benn=C3=A9e wrote: >>=20 >> Chao Peng writes: >>=20 >> >> > Introduction >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> > KVM userspace being able to crash the host is horrible. Under current >> > KVM architecture, all guest memory is inherently accessible from KVM >> > userspace and is exposed to the mentioned crash issue. The goal of this >> > series is to provide a solution to align mm and KVM, on a userspace >> > inaccessible approach of exposing guest memory.=20 >> > >> > Normally, KVM populates secondary page table (e.g. EPT) by using a host >> > virtual address (hva) from core mm page table (e.g. x86 userspace page >> > table). This requires guest memory being mmaped into KVM userspace, but >> > this is also the source where the mentioned crash issue can happen. In >> > theory, apart from those 'shared' memory for device emulation etc, gue= st >> > memory doesn't have to be mmaped into KVM userspace. >> > >> > This series introduces fd-based guest memory which will not be mmaped >> > into KVM userspace. KVM populates secondary page table by using a >> > fd/offset pair backed by a memory file system. The fd can be created >> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can >> > directly interact with them with newly introduced in-kernel interface, >> > therefore remove the KVM userspace from the path of accessing/mmaping >> > the guest memory.=20 >> > >> > Kirill had a patch [2] to address the same issue in a different way. It >> > tracks guest encrypted memory at the 'struct page' level and relies on >> > HWPOISON to reject the userspace access. The patch has been discussed = in >> > several online and offline threads and resulted in a design document [= 3] >> > which is also the original proposal for this series. Later this patch >> > series evolved as more comments received in community but the major >> > concepts in [3] still hold true so recommend reading. >> > >> > The patch series may also be useful for other usages, for example, pure >> > software approach may use it to harden itself against unintentional >> > access to guest memory. This series is designed with these usages in >> > mind but doesn't have code directly support them and extension might be >> > needed. >>=20 >> There are a couple of additional use cases where having a consistent >> memory interface with the kernel would be useful. > > Thanks very much for the info. But I'm not so confident that the current > memfd_restricted() implementation can be useful for all these usages.=20 > >>=20 >> - Xen DomU guests providing other domains with VirtIO backends >>=20 >> Xen by default doesn't give other domains special access to a domains >> memory. The guest can grant access to regions of its memory to other >> domains for this purpose.=20 > > I'm trying to form my understanding on how this could work and what's > the benefit for a DomU guest to provide memory through memfd_restricted(). > AFAICS, memfd_restricted() can help to hide the memory from DomU userspac= e, > but I assume VirtIO backends are still in DomU uerspace and need access > that memory, right? They need access to parts of the memory. At the moment you run your VirtIO domains in the Dom0 and give them access to the whole of a DomU's address space - however the Xen model is by default the guests memory is inaccessible to other domains on the system. The DomU guest uses the Xen grant model to expose portions of its address space to other domains - namely for the VirtIO queues themselves and any pages containing buffers involved in the VirtIO transaction. My thought was that looks like a guest memory interface which is mostly inaccessible (private) with some holes in it where memory is being explicitly shared with other domains. What I want to achieve is a common userspace API with defined semantics for what happens when private and shared regions are accessed. Because having each hypervisor/confidential computing architecture define its own special API for accessing this memory is just a recipe for fragmentation and makes sharing common VirtIO backends impossible. > >>=20 >> - pKVM on ARM >>=20 >> Similar to Xen, pKVM moves the management of the page tables into the >> hypervisor and again doesn't allow those domains to share memory by >> default. > > Right, we already had some discussions on this in the past versions. > >>=20 >> - VirtIO loopback >>=20 >> This allows for VirtIO devices for the host kernel to be serviced by >> backends running in userspace. Obviously the memory userspace is >> allowed to access is strictly limited to the buffers and queues >> because giving userspace unrestricted access to the host kernel would >> have consequences. > > Okay, but normal memfd_create() should work for it, right? And > memfd_restricted() instead may not work as it unmaps the memory from > userspace. > >>=20 >> All of these VirtIO backends work with vhost-user which uses memfds to >> pass references to guest memory from the VMM to the backend >> implementation. > > Sounds to me these are the places where normal memfd_create() can act on. > VirtIO backends work on the mmap-ed memory which currently is not the > case for memfd_restricted(). memfd_restricted() has different design > purpose that unmaps the memory from userspace and employs some kernel > callbacks so other kernel modules can make use of the memory with these > callbacks instead of userspace virtual address. Maybe my understanding is backwards then. Are you saying a guest starts with all its memory exposed and then selectively unmaps the private regions? Is this driven by the VMM or the guest itself? --=20 Alex Benn=C3=A9e