From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B80DC43217 for ; Thu, 13 Oct 2022 13:39:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D514C6B0071; Thu, 13 Oct 2022 09:39:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D00F16B0073; Thu, 13 Oct 2022 09:39:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7AF56B0074; Thu, 13 Oct 2022 09:39:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A70026B0071 for ; Thu, 13 Oct 2022 09:39:42 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6589D1C65D7 for ; Thu, 13 Oct 2022 13:39:42 +0000 (UTC) X-FDA: 80016033804.18.A285893 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf11.hostedemail.com (Postfix) with ESMTP id 14D9E4002F for ; Thu, 13 Oct 2022 13:39:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1665668381; x=1697204381; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:in-reply-to; bh=86+QJDT/SSLvPd1nKHl/DgfgqghBBXSGaVbxPPteq8A=; b=hbHrGuQn40V3AUkS8anrblXqgmTOWw/VOahXdhPu0dSfBnkzl3w4DDFD NDM60WtvSDsPCQdPq21Q/wVOB0hEEcq+nuWzuAWA40ZLtFUrx3oEf60EW E2APq0h4OxoLETI5ygNsuJuJIs+7SWDUGHbUzUpNVoqQiHJ8CXJgfqTjY d+uajiRXia6AxDba4KtJWv0e9HMAajXQEondq8KzuDf1wKPebYB6HcETI 8YBCgZSp+TfKQi4DPQFDpR/OADAGEvXDotn8Nm37sn9gh9Tyh/NSAGCxl ipWwPp48R19JubO5PQgDHCpa0HZ6yEj6Wi4wq2uePfx2S32Nb9Cm7vbrh Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10498"; a="391388316" X-IronPort-AV: E=Sophos;i="5.95,180,1661842800"; d="scan'208";a="391388316" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Oct 2022 06:39:39 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10498"; a="690114910" X-IronPort-AV: E=Sophos;i="5.95,180,1661842800"; d="scan'208";a="690114910" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.193.75]) by fmsmga008.fm.intel.com with ESMTP; 13 Oct 2022 06:39:29 -0700 Date: Thu, 13 Oct 2022 21:34:57 +0800 From: Chao Peng To: Fuad Tabba Cc: Sean Christopherson , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Message-ID: <20221013133457.GA3263142@chaop.bj.intel.com> Reply-To: Chao Peng References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665668381; a=rsa-sha256; cv=none; b=Lou1W1v//JWQnhVJod0i552O9EUkPZskFaBdW7NgG61eDiCZHQ0gtzbKAzhF6sKBeBScu4 1juqUnTbaWNQIxpP0TXqQ9GUYaSDGSHI1Rg7TaZ2jHnomp+FnH+LDcQp92loV7ACwfWuup 4nc7fjCaAvmQAKnNhzgKKv2sn4R0RHM= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=hbHrGuQn; spf=none (imf11.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.43) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665668381; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cP43hgEI7qYtVHj8ACuP8D5CuVfiygK8+6MJjG0Pj2A=; b=rnt2bgBYObO5P6HkfyJz4c6zOOQr6LwiNU74E9YD2rGn32HhA3rYUw6zFPs8gAkpn5lm+8 L1ZMx7obJd7NBxaHs5P4Kosyqv/Ch5zvUtlCJqRK4d6b2lHCdCdYUkS06xqoFSoGn3dQhj 5a7jK0eyTGJybNgwGMwPQe5DQ4/eUhw= X-Stat-Signature: 3x4dot6866r5dqo44shkmf8fzxanti3y X-Rspamd-Queue-Id: 14D9E4002F X-Rspam-User: Authentication-Results: imf11.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=hbHrGuQn; spf=none (imf11.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 192.55.52.43) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-Rspamd-Server: rspam06 X-HE-Tag: 1665668380-527554 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Sep 30, 2022 at 05:19:00PM +0100, Fuad Tabba wrote: > Hi, > > On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson wrote: > > > > On Mon, Sep 26, 2022, Fuad Tabba wrote: > > > Hi, > > > > > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng wrote: > > > > > > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > > > > memory into the guest (after pre-boot phase). > > > > > > > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > > > > and only if the entire gfn range of the associated memslot is shared. > > > > > > > > > > In general I think that this would work with pKVM. However, limiting > > > > > private<->shared conversions to the granularity of a whole memslot > > > > > might be difficult to handle in pKVM, since the guest doesn't have the > > > > > concept of memslots. For example, in pKVM right now, when a guest > > > > > shares back its restricted DMA pool with the host it does so at the > > > > > page-level. > > > > Y'all are killing me :-) > > :D > > > Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at > > granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot, > > X doesn't even have to be that high to get reasonable performance, e.g. assuming > > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to > > work just fine in KVM. > > The guest is potentially enlightened, but the host doesn't necessarily > know which memslot the guest might want to share back, since it > doesn't know where the guest might want to place the DMA pool. If I > understand this correctly, for this to work, all memslots would need > to be the same size and sharing would always need to happen at that > granularity. > > Moreover, for something like a small DMA pool this might scale, but > I'm not sure about potential future workloads (e.g., multimedia > in-place sharing). > > > > > > > > pKVM would also need a way to make an fd accessible again > > > > > when shared back, which I think isn't possible with this patch. > > > > > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > > > that can cause VMA fragmentation if the conversion is frequent as I see. > > > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > > > be the same issue. > > > > > > pKVM doesn't really need to unmap the memory. What is really important > > > is that the memory is not GUP'able. > > > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, > > otherwise KVM wouldn't be able to get the PFN to map into guest memory. > > > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't > > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, > > the end result is the same. > > > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the > > current approach is to do that only in the stage-2 page tables, i.e. only in the > > context of the hypervisor. Which is also the source of the gup() problems; the > > untrusted kernel is blissfully unaware that the memory is inaccessible. > > > > Any approach that moves some of that information into the untrusted kernel so that > > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless > > all of guest memory becomes unguppable, but that's likely not a viable option. > > Actually, for pKVM, there is no need for the guest memory to be > GUP'able at all if we use the new inaccessible_get_pfn(). If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I think that is the major concern?), do you see any other gap from existing API? > This of > course goes back to what I'd mentioned before in v7; it seems that > representing the memslot memory as a file descriptor should be > orthogonal to whether the memory is shared or private, rather than a > private_fd for private memory and the userspace_addr for shared > memory. The host can then map or unmap the shared/private memory using > the fd, which allows it more freedom in even choosing to unmap shared > memory when not needed, for example. Using both private_fd and userspace_addr is only needed in TDX and other confidential computing scenarios, pKVM may only use private_fd if the fd can also be mmaped as a whole to userspace as Sean suggested. Thanks, Chao > > Cheers, > /fuad