From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D30B0C433FE for ; Thu, 20 Oct 2022 10:51:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CD3428E0002; Thu, 20 Oct 2022 06:51:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C82DE8E0001; Thu, 20 Oct 2022 06:51:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B4AB08E0002; Thu, 20 Oct 2022 06:51:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A51D98E0001 for ; Thu, 20 Oct 2022 06:51:12 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 84A024123B for ; Thu, 20 Oct 2022 10:51:12 +0000 (UTC) X-FDA: 80041010784.07.BA32B91 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf23.hostedemail.com (Postfix) with ESMTP id 19878140007 for ; Thu, 20 Oct 2022 10:51:10 +0000 (UTC) Received: by mail-pj1-f45.google.com with SMTP id t12-20020a17090a3b4c00b0020b04251529so2664109pjf.5 for ; Thu, 20 Oct 2022 03:51:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=THcn/iHYHuaAA2JhGclOSPj+v18JyGtJUfmP5t8z+lQ=; b=PmJinTSn6bOFY1sRc3n5f1hZ15Jyjk2crnp6tAUK4UsMnuuKaRiwyIZ9vxnvm6EgX1 BdiOuFPJTUj+Jd5ruhpQt87Sy2B+pJVlB4/vaWJicwCHmvwFg61SgSaguaJ1lBmd7QXS QFioDLsch2dRK9t/P8KldC7XER1zMfGFurOJPEvK5LnYFXZs6vhi+qqy1asHbDDF+aD6 ILFP948uKub82ZM5zYCQEeBuIgE3icp8W05VylvPJmUBugZ9WUWsELxM/uUoK0c4FHxC vWtsfquLcn0jnYKc5NyAZA7wi5EkmZgOEKQEbFJ00pVUkTdd0i4WhDkxKS+UknN/+mhi I2lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=THcn/iHYHuaAA2JhGclOSPj+v18JyGtJUfmP5t8z+lQ=; b=ZqF3bCVL1MoQRtktUtycsfalJAlbU83ZR898yursgkAJYR2w3XoFEFYsqvlzM1ymag y6M8QakN0jmllSOXHDeYWs2EpPBNDNourgRMa67gTK5MEIIEhIm4YRnu4ys1O+zEuzoA /X4CEQfljXQFfICGAqviFRma72P5bE48/gXacRMdQdncBNq+xRwQABuwQe4uEc3q4DOM ENVCvB9FoA1EUVOqiiW7YFE19ZUtSrck14ZR+Z8hsNhVmfxZsn9exXzotYV/F/9bWFle 2BqGQjqlJO5Evn+2M+ynahmxhLGRLoCMhIR8XF+jo4IMc5ZSaVAoA3ktqb9tX6+VxE2I sM6g== X-Gm-Message-State: ACrzQf2qnySxmo5F40HM4D0sX1GIpY7xTZpoZp+WamkDjb6IU4QtgEth ncF49pkWCoq8olcpAUVVVFxAOIxg2XNxN6L421JNXw== X-Google-Smtp-Source: AMsMyM40B/PslhPwM5x2D9ZaTJvWzWPry4q1P2rbX5amIPTYMJjCA42jA3LnrGgZY+4rHUIhjKLhhZn1nSOt0iwlCU0= X-Received: by 2002:a17:90b:38c3:b0:20d:406e:26d9 with SMTP id nn3-20020a17090b38c300b0020d406e26d9mr15478335pjb.121.1666263069563; Thu, 20 Oct 2022 03:51:09 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20221017161955.t4gditaztbwijgcn@box.shutemov.name> <20221017215640.hobzcz47es7dq2bi@box.shutemov.name> <20221019153225.njvg45glehlnjgc7@box.shutemov.name> In-Reply-To: <20221019153225.njvg45glehlnjgc7@box.shutemov.name> From: Vishal Annapurve Date: Thu, 20 Oct 2022 16:20:58 +0530 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: "Kirill A . Shutemov" Cc: "Gupta, Pankaj" , Vlastimil Babka , Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666263071; a=rsa-sha256; cv=none; b=RnNNVqNiYrL9DeKMhUMrUw3c8JbVP9Q3ZQs9K4aq6vp+54Gs1U3RHY4M1qSpdKbTd4ehRP 7sqD8Jp00CQw4FQ2QG3/chA2qqfQadgyUZ6ziXwUeXtB8sGMpEBt2IfQgyWRG1syhKSE/1 Yq7zWQ1VNKgiH5lJ4UI2sXuq0TaC180= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=PmJinTSn; spf=pass (imf23.hostedemail.com: domain of vannapurve@google.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666263071; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=THcn/iHYHuaAA2JhGclOSPj+v18JyGtJUfmP5t8z+lQ=; b=H94ROW0EI7EU+dyzbY33YTW8VaVk/kckWfAI+S0YbccNsT8OB5I1CDKvoMeBW4gJYVbTNL R1+8c0vQsrv2x/92TG2KkE1RHlEifMhllH46+CQDw4KnTkzcKAc5RVAdlrBVMX95YPZJap OkzUR5SUzttr3UhgDbGk4KLKHYjqkUQ= X-Stat-Signature: 7ajr1aacgnr9ycnbdmehdsbaza3yhmpg X-Rspamd-Queue-Id: 19878140007 Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=PmJinTSn; spf=pass (imf23.hostedemail.com: domain of vannapurve@google.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1666263070-448406 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov wrote: > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote: > > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov > > wrote: > > > > > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote: > > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote: > > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote: > > > > > > On 9/15/22 16:29, Chao Peng wrote: > > > > > > > From: "Kirill A. Shutemov" > > > > > > > > > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace > > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its > > > > > > > virtual address space and then tells KVM to use the virtual address to > > > > > > > setup the mapping in the secondary page table (e.g. EPT). > > > > > > > > > > > > > > With confidential computing technologies like Intel TDX, the > > > > > > > memfd-provided memory may be encrypted with special key for special > > > > > > > software domain (e.g. KVM guest) and is not expected to be directly > > > > > > > accessed by userspace. Precisely, userspace access to such encrypted > > > > > > > memory may lead to host crash so it should be prevented. > > > > > > > > > > > > > > This patch introduces userspace inaccessible memfd (created with > > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through > > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via > > > > > > > in-kernel interface so KVM can directly interact with core-mm without > > > > > > > the need to map the memory into KVM userspace. > > > > > > > > > > > > > > It provides semantics required for KVM guest private(encrypted) memory > > > > > > > support that a file descriptor with this flag set is going to be used as > > > > > > > the source of guest memory in confidential computing environments such > > > > > > > as Intel TDX/AMD SEV. > > > > > > > > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It > > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added > > > > > > > in this patch to obtain the physical memory address and then populate > > > > > > > the secondary page table entries. > > > > > > > > > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched > > > > > > > from userspace. When hole-punching happens, KVM can get notified through > > > > > > > inaccessible_notifier it then gets chance to remove any mapped entries > > > > > > > of the range in the secondary page tables. > > > > > > > > > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer > > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch > > > > > > > only implemented tmpfs. The allocated memory is currently marked as > > > > > > > unmovable and unevictable, this is required for current confidential > > > > > > > usage. But in future this might be changed. > > > > > > > > > > > > > > Signed-off-by: Kirill A. Shutemov > > > > > > > Signed-off-by: Chao Peng > > > > > > > --- > > > > > > > > > > > > ... > > > > > > > > > > > > > +static long inaccessible_fallocate(struct file *file, int mode, > > > > > > > + loff_t offset, loff_t len) > > > > > > > +{ > > > > > > > + struct inaccessible_data *data = file->f_mapping->private_data; > > > > > > > + struct file *memfd = data->memfd; > > > > > > > + int ret; > > > > > > > + > > > > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > > > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > > > > > + return -EINVAL; > > > > > > > + } > > > > > > > + > > > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > > > > > > + inaccessible_notifier_invalidate(data, offset, offset + len); > > > > > > > > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open > > > > > > a window where the page tables point to memory no longer valid? > > > > > > > > > > Yes, you are right. Thanks for catching this. > > > > > > > > I also noticed this. But then thought the memory would be anyways zeroed > > > > (hole punched) before this call? > > > > > > Hole punching can free pages, given that offset/len covers full page. > > > > > > -- > > > Kiryl Shutsemau / Kirill A. Shutemov > > > > I think moving this notifier_invalidate before fallocate may not solve > > the problem completely. Is it possible that between invalidate and > > fallocate, KVM tries to handle the page fault for the guest VM from > > another vcpu and uses the pages to be freed to back gpa ranges? Should > > hole punching here also update mem_attr first to say that KVM should > > consider the corresponding gpa ranges to be no more backed by > > inaccessible memfd? > > We rely on external synchronization to prevent this. See code around > mmu_invalidate_retry_hva(). > > -- > Kiryl Shutsemau / Kirill A. Shutemov IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn ranges that are being invalidated are retried till invalidation is complete. In this case, is it possible that KVM tries to serve the page fault after inaccessible_notifier_invalidate is complete but before fallocate could punch hole into the files? e.g. inaccessible_notifier_invalidate(...) ... (system event preempting this control flow, giving a window for the guest to retry accessing the gfn range which was invalidated) fallocate(.., PUNCH_HOLE..)