From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 152BCC433FE for ; Fri, 30 Sep 2022 16:19:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 763168D0002; Fri, 30 Sep 2022 12:19:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 712CD8D0001; Fri, 30 Sep 2022 12:19:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 58BF78D0002; Fri, 30 Sep 2022 12:19:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 458508D0001 for ; Fri, 30 Sep 2022 12:19:25 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 1DF8BC05D4 for ; Fri, 30 Sep 2022 16:19:25 +0000 (UTC) X-FDA: 79969261890.13.810C557 Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by imf18.hostedemail.com (Postfix) with ESMTP id 883B71C000F for ; Fri, 30 Sep 2022 16:19:23 +0000 (UTC) Received: by mail-lj1-f182.google.com with SMTP id p5so5290231ljc.13 for ; Fri, 30 Sep 2022 09:19:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=n5IXmIQPtFLCjFkpUE2OwV3H+1+xrjCHl08IJahd5SyBcKJ/UsdBT0ARbrXIomUclN fd8sqCRfVDzZ+eC1NLtqAp85IzNsTsvVkvysUm5VB52PQNwCA+yB4jBC1xuC3abZlkL3 co9ueIAqbCn/IDfi+RRnhmBQfNkNlKbH/RmP/Gy9yuqVbaafOyHxCJY2JFPjHrbWeyua MRGR4fju3q766ysi3eZBbex8yLYzg8fVBN0u24+SImCGIZCtg/U83iul66WqacM8s0qy dn97lGXvlczA2HSN41cwl43KlCmWxBUo/uGJrr47oJcMCi3tp0jWH6xze24KGsk1CzdL IBTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=vhGoJIJeJzBOwt1/zmKJisGtN/RMdhYShc9Y8zhyBfNLm1Ym6Xz4Lm09nBfEpQRreU WMwRgj0nZaZWhSwQxZu5QFZyZ0k5OG23VTLgnByFFQ03tGgQaFcnc5jTMNm/UTWpb2wK lB4/fWBBY1SskfqBPRO7qgXXd8wzqc9HoJ4pTZdh4eZIsKsrgTDClNyTs2+IVTQmkcbL xFcPapASISdtoGuiBM/j5rgIbyxPjF99gkEMEVak3pAbymSIktP95P448XELXGpiTAQL 9JVovdgQGW3VqOtN5HW9Yo405i4nRhe69VHJkHnDpy6zQWquLD73gNiaY08pm6gmK1Rf QyGg== X-Gm-Message-State: ACrzQf0gTSI5/ietkdenAIoxt6oX/ok6bPD2f2bBRJ40kMiMOT0hWpHu jDlLpGNS7+phd6uUrg6s2dNznxYrYIu1P4h2ElBQQA== X-Google-Smtp-Source: AMsMyM74N5LK+lGu2FSq6RuG8EoPePtn6IJGRM9qjqNHtPAGMTrCCfMcbV+Z5ehpLR7mFWYQ45apCQbGVR2FeZvjGGE= X-Received: by 2002:a2e:9954:0:b0:26c:5555:b121 with SMTP id r20-20020a2e9954000000b0026c5555b121mr3154070ljj.280.1664554761893; Fri, 30 Sep 2022 09:19:21 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> In-Reply-To: From: Fuad Tabba Date: Fri, 30 Sep 2022 17:19:00 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Sean Christopherson Cc: Chao Peng , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664554763; a=rsa-sha256; cv=none; b=8ira4gFCP2MDz/zGQnZsjQF8EvwO67o4nCsJrz5WW03kDBelUdlhuBJmXV9KriI+VHbTN+ n3FTlamus7iKZzQYwKJH6BT53RZbTR1W1rdLbEKfo6Upwx1F53pI+wW2NuZNhIXMe5oroZ XfiYBu78VjNdwpUBetFu4XLIh37u6sE= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=n5IXmIQP; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of tabba@google.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=tabba@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664554763; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=h55Elek9qYnTFQCDKPxdji7KwsKCTt8lqhWifi61fqTlMXLy8jhZkZqBi57Pe4DCPtBmnc x2JaQoPPTJB5Oz4ZF62jZ64lgDvmmLSgnaweJyqQ64B6G3BA7+tQ/5ekxCy7zF6cEhmUOp RYb2gFQITYE4U3vJ9wkLSoWo+9GPcCo= X-Stat-Signature: f65sjj1k74af5bzt6ah13q39bgtxhcw8 X-Rspamd-Queue-Id: 883B71C000F Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=n5IXmIQP; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of tabba@google.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=tabba@google.com X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1664554763-384873 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson wrote: > > On Mon, Sep 26, 2022, Fuad Tabba wrote: > > Hi, > > > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng wrote: > > > > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > > > memory into the guest (after pre-boot phase). > > > > > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > > > and only if the entire gfn range of the associated memslot is shared. > > > > > > > > In general I think that this would work with pKVM. However, limiting > > > > private<->shared conversions to the granularity of a whole memslot > > > > might be difficult to handle in pKVM, since the guest doesn't have the > > > > concept of memslots. For example, in pKVM right now, when a guest > > > > shares back its restricted DMA pool with the host it does so at the > > > > page-level. > > Y'all are killing me :-) :D > Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at > granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot, > X doesn't even have to be that high to get reasonable performance, e.g. assuming > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to > work just fine in KVM. The guest is potentially enlightened, but the host doesn't necessarily know which memslot the guest might want to share back, since it doesn't know where the guest might want to place the DMA pool. If I understand this correctly, for this to work, all memslots would need to be the same size and sharing would always need to happen at that granularity. Moreover, for something like a small DMA pool this might scale, but I'm not sure about potential future workloads (e.g., multimedia in-place sharing). > > > > > pKVM would also need a way to make an fd accessible again > > > > when shared back, which I think isn't possible with this patch. > > > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > > that can cause VMA fragmentation if the conversion is frequent as I see. > > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > > be the same issue. > > > > pKVM doesn't really need to unmap the memory. What is really important > > is that the memory is not GUP'able. > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, > otherwise KVM wouldn't be able to get the PFN to map into guest memory. > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, > the end result is the same. > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the > current approach is to do that only in the stage-2 page tables, i.e. only in the > context of the hypervisor. Which is also the source of the gup() problems; the > untrusted kernel is blissfully unaware that the memory is inaccessible. > > Any approach that moves some of that information into the untrusted kernel so that > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless > all of guest memory becomes unguppable, but that's likely not a viable option. Actually, for pKVM, there is no need for the guest memory to be GUP'able at all if we use the new inaccessible_get_pfn(). This of course goes back to what I'd mentioned before in v7; it seems that representing the memslot memory as a file descriptor should be orthogonal to whether the memory is shared or private, rather than a private_fd for private memory and the userspace_addr for shared memory. The host can then map or unmap the shared/private memory using the fd, which allows it more freedom in even choosing to unmap shared memory when not needed, for example. Cheers, /fuad