From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFFDFECAAD8 for ; Fri, 23 Sep 2022 15:20:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5CE938000C; Fri, 23 Sep 2022 11:20:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A48D80007; Fri, 23 Sep 2022 11:20:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46CAB8000C; Fri, 23 Sep 2022 11:20:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3419480007 for ; Fri, 23 Sep 2022 11:20:53 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 09C301C6866 for ; Fri, 23 Sep 2022 15:20:53 +0000 (UTC) X-FDA: 79943712786.23.86E17C7 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf19.hostedemail.com (Postfix) with ESMTP id 6EAF21A000A for ; Fri, 23 Sep 2022 15:20:52 +0000 (UTC) Received: by mail-lj1-f180.google.com with SMTP id a14so423753ljj.8 for ; Fri, 23 Sep 2022 08:20:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date; bh=SUWKwcsZsry2fgvm0fDpc5kVnOus3A4t37kogjVDCAM=; b=ENR7gm/ufNi1bXyPhRkLqOIR8X61pAuHZDAxvBQJtBMo1XxcJXpZCyvZXjdZtxEGqs Rld/BEcs2SP4Qyo9L4ZnVLbMttpTZWKLplZzijm55RmkSXps8vg+nlMTQRL8Q8n8sP8V PQt74elT4aSaOtht+ghVFN1gYzIIvLmyZUBZycxab9teGDT4klNyOyRb/aF7xVrdEhn9 7oLI/l2Io92zNWZn3GSEQCNXcxdOVW4hgKbxpiZ3V8r6dHHCPZ3UDzrOP3Zt+jMD0Csi vmw3ypm10IkLOpdYu1J8qcIYc6CXN4laL/fqx5HsbDxujMM99DHcUV+6BWkduPO9udYw ZNYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=SUWKwcsZsry2fgvm0fDpc5kVnOus3A4t37kogjVDCAM=; b=M7eh/MSYwt4m06rfViJw6dND+yBWyDMAD2Lg1iwxQpxkAJTa2NHTiCzEvtu03Jw02Z VqZwSZStb1WdROiclDN1cHT38KugycoR7d9vtgnj8vDbz4RiI6cdrgHWaIxJbjxqejAH kGj1cpRePWnEq33APNSBbXyvQFUC7izPe/uADaKMVg9b5xnEHar74Uq+C9P30NJg6hUo 0IAQ3X7x2qgW9LCQuhpjwmG/0IAksdFP4S/DEUKpue5wKHWp6/RFCFCTdzWbf369TgD7 6uOXRGnwH31RU1VsRmfdVKeeJMrz052RlxPKiCkC82oLLUDNqtf7zf9BztP3CSK5ahBr cJcQ== X-Gm-Message-State: ACrzQf3+JKhA3YI9Z+U4Am8oHONVs8dwxBwGJIuHYqVSfjRvOOoG8WVK HTu4tTXgPVDVvtNHh6kEUA94eU4AQUlLMwM0w2qQng== X-Google-Smtp-Source: AMsMyM7wJ1CjDy1uo28Cq+7fc7qkrSMtXjul2vWncBmw/XrFOfxAvAKrbpIhjT0LJgDsFIw328yn+2TOFJu3bRs5xe4= X-Received: by 2002:a05:651c:1508:b0:26c:622e:abe1 with SMTP id e8-20020a05651c150800b0026c622eabe1mr3040402ljf.228.1663946450628; Fri, 23 Sep 2022 08:20:50 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <84e81d21-c800-4fd5-ad7c-f20bcdd7508b@www.fastmail.com> In-Reply-To: <84e81d21-c800-4fd5-ad7c-f20bcdd7508b@www.fastmail.com> From: Fuad Tabba Date: Fri, 23 Sep 2022 16:20:13 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Andy Lutomirski Cc: Sean Christopherson , David Hildenbrand , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "the arch/x86 maintainers" , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen , aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , Michal Hocko , Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663946452; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SUWKwcsZsry2fgvm0fDpc5kVnOus3A4t37kogjVDCAM=; b=dFong/2vJvJFScv1pRkEoyFo7hvwBRVPMHn4WozWW4qLjLd5Puw83usorWcVxOGOZXxxop kjX8kjMoX0SqVgHs7C6W26102GapfD+GZxHZM8cj1KPAJtw1Mic5KzuceE9L9+oJlTnHlJ 0042BspcaA8KOEfmvDF8sALvxhhIfmE= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="ENR7gm/u"; spf=pass (imf19.hostedemail.com: domain of tabba@google.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663946452; a=rsa-sha256; cv=none; b=hYYHAjHSRrz+3YyMmGVo0GbowsmeKbHrLsl/vWuTUOpRkqbY7IJhS/e0Lzwhgf1SYP9Dto KMlpSprEK/J8PLzN39xho2IRJkEoTyZX8/Q0vqYws9oN7zJPQ0F8Nw66jnAAh+n2Kjv9Nm FJxwN1q/OgpW0ZIWJOFd4sVzk7naAq4= X-Stat-Signature: xdxotqd9ze4a45ragdxbsf1dydjqt1zu X-Rspamd-Queue-Id: 6EAF21A000A X-Rspam-User: Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="ENR7gm/u"; spf=pass (imf19.hostedemail.com: domain of tabba@google.com designates 209.85.208.180 as permitted sender) smtp.mailfrom=tabba@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam03 X-HE-Tag: 1663946452-284958 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, <...> > > Regarding pKVM's use case, with the shim approach I believe this can be= done by > > allowing userspace mmap() the "hidden" memfd, but with a ton of restric= tions > > piled on top. > > > > My first thought was to make the uAPI a set of KVM ioctls so that KVM > > could tightly > > tightly control usage without taking on too much complexity in the > > kernel, but > > working through things, routing the behavior through the shim itself > > might not be > > all that horrific. > > > > IIRC, we discarded the idea of allowing userspace to map the "private" > > fd because > > things got too complex, but with the shim it doesn't seem _that_ bad. > > What's the exact use case? Is it just to pre-populate the memory? Prepopulate memory and access memory that could go back and forth from being shared to being private. Cheers, /fuad > > > > E.g. on the memfd side: > > > > 1. The entire memfd must be mapped, and at most one mapping is allowe= d, i.e. > > mapping is all or nothing. > > > > 2. Acquiring a reference via get_pfn() is disallowed if there's a map= ping for > > the restricted memfd. > > > > 3. Add notifier hooks to allow downstream users to further restrict t= hings. > > > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() every= thing in > > one shot. > > > > 5. Require that there are no outstanding references at munmap(). Or = if this > > can't be guaranteed by userspace, maybe add some way for userspace= to wait > > until it's ok to convert to private? E.g. so that get_pfn() doesn= 't need > > to do an expensive check every time. > > Hmm. I haven't looked at the code to see if this would really work, but = I think this could be done more in line with how the rest of the kernel wor= ks by using the rmap infrastructure. When the pKVM memfd is in not-yet-pri= vate mode, just let it be mmapped as usual (but don't allow any form of GUP= or pinning). Then have an ioctl to switch to to shared mode that takes lo= cks or sets flags so that no new faults can be serviced and does unmap_mapp= ing_range. > > As long as the shim arranges to have its own vm_ops, I don't immediately = see any reason this can't work.