From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7575CA9ECE for ; Thu, 31 Oct 2019 07:21:25 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7BEB32087F for ; Thu, 31 Oct 2019 07:21:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="DG2bDla3" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7BEB32087F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DBB746B0003; Thu, 31 Oct 2019 03:21:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D91406B0005; Thu, 31 Oct 2019 03:21:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7FB56B0007; Thu, 31 Oct 2019 03:21:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59]) by kanga.kvack.org (Postfix) with ESMTP id A66B26B0003 for ; Thu, 31 Oct 2019 03:21:24 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 33F81181AEF30 for ; Thu, 31 Oct 2019 07:21:24 +0000 (UTC) X-FDA: 76103234088.29.sheet29_4662d197bc260 X-HE-Tag: sheet29_4662d197bc260 X-Filterd-Recvd-Size: 8380 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Thu, 31 Oct 2019 07:21:23 +0000 (UTC) Received: from rapoport-lnx (190.228.71.37.rev.sfr.net [37.71.228.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B92D12083E; Thu, 31 Oct 2019 07:21:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1572506482; bh=RgGczvcyhNK7hxRphi3qyXJOIQ77K2UxqTiZPdnelJw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=DG2bDla3FI4L3kxCkdSpGIP1S/AUE+/J+wCL/HQOxOCjeZBUlB4V0IFhk6ov/GsMp saxuC/x188Q2/LNGMCF2oqXem4Xj48zkvWNkvM+gsB1VKSoqa3UR/BznnvxKIjJJCh FRVsiCPLjLZdiu2iXzPdSo3upZ6BwIWtuv/IACLY= Date: Thu, 31 Oct 2019 08:21:13 +0100 From: Mike Rapoport To: Andy Lutomirski Cc: LKML , Alexey Dobriyan , Andrew Morton , Arnd Bergmann , Borislav Petkov , Dave Hansen , James Bottomley , Peter Zijlstra , Steven Rostedt , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Linux API , Linux-MM , X86 ML , Mike Rapoport Subject: Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings Message-ID: <20191031072112.GA6990@rapoport-lnx> References: <1572171452-7958-1-git-send-email-rppt@kernel.org> <20191029093254.GE18773@rapoport-lnx> <20191030084005.GC20624@rapoport-lnx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 30, 2019 at 02:28:21PM -0700, Andy Lutomirski wrote: > On Wed, Oct 30, 2019 at 1:40 AM Mike Rapoport wrote: > > > > On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote: > > > On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport wro= te: > > > > > > > > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote: > > > > > > > > > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport = wrote: > > > > > > > > > > > > =EF=BB=BFFrom: Mike Rapoport > > > > > > > > > > > > Hi, > > > > > > > > > > > > The patch below aims to allow applications to create mappins = that have > > > > > > pages visible only to the owning process. Such mappings could= be used to > > > > > > store secrets so that these secrets are not visible neither t= o other > > > > > > processes nor to the kernel. > > > > > > > > > > > > I've only tested the basic functionality, the changes should = be verified > > > > > > against THP/migration/compaction. Yet, I'd appreciate early f= eedback. > > > > > > > > > > I=E2=80=99ve contemplated the concept a fair amount, and I thin= k you should > > > > > consider a change to the API. In particular, rather than having= it be a > > > > > MAP_ flag, make it a chardev. You can, at least at first, allo= w only > > > > > MAP_SHARED, and admins can decide who gets to use it. It might= also play > > > > > better with the VM overall, and you won=E2=80=99t need a VM_ fl= ag for it =E2=80=94 you > > > > > can just wire up .fault to do the right thing. > > > > > > > > I think mmap()/mprotect()/madvise() are the natural APIs for such > > > > interface. > > > > > > Then you have a whole bunch of questions to answer. For example: > > > > > > What happens if you mprotect() or similar when the mapping is alrea= dy > > > in use in a way that's incompatible with MAP_EXCLUSIVE? > > > > Then we refuse to mprotect()? Like in any other case when vm_flags ar= e not > > compatible with required madvise()/mprotect() operation. > > >=20 > I'm not talking about flags. I'm talking about the case where one > thread (or RDMA or whatever) has get_user_pages()'d a mapping and > another thread mprotect()s it MAP_EXCLUSIVE. >=20 > > > Is it actually reasonable to malloc() some memory and then make it = exclusive? > > > > > > Are you permitted to map a file MAP_EXCLUSIVE? What does it mean? > > > > I'd limit MAP_EXCLUSIVE only to anonymous memory. > > > > > What does MAP_PRIVATE | MAP_EXCLUSIVE do? > > > > My preference is to have only mmap() and then the semantics is more c= lear: > > > > MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it = locked > > and drops the pages in this region from the direct map. > > The pages are returned back on munmap(). > > Then there is no way to change an existing area to be exclusive or vi= ce > > versa. >=20 > And what happens if you fork()? Limiting it to MAP_SHARED | > MAP_EXCLUSIVE would about this particular nasty question. >=20 > > > > > How does one pass exclusive memory via SCM_RIGHTS? (If it's a > > > memfd-like or chardev interface, it's trivial. mmap(), not so much= .) > > > > Why passing such memory via SCM_RIGHTS would be useful? >=20 > Suppose I want to put a secret into exclusive memory and then send > that secret to some other process. The obvious approach would be to > SCM_RIGHTS an fd over, but you can't do that with MAP_EXCLUSIVE as > you've defined it. In general, there are lots of use cases for memfd > and other fd-backed memory. >=20 > > > > > And finally, there's my personal giant pet peeve: a major use of th= is > > > will be for virtualization. I suspect that a lot of people would l= ike > > > the majority of KVM guest memory to be unmapped from the host > > > pagetables. But people might also like for guest memory to be > > > unmapped in *QEMU's* pagetables, and mmap() is a basically worthles= s > > > interface for this. Getting fd-backed memory into a guest will tak= e > > > some possibly major work in the kernel, but getting vma-backed memo= ry > > > into a guest without mapping it in the host user address space seem= s > > > much, much worse. > > > > Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets > > rather than use it for the entire guest memory. I even considered add= ing a > > limit for the mapping size, but then I decided that since RLIMIT_MEML= OCK is > > anyway enforced there is no need for a new one. > > > > I agree that getting fd-backed memory into a guest would be less pain= that > > VMA, but KVM can already use memory outside the control of the kernel= via > > /dev/map [1]. >=20 > That series doesn't address the problem I'm talking about at all. I'm > saying that there is a legitimate use case where QEMU should *not* > have a mapping of the memory. So QEMU would create some exclusive > memory using /dev/exclusive_memory and would tell KVM to map it into > the guest without mapping it into QEMU's address space at all. >=20 > (In fact, the way that SEV currently works is *functionally* like > this, except that there's a bogus incoherent mapping in the QEMU > process that is a giant can of worms. >=20 >=20 > IMO a major benefit of a chardev approach is that you don't need a new > VM_ flag and you don't need to worry about wiring it up everywhere in > the core mm code. Ok, at last I'm starting to see your and Christoph's point. Just to reiterate, we can use fd-backed memory using /dev/exclusive_memor= y chardev (or some other name we'll pick after long bikeshedding) and then the .mmap method of this character device can do interesting things with the backing physical memory. Since the memory is not VMA-mapped, we do no= t have to find all the places in the core that might require a check of a V= M_ flag to ensure there is no clashes with the exclusive memory. Still, whatever we do with the mapping properties of this memory, we nee= d a solution to the splitting of huge pages that map the direct map, but th= is is an orthogonal problem in a way. --=20 Sincerely yours, Mike.