From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8C05C433E0 for ; Thu, 11 Feb 2021 12:07:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 62FD464E95 for ; Thu, 11 Feb 2021 12:07:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 62FD464E95 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ECFDA6B00D6; Thu, 11 Feb 2021 07:07:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E583F6B00D7; Thu, 11 Feb 2021 07:07:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF9E36B00D8; Thu, 11 Feb 2021 07:07:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0222.hostedemail.com [216.40.44.222]) by kanga.kvack.org (Postfix) with ESMTP id B49646B00D6 for ; Thu, 11 Feb 2021 07:07:30 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 7B49E8249980 for ; Thu, 11 Feb 2021 12:07:30 +0000 (UTC) X-FDA: 77805862260.10.sea38_520197727618 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id 5B92C16A0BE for ; Thu, 11 Feb 2021 12:07:30 +0000 (UTC) X-HE-Tag: sea38_520197727618 X-Filterd-Recvd-Size: 10404 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Thu, 11 Feb 2021 12:07:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613045248; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FxdQiOdDrI0lN+ynkHsa9xjwUnD3y3w6sb41lTRrq7A=; b=UwtHITsgj98MZg/gJY52Oqnj2q297XoQB1yxQveEPbpzdJe55Uny/+ujO2W/raG9x7JvP0 QYHG8lzOZLKLUde5SdU9flrM9L4qJ8DhMgOgEU7+Txj963fq3geDpk0zpFeTkX9W9XYwil gAeu2xhIsXW2f2Xmsl12LtpyNpr8zr4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-437-HGb2gjz-PHmITs_HU4BpDA-1; Thu, 11 Feb 2021 07:07:25 -0500 X-MC-Unique: HGb2gjz-PHmITs_HU4BpDA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 51157192AB79; Thu, 11 Feb 2021 12:07:20 +0000 (UTC) Received: from [10.36.114.52] (ovpn-114-52.ams2.redhat.com [10.36.114.52]) by smtp.corp.redhat.com (Postfix) with ESMTP id 001DD60C0F; Thu, 11 Feb 2021 12:07:10 +0000 (UTC) To: Mike Rapoport Cc: Michal Hocko , Mike Rapoport , Andrew Morton , Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer , Palmer Dabbelt References: <20210208084920.2884-1-rppt@kernel.org> <20210208084920.2884-8-rppt@kernel.org> <20210208212605.GX242749@kernel.org> <20210209090938.GP299309@linux.ibm.com> <20210211071319.GF242749@kernel.org> <0d66baec-1898-987b-7eaf-68a015c027ff@redhat.com> <20210211112702.GI242749@kernel.org> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas Message-ID: <05082284-bd85-579f-2b3e-9b1af663eb6f@redhat.com> Date: Thu, 11 Feb 2021 13:07:10 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: <20210211112702.GI242749@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11.02.21 12:27, Mike Rapoport wrote: > On Thu, Feb 11, 2021 at 10:01:32AM +0100, David Hildenbrand wrote: >> On 11.02.21 09:39, Michal Hocko wrote: >>> On Thu 11-02-21 09:13:19, Mike Rapoport wrote: >>>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote: >>>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote: >>> [...] >>>>>> Citing my older email: >>>>>> >>>>>> I've hesitated whether to continue to use new flags to memfd= _create() or to >>>>>> add a new system call and I've decided to use a new system c= all after I've >>>>>> started to look into man pages update. There would have been= two completely >>>>>> independent descriptions and I think it would have been very= confusing. >>>>> >>>>> Could you elaborate? Unmapping from the kernel address space can wo= rk >>>>> both for sealed or hugetlb memfds, no? Those features are completel= y >>>>> orthogonal AFAICS. With a dedicated syscall you will need to introd= uce >>>>> this functionality on top if that is required. Have you considered = that? >>>>> I mean hugetlb pages are used to back guest memory very often. Is t= his >>>>> something that will be a secret memory usecase? >>>>> >>>>> Please be really specific when giving arguments to back a new sysca= ll >>>>> decision. >>>> >>>> Isn't "syscalls have completely independent description" specific en= ough? >>> >>> No, it's not as you can see from questions I've had above. More on th= at >>> below. >>> >>>> We are talking about API here, not the implementation details whethe= r >>>> secretmem supports large pages or not. >>>> >>>> The purpose of memfd_create() is to create a file-like access to mem= ory. >>>> The purpose of memfd_secret() is to create a way to access memory hi= dden >>>> from the kernel. >>>> >>>> I don't think overloading memfd_create() with the secretmem flags be= cause >>>> they happen to return a file descriptor will be better for users, bu= t >>>> rather will be more confusing. >>> >>> This is quite a subjective conclusion. I could very well argue that i= t >>> would be much better to have a single syscall to get a fd backed memo= ry >>> with spedific requirements (sealing, unmapping from the kernel addres= s >>> space). Neither of us would be clearly right or wrong. A more importa= nt >>> point is a future extensibility and usability, though. So let's just >>> think of few usecases I have outlined above. Is it unrealistic to exp= ect >>> that secret memory should be sealable? What about hugetlb? Because if >>> the answer is no then a new API is a clear win as the combination of >>> flags would never work and then we would just suffer from the syscall >>> multiplexing without much gain. On the other hand if combination of t= he >>> functionality is to be expected then you will have to jam it into >>> memfd_create and copy the interface likely causing more confusion. Se= e >>> what I mean? >>> >>> I by no means do not insist one way or the other but from what I have >>> seen so far I have a feeling that the interface hasn't been thought >>> through enough. Sure you have landed with fd based approach and that >>> seems fair. But how to get that fd seems to still have some gaps IMHO= . >>> >> >> I agree with Michal. This has been raised by different >> people already, including on LWN (https://lwn.net/Articles/835342/). >> >> I can follow Mike's reasoning (man page), and I am also fine if there = is >> a valid reason. However, IMHO the basic description seems to match qui= te good: >> >> memfd_create() creates an anonymous file and returns a file de= scriptor that refers to it. The >> file behaves like a regular file, and so can be modified, trun= cated, memory-mapped, and so on. >> However, unlike a regular file, it lives in RAM and has a vol= atile backing storage. Once all >> references to the file are dropped, it is automatically releas= ed. Anonymous memory is used >> for all backing pages of the file. Therefore, files created= by memfd_create() have the same >> semantics as other anonymous memory allocations such as those = allocated using mmap(2) with the >> MAP_ANONYMOUS flag. >=20 > Even despite my laziness and huge amount of copy-paste you can spot the > differences (this is a very old version, update is due): >=20 > memfd_secret() creates an anonymous file and returns a file de= scriptor > that refers to it. The file can only be memory-mapped; the me= mory in > such mapping will have stronger protection than usual memor= y mapped > files, and so it can be used to store application secrets. U= nlike a > regular file, a file created with memfd_secret() lives in RAM a= nd has a > volatile backing storage. Once all references to the file are = dropped, > it is automatically released. The initial size of the file i= s set to > 0. Following the call, the file size should be set using ftrun= cate(2). >=20 > The memory areas obtained with mmap(2) from the file descriptor= are ex=E2=80=90 > clusive to the owning context. These areas are removed from th= e kernel > page tables and only the page table of the process holding the = file de=E2=80=90 > scriptor maps the corresponding physical memory. > =20 So let's talk about the main user-visible differences to other memfd=20 files (especially, other purely virtual files like hugetlbfs). With=20 secretmem: - File content can only be read/written via memory mappings. - File content cannot be swapped out. I think there are still valid ways to modify file content using=20 syscalls: e.g., fallocate(PUNCH_HOLE). Things like truncate also seems=20 to work just fine. What else? >> AFAIKS, we would need MFD_SECRET and disallow >> MFD_ALLOW_SEALING and MFD_HUGETLB. >=20 > So here we start to multiplex. Yes. And as Michal said, maybe we can support combinations in the future. >=20 >> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow a= ny kind of >> temporary mappings (eor migration). TBC. >=20 > Never map is the default. When we'll need to map we'll add an explicit = flag > for it. No strong opinion. (I'd try to hurt the kernel less as default) --=20 Thanks, David / dhildenb