From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E03C5C54E67 for ; Thu, 28 Mar 2024 09:07:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5BDDE6B0087; Thu, 28 Mar 2024 05:07:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 56E436B0088; Thu, 28 Mar 2024 05:07:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E72B6B0089; Thu, 28 Mar 2024 05:07:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1A9F26B0087 for ; Thu, 28 Mar 2024 05:07:05 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BAEE4121100 for ; Thu, 28 Mar 2024 09:07:04 +0000 (UTC) X-FDA: 81945868368.09.3842F58 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf28.hostedemail.com (Postfix) with ESMTP id 662AAC000A for ; Thu, 28 Mar 2024 09:07:02 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=G8XL7CfP; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711616822; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kbbXbY9boTjSy31Os/orS3iUeVt1pQN6gPCQBXp42Bo=; b=ErYIRfbRSE8dOmyaVohzoYA1Ky3bjbS+el6APY/0p8rPpUn1WtVEsrKIth0nux6qRplT9x nEHstvgAw/hnAqa4YfXAhwwp66157nNxKtEwXRuCGGQIVVXJR7cAAc+BJmFOzcrQOg44DB KzIuFIAry0hxE5YjpM6AzBVr5g5lMzE= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=G8XL7CfP; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711616822; a=rsa-sha256; cv=none; b=HKIfgsPKQwO+aqo4ODZgFrV85kJResfFs3xyks7YO3i2DMJrZvdqSoAK/RhqYKeGIBrLF8 ceSeC0QSze9mGNnaCULLzIVSMsXZbs/PHzImWjC3+cga1KpRAdOGsysi55peKbOO1NHtWZ t+qwNNDP7ivvdeepxMxDrDZ8JnHWVEE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1711616821; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=kbbXbY9boTjSy31Os/orS3iUeVt1pQN6gPCQBXp42Bo=; b=G8XL7CfPbBYsZDSXM7hkTtdDHJ16i8iLkHoB77MCtoKk5iRSkHOuUyOjDTx9O3CfINLWDa +JZowBPC/cuAgVqZTSTjUIGXoXLnmUeZxMtWkbAH6eKA+UwiYQ4A4EvQdhgJFOFacV/uM6 zc8EIzlb8Q5nsR9hVRCCEPDs2p+oKA4= Received: from mail-lf1-f70.google.com (mail-lf1-f70.google.com [209.85.167.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-526-1l4bixMaPUy7X0F5JRkjLQ-1; Thu, 28 Mar 2024 05:06:58 -0400 X-MC-Unique: 1l4bixMaPUy7X0F5JRkjLQ-1 Received: by mail-lf1-f70.google.com with SMTP id 2adb3069b0e04-515bd3d89c9so466539e87.3 for ; Thu, 28 Mar 2024 02:06:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711616817; x=1712221617; h=content-transfer-encoding:in-reply-to:organization:autocrypt:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=kbbXbY9boTjSy31Os/orS3iUeVt1pQN6gPCQBXp42Bo=; b=df7I1j3iiT9ClcsSu9roTzMhxf/fzbiRWFr+EeN/y9jBKoK4IAagCW08gllqvMnddA id+0A2JTJw6YT7l5eDg8eheY8jGdLa0GBdQX/2sGTxKAvtoL9X434AylRMLEqQnXshk/ K7v3EUi23QU/AwGVqCjhEXRt3gThmWjD2fQMaJC9bkDttuVxlvRngMuo9KkrP0Mo1UjH pLVzVrPkh0xVq8uD3tZJczKXt9CEB1TBxFzIsQ5+qK3aPm2Srgcq2ehrWiSeJMST0RLV 32RgEIA4wnYkdQo/MjVci0DXLF4Sds3s0N27QqRK48M6rFrQtDjfBWOAk4tqNGnzxusd dxJw== X-Forwarded-Encrypted: i=1; AJvYcCUkYspS85j6m5/zMBCc4PYdr4X7Zr3bW7eDSktjcn/oklc5HkqnR2ynq6Pzp+avcFXCFHVYiysGs44lWegJ7BWobrU= X-Gm-Message-State: AOJu0YwP+nihVNUuKxVVQLiQHfAoQuQzZlnqHQhU56o+k2sllBtVgPIT OfiwSiKyzKe6Jgg/GR/t1N1Wm6/iijNWI4/wpgwn4IQHi7v23aYxNy74CRmjzmvZql3/cvUyXsV ga93zzFC/AJhxCSF7ANPVUL6ULDa9PKKZmEnogjnZeKmGdm2a X-Received: by 2002:a19:e043:0:b0:513:da4d:a9a6 with SMTP id g3-20020a19e043000000b00513da4da9a6mr1334938lfj.46.1711616816535; Thu, 28 Mar 2024 02:06:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHkMlQBTRxWcDfJgOAf4yw1fei2yFJiKWkwI0R9YVETCl/MK8NoFKttmescL0wQv4e2N8c/xQ== X-Received: by 2002:a19:e043:0:b0:513:da4d:a9a6 with SMTP id g3-20020a19e043000000b00513da4da9a6mr1334911lfj.46.1711616815849; Thu, 28 Mar 2024 02:06:55 -0700 (PDT) Received: from ?IPV6:2003:cb:c714:3600:8033:4189:6bd4:ea29? (p200300cbc7143600803341896bd4ea29.dip0.t-ipconnect.de. [2003:cb:c714:3600:8033:4189:6bd4:ea29]) by smtp.gmail.com with ESMTPSA id dd7-20020a0560001e8700b00341c6440c36sm1190593wrb.74.2024.03.28.02.06.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 28 Mar 2024 02:06:55 -0700 (PDT) Message-ID: Date: Thu, 28 Mar 2024 10:06:52 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: folio_mmapped To: Will Deacon Cc: Sean Christopherson , Vishal Annapurve , Quentin Perret , Matthew Wilcox , Fuad Tabba , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, keirf@google.com, linux-mm@kvack.org References: <7470390a-5a97-475d-aaad-0f6dfb3d26ea@redhat.com> <40f82a61-39b0-4dda-ac32-a7b5da2a31e8@redhat.com> <20240319143119.GA2736@willie-the-truck> <2d6fc3c0-a55b-4316-90b8-deabb065d007@redhat.com> <20240327193454.GB11880@willie-the-truck> From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: <20240327193454.GB11880@willie-the-truck> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 662AAC000A X-Rspam-User: X-Stat-Signature: rcj84rf6q8h1u9nqeoihdgy65pqg8hpj X-Rspamd-Server: rspam01 X-HE-Tag: 1711616822-503164 X-HE-Meta: U2FsdGVkX1+1H7cuD3KSfugXlJkGHq5QaH1YgLatnjPX04A/NAXiJiWd3qasQU+ypx3ZG8ksdDqaIDCUjLjDbCPic6SmPihJMOe00ukdsmoWqcMtpR3Q+uGHImqPfRVrbVlbE5j4M/WBTzn0oWNN9Vv+vj7frMVs+8EaTNVRNNeGfbMjDLC5yYIsGVMCpnPpRO9WYkFG2BX41k6PqlQH/UTd4VyRYAStZIYAh03kbsU7UQqOA7DDiDVHFO8Ggc5e6dLDiq+sSutp6j/aud7CM4n4MHtsz3L8J44sJTPzNqmJoF2hnmsKxE6YqhjUEeKpVOBzeGtdJI1yg5PtaLccusznvilVDpV2yQU3ZQ10tunrhSczWJH9B0fmWT//nfMlfZvj+zGwVcXvB33RQEbKYYdbzZM4ea4qUGVkOL8hhJEYFJpDO2KPk4s+CIDR0w7ZJhow52QdKYnsl1owKNuA8OfTTzToP+LZyFdNqF/TGclrrIiHtMxL5gs3wvBskV+avm4XaxlOpyat/fJhNtEbuMMdje8rkVy43/dNH6dEHyhuTdbZV09B6obH7GDttgZZE4Wnb41mS3C5o7QtR674OmeJhwGYnIp3QmA4jwK5cIq2AAirbqJRpwgQFvCSgSzoCCOnIZotWPAma7E0GYJfSKKHKQhYdgyypFdQXGssU0nKEyb3M8DrxLb/UDmSaACCT1Mx1NeW1aicgU8x0cnbk5WsKQYyJl160G0FG/WY8G6pmvEVOGKEXP87G3fXG4H187DdH6IB6JVl0YXceeBt/EmC97kZQrpEZYLMpiRVQqgbvaZoJn3v3+K/+whJgNHKIn6rK1Q0lT0SbPFkTzmLe9OI20XvkCYikA65ZuWW1m/Yl3bcmolk5kGJFpxgHh5i+u5MltgwT8jQkU/DhNYG19ptQwzgdE0BxmhU4ENbzr513paKscvKqBWmbiaaF69kPTiUXJwAgRXKnBSK7kE 9QNzs/QS YNiF6DDShYSAVquaHJvA4MczPRinPEi+h/L9W87Bjli03mHA9vKMuQNfZhFMpjOTsC+MN3Tud4dnSp+pQx28GH7H4SmPLlfplWao5zDy24wwMOjwngpxYoXSlPIVtpBZdymI7dfOG4GZmOqDONSImsoxuSW75/Wae3y12x7GSzl2TOcgAJMCoke6/BjCKYl8j5C9QM/X80WM58v04BtIKdN+dRFnejkh6q50ObTxYHmrm9zkxqz3LBWuPFz9HV7YCK2hWKiA0NAWu+b+6IVxWMmlcQHslYslKmmqXT0FlqwyUHK1blx6OuYAxXFPEgYUwCc1l7J0Q5QS0q3u4MSUEdIL0HY6KqCqk752ttOKQcanFAUatfdmT017C+rlIw8V4kRPvhTB9JpADPg8n+SY4UR3Yp6LD5LIZJ9ZSxG3FVMwg1orUvpJyH9kaVwysbCL7wIjZbwde7B09aDDaFusJb/lML587xxYWY8vuXqTXfeXUhHtEjHQBHA5BKdAJ/dIhogkJNd6LfCmn6rrj8NaUiFke/JhvKQ0fYqDF0H8jjcG50H5rm1+s7Zphw4nnHxNJYzAy8kfujgtUBnRjfOu+ccVYXQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 27.03.24 20:34, Will Deacon wrote: > Hi again, David, > > On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote: >> On 19.03.24 15:31, Will Deacon wrote: >> sorry for the late reply! > > Bah, you and me both! This time I'm faster! :) > >>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: >>>> On 19.03.24 01:10, Sean Christopherson wrote: >>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote: >>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand wrote: >>> From the pKVM side, we're working on guest_memfd primarily to avoid >>> diverging from what other CoCo solutions end up using, but if it gets >>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do >>> today with anonymous memory, then it's a really hard sell to switch over >>> from what we have in production. We're also hoping that, over time, >>> guest_memfd will become more closely integrated with the mm subsystem to >>> enable things like hypervisor-assisted page migration, which we would >>> love to have. >> >> Reading Sean's reply, he has a different view on that. And I think that's >> the main issue: there are too many different use cases and too many >> different requirements that could turn guest_memfd into something that maybe >> it really shouldn't be. > > No argument there, and we're certainly not tied to any specific > mechanism on the pKVM side. Maybe Sean can chime in, but we've > definitely spoken about migration being a goal in the past, so I guess > something changed since then on the guest_memfd side. > > Regardless, from our point of view, we just need to make sure that > whatever we settle on for pKVM does the things we need it to do (or can > at least be extended to do them) and we're happy to implement that in > whatever way works best for upstream, guest_memfd or otherwise. > >>> We're happy to pursue alternative approaches using anonymous memory if >>> you'd prefer to keep guest_memfd limited in functionality (e.g. >>> preventing GUP of private pages by extending mapping_flags as per [1]), >>> but we're equally willing to contribute to guest_memfd if extensions are >>> welcome. >>> >>> What do you prefer? >> >> Let me summarize the history: > > First off, thanks for piecing together the archaeology... > >> AMD had its thing running and it worked for them (but I recall it was hacky >> :) ). >> >> TDX made it possible to crash the machine when accessing secure memory from >> user space (MCE). >> >> So secure memory must not be mapped into user space -- no page tables. >> Prototypes with anonymous memory existed (and I didn't hate them, although >> hacky), but one of the other selling points of guest_memfd was that we could >> create VMs that wouldn't need any page tables at all, which I found >> interesting. > > Are the prototypes you refer to here based on the old stuff from Kirill? Yes. > We followed that work at the time, thinking we were going to be using > that before guest_memfd came along, so we've sadly been collecting > out-of-tree patches for a little while :/ :/ > >> There was a bit more to that (easier conversion, avoiding GUP, specifying on >> allocation that the memory was unmovable ...), but I'll get to that later. >> >> The design principle was: nasty private memory (unmovable, unswappable, >> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared" >> memory is allocated from an ordinary memfd. >> >> This makes sense: shared memory is neither nasty nor special. You can >> migrate it, swap it out, map it into page tables, GUP it, ... without any >> issues. > > Slight aside and not wanting to derail the discussion, but we have a few > different types of sharing which we'll have to consider: Thanks for sharing! > > * Memory shared from the host to the guest. This remains owned by the > host and the normal mm stuff can be made to work with it. Okay, host and guest can access it. We can jut migrate memory around, swap it out ... like ordinary guest memory today. > > * Memory shared from the guest to the host. This remains owned by the > guest, so there's a pin on the pages and the normal mm stuff can't > work without co-operation from the guest (see next point). Okay, host and guest can access it, but we cannot migrate memory around or swap it out ... like ordinary guest memory today that is longterm pinned. > > * Memory relinquished from the guest to the host. This actually unmaps > the pages from the host and transfers ownership back to the host, > after which the pin is dropped and the normal mm stuff can work. We > use this to implement ballooning. > Okay, so this is essentially just a state transition between the two above. > I suppose the main thing is that the architecture backend can deal with > these states, so the core code shouldn't really care as long as it's > aware that shared memory may be pinned. So IIUC, the states are: (1) Private: inaccesible by the host, accessible by the guest, "owned by the guest" (2) Host Shared: accessible by the host + guest, "owned by the host" (3) Guest Shared: accessible by the host, "owned by the guest" Memory ballooning is simply transitioning from (3) to (2), and then discarding the memory. Any state I am missing? Which transitions are possible? (1) <-> (2) ? Not sure if the direct transition is possible. (2) <-> (3) ? IIUC yes. (1) <-> (3) ? IIUC yes. There is ongoing work on longterm-pinning memory from a memfd/shmem. So thinking in terms of my vague "fd guest_memfd + fd pair", that approach could look like the following: (1) guest_memfd (could be "with longterm pin") (2) memfd (3) memfd with a longterm pin But again, just some possible idea to make it work with guest_memfd. > >> So if I would describe some key characteristics of guest_memfd as of today, >> it would probably be: >> >> 1) Memory is unmovable and unswappable. Right from the beginning, it is >> allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). >> 2) Memory is inaccessible. It cannot be read from user space, the >> kernel, it cannot be GUP'ed ... only some mechanisms might end up >> touching that memory (e.g., hibernation, /proc/kcore) might end up >> touching it "by accident", and we usually can handle these cases. >> 3) Memory can be discarded in page granularity. There should be no cases >> where you cannot discard memory to over-allocate memory for private >> pages that have been replaced by shared pages otherwise. >> 4) Page tables are not required (well, it's an memfd), and the fd could >> in theory be passed to other processes. >> >> Having "ordinary shared" memory in there implies that 1) and 2) will have to >> be adjusted for them, which kind-of turns it "partially" into ordinary shmem >> again. > > Yes, and we'd also need a way to establish hugepages (where possible) > even for the *private* memory so as to reduce the depth of the guest's > stage-2 walk. > Understood, and as discussed, that's a bit more "hairy". >> Going back to the beginning: with pKVM, we likely want the following >> >> 1) Convert pages private<->shared in-place >> 2) Stop user space + kernel from accessing private memory in process >> context. Likely for pKVM we would only crash the process, which >> would be acceptable. >> 3) Prevent GUP to private memory. Otherwise we could crash the kernel. >> 4) Prevent private pages from swapout+migration until supported. >> >> >> I suspect your current solution with anonymous memory gets all but 3) sorted >> out, correct? > > I agree on all of these and, yes, (3) is the problem for us. We've also > been thinking a bit about CoW recently and I suspect the use of > vm_normal_page() in do_wp_page() could lead to issues similar to those > we hit with GUP. There are various ways to approach that, but I'm not > sure what's best. Would COW be required or is that just the nasty side-effect of trying to use anonymous memory? > >> I'm curious, may there be a requirement in the future that shared memory >> could be mapped into other processes? (thinking vhost-user and such things). > > It's not impossible. We use crosvm as our VMM, and that has a > multi-process sandbox mode which I think relies on just that... > Okay, so basing the design on anonymous memory might not be the best choice ... :/ > Cheers, > > Will > > (btw: I'm getting some time away from the computer over Easter, so I'll be > a little slow on email again. Nothing personal!). Sure, no worries! Enjoy! -- Cheers, David / dhildenb