From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DC2AD374A4 for ; Thu, 17 Oct 2024 15:02:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83BEB6B007B; Thu, 17 Oct 2024 11:02:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EBE16B0082; Thu, 17 Oct 2024 11:02:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 665A06B0083; Thu, 17 Oct 2024 11:02:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 41B4A6B007B for ; Thu, 17 Oct 2024 11:02:35 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D1C11121615 for ; Thu, 17 Oct 2024 15:02:24 +0000 (UTC) X-FDA: 82683410418.15.89003D6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id D0CF214000E for ; Thu, 17 Oct 2024 15:02:24 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MKCt+M2d; spf=pass (imf26.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729177304; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MTjPh/M65BMtmIFSvA8d3L/tEAJtMt+yGdFBQYvxlzM=; b=U6irhwfVPHaTxu9catkM+Su7BmodfQbm4KuRjmX7DkSfxw5aqXoAGJO4Y1PqldFdhSNohX l3GuOt8Adr6j44TYPtTiGLciCS9ei97wQhVjUjVcvluyRGmfJ2kbGdOLslJxi3x/M3WQSd jgKZXJSSOA9I4/DcLC9MKcFDbpe0YFU= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MKCt+M2d; spf=pass (imf26.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729177304; a=rsa-sha256; cv=none; b=48xFbuMntLAzblI8l7kwynhJXCvUx8FzX9iiUCwg6jj3RFGSYRBdU8ggiwVQInkSuxW+K+ fSUY3kmvx2KEIcPIPctYvPr6U6WaBC615uHFZaIl27GVTFo32zxZ4QJtIb7enN0Rp6Wd7x 1GBXLxuEI/ndEE5swbAOstJoS148yXs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1729177351; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=MTjPh/M65BMtmIFSvA8d3L/tEAJtMt+yGdFBQYvxlzM=; b=MKCt+M2dh2dDZSYHc7GYmzPM4rkHkYBArbGhzfvOgbIsh6D0MMSbxSy/eVjdNxOkTmKS3c vpgZF3myUIAdCsq99XryNgqfgja3nPySGCH22KfKypDTj2j1UEDtz5pbN8HHSsqTAKPfD7 F+ESWwMu0fqnUwDRROs1ITjautwggT8= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-486-RapNZoazP72wM5ym1fRd6A-1; Thu, 17 Oct 2024 11:02:29 -0400 X-MC-Unique: RapNZoazP72wM5ym1fRd6A-1 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-4311a383111so6582215e9.3 for ; Thu, 17 Oct 2024 08:02:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729177347; x=1729782147; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=MTjPh/M65BMtmIFSvA8d3L/tEAJtMt+yGdFBQYvxlzM=; b=aflTmNtDRiZv8+dXZ4fEYOYNvgAMMFdKt+htNzRuIauNL22eSNUZ2AWJuMr7ix8OGL ugItXIhoRgQYQBGRxsh4uzQUBcI3cYjg6bJP5z/p/npjiM3lVFywvaY08zAJQ5WCGt+R RN3OPJ7WO/Ay0fxzAeyWrLVKmf4a/2caguKTEsLc+aCpaaT0Hwyl1KiJYYzTAY1iFOFg K0nPFMyZoS1snh125AOjBTjWpb9S9exHk3A+YHneCjMEBVEBv34ToNnDNwsfiG5fmTLL y8gvIq9U5mN+fvtYZMr61WFg4xjzq3eySwmMLZIKWaZxRi2ZzKvluBlT9sUytlrhBDSB Sn0Q== X-Forwarded-Encrypted: i=1; AJvYcCVQTe3EtOO3Wjw+UnYHLKAfr+RduO+28lDirTxj9qwTbx5cfTbS/aV48x8tII8RwT/QbvdrGwwL6w==@kvack.org X-Gm-Message-State: AOJu0YxRHHOn/l7xFXHV7FyZ9GIgrQqe28X/preWHuXFxrli+GOSwsRm I9Oxrwnz8VuEILdoMmd8e06M5E9i/7oZM7aSyvCuFZYEvbbf6rwvcwgf0tVvplF35+kjxR7AZXj aLYvj+iT+g6Fwsx9e0UCzFepmlVDuFw+93qdE64wZJ9NhbkHr X-Received: by 2002:a05:600c:5251:b0:426:64a2:5362 with SMTP id 5b1f17b1804b1-431255dc80cmr155955655e9.8.1729177346715; Thu, 17 Oct 2024 08:02:26 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGGFQtQQ1MHMmFwzunIp+a4xLBLgvKb4xo4upewAL+kcpCdWhaQROR4JKl6laOAfSB0CBZ4Bg== X-Received: by 2002:a05:600c:5251:b0:426:64a2:5362 with SMTP id 5b1f17b1804b1-431255dc80cmr155954485e9.8.1729177345769; Thu, 17 Oct 2024 08:02:25 -0700 (PDT) Received: from ?IPV6:2003:cb:c705:7600:62cc:24c1:9dbe:a2f5? (p200300cbc705760062cc24c19dbea2f5.dip0.t-ipconnect.de. [2003:cb:c705:7600:62cc:24c1:9dbe:a2f5]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43158c4061bsm28726975e9.25.2024.10.17.08.02.23 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 17 Oct 2024 08:02:25 -0700 (PDT) Message-ID: Date: Thu, 17 Oct 2024 17:02:22 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private To: Peter Xu Cc: Ackerley Tng , tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, erdemaktas@google.com, vannapurve@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org References: <1d243dde-2ddf-4875-890d-e6bb47931e40@redhat.com> From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Stat-Signature: ho1sy7jxitjwiifnih3xxydyxmyfmft3 X-Rspamd-Queue-Id: D0CF214000E X-Rspamd-Server: rspam11 X-HE-Tag: 1729177344-319763 X-HE-Meta: U2FsdGVkX1/l8hUWkU2Yt98oCTgkwqkGzFsqEVvgem67f8q7Zcp5R4gcVkC71LSPD8orh+pfPSQyEYuLQl8N4e4SvzZtgZDz01RBqVtiPjfMpb2m4u9dsJbqw5xME6H6P7fF60VzhRRPL/Cl1EBxAJ6S9UKGpNCCUaRMOjyMVS9dmcIF8/UinrbA9wa9Six9mk6Z800kKyalEeTdKOSt9ho7m1wivd+waEikpGvVrI38QdLRQZQmdhog32tAVkzcpzcxaLWzPgt0uFeo0lUQM8poc676o9QldYZKLQyJlTjygerE8d46nXTEIg5ml7xBA9I15UcyCbemvRkKZbvxQGQLwcUZC3mSfLethnZyY7cY8IntWOESDnmtVt4C0RikvWT2PtVojbmGj7o0vGmt7fU5otzsXTAhVKPM7yPTa3k0vRUS+cptQparFVpQnCqU64niyhozV9EG8dFjfxQJx+59xJW2Q4kdow1CKJxkKjf6J+EW8PoOmuxf3FW56JarS5/tA4el7V+Vvsx37FIXg9rctpb2IQdUk0tacX8lLgnQmN8GifomlwtXFmmuAPrzn025b9Xd3kcdAkfRqyUVt6lTEeU5ztiS50yuYCu6040Li+3sU3KjXzugVmwwa3pv0I+fBbyUZzCGIZX5Q+d1bt3yAQfgfQb3mrT0k9cpXwQjnXhH/10ocgb8NwNQwgGdhTrLOYOK60oDIDF8keU6BocrKuDMg/Sud0K8sWdMIXyn6f4NuUJJNkJ9KmnzQQl/43K/lJCOyEWY4lVBYahLgm86KosOditXQJQjgFXcUghGY4CEjBuiod2CaAz+ohfe9IjaUHeGPLLscUrE667fwcTx1bH+az0esDKY3waAxfukKl2pFl2C6bo8rUjgaMDVsezgJlBwG5URsvDuFxSyDRkojAV3DOIZb2QXIalwMZ9mm5b90YTadM1HUO5ptnGpyIq5zhCVhY0XHkMKFUW S6TVxn5F UQZTUyu+qOWpmoZ059g8Vc3tK8ly233Qn+OwOh2W82kD8LN3U8Khq1XiiIY6Pn81VOWW4wgXGp9raZv/2YpUlvCuC6wgUKc7a5mEvB7qfpFGfQktLkuErGZiZYrSe/+FnDZ+VcseQzY26Tirc41uV9AWq/dhBgUL+i/wibo5nZS2H8/bXdzLfwWzJH31Dq/OaKQSt052dShd13TqvylUkmfu5t3flAuAJTofghx8+pDY6/1of/KUTu9BZOWC3JQKfyoTrdOuWpEv7dTl3iXlT4WUjZGk0FbztBUofcSGN8gb0MEG4qU45RVJsO9Mgo7V9ybSk9yM2zhXm+ZS4aim+JfFCbO+ovRfgDTEsO+92zzwAkAnXAGkZIsYMy7Qfg0k9wErUP7QKCCKjt8+4Zp5PZO8yo7CxPVWD0av5MN+5vipuNORZu32LaA3vPnv6LtovL/t4N25gHjkZtrICbv5M01SstLluA0sh5nJF0wkM3R/ilD+q1TrPV7GNmWd/y55s3ClwWd7CKJakzSN5V4EbN+c9qvTvhGCmIw/GjQ23NoJORp7W7p8/bOiaoo+lUeuhwEyLnCBoBP6iaczIYO4IWSl3KA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16.10.24 22:16, Peter Xu wrote: > On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote: >> On 16.10.24 01:42, Ackerley Tng wrote: >>> Peter Xu writes: >>> >>>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote: >>>>> Peter Xu writes: >>>>> >>>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote: >>>>>>> The faultability xarray is stored on the inode since faultability is a >>>>>>> property of the guest_memfd's memory contents. >>>>>>> >>>>>>> In this RFC, presence of an entry in the xarray indicates faultable, >>>>>>> but this could be flipped so that presence indicates unfaultable. For >>>>>>> flexibility, a special value "FAULT" is used instead of a simple >>>>>>> boolean. >>>>>>> >>>>>>> However, at some stages of a VM's lifecycle there could be more >>>>>>> private pages, and at other stages there could be more shared pages. >>>>>>> >>>>>>> This is likely to be replaced by a better data structure in a future >>>>>>> revision to better support ranges. >>>>>>> >>>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a >>>>>>> pointer. inode->i_mapping->i_private_data. >>>>>> >>>>>> Could you help explain the difference between faultability v.s. the >>>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE? Not sure if I'm the only one who's >>>>>> confused, otherwise might be good to enrich the commit message. >>>>> >>>>> Thank you for this question, I'll add this to the commit message to the >>>>> next revision if Fuad's patch set [1] doesn't make it first. >>>>> >>>>> Reason (a): To elaborate on the explanation in [1], >>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be >>>>> private or shared, and faultability is whether the page is allowed to be >>>>> faulted in by userspace. >>>>> >>>>> These two are similar but may not be the same thing. In pKVM, pKVM >>>>> cannot trust userspace's configuration of private/shared, and other >>>>> information will go into determining the private/shared setting in >>>>> faultability. >>>> >>>> It makes sense to me that the kernel has the right to decide which page is >>>> shared / private. No matter if it's for pKVM or CoCo, I believe the normal >>>> case is most / all pages are private, until some requests to share them for >>>> special purposes (like DMA). But that'll need to be initiated as a request >>>> from the guest not the userspace hypervisor. >>> >>> For TDX, the plan is that the guest will request the page to be remapped >>> as shared or private, and the handler for that request will exit to >>> the userspace VMM. >>> >>> The userspace VMM will then do any necessary coordination (e.g. for a >>> shared to private conversion it may need to unpin pages from DMA), and >>> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with >>> the guest's requested conversion. This is where >>> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided. >>> >>> Patch 38 [1] updates >>> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to >>> demonstrate the usage flow for x86. >>> >>> Fuad will be in a better position to explain the flow for pKVM. >>> >>>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is >>>> planned to be used in the future. Currently it's always set at least in >>>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything.. >>>> >>>> If it's driven by the userspace side of the hypervisor, I wonder when >>>> should the user app request some different value it already was, if the >>>> kernel already has an answer in this case. It made me even more confused, >>>> as we have this in the API doc: >>>> >>>> Note, there is no "get" API. Userspace is responsible for >>>> explicitly tracking the state of a gfn/page as needed. >>>> >>>> And I do wonder whether we will still need some API just to query whether >>>> the kernel allows the page to be mapped or not (aka, the "real" shared / >>>> private status of a guest page). I guess that's not directly relevant to >>>> the faultability to be introduced here, but if you or anyone know please >>>> kindly share, I'd love to learn about it. >>> >>> The userspace VMM will track the initial shared/private state, in the >>> sense that when the VM is created, the mem_attr_array is initialized >>> such that the guest pages are all shared. >>> >>> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl, >>> it should record all changes so it knows what the state is in the >>> kernel. >>> >>> Even if userspace VMM doesn't record the state properly, if the >>> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change >>> (e.g. setting an already private page to private), it will just be a >>> no-op in the kernel. >>> >>>>> >>>>> Perhaps Fuad can elaborate more here. >>>>> >>>>> Reason (b): In this patch series (mostly focus on x86 first), we're >>>>> using faultability to prevent any future faults before checking that >>>>> there are no mappings. >>>>> >>>>> Having a different xarray from mem_attr_array allows us to disable >>>>> faulting before committing to changing mem_attr_array. Please see >>>>> `kvm_gmem_should_set_attributes_private()` in this patch [2]. >>>>> >>>>> We're not completely sure about the effectiveness of using faultability >>>>> to block off future faults here, in future revisions we may be using a >>>>> different approach. The folio_lock() is probably important if we need to >>>>> check mapcount. Please let me know if you have any ideas! >>>>> >>>>> The starting point of having a different xarray was pKVM's requirement >>>>> of having separate xarrays, and we later realized that the xarray could >>>>> be used for reason (b). For x86 we could perhaps eventually remove the >>>>> second xarray? Not sure as of now. >>>> >>>> Just had a quick look at patch 27: >>>> >>>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/ >>>> >>>> I'm not yet sure what's protecting from faultability being modified against >>>> a concurrent fault(). >>>> >>>> I wonder whether one can use the folio lock to serialize that, so that one >>>> needs to take the folio lock to modify/lookup the folio's faultability, >>>> then it may naturally match with the fault() handler design, where >>>> kvm_gmem_get_folio() needs to lock the page first. >>>> >>>> But then kvm_gmem_is_faultable() will need to also be called only after the >>>> folio is locked to avoid races. >>> >>> My bad. In our rush to get this series out before LPC, the patch series >>> was not organized very well. Patch 39 [2] adds the >>> lock. filemap_invalidate_lock_shared() should make sure that faulting >>> doesn't race with faultability updates. >>> >>>>>> The latter is per-slot, so one level higher, however I don't think it's a >>>>>> common use case for mapping the same gmemfd in multiple slots anyway for >>>>>> KVM (besides corner cases like live upgrade). So perhaps this is not about >>>>>> layering but something else? For example, any use case where PRIVATE and >>>>>> FAULTABLE can be reported with different values. >>>>>> >>>>>> Another higher level question is, is there any plan to support non-CoCo >>>>>> context for 1G? >>>>> >>>>> I believe guest_memfd users are generally in favor of eventually using >>>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared, >>>>> in the case of CoCo) page support. >>>>> >>>>> However, core-mm's fault path does not support mapping at anything >>>>> higher than the PMD level (other than hugetlb_fault(), which the >>>>> community wants to move away from), so core-mm wouldn't be able to map >>>>> 1G pages taken from HugeTLB. >>>> >>>> Have you looked at vm_operations_struct.huge_fault()? Or maybe you're >>>> referring to some other challenges? >>>> >>> >>> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but >>> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by >>> core-mm using vm_operations_struct.huge_fault(). >> >> >> Just to clarify a bit for Peter: as has been discussed previously, there are >> rather big difference between CoCo and non-CoCo VMs. >> >> In CoCo VMs, the primary portion of all pages are private, and they are not >> mapped into user space. Only a handful of pages are commonly shared and >> mapped into user space. >> >> In non-CoCo VMs, all pages are shared and (for the time being) all pages are >> mapped into user space from where KVM will consume them. >> >> >> Installing pmd/pud mappings into user space (recall: shared memory only) is >> currently not really a requirement for CoCo VMs, and therefore not the focus >> of this work. >> >> Further, it's currently considered to be incompatible with getting in-place >> private<->share conversion on *page* granularity right, as we will be >> exposing huge/gigantic folios via individual small folios to core-MM. >> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not >> going to fly, unless using a PFNMAP, which has been briefly discussed as >> well, bu disregarded so far (no page pinning support). >> >> So in the context of this work here, huge faults and PUD/PMD *user space >> page tables* do not apply. >> >> For non-CoCo VMs there is no in-place conversion problem. One could use the >> same CoCo implementation, but without user space pud/pmd mappings. KVM and >> VFIO would have to consume this memory via the guest_memfd in memslots >> instead of via the user space mappings to more easily get PMD/PUD mappings >> into the secondary MMU. And the downsides would be sacrificing the vmemmap > > Is there chance that when !CoCo will be supported, then external modules > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd? I expect this at least initially to be the case. At some point, we might see a transition to fd+offset for some interfaces. I recall that there was a similar discussion when specifying "shared" memory in a KVM memory slot that will be backed by a guest_memfd: initially, this would be via VA and not via guest_memfd+offset. I recall Sean and James wants it to stay that way (sorry if I am wrong!), and James might require that to get the fancy uffd mechanism flying. > > To support CoCo, I understand gmem+offset is required all over the places. > However in a non-CoCo context, I wonder whether the other modules are > required to stick with gmem+offset, or they can reuse the old VA ways, > because how it works can fundamentally be the same as before, except that > the folios now will be managed by gmemfd. > > I think the good thing with such approach is when developing CoCo support > for all these modules, there's less constraints / concerns to be compatible > with non-CoCo use case, also it'll make it even easier to be used in > production before all CoCo facilities ready, as most infrastructures are > already around and being used for years if VA can be mapped and GUPed like > before. Right, but even if most interfaces support guest_memfd+offset, things like DIRECT_IO to shared guest memory will require VA+GUP (someone brought that up at LPC). -- Cheers, David / dhildenb