From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC0EBC54E58 for ; Mon, 18 Mar 2024 17:06:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6E53F6B0092; Mon, 18 Mar 2024 13:06:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6948D6B0093; Mon, 18 Mar 2024 13:06:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 534EF6B0095; Mon, 18 Mar 2024 13:06:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3D2776B0092 for ; Mon, 18 Mar 2024 13:06:29 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id ABCE2120825 for ; Mon, 18 Mar 2024 17:06:28 +0000 (UTC) X-FDA: 81910788456.29.0346D11 Received: from mail-qv1-f50.google.com (mail-qv1-f50.google.com [209.85.219.50]) by imf30.hostedemail.com (Postfix) with ESMTP id CC69F8002D for ; Mon, 18 Mar 2024 17:06:26 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DVhzWV6l; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of vannapurve@google.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710781586; a=rsa-sha256; cv=none; b=YhA7mD/aCG6wJoknu3rCCGK9Qv1MSQqZexrjMn8QINdWEcLknwLDRZcBt93EUAkyfPLZqh Z7dKXGTOZzimI8x7RmeQDIuWkCRRsxzyumRFiT+PHNiWpNtjGELqH3VV7nTjGPGmcT9tLc MIPlqsnZGOUM79tGjN7ag5F1uOeggqo= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DVhzWV6l; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of vannapurve@google.com designates 209.85.219.50 as permitted sender) smtp.mailfrom=vannapurve@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710781586; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NF57ccWL2fW0MRBAd8FHi6iAKCBMMg0jPBtwttJtVt4=; b=pKYSo4LvmalQVI4tS2V7Gx18NFgGGq4+BU9lH58KHT0ziT/20DWBbN4WgkTIySN7HTK/wM bgH1fPK4o5btKBvJDyL21OmAC3mQLz92bbfTxucopQRM8M/WkP7ISaiqOJHkvlvmQuDuFw dEAU/J4nYnsdDWxGWJcotS3b5F+I7C8= Received: by mail-qv1-f50.google.com with SMTP id 6a1803df08f44-690fed6816fso31745106d6.1 for ; Mon, 18 Mar 2024 10:06:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710781586; x=1711386386; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NF57ccWL2fW0MRBAd8FHi6iAKCBMMg0jPBtwttJtVt4=; b=DVhzWV6lVlWDEbNYcmovYRYiHHWpf+lp+s8JN2xoZ6K6St28qW+OfGYqLgKuEO60vK 74upa0dFqBdsVgnMQkJMmsL9cMdIvqvIMU5nS/jns7mIWAZ6jStvqeRKPXA6SVGcaF9o hargFcz61zdlANAPAtBETLYSMKXLwCzZMfOE0o6yfTyQivdTMYeyVsFJjKG3RxrS9awS TzDP/y5RFmnZIoKt2JBvAd68bNgZHfVN1BgpHiK/Hz7GBxQQhUSEdPu2dyZ85wOBAHUd 71ZBR+nV5brjIpFeWkgW5U6eFzkXK24pzLsqZwwY1OmYwTzwY5LN5GhHwfeiF116IXQT A50A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710781586; x=1711386386; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NF57ccWL2fW0MRBAd8FHi6iAKCBMMg0jPBtwttJtVt4=; b=l7l5NzzYu+ti56MOQtKL2frg1pk0c4sL/8jIQ7VNAsiHQUMP9DEPKHm/fPWp8MsxEE 7NQcACXFkb+fi+tgK4VWm1oxei0zjsXJJc6kYhnLEUx5DquI54xTLV1a4ZidkZ27B980 BSzFenYndg0ZgiDOdSfy1LJufSQHhYzw5Bm1IeFZGrwDmLGxxYNwpFgJ8cvCx1QRxzBh eDR7hRqSvJF+SxNkpf9t9WRvhu4XymU4Rjrp8dwc11f9zXdCb3PhvNnVrltTlM4yVuUA iuFCKojRtEiPbGk7dOmUsOkwBJQ1zmnJg1snqXvYXGB10ofpCa1RkLnGY5oUckhWaAK1 jo0A== X-Forwarded-Encrypted: i=1; AJvYcCWNEFDtT+DLGqOTDJBXLDkw2iJabRC2bdqKFtRb6vYrDr2bCOYMzTZwwDoDEcwh4G+t8hv/uBz+fRu1tNz7iE4qdcs= X-Gm-Message-State: AOJu0YwO6NRUqVT5nn6LrhCT1sVaoIfj7Z9zizDE6LKaPfIheyTU/2bz zVprm99pxKjF0AMmjLbKa6+5+/SgSjHzbLAiHkrD40RhgBxURBKMPnVR/pJld0D0kF7DBWo8UDH KAnLiqlT9ccTux015TPYRSGKDUIvLnCXKWbrG X-Google-Smtp-Source: AGHT+IGchNmpKqEvOsyJsORM6Rdk76HLItKpY9KUt3EAsDMcebiP7GEsuPfgxQ4qtBtUYHV/dqgAjPV9ICe9FSECPPA= X-Received: by 2002:ad4:5699:0:b0:690:ca65:3393 with SMTP id bd25-20020ad45699000000b00690ca653393mr11918853qvb.33.1710781585330; Mon, 18 Mar 2024 10:06:25 -0700 (PDT) MIME-Version: 1.0 References: <40a8fb34-868f-4e19-9f98-7516948fc740@redhat.com> <20240226105258596-0800.eberman@hu-eberman-lv.qualcomm.com> <925f8f5d-c356-4c20-a6a5-dd7efde5ee86@redhat.com> <755911e5-8d4a-4e24-89c7-a087a26ec5f6@redhat.com> <99a94a42-2781-4d48-8b8c-004e95db6bb5@redhat.com> In-Reply-To: From: Vishal Annapurve Date: Mon, 18 Mar 2024 10:06:11 -0700 Message-ID: Subject: Re: folio_mmapped To: David Hildenbrand Cc: Sean Christopherson , Quentin Perret , Matthew Wilcox , Fuad Tabba , kvm@vger.kernel.org, kvmarm@lists.linux.dev, pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk, brauner@kernel.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, ackerleytng@google.com, mail@maciej.szmigiero.name, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, keirf@google.com, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: CC69F8002D X-Stat-Signature: nr1yrhz71yuqc95p3acasougi8yhk3p7 X-HE-Tag: 1710781586-942911 X-HE-Meta: U2FsdGVkX19GcA9pCEF1wNhQjoBYiLOnVsNPW1fdVHbtC2oV2Hv3aNkG53rCRki5chYYVJQQ6ZunX9Utp/LGRF46wOOtOh3ATlM52eSjSM4WiINPaByKZ0VKQ34tHO7Id2sqywbqQZ9kRKl2oKbXfPT/Yzy3mEXxw0PSJHnCmEi/E52MMeggOOeqryUAWQRYvC58WLMOcx+gObSgrMUp+17mUZjnUtjyzafL2KHQYGY0OtCxY5VQwpzbqKp5EXsxi136xemRN8EEbJmRzm9y815TaCqtnc8uTqw0hhEhfLe8ON6h7eZuGLI8Bi1lFx6xJYnxInQOwfyv3sLVuexWVFVV1Qdb5qjJczC/IsYU4Dl2X0hQo89bvr3rh/m0tsTzLcLxSis4ov4xxl36KaG9vcyysl7NQzHCFuDGj15xq7/vdg1Cf0+SvI8ESz3ZOCWaFioKK++sd6aKMGP+s0hy0jCN3W1eB7u0c1IkW9JCpSC9ncKuogJxdSdGHxx0S5MX6w5Pu5qN264rMeu5jJQmM/K18qOSQcR9B+/QZPP/qWgpgPPW4wYnCsdukO9Veg9xVc1XJ/6vNrvzV/aEQuIVf15sKI9w6h5jJEVW4VRh8mS4C9bsm2Lc+SsMJZcpxFpbEay83EI7eboEo1RZ/4D/a/OfE+pCLR9jvKs/szLHJBK29sgA+a3SXWxkwnkTLx0fFj53r1WWmpK2hpQSjvCuuDUubObxEgBpPw9L+/dlc6/StKvYPnmqw/lgd1cIkT3w3AwVY7atcEc1rO4as1bJ4jUwqOIvhq3RrLxukwx+8qMUwh0IsPCMVTiODyyNcA6fvwsfynV28I798J6p2XI4y9DjzKpY8Qji1MREtEtdNLhkAN4VT4tCKVqvLkDirFe6QAPtghlHl7j7Vn0DvCLHiJT4udKUu/uq8K+uHchJoTu7nPyL09wbmbrdV1uyPwErIIiXUPZh7R60MFlX2B5 S68YHbJH pPO7jPSUmENqK05oLzUNi4x9RpHjK8uE340Z3xm6nMpUAe/lAbS0K2OUsN5iQeA5+K6z4Y/BzUSxGak0L9Gg5SSQlGcS059N40ellFBY/VKrG5HKpM0pkg9s/2L+JOooXUjYymW6E0f605QIVvL6n5fheokGxDKihYZGNy8rD5MrqniQ8QGlSAHitThxTZEyJdJB8bOHyvWwDkoDoBi0QecS8rnPQsDK0M8AOFPMdbyQtwBoAjhzTYQRubjhUChKosY+CinVELVaa0Kk+Gs1f/xt/7vK8wFKV8P33Blfd8iJH7nE7cOdG91C0n1dTHJJm5aXxA77h3D9qbdbpLgKuvUheMxAgqz3B4CGHk0ifBET6ph6HbhTMs4HjwoGUnrVVXS8eMcyXNl3oNYnjRxQbQqgEAQajB/nerUcxDxE710BSRAKpcyo5+5g99S+YyqY/zN+iBBM0uZtOWw7JVDoXG/Jn6VJil/SLx0fuEaz1reLbYIGahNSso7Rk9Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000057, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 4, 2024 at 12:17=E2=80=AFPM David Hildenbrand wrote: > > On 04.03.24 20:04, Sean Christopherson wrote: > > On Mon, Mar 04, 2024, Quentin Perret wrote: > >>> As discussed in the sub-thread, that might still be required. > >>> > >>> One could think about completely forbidding GUP on these mmap'ed > >>> guest-memfds. But likely, there might be use cases in the future wher= e you > >>> want to use GUP on shared memory inside a guest_memfd. > >>> > >>> (the iouring example I gave might currently not work because > >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contai= ns some > >>> details) > >> > >> Perhaps it would be wise to start with GUP being forbidden if the > >> current users do not need it (not sure if that is the case in Android, > >> I'll check) ? We can always relax this constraint later when/if the > >> use-cases arise, which is obviously much harder to do the other way > >> around. > > > > +1000. At least on the KVM side, I would like to be as conservative as= possible > > when it comes to letting anything other than the guest access guest_mem= fd. > > So we'll have to do it similar to any occurrences of "secretmem" in > gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code > similar to e.g., folio_is_secretmem(). > > IIRC, we might not be able to de-reference the actual mapping because it > could get free concurrently ... > > That will then prohibit any kind of GUP access to these pages, including > reading/writing for ptrace/debugging purposes, for core dumping purposes > etc. But at least, you know that nobody was able to optain page > references using GUP that might be used for reading/writing later. > There has been little discussion about supporting 1G pages with guest_memfd for TDX/SNP or pKVM. I would like to restart this discussion [1]. 1G pages should be a very important usecase for guest memfd, especially considering large VM sizes supporting confidential GPU/TPU workloads. Using separate backing stores for private and shared memory ranges is not going to work effectively when using 1G pages. Consider the following scenario of memory conversion when using 1G pages to back private memory: * Guest requests conversion of 4KB range from private to shared, host in response ideally does following steps: a) Updates the guest memory attributes b) Unbacks the corresponding private memory c) Allocates corresponding shared memory or let it be faulted in when guest accesses it Step b above can't be skipped here, otherwise we would have two physical pages (1 backing private memory, another backing the shared memory) for the same GPA range causing "double allocation". With 1G pages, it would be difficult to punch KBs or even MBs sized hole since to support that: 1G page would need to be split (which hugetlbfs doesn't support today because of right reasons), causing - - loss of vmemmap optimization [3] - losing ability to reconstitute the huge page again, especially as private pages in CVMs are not relocatable today, increasing overall fragmentation over time. - unless a smarter algorithm is devised for memory reclaim to reconstitute large pages for unmovable memory. With the above limitations in place, best thing could be to allow: - single backing store for both shared and private memory ranges - host userspace to mmap the guest memfd (as this series is trying to do) - allow userspace to fault in memfd file ranges that correspond to shared GPA ranges - pagetable mappings will need to be restricted to shared memory ranges causing higher granularity mappings (somewhat similar to what HGM series from James [2] was trying to do) than 1G. - Allow IOMMU also to map those pages (pfns would be requested using get_user_pages* APIs) to allow devices to access shared memory. IOMMU management code would have to be enlightened or somehow restricted to map only shared regions of guest memfd. - Upon conversion from shared to private, host will have to ensure that there are no mappings/references present for the memory ranges being converted to private. If the above usecase sounds reasonable, GUP access to guest memfd pages should be allowed. [1] https://lore.kernel.org/lkml/CAGtprH_H1afUJ2cUnznWqYLTZVuEcOogRwXF6uBAe= HbLMQsrsQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/20230218002819.1486479-2-jthoughton@google= .com/ [3] https://docs.kernel.org/mm/vmemmap_dedup.html > -- > Cheers, > > David / dhildenb >