From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29CC9C7EE31 for ; Fri, 27 Jun 2025 13:51:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ACE706B00B4; Fri, 27 Jun 2025 09:51:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AA5A06B00B5; Fri, 27 Jun 2025 09:51:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BC2A6B00BC; Fri, 27 Jun 2025 09:51:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8A2CB6B00B4 for ; Fri, 27 Jun 2025 09:51:57 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4A8ED801D3 for ; Fri, 27 Jun 2025 13:51:57 +0000 (UTC) X-FDA: 83601319074.13.2C1B692 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id F27724000F for ; Fri, 27 Jun 2025 13:51:54 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Vht6AYjq; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751032315; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Jdi8H7Y+WEaY2+EapncWAogUSUPPOlCTUAbjZIpMpK8=; b=O9dc1UiDjoFXSs0fNKnWt7HJgMT5AcZxaAfYXiz84/aFeyLLWYw6YuKcwkOqirnCfrg3wF 6BqENVh4zgVbRPh79YK7oq0LhQwouItU0ts0AS8B2Rb/tVJbgLDv0inMPQ9oV41wF93ZhK rNg/lzquQeW4dpxA0cfBC75A/qaoKM0= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Vht6AYjq; spf=pass (imf07.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751032315; a=rsa-sha256; cv=none; b=HY4yVIvbjwgq1DkpR9UhBh2+MufQT0Uuj65OPBKeQqWB9seQKN22EdpLF78ItVX7R6TUN2 /SqunaUSzwqZnAceq9OeMmINOyZlu0kBy5xOHgvwFuKvzycwCKfdJcpAq2k7HsmNb/fOsz 0b0dQmOfNUslsksZOu5AJok/C2/dUgA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1751032314; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Jdi8H7Y+WEaY2+EapncWAogUSUPPOlCTUAbjZIpMpK8=; b=Vht6AYjq42Cfae8Fi735QkLTFF9PwKbjO/5oQnPIWeWgc1PmOWhUYWuEBFa3GqPYAipe74 1VzhPeeYfFTrCQ5iL8ki6HL4xV25vEJyHsp+xcw3lNQPerhztNU36ELHhS4ea0PArxN5vT kuMvdr+/x1YSQ+iXut+YqW4yMB9C+zw= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-584-MutJk4CJPwKFVsoB3yJ8Jw-1; Fri, 27 Jun 2025 09:51:53 -0400 X-MC-Unique: MutJk4CJPwKFVsoB3yJ8Jw-1 X-Mimecast-MFC-AGG-ID: MutJk4CJPwKFVsoB3yJ8Jw_1751032313 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-6fb5f70b93bso35827956d6.3 for ; Fri, 27 Jun 2025 06:51:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751032312; x=1751637112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Jdi8H7Y+WEaY2+EapncWAogUSUPPOlCTUAbjZIpMpK8=; b=bdauuNRajltZWBEJcapoJUGU9gEilU220q0beJYzKLQS64Jt5KpdG+gJHyyM3NH6h4 SdJvLGS8h68IhGPzBczZPc5B0jFO1Y3Wm6StorXpVnv/y5q+TaXX1LNoX8mL4n7Gmi5c vaWY4LLizL/O5EQuJ+fiaahUDr8/TOd8xi5WMmLQAx1Wdog/aaV2eiVvSrp2LzXAq1s4 ZHEjlKLvPHKUzMgWl2FJ9peN0V6gqsfNaMr6xsxQrjvihdj5AKZ5O4fj1Y1YAaJMW6Q1 AsrAYQ+SsttCWeTJ9qNLimzZj9/i3egI5MqFoE2NQT/FzH5xS3WdczDxO9aKGEY93EQL VGJw== X-Gm-Message-State: AOJu0YwlP5k/ZIPeWgb6osIVqS8bMFoYaJ7s7FTlluB9FGG6x5rFdU/+ ywffASppQUmY2SXc7moKSu9U+f3dA1+CLSkHDg5d/OvFfNhBUvYIJe0sjwzek2s1tW8NTn6ztHu LcOMVEn/zfbAan/JqAcXOJ/FoND0Oz0NO1g5ZABAY+/BRpL5twJGR X-Gm-Gg: ASbGncv6UeS3WOzwPK9HEyElJqiQ2QxXhZAS3beB5ok4WEEwkaHAtsYCEjnTpgifn3n wwXF/i6kM3xdZmAbQXWMIS1naZOUXtNCupEf7wte9sxdIgGDSJCkkie78Ql6od7AHBudZEeM+sm EOCWvfdqJpeg0gzaDcPSFbf6AaiEIXuRdoNkS6FSdXAnDAhIsAZ/V5i11EO6SHio9c8uYedOC35 6YEhY2uuXSK3R+F04lusam7oyA1mag3usTv2ZO4soJz+6s6zrypBJAzXgwgvSFxM2MSbsAM/3CA jWP5j2zeby9GdQ== X-Received: by 2002:a05:6214:c67:b0:6fd:6fc6:3961 with SMTP id 6a1803df08f44-70003c8eae7mr58595576d6.37.1751032310183; Fri, 27 Jun 2025 06:51:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEiC41jUhRVuBAY/tvm/vWlRrKGu+kuTwOtZUZIehF8tIJVZBE67diI3DWMUfhyMY0RMShs4w== X-Received: by 2002:a05:6214:c67:b0:6fd:6fc6:3961 with SMTP id 6a1803df08f44-70003c8eae7mr58592736d6.37.1751032306777; Fri, 27 Jun 2025 06:51:46 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7d44317195fsm131113185a.38.2025.06.27.06.51.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Jun 2025 06:51:46 -0700 (PDT) Date: Fri, 27 Jun 2025 09:51:42 -0400 From: Peter Xu To: Nikita Kalyazin Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins , Oscar Salvador , Michal Hocko , David Hildenbrand , Muchun Song , Andrea Arcangeli , Ujwal Kundur , Suren Baghdasaryan , Andrew Morton , Vlastimil Babka , "Liam R . Howlett" , James Houghton , Mike Rapoport , Lorenzo Stoakes , Axel Rasmussen Subject: Re: [PATCH 0/4] mm/userfaultfd: modulize memory types Message-ID: References: <20250620190342.1780170-1-peterx@redhat.com> <114133f5-0282-463d-9d65-3143aa658806@amazon.com> <7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com> MIME-Version: 1.0 In-Reply-To: <7666ee96-6f09-4dc1-8cb2-002a2d2a29cf@amazon.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: mTV-Z0IpYgKofe4s1FW5Jc_cLPTzBNW4i0eYHfSb02I_1751032313 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: F27724000F X-Stat-Signature: qst3ppf8b9hkchxhpke3w4s889nztsxf X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1751032314-96892 X-HE-Meta: U2FsdGVkX1/6w444rvfQCDcgQuSaGzPDx1YRiHsA2Qmqm64xz0HOiRHnlOuyy4j41NBpzPUF5mylu617WZGBvLEZRdjY8GNmmpZUqJ+Vngi8MMUJPqiYr67hkzW97TtWmv53qubN4LfK2qYSEzF51s7SARnJhYnbvkvr0FJIkMD6wXCkUiKlVbq8L5aEEPqHZhaDo+DFVE/PbyMt6OI9QSL/l5EOvPLsXpIdNRqWXi4hXjR3wC339511jwwbTuHI83PHT3FjwxNHpl/AQak2ttljIu/lr+ujMhD1xNmzCyJ0uzVtTf1VWAMIggm2aQxOmGaGn2l5ik+T7NSVSyT2/oV5RyG09NHwEdK3mqSdIWNfgz/MhdLkEjuKkhPHCjgfj9ObGGWlMYg9bjiqBtUaVfarSGzMa+ucBHjWuQNUeWbXkBR+UVqmBjLzqTk9NFeELF31SYYRMANEVkJgm6oL12db16Szb1RB5wFGLab5VFC32iEAwhgQ+iok3Px2LD5A6MlgcJZRboP1wTf3/bzBH67gI4U6u+Fp3/3jRNwECp1jxXVpppp250fOEwn2/UoK1CS4u7fHTQypuG2PgOLqMO3XazudFRy3EVzvEsQdhiNB7QHRNsib4uCqZ9JkL1eR99c02AlbZeY7pD4NGCfa58wveJ89eatl3qOFZJRVz1cjJMtyr1lElkfunttjr4mmWUy4U/+eG403Vuh/y+OmQWr2x4q4JNlUcaLi8aamKNdnIKgLMorQg1xJyehp85Em+bSGzBwj351ipT7Dpj4kGNZsjSaEHI38GLdFZw6rTxfOQ5YkIleog/jxos9ekqx5fiDFs1vxndjc+gfRangtO16zZlGjMiuoeStqP4XMQ+BvYxQv/C3sJiowzLSxzgZCQvwz2jOFuwEhs/d0cCbLCrtzMkPQyk6PFsodI0zM3vchqTlSQOWiflkNNfiTvdb+SsihiMlpcgpkKrkj4AG F7jJ41+f uDG2ll5XfaK4Wd+/9dv3XRhUXBOhEQjbQdrthvfrAQzjE4wyoNTp9pSCynH9+9yaNlTLmiZiXVBApoG0/7wMyodIj3qCdidiokL9GCYF2mPFSRHsA5wUFzBI9uk62Av7SolfDyvqjTzJUdhNpOccA2MLhsiTYmpS6TIUaiiOhrcWxPEFztHteVYV7e2m3fF80/vrhnehAFR+zqzwfhRlRYSPPfwZ/5in3HqqORi+TLdCRV3BhY8pw+JfmrY0DzV8gBKiaQ7ov4odBemZKKe+tU4i/6VDaPZXyxvXwlvvsi/Gm6oTyaWWCq93NHpFsedWQZppjYN1xkTLWN0y3vmTikigFj3R8xtGuOLeBbf5yHbO9+jSB6ZLRrumUBptDh62FtVkxMktYSmXv0xK0Z2YxHall1YUNREgn2HejLEDPBxJ8e08= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 26, 2025 at 05:09:47PM +0100, Nikita Kalyazin wrote: > > > On 25/06/2025 21:17, Peter Xu wrote: > > On Wed, Jun 25, 2025 at 05:56:23PM +0100, Nikita Kalyazin wrote: > > > > > > > > > On 20/06/2025 20:03, Peter Xu wrote: > > > > [based on akpm/mm-new] > > > > > > > > This series is an alternative proposal of what Nikita proposed here on the > > > > initial three patches: > > > > > > > > https://lore.kernel.org/r/20250404154352.23078-1-kalyazin@amazon.com > > > > > > > > This is not yet relevant to any guest-memfd support, but paving way for it. > > > > > > Hi Peter, > > > > Hi, Nikita, > > > > > > > > Thanks for posting this. I confirmed that minor fault handling was working > > > for guest_memfd based on this series and looked simple (a draft based on > > > mmap support in guest_memfd v7 [1]): > > > > Thanks for the quick spin, glad to know it works. Some trivial things to > > mention below.. > > Following up, I drafted UFFDIO_COPY support for guest_memfd to confirm it > works as well: Appreciated. Since at it, I'll comment quickly below. > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 8c44e4b9f5f8..b5458a22fff4 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -349,12 +349,19 @@ static bool kvm_gmem_offset_is_shared(struct file > *file, pgoff_t index) > > static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > { > + struct vm_area_struct *vma = vmf ? vmf->vma : NULL; > struct inode *inode = file_inode(vmf->vma->vm_file); > struct folio *folio; > vm_fault_t ret = VM_FAULT_LOCKED; > > filemap_invalidate_lock_shared(inode->i_mapping); > > + folio = filemap_get_entry(inode->i_mapping, vmf->pgoff); > + if (!folio && vma && userfaultfd_missing(vma)) { > + filemap_invalidate_unlock_shared(inode->i_mapping); > + return handle_userfault(vmf, VM_UFFD_MISSING); > + } Likely a possible refcount leak when folio != NULL here. > + > folio = kvm_gmem_get_folio(inode, vmf->pgoff); > if (IS_ERR(folio)) { > int err = PTR_ERR(folio); > @@ -438,10 +445,57 @@ static int kvm_gmem_uffd_get_folio(struct inode > *inode, pgoff_t pgoff, > return 0; > } > > +static int kvm_gmem_mfill_atomic_pte(pmd_t *dst_pmd, > + struct vm_area_struct *dst_vma, > + unsigned long dst_addr, > + unsigned long src_addr, > + uffd_flags_t flags, > + struct folio **foliop) > +{ > + struct inode *inode = file_inode(dst_vma->vm_file); > + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); > + struct folio *folio; > + int ret; > + > + folio = kvm_gmem_get_folio(inode, pgoff); > + if (IS_ERR(folio)) { > + ret = PTR_ERR(folio); > + goto out; > + } > + > + folio_unlock(folio); > + > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { > + void *vaddr = kmap_local_folio(folio, 0); > + ret = copy_from_user(vaddr, (const void __user *)src_addr, PAGE_SIZE); > + kunmap_local(vaddr); > + if (unlikely(ret)) { > + *foliop = folio; > + ret = -ENOENT; > + goto out; > + } > + } else { /* ZEROPAGE */ > + clear_user_highpage(&folio->page, dst_addr); > + } > + > + kvm_gmem_mark_prepared(folio); Since Faud's series hasn't yet landed, so I'm almost looking at the current code base with an imagination of what might happen. In general, missing trapping for guest-memfd could start to be slightly trickier. So far IIUC guest-memfd cache pool needs to be populated only by a prior fallocate() syscall, not during fault. So I suppose we will need to use uptodate bit to mark folio ready, like what's done here. If so, we may want to make sure in fault path any !uptodate fault will get trapped for missing too, even if it sounds not strictly a "cache miss" ... so slightly confusing but sounds necessary. Meanwhile, I'm not 100% sure how it goes especially if taking CoCo into account, because CoCo needs to prepare the pages, so mark uptodate may not be enough? I don't know well on the CoCo side to tell. Otherwise we'll at least need to restrict MISSING traps to only happen on fully shared guest-memfds. OTOH, MINOR should be much easier to be done for guest-memfd, not only because the code to support that would be very minimum which is definitely lovely, but also because it's still pretty common idea to monitor pgtable entries, and it should logically even apply to CoCo: in a fault(), we need to check whether the guest-memfd folio is "shared" and/or "faultable" first; it should already fail the fault() if it's a private folio. Then if it's visible (aka, "faultable") to HVA namespace, then it's legal to trap a MINOR too. For !CoCo it'll always trap as it's always faultable. MINOR also makes more sense to be used in the future with 1G postcopy support on top of gmem, because that's almost the only way to go. Looks like we've made up our mind to reuse Hugetlb pages for gmem which sounds good, then Hugetlb pages are in 1G granule in allocations, and we can't easily do 4K miss trapping on one 1G huge page. MINOR is simpler but actually more powerful from that POV. To summarize, I think after this we can do MINOR before MISSING for guest-memfd if MINOR already works for you. We can leave MISSING until we know how we would use it. Thanks, > + > + ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, > + &folio->page, true, flags); > + > + if (ret) > + folio_put(folio); > +out: > + return ret; > +} > + > static const vm_uffd_ops kvm_gmem_uffd_ops = { > - .uffd_features = VM_UFFD_MINOR, > - .uffd_ioctls = BIT(_UFFDIO_CONTINUE), > + .uffd_features = VM_UFFD_MISSING | VM_UFFD_MINOR, > + .uffd_ioctls = BIT(_UFFDIO_COPY) | > + BIT(_UFFDIO_ZEROPAGE) | > + BIT(_UFFDIO_CONTINUE), > .uffd_get_folio = kvm_gmem_uffd_get_folio, > + .uffd_copy = kvm_gmem_mfill_atomic_pte, > }; > #endif > > > > > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > index 5abb6d52a375..6ddc73419724 100644 > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -5,6 +5,9 @@ > > > #include > > > #include > > > #include > > > +#ifdef CONFIG_USERFAULTFD > > > > This ifdef not needed, userfaultfd_k.h has taken care of all cases. > > Good to know, thanks. > > > > +#include > > > +#endif > > > > > > #include "kvm_mm.h" > > > > > > @@ -396,6 +399,14 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > > > kvm_gmem_mark_prepared(folio); > > > } > > > > > > +#ifdef CONFIG_USERFAULTFD > > > > Same here. userfaultfd_minor() is always defined. > > Thank you. > > > I'll wait for a few more days for reviewers, and likely send v2 before next > > week. > > > > Thanks, > > > > -- > > Peter Xu > > > -- Peter Xu