From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50967C282DE for ; Wed, 5 Mar 2025 19:36:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A25F6280025; Wed, 5 Mar 2025 14:36:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D52C280008; Wed, 5 Mar 2025 14:36:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 82C91280025; Wed, 5 Mar 2025 14:36:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5E4D9280008 for ; Wed, 5 Mar 2025 14:36:05 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 50345120AC9 for ; Wed, 5 Mar 2025 19:36:06 +0000 (UTC) X-FDA: 83188503132.06.4DEFCCE Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf27.hostedemail.com (Postfix) with ESMTP id 488EC40003 for ; Wed, 5 Mar 2025 19:36:04 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jGsM237Z; spf=pass (imf27.hostedemail.com: domain of jthoughton@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741203364; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A7j4TYUz1ATWrlNemoGuBeGHStX0tjMRbmrIYUOKq+c=; b=Ou7aolnDb4lURnGlMJbkVqLZbZU096LLIGX8GG8kb5POjjPMExpEDj3+UQ2Pjx1sEivTgs WFXxDOSQEsom++E91DKVOBS8fKXBmjG2rFEzOVG1Wnm5Bcq8pdWw5kOLFWX+ROhb8R+kqV T7WMNJd1Cgf2pbWkwaWOId47TbfjQT0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741203364; a=rsa-sha256; cv=none; b=QMMX0NbI/dkyxGFv2vg//Mw6sjfolbZthJlLU+5Vk4dBx6K5OlXouJKDso+caK0oPd7f4Z Oyr570YSEUzzu1nrWNt9QuZCNN0ofk/QWxMSyUWLYzSLGUEA7d1rNqVdV2u9vfPFYm2U9X dRPio9x9IIkuKNO8+TudX+j+10MWbGs= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=jGsM237Z; spf=pass (imf27.hostedemail.com: domain of jthoughton@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-e549b0f8d57so6346464276.3 for ; Wed, 05 Mar 2025 11:36:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1741203363; x=1741808163; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=A7j4TYUz1ATWrlNemoGuBeGHStX0tjMRbmrIYUOKq+c=; b=jGsM237ZwPsO2Jm+eE+9UWNkXrh/3jS62uDR/T0mOVh4a7sKOMZ8YHl9VOgTFVAiid tlpIVZySMANuUfZ5oBW1aa+Gfn9Mw0luDo4yaY8xKLYQOPwbaVYrBTQBr2swJwLxnJHK h0ja0cPoW9xtknAo/EkknCBllYHJs3i0HNsqgt2ohjtV+UJUzpqfz4a19i6t5orVPtXD O8uno85LQhHnkVmsja1+wSNfY87rn4ubO+uIUIRtUYcd9zkzCV5sTlEsfZkW17BJqGG7 3DaZACfO4ftnpxNXctAy8yQEWUPqJztY++AMwovO2rShd0jkBE8fkdI4SbvX0rmdyfRS dgPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741203363; x=1741808163; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A7j4TYUz1ATWrlNemoGuBeGHStX0tjMRbmrIYUOKq+c=; b=IcNUmNIZ8alDaXvtsMJ2Hgy9IqUQ1VNMBN1eljwVOLy5wkXnbdtv7fSt5qylB2D4Ls qhiiE2mEkXhytyRyhgvLZuwaTzWMNgSxVrF4ALMnS9Y+oDq5x8x+FimBhXCCrtTC3ai3 oBDe4QYhF9d9NEmgTucNX3NXwFup4YEV5dsN8LLbzzFL4WqIq9b3t/EN27qHJjdndvma Z6bCraeBEHbcLfyMkWYhpjAi5tXCADd8CnvV5i3Tax0I8iwsek3jrEgb/Lv3k0LmosfM O07b5utt/yn9KFZIVaDpqagvxpC4E2RE8+iAXf9Wew+2gufF9/EkxzdqgDagXOWWhMi2 YPIA== X-Forwarded-Encrypted: i=1; AJvYcCWMP2n6PDrcXEPrWtTGXUPVOvU8MOU6g727Hssgd8OMxxbgtXnIoTC166qRWqoDJm+Z7eA62NQdyg==@kvack.org X-Gm-Message-State: AOJu0Yx5dgL8g/mKuq1rmRwIB5pRVawznXMXs/3Wgw/w0P+h8kZs6AP3 yGBVheBzgTtNN0Do7EHxc+HNYF3ZF6B8yE+lEyCwjOXju8ZTuOhny1AnyunAvkz+ZDBAa8wBjhb khHKiyG7VteEEwfFFLAvZ0y+fQgfPGlTji79s X-Gm-Gg: ASbGncvmVJjbij7Wn657rvlAxvIPv7Cw+9G0D8/G4uvtU/cuCdLCM0TxvXw3O+tspdd ldNpQM1/cE2Dx8ZHU34MqE/pAKz0dAJ0uLfXsaOM999cjbu44l4pnhDMeg9NOE2mx3DhnXq/rMR KgqTXK1sRjkePgW18xZ+xj7EqBrPv6iue0uVtucPtN/cADA5m9uKH4wt8= X-Google-Smtp-Source: AGHT+IEvPOyAcjsL2RWgYwJSz3xZ1JjN9y605JD64fgEpuj1+DsSQCpfXApqeQGhC4UR0wNYkGg7VOB0kbmZ6kFv6/Y= X-Received: by 2002:a05:6902:3210:b0:e5b:637d:f6f0 with SMTP id 3f1490d57ef6-e611e1bae68mr5615379276.21.1741203363120; Wed, 05 Mar 2025 11:36:03 -0800 (PST) MIME-Version: 1.0 References: <20250303133011.44095-1-kalyazin@amazon.com> In-Reply-To: From: James Houghton Date: Wed, 5 Mar 2025 11:35:27 -0800 X-Gm-Features: AQ5f1JrDW7mLkFtO0z55-PYq8YDriG8ewuUVroKEABXL5Sgdkz7qYB1ayor2JoM Message-ID: Subject: Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing To: Peter Xu Cc: Nikita Kalyazin , akpm@linux-foundation.org, pbonzini@redhat.com, shuah@kernel.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lorenzo.stoakes@oracle.com, david@redhat.com, ryan.roberts@arm.com, quic_eberman@quicinc.com, graf@amazon.de, jgowans@amazon.com, roypat@amazon.co.uk, derekmn@amazon.com, nsaenz@amazon.es, xmarcalx@amazon.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Stat-Signature: ojimzg93538p3k5dz1uxw7qu1fzf7gmw X-Rspamd-Queue-Id: 488EC40003 X-Rspam-User: X-HE-Tag: 1741203364-238951 X-HE-Meta: U2FsdGVkX18C6Q+h+7zwBR87tCCN6I+T127C9gbvWKqi9zBM+EDfM+tpXFT0s55fMxrIiX7N2h2uEqS+m1qVa38/0LHReHzAfOlATl5PgEOoP6gkxiuV8siIK7JMfEdd1iX2Tyjfm2REEjd/jxodn+uTQ3jdUB/LQwNibdnV06ub9e5UN0uadAGgzi8oerXq6xzD19qh2hWQWpmgqGRi94U3DWof+Rulwv65tFsPCtW5OcheexB/6bzwC0w0m57/QX4dAj+0eqUAdOwh+0Z/Xi/fGFMdTMTKHVUVpR4TJF8GL9h+/Z2Yoxp49v+HuUYrvHHlNtCbTY7m1hiDG7KcBLjFGzrTlj9NQk2VWGBGbKdSHZ221sF5zLv7DbTmdUsDpO/RjxLJMgxg6xDEcxvdIPX80tmztEKjmThEuh0He2PsOqPE1UOUVh3i802qTwD7rr1oxi5t8Uw8AGI+yTI278c7lqYCmcKNNrS705UEeGfFUUCQzofcOi3N+wW8+p325J3OdW8zMMfa+hbDQCMEsBzkdkpYEBpdWt/NUp5JuDQvICOHi38mE6cv5Yh8meKXYgim9oM35wDEr5dOkTugi9Zh0MBENTcqXR0UDfWujxuWavmLaKArW9qhRDDEL7Q1Y9EwLNfXRiwzzsm2g28NjvHnJaA+cnp5XgXe1WhpbZf3YYbExRMPcSvZR3xlwrQx2fQYmyDsPxrLQhmcC2uTr01HXV0CHhs/rcr7Fgk3f0NfYd4YNPDi21QblxpPsQeo2Ik6qJkP1UVSw7lKGZkWESKkvMjFmOZXNnXGcGDKl67VXuud34/y1F/Geoe4/snQApgPOsiv11F4uLT8iknQjAfc22LY94PCf+VK4QslAmiuGUdfYOl6gG57yEQ+0vfIzWRr6h66Ky+tZRbTQSDQ8plDHlFOAHPB4fYbR5KoHlmjCeKtdeU+S+3CUYIYxLgNJhN7KrjacAj3QoUYUpv YEvXof06 KBxhEd4+ZRaCMpHRVze94LH5afboC3Pzwp0J0oXny+0AyhnIxzkk5aZx6GxTAGuyHx8zjWXlwiWH53JKfVw9wautRSxTdQpwPBFhVC6/LRFKGn0y1RMw1LqeKFsb34kEu7VhCcPruQmiKzmOGHchfdhVeZpT9KhE9kG3xFVqdGZzHJ0ieQ/XjsXh+fMZ9NV6i4/GypgeAdcjvxWMmSr2LrM4fz/jKioeU218rsdd3U3WMn0qZbUXh8iJPMD1OUIRB1t6/PLiadWQeJmbOeMMvjva+LrYd9GYv31TYVX8eY42/YWaUbYGse7b1gA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000009, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 3, 2025 at 1:29=E2=80=AFPM Peter Xu wrote: > > On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote: > > This series is built on top of the v3 write syscall support [1]. > > > > With James's KVM userfault [2], it is possible to handle stage-2 faults > > in guest_memfd in userspace. However, KVM itself also triggers faults > > in guest_memfd in some cases, for example: PV interfaces like kvmclock, > > PV EOI and page table walking code when fetching the MMIO instruction o= n > > x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3] > > that KVM would be accessing those pages via userspace page tables. In > > order for such faults to be handled in userspace, guest_memfd needs to > > support userfaultfd. > > > > This series proposes a limited support for userfaultfd in guest_memfd: > > - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM` > > (as is fault support in general) > > - Only `page missing` event is currently supported > > - Userspace is supposed to respond to the event with the `write` > > syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting > > process. Note that we can't use `UFFDIO_COPY` here because > > userfaulfd code does not know how to prepare guest_memfd pages, eg > > remove them from direct map [4]. > > > > Not included in this series: > > - Proper interface for userfaultfd to recognise guest_memfd mappings > > - Proper handling of truncation cases after locking the page > > > > Request for comments: > > - Is it a sensible workflow for guest_memfd to resolve a userfault > > `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One > > of the alternatives is teaching `UFFDIO_COPY` how to deal with > > guest_memfd pages. > > Probably not.. I don't see what protects a thread fault concurrently > during write() happening, seeing partial data. Since you check the page > cache it'll let it pass, but the partial page will be faulted in there. +1 here. I think the simplest way to make it work would be to also check folio_test_uptodate() in the userfaultfd_missing() check[1]. It would pair with kvm_gmem_mark_prepared() in the write() path[2]. I'm not sure if that's the "right" way, I think it would prevent threads from reading data as it is written. [1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@amazon.com= / [2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@amazon.com= / > I think we may need to either go with full MISSING or full MINOR traps. I agree, and just to clarify: you've basically implemented the MISSING model, just using write() to resolve userfaults instead of UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really doing much; when the page cache has a page, the fault handler will populate the PTE for you. I think it's probably simpler to implement the MINOR model, where userspace can populate the page cache however it wants; write() is perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs for gmem, and the fault handler needs to check for the presence of PTEs. The `struct vm_fault` you have should contain enough info. > One thing to mention is we probably need MINOR sooner or later to support > gmem huge pages. The thing is for huge folios in gmem we can't rely on > missing in page cache, as we always need to allocate in hugetlb sizes. > > > - What is a way forward to make userfaultfd code aware of guest_memfd? > > I saw that Patrick hit a somewhat similar problem in [5] when trying > > to use direct map manipulation functions in KVM and was pointed by > > David at Elliot's guestmem library [6] that might include a shim for= that. > > Would the library be the right place to expose required interfaces l= ike > > `vma_is_gmem`? > > Not sure what's the best to do, but IIUC the current way this series uses > may not work as long as one tries to reference a kvm symbol from core mm.= . > > One trick I used so far is leveraging vm_ops and provide hook function to > report specialties when it's gmem. In general, I did not yet dare to > overload vm_area_struct, but I'm thinking maybe vm_ops is more possible t= o > be accepted. E.g. something like this: > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5e742738240c..b068bb79fdbc 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -653,8 +653,26 @@ struct vm_operations_struct { > */ > struct page *(*find_special_page)(struct vm_area_struct *vma, > unsigned long addr); > + /* > + * When set, return the allowed orders bitmask in faults of mmap(= ) > + * ranges (e.g. for follow up huge_fault() processing). Drivers > + * can use this to bypass THP setups for specific types of VMAs. > + */ > + unsigned long (*get_supported_orders)(struct vm_area_struct *vma)= ; > }; > > +static inline bool vma_has_supported_orders(struct vm_area_struct *vma) > +{ > + return vma->vm_ops && vma->vm_ops->get_supported_orders; > +} > + > +static inline unsigned long vma_get_supported_orders(struct vm_area_stru= ct *vma) > +{ > + if (!vma_has_supported_orders(vma)) > + return 0; > + return vma->vm_ops->get_supported_orders(vma); > +} > + > > In my case I used that to allow gmem report huge page supports on faults. > > Said that, above only existed in my own tree so far, so I also don't know > whether something like that could be accepted (even if it'll work for you= ). I think it might be useful to implement an fs-generic MINOR mode. The fault handler is already easy enough to do generically (though it would become more difficult to determine if the "MINOR" fault is actually a MISSING fault, but at least for my userspace, the distinction isn't important. :)) So the question becomes: what should UFFDIO_CONTINUE look like? And I think it would be nice if UFFDIO_CONTINUE just called vm_ops->fault() to get the page we want to map and then mapped it, instead of having shmem-specific and hugetlb-specific versions (though maybe we need to keep the hugetlb specialization...). That would avoid putting kvm/gmem/etc. symbols in mm/userfaultfd code. I've actually wanted to do this for a while but haven't had a good reason to pursue it. I wonder if it can be done in a backwards-compatible fashion...