From: Nikita Kalyazin <kalyazin@amazon.com>
To: "David Hildenbrand (Red Hat)" <david@kernel.org>,
Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>, <linux-mm@kvack.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
"Axel Rasmussen" <axelrasmussen@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Hugh Dickins <hughd@google.com>,
"James Houghton" <jthoughton@google.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Michal Hocko <mhocko@suse.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"Sean Christopherson" <seanjc@google.com>,
Shuah Khan <shuah@kernel.org>,
"Suren Baghdasaryan" <surenb@google.com>,
Vlastimil Babka <vbabka@suse.cz>, <linux-kernel@vger.kernel.org>,
<kvm@vger.kernel.org>, <linux-kselftest@vger.kernel.org>
Subject: Re: [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode
Date: Thu, 4 Dec 2025 17:27:04 +0000 [thread overview]
Message-ID: <2afda7a3-3c48-44e4-b462-49e0d223208b@amazon.com> (raw)
In-Reply-To: <6b21d20c-447f-4059-8cbd-76a8eeebe834@amazon.com>
On 03/12/2025 10:03, Nikita Kalyazin wrote:
> On 03/12/2025 09:23, David Hildenbrand (Red Hat) wrote:
>> On 12/2/25 12:50, Nikita Kalyazin wrote:
>>>
>>>
>>> On 01/12/2025 20:57, Peter Xu wrote:
>>>> On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:
>>>>>
>>>>>
>>>>> On 01/12/2025 18:35, Peter Xu wrote:
>>>>>> On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
>>>>>>> I believe I found the precise point where we convinced ourselves
>>>>>>> that minor
>>>>>>> support was sufficient: [1]. If at this moment we don't find
>>>>>>> that reasoning
>>>>>>> valid anymore, then indeed implementing missing is the only option.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/kvm/Z9GsIDVYWoV8d8-C@x1.local
>>>>>>
>>>>>> Now after I re-read the discussion, I may have made a wrong statement
>>>>>> there, sorry. I could have got slightly confused on when the write()
>>>>>> syscall can be involved.
>>>>>>
>>>>>> I agree if you want to get an event when cache missed with the
>>>>>> current uffd
>>>>>> definitions and when pre-population is forbidden, then MISSING
>>>>>> trap is
>>>>>> required. That is, with/without the need of UFFDIO_COPY being
>>>>>> available.
>>>>>>
>>>>>> Do I understand it right that UFFDIO_COPY is not allowed in your
>>>>>> case, but
>>>>>> only write()?
>>>>>
>>>>> No, UFFDIO_COPY would work perfectly fine. We will still use write()
>>>>> whenever we resolve stage-2 faults as they aren't visible to UFFD.
>>>>> When a
>>>>> userfault occurs at an offset that already has a page in the cache,
>>>>> we will
>>>>> have to keep using UFFDIO_CONTINUE so it looks like both will be
>>>>> required:
>>>>>
>>>>> - user mapping major fault -> UFFDIO_COPY (fills the cache and
>>>>> sets up
>>>>> userspace PT)
>>>>> - user mapping minor fault -> UFFDIO_CONTINUE (only sets up
>>>>> userspace PT)
>>>>> - stage-2 fault -> write() (only fills the cache)
>>>>
>>>> Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's
>>>> series?
>>>
>>> Yes, that's the one ([1]).
>>>
>>> [1]
>>> https://lore.kernel.org/kvm/20250618042424.330664-1-
>>> jthoughton@google.com
>>>
>>>>
>>>> It looks fine indeed, but it looks slightly weird then, as you'll
>>>> have two
>>>> ways to populate the page cache. Logically here atomicity is indeed
>>>> not
>>>> needed when you trap both MISSING + MINOR.
>>>
>>> I reran the test based on the UFFDIO_COPY prototype I had using your
>>> series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB:
>>> 237 vs 202 ms (+17%). Even though UFFDIO_COPY alone is functionally
>>> sufficient, I would prefer to have an option to use write() where
>>> possible and only falling back to UFFDIO_COPY for userspace faults to
>>> have better performance.
>>
>> Just so I understand correctly: we could even do without UFFDIO_COPY for
>> that scenario by using write() + minor faults?
>
> We still need major fault notifications as well (which we were
> accidentally generating until this version). But we can resolve them
> with write() + UFFDIO_CONTINUE instead of UFFDIO_COPY.
We had a conversation about that at the guest_memfd sync today:
Q: Is it possible from the API point of view to support MISSING
notifications without supporting UFFDIO_COPY?
A: The manpage [1] says on UFFDIO_REGISTER_MODE_MISSING that "the page
fault is resolved from user-space by either an UFFDIO_COPY or an
UFFDIO_ZEROPAGE ioctl", but I don't think it's actually enforced
anywhere in the code.
Q: UFFDIO_COPY is supposed to provide atomic semantics, while write() +
UFFDIO_CONTINUE does not. Is it a problem?
A: It isn't a problem for the particular Firecracker use case because 1)
vCPUs can be prevented from seeing partially populated pages in the
cache via KVM userfault intercept [2] and 2) we do not use other
userspace mappings. However, as James pointed, in the general case,
other actors may observe partially populated pages via other userspace
mappings.
[1] https://man7.org/linux/man-pages/man2/userfaultfd.2.html
[2]
https://lore.kernel.org/kvm/20250618042424.330664-1-jthoughton@google.com
>
>>
>> But what you are saying is that there might be a performance benefit in
>> using UFFDIO_COPY for userspace faults, to avoid the write()+minor fault
>> overhead?
>
> UFFDIO_COPY _may_ be faster to resolve userspace faults because it's a
> single syscall instead of two, but the amount of userspace faults, at
> least in our scenario, is negligible compared to the amount of stage-2
> faults, so I wouldn't use it as an argument for supporting UFFDIO_COPY
> if it can be avoided.
>
>>
>> --
>> Cheers
>>
>> David
>
next prev parent reply other threads:[~2025-12-04 17:27 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-30 11:18 [PATCH v3 0/5] mm, kvm: add guest_memfd support for uffd minor faults Mike Rapoport
2025-11-30 11:18 ` [PATCH v3 1/5] userfaultfd: move vma_can_userfault out of line Mike Rapoport
2025-11-30 11:18 ` [PATCH v3 2/5] userfaultfd, shmem: use a VMA callback to handle UFFDIO_CONTINUE Mike Rapoport
2025-11-30 11:18 ` [PATCH v3 3/5] mm: introduce VM_FAULT_UFFD_MINOR fault reason Mike Rapoport
2025-12-01 8:59 ` David Hildenbrand (Red Hat)
2025-11-30 11:18 ` [PATCH v3 4/5] guest_memfd: add support for userfaultfd minor mode Mike Rapoport
2025-12-01 9:12 ` David Hildenbrand (Red Hat)
2025-12-01 13:39 ` Nikita Kalyazin
2025-12-01 15:54 ` David Hildenbrand (Red Hat)
2025-12-01 16:48 ` Nikita Kalyazin
2025-12-01 18:35 ` Peter Xu
2025-12-01 20:12 ` Nikita Kalyazin
2025-12-01 20:57 ` Peter Xu
2025-12-02 11:50 ` Nikita Kalyazin
2025-12-02 15:36 ` Peter Xu
2025-12-02 15:59 ` Nikita Kalyazin
2025-12-03 9:23 ` David Hildenbrand (Red Hat)
2025-12-03 10:03 ` Nikita Kalyazin
2025-12-04 17:27 ` Nikita Kalyazin [this message]
2025-11-30 11:18 ` [PATCH v3 5/5] KVM: selftests: test userfaultfd minor for guest_memfd Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2afda7a3-3c48-44e4-b462-49e0d223208b@amazon.com \
--to=kalyazin@amazon.com \
--cc=Liam.Howlett@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=hughd@google.com \
--cc=jthoughton@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=shuah@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox