PROBLEM: userfaultfd REGISTER minor mode on MAP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE  range fails
@ 2025-09-15 20:13 David P. Reed
  2025-09-15 20:24 ` James Houghton
  0 siblings, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-15 20:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm


[1.] One line summary of the problem: userfaultfd REGISTER minor mode on MAP_PRIVATE fails
[2.] Full description of the problem/report:
The userfaultfd man page and the kernel docs seem to indicate that an area mapped
MAP_PRIVATE|MAP_ANONYMOUS can be registered to handle MINOR page faults on regular pages.
However, testing showed that not to work. MAP_SHARED does allow registration for MINOR
page fault events, though.
Either the documentation or the code should be fixed, IMO. Now reading the code that rejects
this case in the kernel source, the test in vma_can_userfault() that rejects this is this
line: 
	if ((vm_flags & VM_UFFD_MINOR) &&
	    (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
		return false;
which probably should include !vma_is_anonymous(vma).

Or maybe the COW that might happen if the program were forked is something that can't be handled, which seems odd.


[3.] Keywords (i.e., modules, networking, kernel): kernel, memory management
[4.] Kernel information
[4.1.] Kernel version (from /proc/version): Linux version 6.15.10-200.fc42.x86_64 (mockbuild@14a33d64645143cab3659d1335d9f80c) (gcc (GCC) 15.2.1 20250808 (Red Hat 15.2.1-1), GNU ld version 2.44-6.fc42) #1 SMP PREEMPT_DYNAMIC Fri Aug 15 15:57:06 UTC 2025


I can construct a program that exhibits just this bug, if that will help. For now, though, I have just replaced MAP_PRIVATE with MAP_SHARED on my anonymous pages, as a workaround. I don't fork, so there's no need for COW.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-15 20:13 PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails David P. Reed
@ 2025-09-15 20:24 ` James Houghton
  2025-09-15 22:58   ` David P. Reed
  0 siblings, 1 reply; 26+ messages in thread
From: James Houghton @ 2025-09-15 20:24 UTC (permalink / raw)
  To: David P. Reed; +Cc: Andrew Morton, linux-mm, Peter Xu, Axel Rasmussen

On Mon, Sep 15, 2025 at 1:13 PM David P. Reed <dpreed@deepplum.com> wrote:
>
>
> [1.] One line summary of the problem: userfaultfd REGISTER minor mode on MAP_PRIVATE fails
> [2.] Full description of the problem/report:
> The userfaultfd man page and the kernel docs seem to indicate that an area mapped
> MAP_PRIVATE|MAP_ANONYMOUS can be registered to handle MINOR page faults on regular pages.
> However, testing showed that not to work. MAP_SHARED does allow registration for MINOR
> page fault events, though.
> Either the documentation or the code should be fixed, IMO. Now reading the code that rejects
> this case in the kernel source, the test in vma_can_userfault() that rejects this is this
> line:
>         if ((vm_flags & VM_UFFD_MINOR) &&
>             (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
>                 return false;
> which probably should include !vma_is_anonymous(vma).
>
> Or maybe the COW that might happen if the program were forked is something that can't be handled, which seems odd.

UFFDIO_CONTINUE, the resolution ioctl for userfaultfd minor faults,
doesn't have defined semantics for MAP_PRIVATE mappings. The
documentation is unclear that MAP_PRIVATE + userfaultfd minor faults
is invalid, but this is intentional behavior.

What would you like UFFDIO_CONTINUE on MAP_PRIVATE to do? Should it
populate a read-only PTE? Should it do CoW and populate a writable
PTE? I'm curious to hear more about your use case (and why UFFDIO_COPY
doesn't do what you want).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-15 20:24 ` James Houghton
@ 2025-09-15 22:58   ` David P. Reed
  2025-09-16  0:31     ` James Houghton
  0 siblings, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-15 22:58 UTC (permalink / raw)
  To: James Houghton; +Cc: Andrew Morton, linux-mm, Peter Xu, Axel Rasmussen

On Monday, September 15, 2025 16:24, "James Houghton" <jthoughton@google.com> said:

> On Mon, Sep 15, 2025 at 1:13 PM David P. Reed <dpreed@deepplum.com> wrote:
>>
>>
>> [1.] One line summary of the problem: userfaultfd REGISTER minor mode on
>> MAP_PRIVATE fails
>> [2.] Full description of the problem/report:
>> The userfaultfd man page and the kernel docs seem to indicate that an area
>> mapped
>> MAP_PRIVATE|MAP_ANONYMOUS can be registered to handle MINOR page faults on
>> regular pages.
>> However, testing showed that not to work. MAP_SHARED does allow registration for
>> MINOR
>> page fault events, though.
>> Either the documentation or the code should be fixed, IMO. Now reading the code
>> that rejects
>> this case in the kernel source, the test in vma_can_userfault() that rejects this
>> is this
>> line:
>>         if ((vm_flags & VM_UFFD_MINOR) &&
>>             (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
>>                 return false;
>> which probably should include !vma_is_anonymous(vma).
>>
>> Or maybe the COW that might happen if the program were forked is something that
>> can't be handled, which seems odd.
> 
> UFFDIO_CONTINUE, the resolution ioctl for userfaultfd minor faults,
> doesn't have defined semantics for MAP_PRIVATE mappings. The
> documentation is unclear that MAP_PRIVATE + userfaultfd minor faults
> is invalid, but this is intentional behavior.
> 
> What would you like UFFDIO_CONTINUE on MAP_PRIVATE to do? Should it
> populate a read-only PTE? Should it do CoW and populate a writable
> PTE? I'm curious to hear more about your use case (and why UFFDIO_COPY
> doesn't do what you want).
> 

Well, I was just expecting to UFFDIO_CONTINUE to do whatever "normally" gets done. So, the normal case for MAP_PRIVATE|MAP_ANONYMOUS, if the page is in the swap cache and thus takes a minor fault, would depend on whether the access was a write or a read.

For a read, the page just gets installed in the page map from the swap cache.
For a write, if the page hasn't yet been copied, a copy is made of the swap cache contents of that page at that point, and the new copy is installed into the page table of the writing process.

However, the problem I'm reporting is that I can't even register such a page for minor page faults. 

Now there is a question of the meaning of UUFIO_COPY should be (not continue). If page is MAP_PRIVATE, MAP_COPY is like writing to the page at the time of the minor fault. So the version of the data in the swap cache for the page should be ignored, replacing the  local version makes sense. Any other process that still has the original version from the time of the fork() that shared the page should not be affected, I would think.

There is a confusing possibility, however, with the file descriptor for uffd. In the case of a fork(), the file descriptor would be shared, and so either fork could end up listening via poll/select.

It's hard to decide what is right semantically, because the normal use of userfault is to monitor from another process, though you can use read() in the same process as the faulting one - this seems to be because either fork or a unix-socket can be the path for sending the file descriptor to another process. But this is just definitional, the actual user design would have to handle faults in one place or another.

Now in this case, whichever process does the first read() on the file descriptor would get the information about the minor fault. (I assume both would NOT, but I'm early in my use of userfaultfd). So it could continue or copy, as desired.

Generally, anyone using userfaultfd would understand the nuances of fork() and file handle duplication. So they would probably close the fd in one process or the other, as appropriate. (I admit I haven't tested what happens if both forks try to use the file descriptor, but I can imagine it might be useful if they coordinate carefully).

Now, if many forks end up sharing the uffd file descriptor and also end up with copy-on-write shared pages in the MAP_PRIVATE region, the above definitions of the continue and copy would continue to make sense - to me anyway.

Hope this helps

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-15 22:58   ` David P. Reed
@ 2025-09-16  0:31     ` James Houghton
  2025-09-16 14:48       ` Peter Xu
  2025-09-16 15:37       ` David P. Reed
  0 siblings, 2 replies; 26+ messages in thread
From: James Houghton @ 2025-09-16  0:31 UTC (permalink / raw)
  To: David P. Reed; +Cc: Andrew Morton, linux-mm, Peter Xu, Axel Rasmussen

On Mon, Sep 15, 2025 at 3:58 PM David P. Reed <dpreed@deepplum.com> wrote:
>
>
>
> On Monday, September 15, 2025 16:24, "James Houghton" <jthoughton@google.com> said:
>
> > On Mon, Sep 15, 2025 at 1:13 PM David P. Reed <dpreed@deepplum.com> wrote:
> >>
> >>
> >> [1.] One line summary of the problem: userfaultfd REGISTER minor mode on
> >> MAP_PRIVATE fails
> >> [2.] Full description of the problem/report:
> >> The userfaultfd man page and the kernel docs seem to indicate that an area
> >> mapped
> >> MAP_PRIVATE|MAP_ANONYMOUS can be registered to handle MINOR page faults on
> >> regular pages.
> >> However, testing showed that not to work. MAP_SHARED does allow registration for
> >> MINOR
> >> page fault events, though.
> >> Either the documentation or the code should be fixed, IMO. Now reading the code
> >> that rejects
> >> this case in the kernel source, the test in vma_can_userfault() that rejects this
> >> is this
> >> line:
> >>         if ((vm_flags & VM_UFFD_MINOR) &&
> >>             (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
> >>                 return false;
> >> which probably should include !vma_is_anonymous(vma).
> >>
> >> Or maybe the COW that might happen if the program were forked is something that
> >> can't be handled, which seems odd.
> >
> > UFFDIO_CONTINUE, the resolution ioctl for userfaultfd minor faults,
> > doesn't have defined semantics for MAP_PRIVATE mappings. The
> > documentation is unclear that MAP_PRIVATE + userfaultfd minor faults
> > is invalid, but this is intentional behavior.
> >
> > What would you like UFFDIO_CONTINUE on MAP_PRIVATE to do? Should it
> > populate a read-only PTE? Should it do CoW and populate a writable
> > PTE? I'm curious to hear more about your use case (and why UFFDIO_COPY
> > doesn't do what you want).
> >
>
> Well, I was just expecting to UFFDIO_CONTINUE to do whatever "normally" gets done. So, the normal case for MAP_PRIVATE|MAP_ANONYMOUS, if the page is in the swap cache and thus takes a minor fault, would depend on whether the access was a write or a read.

This minor fault is not a *userfaultfd* minor fault, and even if
registering UFFD_REGISTER_MODE_MINOR on this VMA were allowed, you
wouldn't get userfaults. This is because swap-outs for MAP_ANONYMOUS
VMAs leave behind a swap entry (!pte_present() && !pte_none()).
UFFDIO_CONTINUE cannot resolve this condition, so no minor fault is
generated in the first place.

Why can't UFFDIO_CONTINUE resolve this condition? Well UFFDIO_CONTINUE
only populates pte_none() PTEs; it will not and should not obliterate
a swap entry. And no one has a use-case for making it trigger a
swap-in.

The same logic applies to CoW; CoW faults are not (minor) userfaults
because UFFDIO_CONTINUE cannot resolve them.

> For a read, the page just gets installed in the page map from the swap cache.
> For a write, if the page hasn't yet been copied, a copy is made of the swap cache contents of that page at that point, and the new copy is installed into the page table of the writing process.

Sure, but if this is the behavior you want, why do you want/need userfaultfd?

> However, the problem I'm reporting is that I can't even register such a page for minor page faults.

I understand; I find it easier to speak in terms of the behavior of
the resolution ioctl (it is equivalent).

> Now there is a question of the meaning of UUFIO_COPY should be (not continue). If page is MAP_PRIVATE, MAP_COPY is like writing to the page at the time of the minor fault. So the version of the data in the swap cache for the page should be ignored, replacing the  local version makes sense. Any other process that still has the original version from the time of the fork() that shared the page should not be affected, I would think.
>
> There is a confusing possibility, however, with the file descriptor for uffd. In the case of a fork(), the file descriptor would be shared, and so either fork could end up listening via poll/select.
>
> It's hard to decide what is right semantically, because the normal use of userfault is to monitor from another process, though you can use read() in the same process as the faulting one - this seems to be because either fork or a unix-socket can be the path for sending the file descriptor to another process. But this is just definitional, the actual user design would have to handle faults in one place or another.
>
> Now in this case, whichever process does the first read() on the file descriptor would get the information about the minor fault. (I assume both would NOT, but I'm early in my use of userfaultfd). So it could continue or copy, as desired.
>
> Generally, anyone using userfaultfd would understand the nuances of fork() and file handle duplication. So they would probably close the fd in one process or the other, as appropriate. (I admit I haven't tested what happens if both forks try to use the file descriptor, but I can imagine it might be useful if they coordinate carefully).

I am not really following how the above connects to not being able to
use userfaultfd minor faults for MAP_PRIVATE.

>
> Now, if many forks end up sharing the uffd file descriptor and also end up with copy-on-write shared pages in the MAP_PRIVATE region, the above definitions of the continue and copy would continue to make sense - to me anyway.
>
> Hope this helps

I still don't have a solid grasp of what your use case is.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16  0:31     ` James Houghton
@ 2025-09-16 14:48       ` Peter Xu
  2025-09-16 15:52         ` David P. Reed
  2025-09-16 15:37       ` David P. Reed
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-09-16 14:48 UTC (permalink / raw)
  To: James Houghton; +Cc: David P. Reed, Andrew Morton, linux-mm, Axel Rasmussen

On Mon, Sep 15, 2025 at 05:31:51PM -0700, James Houghton wrote:
> I still don't have a solid grasp of what your use case is.

David, are you trying to provide a synchronous trap for an anon swapin
event?  Say, you want to be able to stop a thread from swapping in anything
from disk (or swap cache), do something, and UFFDIO_CONTINUE to kick it off
again?

That might make some sense when trying to match what MINOR mode means
v.s. the mm's minor faults, but some explanation of why you wanted to do
that would be helpful.  I agree with James that it was at least not the
intention when userfaultfd MINOR trap was introduced.

The other thing to mention is, AFAIU userfaultfd's major use case is not
through a fork(), even though it should work..  In many cases, userfaultfd
is used within a process, with a dedicated thread resolving faults. When
it's used across processes, fork() should work but UFFD_FEATURE_EVENT_FORK
is required, or otherwise via SCM_RIGHTS.  For the latter, the tracee needs
to not only share the uffd object, but tell the tracer explicitly about the
memory layout, because those addresses in the events will be reported in
tracee's mm address space.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16  0:31     ` James Houghton
  2025-09-16 14:48       ` Peter Xu
@ 2025-09-16 15:37       ` David P. Reed
  1 sibling, 0 replies; 26+ messages in thread
From: David P. Reed @ 2025-09-16 15:37 UTC (permalink / raw)
  To: James Houghton; +Cc: Andrew Morton, linux-mm, Peter Xu, Axel Rasmussen



On Monday, September 15, 2025 20:31, "James Houghton" <jthoughton@google.com> said:

> On Mon, Sep 15, 2025 at 3:58 PM David P. Reed <dpreed@deepplum.com> wrote:
>>
>>
>>
>> On Monday, September 15, 2025 16:24, "James Houghton" <jthoughton@google.com>
>> said:
>>
>> > On Mon, Sep 15, 2025 at 1:13 PM David P. Reed <dpreed@deepplum.com>
>> wrote:
>> >>
>> >>
>> >> [1.] One line summary of the problem: userfaultfd REGISTER minor mode on
>> >> MAP_PRIVATE fails
>> >> [2.] Full description of the problem/report:
>> >> The userfaultfd man page and the kernel docs seem to indicate that an area
>> >> mapped
>> >> MAP_PRIVATE|MAP_ANONYMOUS can be registered to handle MINOR page faults on
>> >> regular pages.
>> >> However, testing showed that not to work. MAP_SHARED does allow registration
>> for
>> >> MINOR
>> >> page fault events, though.
>> >> Either the documentation or the code should be fixed, IMO. Now reading the
>> code
>> >> that rejects
>> >> this case in the kernel source, the test in vma_can_userfault() that rejects
>> this
>> >> is this
>> >> line:
>> >>         if ((vm_flags & VM_UFFD_MINOR) &&
>> >>             (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
>> >>                 return false;
>> >> which probably should include !vma_is_anonymous(vma).
>> >>
>> >> Or maybe the COW that might happen if the program were forked is something
>> that
>> >> can't be handled, which seems odd.
>> >
>> > UFFDIO_CONTINUE, the resolution ioctl for userfaultfd minor faults,
>> > doesn't have defined semantics for MAP_PRIVATE mappings. The
>> > documentation is unclear that MAP_PRIVATE + userfaultfd minor faults
>> > is invalid, but this is intentional behavior.
>> >
>> > What would you like UFFDIO_CONTINUE on MAP_PRIVATE to do? Should it
>> > populate a read-only PTE? Should it do CoW and populate a writable
>> > PTE? I'm curious to hear more about your use case (and why UFFDIO_COPY
>> > doesn't do what you want).
>> >
>>
>> Well, I was just expecting to UFFDIO_CONTINUE to do whatever "normally" gets
>> done. So, the normal case for MAP_PRIVATE|MAP_ANONYMOUS, if the page is in the
>> swap cache and thus takes a minor fault, would depend on whether the access was a
>> write or a read.
> 
> This minor fault is not a *userfaultfd* minor fault, and even if
> registering UFFD_REGISTER_MODE_MINOR on this VMA were allowed, you
> wouldn't get userfaults. This is because swap-outs for MAP_ANONYMOUS
> VMAs leave behind a swap entry (!pte_present() && !pte_none()).
> UFFDIO_CONTINUE cannot resolve this condition, so no minor fault is
> generated in the first place.
> 
> Why can't UFFDIO_CONTINUE resolve this condition? Well UFFDIO_CONTINUE
> only populates pte_none() PTEs; it will not and should not obliterate
> a swap entry. And no one has a use-case for making it trigger a
> swap-in.

Well, it's not a page-missing fault, because the page may be in the swap cache and not yet on disk, which the documentation says is not a major (page missing) fault. Maybe it's a documentation problem, if "userfault minor" and "minor fault" aren't the same?
As I note in the report, the documentation is pretty unclear on this point (and also on why MAP_PRIVATE doesn't work).

> 
> The same logic applies to CoW; CoW faults are not (minor) userfaults
> because UFFDIO_CONTINUE cannot resolve them.
> 
>> For a read, the page just gets installed in the page map from the swap cache.
>> For a write, if the page hasn't yet been copied, a copy is made of the swap cache
>> contents of that page at that point, and the new copy is installed into the page
>> table of the writing process.
> 
> Sure, but if this is the behavior you want, why do you want/need userfaultfd?

Because I am tracking page creation events. The COW case creates a new page, and UFFDIO_COPY isn't able to express "just proceed". If the COW is silent, then I won't see page creation by COW. Maybe we need another "mode" besides "missing" and "minor" that gets triggered by COW?  (note that write-protect isn't quite the same as that, because it gets triggered by writes that don't cause COW, also - if it is even allowed for the case of MAP_ANONYMOUS|MAP_PRIVATE.

> 
>> However, the problem I'm reporting is that I can't even register such a page for
>> minor page faults.
> 
> I understand; I find it easier to speak in terms of the behavior of
> the resolution ioctl (it is equivalent).
>
How is it equivalent to the REGISTER rejecting a mode?
 
>> Now there is a question of the meaning of UUFIO_COPY should be (not continue). If
>> page is MAP_PRIVATE, MAP_COPY is like writing to the page at the time of the
>> minor fault. So the version of the data in the swap cache for the page should be
>> ignored, replacing the  local version makes sense. Any other process that still
>> has the original version from the time of the fork() that shared the page should
>> not be affected, I would think.
>>
>> There is a confusing possibility, however, with the file descriptor for uffd. In
>> the case of a fork(), the file descriptor would be shared, and so either fork
>> could end up listening via poll/select.
>>
>> It's hard to decide what is right semantically, because the normal use of
>> userfault is to monitor from another process, though you can use read() in the
>> same process as the faulting one - this seems to be because either fork or a
>> unix-socket can be the path for sending the file descriptor to another process.
>> But this is just definitional, the actual user design would have to handle faults
>> in one place or another.
>>
>> Now in this case, whichever process does the first read() on the file descriptor
>> would get the information about the minor fault. (I assume both would NOT, but
>> I'm early in my use of userfaultfd). So it could continue or copy, as desired.
>>
>> Generally, anyone using userfaultfd would understand the nuances of fork() and
>> file handle duplication. So they would probably close the fd in one process or
>> the other, as appropriate. (I admit I haven't tested what happens if both forks
>> try to use the file descriptor, but I can imagine it might be useful if they
>> coordinate carefully).
> 
> I am not really following how the above connects to not being able to
> use userfaultfd minor faults for MAP_PRIVATE.
> 
>>
>> Now, if many forks end up sharing the uffd file descriptor and also end up with
>> copy-on-write shared pages in the MAP_PRIVATE region, the above definitions of
>> the continue and copy would continue to make sense - to me anyway.
>>
>> Hope this helps
> 
> I still don't have a solid grasp of what your use case is.
> 

My use case is simple, and has been described elsewhere by others. Creating a userspace paging monitor process that can catch page faults in userspace, either tracking them or modifying their behavior.  Not being able to handle anonymous, private pages at all seems to make it useless for that purpose. (doing research on using that fault info to drive madvise LRU management for certain cases). (I can do this with kretprobes thru a kernel driver, but since the mm code is rapidly evolving, it's not anything like an ABI that works across versions).

It's interesting that anonymous private huge page minor fault mode is not rejected, just regular page. (The code snippet above is what rejects regular pages but not huge pages mapped privately).

Just curious - are you a designer or maintainer of userfaultfd? You aren't listed as a maintainer. I would be able to provide a patch set that "fixes" the behavior to be the way I believe would be the most useful, but the point of reporting this as a problem is to avoid rejection by the maintainer, Andrew Morton, if somehow I've missed a subtle concern that isn't explained in the documentation. 
I could also provide a documentation patch to clarify the MAP_PRIVATE|MAP_ANONYMOUS rejection of minor fault handling doesn't work.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 14:48       ` Peter Xu
@ 2025-09-16 15:52         ` David P. Reed
  2025-09-16 16:13           ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-16 15:52 UTC (permalink / raw)
  To: Peter Xu; +Cc: James Houghton, Andrew Morton, linux-mm, Axel Rasmussen

On Tuesday, September 16, 2025 10:48, "Peter Xu" <peterx@redhat.com> said:

> On Mon, Sep 15, 2025 at 05:31:51PM -0700, James Houghton wrote:
>> I still don't have a solid grasp of what your use case is.
> 
> David, are you trying to provide a synchronous trap for an anon swapin
> event?  Say, you want to be able to stop a thread from swapping in anything
> from disk (or swap cache), do something, and UFFDIO_CONTINUE to kick it off
> again?

synchronous would be better. But what I want to do is at least get notifications of swapin events (including the case when the page is in swap cache). Also, using UFFDIO_COPY can be useful for the swap in case might make sense (but rarely, because there's no way to access the data that was swapped out).

> 
> That might make some sense when trying to match what MINOR mode means
> v.s. the mm's minor faults, but some explanation of why you wanted to do
> that would be helpful.  I agree with James that it was at least not the
> intention when userfaultfd MINOR trap was introduced.

I suspected that - however, notice that the rejection of registering minor mode is what I was reporting. It's oddly coded as if ordinary (4K) pages that are MAP_PRIVATE are the problem - no comment in the line of code I quoted explains why.

> 
> The other thing to mention is, AFAIU userfaultfd's major use case is not
> through a fork(), even though it should work..  In many cases, userfaultfd
> is used within a process, with a dedicated thread resolving faults. When
> it's used across processes, fork() should work but UFFD_FEATURE_EVENT_FORK
> is required, or otherwise via SCM_RIGHTS.  For the latter, the tracee needs
> to not only share the uffd object, but tell the tracer explicitly about the
> memory layout, because those addresses in the events will be reported in
> tracee's mm address space.

Yeah, the documentation tends to suggest that the file descriptor should be shared via a Linux Socket. But the case of a fork() should work. (the examples use O_CLOEXEC, but of course that isn't "close on fork").

> 
> Thanks,
> 
> --
> Peter Xu
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 15:52         ` David P. Reed
@ 2025-09-16 16:13           ` Peter Xu
  2025-09-16 17:09             ` David P. Reed
  2025-09-16 17:27             ` David P. Reed
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Xu @ 2025-09-16 16:13 UTC (permalink / raw)
  To: David P. Reed; +Cc: James Houghton, Andrew Morton, linux-mm, Axel Rasmussen

On Tue, Sep 16, 2025 at 11:52:18AM -0400, David P. Reed wrote:
> synchronous would be better. But what I want to do is at least get
> notifications of swapin events (including the case when the page is in
> swap cache). Also, using UFFDIO_COPY can be useful for the swap in case
> might make sense (but rarely, because there's no way to access the data
> that was swapped out).

Some more info on the use case might be helpful.  I can start with some
more questions if that helps.

- If it's about page hotness / coldness, have you tried existing facilities
  (page idle, DAMON, etc.)?  If so, why they won't work?

- Assuming it's async reports that can be collected, what do you plan to do
  with the info?  Do you care about swap outs prior to swap ins?

- How sync events would be better in this case?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 16:13           ` Peter Xu
@ 2025-09-16 17:09             ` David P. Reed
  2025-09-26 22:16               ` Peter Xu
  2025-09-16 17:27             ` David P. Reed
  1 sibling, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-16 17:09 UTC (permalink / raw)
  To: Peter Xu; +Cc: James Houghton, Andrew Morton, linux-mm, Axel Rasmussen

Than -

Thanks for your interest. Some clarifications on my current use case are interposed below. 

On Tuesday, September 16, 2025 12:13, "Peter Xu" <peterx@redhat.com> said:

> On Tue, Sep 16, 2025 at 11:52:18AM -0400, David P. Reed wrote:
>> synchronous would be better. But what I want to do is at least get
>> notifications of swapin events (including the case when the page is in
>> swap cache). Also, using UFFDIO_COPY can be useful for the swap in case
>> might make sense (but rarely, because there's no way to access the data
>> that was swapped out).
> 
> Some more info on the use case might be helpful.  I can start with some
> more questions if that helps.
> 
> - If it's about page hotness / coldness, have you tried existing facilities
>   (page idle, DAMON, etc.)?  If so, why they won't work?

Yes. Those functions are just summarizers, giving counts and averages. They provide zero detail about specific pages to the application running in the process.

I can clarify my use case focus by pointing out what inspired me to use userfaultfd for application specific memory management (which is, after all, what userfaultfd was promoted for on Linux Weekly News a while back when it first came out). This 2024 paper is along the same lines as what I'm researching, and was published in 2024 Usenix proceedings.
ExtMem: Enabling Application Aware Virtual Memory Management for Data Intensive Applications
https://www.usenix.org/conference/atc24/presentation/jalalian

See figure 3 of the paper for their performance problem with userfaultfd vs. their kernel modifications (upcall). 

They had tried using userfaultfd for their work, and found it was "too slow" compared to what they call the "upcall" technique they achieved by modifying the kernel page fault handling path. (see paper for details - and Jalalian't thesis dives into it more deeply).
I could code up an equivalent to their "upcall" - but that would mean completely non-standard (and fraught with security issues, as well as not being able to use a separate management process).
For me, the performance concern is less problematic - I'm doing application analysis and experimentation. And I don't want to have to maintain a kernel patch set.

Note that I expect to use madvise() and process_madvise() to manipulate page coldness and swapping as well.

> 
> - Assuming it's async reports that can be collected, what do you plan to do
>   with the info?  Do you care about swap outs prior to swap ins?

Detailed application paging measurements, modeling, and so forth.
I'm not asking for a big enhancement to userfaultfd - just expecting it should (as is) basically work, if the UFFDIO_REGISTER actually allowed to register the minor page fault mode,

> 
> - How sync events would be better in this case?

Simpler to coordinate the interaction with the faulting process by far.

> 
> Than
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 16:13           ` Peter Xu
  2025-09-16 17:09             ` David P. Reed
@ 2025-09-16 17:27             ` David P. Reed
  2025-09-16 18:35               ` Axel Rasmussen
  1 sibling, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-16 17:27 UTC (permalink / raw)
  To: Peter Xu; +Cc: James Houghton, Andrew Morton, linux-mm, Axel Rasmussen

Than -

Just to clarify - 
Looking at the man page for UFFDIO_API, there are two "feature bits" that indicate cases where "minor" handling is now supported, and can be enabled.
UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM 
In my reading of the documents, these seem to imply that before they were added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were supported, and that the "new" additions to the MINOR mode were just for HUGETLBFS and MAP_SHARED cases.

It seems odd that anonymous page faults and COW would not be handled, given that context.

Anyway, that's unclear in any of the documentation. This just adds to my last response where I explain my use case.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 17:27             ` David P. Reed
@ 2025-09-16 18:35               ` Axel Rasmussen
  2025-09-16 19:10                 ` James Houghton
  2025-09-16 19:52                 ` David P. Reed
  0 siblings, 2 replies; 26+ messages in thread
From: Axel Rasmussen @ 2025-09-16 18:35 UTC (permalink / raw)
  To: David P. Reed; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1888 bytes --]

On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com> wrote:

> Than -
>
> Just to clarify -
> Looking at the man page for UFFDIO_API, there are two "feature bits" that
> indicate cases where "minor" handling is now supported, and can be enabled.
> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> In my reading of the documents, these seem to imply that before they were
> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
> supported, and that the "new" additions to the MINOR mode were just for
> HUGETLBFS and MAP_SHARED cases.
>

Actually minor fault support didn't exist at all before those two features
were added. :)

You are right that userfaultfd's use of "minor fault" is (unfortunately)
slightly different from the meaning in other contexts. I think the more
normal meaning is, faults which do not incur I/O (i.e., swap faults and
file faults [i.e., faults on non-swap-backed pages] are major, other faults
are minor).

For userfaultfd, a minor fault is a fault where the page already exists in
the page cache, but the page table entry wasn't setup. I don't think that
scenario can ever happen for anonymous, private mappings, so it doesn't
really make sense to be able to register such mappings in this mode. If you
create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then access it (read
or write), that fault requires allocation of a new page, so userfaultfd
does not consider that a "minor fault". My recollection though is if you
make a file on tmpfs or hugetlbfs, fallocate() it or whatever, and you
MAP_PRIVATE that file, *that* registration will work.

>
> It seems odd that anonymous page faults and COW would not be handled,
> given that context.
>
> Anyway, that's unclear in any of the documentation. This just adds to my
> last response where I explain my use case.
>
>
>

[-- Attachment #2: Type: text/html, Size: 2491 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 18:35               ` Axel Rasmussen
@ 2025-09-16 19:10                 ` James Houghton
  2025-09-16 19:47                   ` David P. Reed
  2025-09-16 22:04                   ` Axel Rasmussen
  2025-09-16 19:52                 ` David P. Reed
  1 sibling, 2 replies; 26+ messages in thread
From: James Houghton @ 2025-09-16 19:10 UTC (permalink / raw)
  To: Axel Rasmussen; +Cc: David P. Reed, Peter Xu, Andrew Morton, linux-mm

On Tue, Sep 16, 2025 at 11:35 AM Axel Rasmussen
<axelrasmussen@google.com> wrote:
>
>
>
> On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com> wrote:
>>
>> Than -
>>
>> Just to clarify -
>> Looking at the man page for UFFDIO_API, there are two "feature bits" that indicate cases where "minor" handling is now supported, and can be enabled.
>> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
>> In my reading of the documents, these seem to imply that before they were added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were supported, and that the "new" additions to the MINOR mode were just for HUGETLBFS and MAP_SHARED cases.
>
>
> Actually minor fault support didn't exist at all before those two features were added. :)
>
> You are right that userfaultfd's use of "minor fault" is (unfortunately) slightly different from the meaning in other contexts. I think the more normal meaning is, faults which do not incur I/O (i.e., swap faults and file faults [i.e., faults on non-swap-backed pages] are major, other faults are minor).
>
> For userfaultfd, a minor fault is a fault where the page already exists in the page cache, but the page table entry wasn't setup. I don't think that scenario can ever happen for anonymous, private mappings, so it doesn't really make sense to be able to register such mappings in this mode. If you create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then access it (read or write), that fault requires allocation of a new page, so userfaultfd does not consider that a "minor fault". My recollection though is if you make a file on tmpfs or hugetlbfs, fallocate() it or whatever, and you MAP_PRIVATE that file, *that* registration will work.

Ah! You're right... MAP_PRIVATE *is* supported (for tmpfs and
hugetlbfs only), and UFFDIO_CONTINUE will, upon finding the page in
the page cache, install a RO PTE for it.

But what happens when the write comes after installing the RO PTE? My
reading of the code today makes me think that we'd get a minor
userfault and then be unable to continue...! (The only reasonable
behavior is that CoW is done without triggering a userfault... I
assumed/thought this was the behavior today. I wish I had time to test
this -- I hope I'm misreading it.)

:( Here I was thinking I understood how userfaultfd minor faults worked.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 19:10                 ` James Houghton
@ 2025-09-16 19:47                   ` David P. Reed
  2025-09-16 22:04                   ` Axel Rasmussen
  1 sibling, 0 replies; 26+ messages in thread
From: David P. Reed @ 2025-09-16 19:47 UTC (permalink / raw)
  To: James Houghton; +Cc: Axel Rasmussen, Peter Xu, Andrew Morton, linux-mm



On Tuesday, September 16, 2025 15:10, "James Houghton" <jthoughton@google.com> said:

> On Tue, Sep 16, 2025 at 11:35 AM Axel Rasmussen
> <axelrasmussen@google.com> wrote:
>>
>>
>>
>> On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com>
>> wrote:
>>>
>>> Than -
>>>
>>> Just to clarify -
>>> Looking at the man page for UFFDIO_API, there are two "feature bits" that
>>> indicate cases where "minor" handling is now supported, and can be enabled.
>>> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
>>> In my reading of the documents, these seem to imply that before they were added
>>> as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were supported, and
>>> that the "new" additions to the MINOR mode were just for HUGETLBFS and
>>> MAP_SHARED cases.
>>
>>
>> Actually minor fault support didn't exist at all before those two features were
>> added. :)
>>
>> You are right that userfaultfd's use of "minor fault" is (unfortunately) slightly
>> different from the meaning in other contexts. I think the more normal meaning is,
>> faults which do not incur I/O (i.e., swap faults and file faults [i.e., faults on
>> non-swap-backed pages] are major, other faults are minor).
>>
>> For userfaultfd, a minor fault is a fault where the page already exists in the
>> page cache, but the page table entry wasn't setup. I don't think that scenario
>> can ever happen for anonymous, private mappings, so it doesn't really make sense
>> to be able to register such mappings in this mode. If you create a mapping with
>> mmap(MAP_ANON|MAP_PRIVATE) and then access it (read or write), that fault
>> requires allocation of a new page, so userfaultfd does not consider that a "minor
>> fault". My recollection though is if you make a file on tmpfs or hugetlbfs,
>> fallocate() it or whatever, and you MAP_PRIVATE that file, *that* registration
>> will work.
> 
> Ah! You're right... MAP_PRIVATE *is* supported (for tmpfs and
> hugetlbfs only), and UFFDIO_CONTINUE will, upon finding the page in
> the page cache, install a RO PTE for it.
> 
> But what happens when the write comes after installing the RO PTE? My
> reading of the code today makes me think that we'd get a minor
> userfault and then be unable to continue...! (The only reasonable
> behavior is that CoW is done without triggering a userfault... I
> assumed/thought this was the behavior today. I wish I had time to test
> this -- I hope I'm misreading it.)
> 
> :( Here I was thinking I understood how userfaultfd minor faults worked.
> 

So did I. It's kind of confusing. I suppose `git blame` might suggest when the code that doesn't like MAP_PRIVATE|MAP_ANONYMOUS pages in the UFFDIO_REGISTER minor call was introduced... I know that MAP_SHARED|MAP_ANONYMOUS pages allows minor faults to be registered on them - and the only difference really is the passing of the mapping to forked clones (and the COW behavior resulting from sharing until written, but minor mapping has nothing to do with write faults particularly - reading can cause minor faults too).





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 18:35               ` Axel Rasmussen
  2025-09-16 19:10                 ` James Houghton
@ 2025-09-16 19:52                 ` David P. Reed
  2025-09-17 16:13                   ` Axel Rasmussen
  1 sibling, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-16 19:52 UTC (permalink / raw)
  To: Axel Rasmussen; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm



On Tuesday, September 16, 2025 14:35, "Axel Rasmussen" <axelrasmussen@google.com> said:

> On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com> wrote:
> 
>> Than -
>>
>> Just to clarify -
>> Looking at the man page for UFFDIO_API, there are two "feature bits" that
>> indicate cases where "minor" handling is now supported, and can be enabled.
>> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
>> In my reading of the documents, these seem to imply that before they were
>> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
>> supported, and that the "new" additions to the MINOR mode were just for
>> HUGETLBFS and MAP_SHARED cases.
>>
> 
> Actually minor fault support didn't exist at all before those two features
> were added. :)

Thanks for commenting. I'm not sure that's exactly true. Why is SNMEM (MAP_SHARED) supported, but not ordinary pages? I wasn't party to the evolution here, but so far no one has explained why there's a special difference between SHMEM and ordinary VMAs.

> 
> You are right that userfaultfd's use of "minor fault" is (unfortunately)
> slightly different from the meaning in other contexts. I think the more
> normal meaning is, faults which do not incur I/O (i.e., swap faults and
> file faults [i.e., faults on non-swap-backed pages] are major, other faults
> are minor).
> 
> For userfaultfd, a minor fault is a fault where the page already exists in
> the page cache, but the page table entry wasn't setup. I don't think that
> scenario can ever happen for anonymous, private mappings, so it doesn't
> really make sense to be able to register such mappings in this mode. If you
> create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then access it (read
> or write), that fault requires allocation of a new page, so userfaultfd
> does not consider that a "minor fault". My recollection though is if you
> make a file on tmpfs or hugetlbfs, fallocate() it or whatever, and you
> MAP_PRIVATE that file, *that* registration will work.
> 
> 
>>
>> It seems odd that anonymous page faults and COW would not be handled,
>> given that context.
>>
>> Anyway, that's unclear in any of the documentation. This just adds to my
>> last response where I explain my use case.
>>
>>
>>
> 




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 19:10                 ` James Houghton
  2025-09-16 19:47                   ` David P. Reed
@ 2025-09-16 22:04                   ` Axel Rasmussen
  2025-09-26 22:00                     ` Peter Xu
  1 sibling, 1 reply; 26+ messages in thread
From: Axel Rasmussen @ 2025-09-16 22:04 UTC (permalink / raw)
  To: James Houghton; +Cc: David P. Reed, Peter Xu, Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3015 bytes --]

On Tue, Sep 16, 2025 at 12:11 PM James Houghton <jthoughton@google.com>
wrote:

> On Tue, Sep 16, 2025 at 11:35 AM Axel Rasmussen
> <axelrasmussen@google.com> wrote:
> >
> >
> >
> > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com>
> wrote:
> >>
> >> Than -
> >>
> >> Just to clarify -
> >> Looking at the man page for UFFDIO_API, there are two "feature bits"
> that indicate cases where "minor" handling is now supported, and can be
> enabled.
> >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> >> In my reading of the documents, these seem to imply that before they
> were added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
> supported, and that the "new" additions to the MINOR mode were just for
> HUGETLBFS and MAP_SHARED cases.
> >
> >
> > Actually minor fault support didn't exist at all before those two
> features were added. :)
> >
> > You are right that userfaultfd's use of "minor fault" is (unfortunately)
> slightly different from the meaning in other contexts. I think the more
> normal meaning is, faults which do not incur I/O (i.e., swap faults and
> file faults [i.e., faults on non-swap-backed pages] are major, other faults
> are minor).
> >
> > For userfaultfd, a minor fault is a fault where the page already exists
> in the page cache, but the page table entry wasn't setup. I don't think
> that scenario can ever happen for anonymous, private mappings, so it
> doesn't really make sense to be able to register such mappings in this
> mode. If you create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then
> access it (read or write), that fault requires allocation of a new page, so
> userfaultfd does not consider that a "minor fault". My recollection though
> is if you make a file on tmpfs or hugetlbfs, fallocate() it or whatever,
> and you MAP_PRIVATE that file, *that* registration will work.
>
> Ah! You're right... MAP_PRIVATE *is* supported (for tmpfs and
> hugetlbfs only), and UFFDIO_CONTINUE will, upon finding the page in
> the page cache, install a RO PTE for it.
>

Why does it have to be RO? I think it depends on the PROT_ flag you
specified when you created the private mapping.


>
> But what happens when the write comes after installing the RO PTE? My
> reading of the code today makes me think that we'd get a minor
> userfault and then be unable to continue...! (The only reasonable
> behavior is that CoW is done without triggering a userfault... I
> assumed/thought this was the behavior today. I wish I had time to test
> this -- I hope I'm misreading it.)
>

It's possible my memory is wrong, but I don't think UFFD minor fault
handling really interacts with CoW faults. IOW, I think you get a UFFD
minor fault when the PTE is missing, not when it's RO resulting in CoW. I
think there we just CoW the page as per normal and no fault is reported via
UFFD?


>
> :( Here I was thinking I understood how userfaultfd minor faults worked.
>

[-- Attachment #2: Type: text/html, Size: 3970 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 19:52                 ` David P. Reed
@ 2025-09-17 16:13                   ` Axel Rasmussen
  2025-09-19 18:29                     ` David P. Reed
  0 siblings, 1 reply; 26+ messages in thread
From: Axel Rasmussen @ 2025-09-17 16:13 UTC (permalink / raw)
  To: David P. Reed; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm

On Tue, Sep 16, 2025 at 12:52 PM David P. Reed <dpreed@deepplum.com> wrote:
>
>
>
> On Tuesday, September 16, 2025 14:35, "Axel Rasmussen" <axelrasmussen@google.com> said:
>
> > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com> wrote:
> >
> >> Than -
> >>
> >> Just to clarify -
> >> Looking at the man page for UFFDIO_API, there are two "feature bits" that
> >> indicate cases where "minor" handling is now supported, and can be enabled.
> >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> >> In my reading of the documents, these seem to imply that before they were
> >> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
> >> supported, and that the "new" additions to the MINOR mode were just for
> >> HUGETLBFS and MAP_SHARED cases.
> >>
> >
> > Actually minor fault support didn't exist at all before those two features
> > were added. :)
>
> Thanks for commenting. I'm not sure that's exactly true. Why is SNMEM (MAP_SHARED) supported, but not ordinary pages? I wasn't party to the evolution here, but so far no one has explained why there's a special difference between SHMEM and ordinary VMAs.

I promise it's true, I wrote the UFFD minor fault handling feature. :)

As for why... Like I said above, UFFD calls it a "minor" fault if the
PTE doesn't exist, but the page already exists in the page cache. If
the PTE does exist, you won't get either a minor *or* a missing fault.
If the page does not already existing the page cache, you'll get a
missing fault, not a minor fault.

So "ordinary" VMAs are not supported because I don't think there is
any way to create that condition with them? If you just
mmap(MAP_ANON|MAP_PRIVATE), those pages will never be in the page
cache, right? How would you go about doing so? You don't have an fd,
you can't fallocate it. If you specified MAP_POPULATE, the PTEs would
also be installed, so you just wouldn't get userfaults at all. If you
create the mapping, then fork, then write to it in the child, I think
the pages just get CoWed, I don't think userfaults are generated for
that, because the PTE was already there (albeit, with RO permissions).

I guess maybe a way to make progress here is, can you list out what
sequence of steps you believe should result in a UFFD minor fault?
Like (for example):

fd = memfd_create()
fallocate(fd, 0, 0, size)
mmap(fd, MAP_PRIVATE)
/* register mapping for UFFD minor faults */
/* read or write to mapping */

Now we get a minor fault.



>
> >
> > You are right that userfaultfd's use of "minor fault" is (unfortunately)
> > slightly different from the meaning in other contexts. I think the more
> > normal meaning is, faults which do not incur I/O (i.e., swap faults and
> > file faults [i.e., faults on non-swap-backed pages] are major, other faults
> > are minor).
> >
> > For userfaultfd, a minor fault is a fault where the page already exists in
> > the page cache, but the page table entry wasn't setup. I don't think that
> > scenario can ever happen for anonymous, private mappings, so it doesn't
> > really make sense to be able to register such mappings in this mode. If you
> > create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then access it (read
> > or write), that fault requires allocation of a new page, so userfaultfd
> > does not consider that a "minor fault". My recollection though is if you
> > make a file on tmpfs or hugetlbfs, fallocate() it or whatever, and you
> > MAP_PRIVATE that file, *that* registration will work.
> >
> >
> >>
> >> It seems odd that anonymous page faults and COW would not be handled,
> >> given that context.
> >>
> >> Anyway, that's unclear in any of the documentation. This just adds to my
> >> last response where I explain my use case.
> >>
> >>
> >>
> >
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-17 16:13                   ` Axel Rasmussen
@ 2025-09-19 18:29                     ` David P. Reed
  2025-09-25 19:20                       ` Axel Rasmussen
  0 siblings, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-19 18:29 UTC (permalink / raw)
  To: Axel Rasmussen; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm



On Wednesday, September 17, 2025 12:13, "Axel Rasmussen" <axelrasmussen@google.com> said:

> On Tue, Sep 16, 2025 at 12:52 PM David P. Reed <dpreed@deepplum.com> wrote:
>>
>>
>>
>> On Tuesday, September 16, 2025 14:35, "Axel Rasmussen" <axelrasmussen@google.com>
>> said:
>>
>> > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com>
>> wrote:
>> >
>> >> Than -
>> >>
>> >> Just to clarify -
>> >> Looking at the man page for UFFDIO_API, there are two "feature bits" that
>> >> indicate cases where "minor" handling is now supported, and can be enabled.
>> >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
>> >> In my reading of the documents, these seem to imply that before they were
>> >> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
>> >> supported, and that the "new" additions to the MINOR mode were just for
>> >> HUGETLBFS and MAP_SHARED cases.
>> >>
>> >
>> > Actually minor fault support didn't exist at all before those two features
>> > were added. :)
>>
>> Thanks for commenting. I'm not sure that's exactly true. Why is SNMEM
>> (MAP_SHARED) supported, but not ordinary pages? I wasn't party to the evolution
>> here, but so far no one has explained why there's a special difference between
>> SHMEM and ordinary VMAs.
> 
> I promise it's true, I wrote the UFFD minor fault handling feature. :)
OK, but I am still confused as to SHMEM VMAs are supported and non-SHMEM are not, in the case of an anonymous mapped range.

> 
> As for why... Like I said above, UFFD calls it a "minor" fault if the
> PTE doesn't exist, but the page already exists in the page cache. If
> the PTE does exist, you won't get either a minor *or* a missing fault.
> If the page does not already existing the page cache, you'll get a
> missing fault, not a minor fault.
I'm assuming that you understand there is a profound difference between the "page cache" and the "swap cache" in Linux. I am referring to what happens when a page is in the swap cache, (which is primarily about anaonymous pages, but a weird corner case is that "tmpfs" is backed by the swap cache and the swap system, not by the page cache).

The "historical reasons" for the swap cache not being the page cache weirdly difficult to decode - I've spent a chunk of months trying to do historical reasearch on how this came about, but more importantly, why. No luck on the why. (And the main reason seems to be that, if I were to guess, that the folks who built it wanted to avoid using "inodes", which are required by the whole page cache meechanism, perhaps because they thought inodes were "expensive").

Anyway, I'm now understanding that UFFD's chosen a variant meaning of "minor page fault" that seems tied to pages that are file backed or SHMEM.

A "swapped" page is anonymous by definition of what "swap" means in Linux. In Unix and other systems, swapping was a generic term that included file-backed paging as well as non-file-backed pages.

Anyway, I'm quite puzzled why I can't seem to monitor MAP_PRIVATE|MAP_ANONYMOUS page faults with userfaultfd.  The reason I focus on CoW is that CoW and fork() behavior is basically the only user visible difference between MAP_PRIVATE and MAP_SHARED. And if you read random examples of how to use mmap(), quite often MAP_PRIVATE is suggested as if it were the "normal" usage (despite what happens on fork()).

> 
> So "ordinary" VMAs are not supported because I don't think there is
> any way to create that condition with them? If you just
> mmap(MAP_ANON|MAP_PRIVATE), those pages will never be in the page
> cache, right? How would you go about doing so? You don't have an fd,
> you can't fallocate it. If you specified MAP_POPULATE, the PTEs would
> also be installed, so you just wouldn't get userfaults at all. If you
> create the mapping, then fork, then write to it in the child, I think
> the pages just get CoWed, I don't think userfaults are generated for
> that, because the PTE was already there (albeit, with RO permissions).
> 
> I guess maybe a way to make progress here is, can you list out what
> sequence of steps you believe should result in a UFFD minor fault?
> Like (for example):
> 
> fd = memfd_create()
> fallocate(fd, 0, 0, size)
> mmap(fd, MAP_PRIVATE)
> /* register mapping for UFFD minor faults */
> /* read or write to mapping */
> 
> Now we get a minor fault.
> 
> 
> 
>>
>> >
>> > You are right that userfaultfd's use of "minor fault" is (unfortunately)
>> > slightly different from the meaning in other contexts. I think the more
>> > normal meaning is, faults which do not incur I/O (i.e., swap faults and
>> > file faults [i.e., faults on non-swap-backed pages] are major, other faults
>> > are minor).
>> >
>> > For userfaultfd, a minor fault is a fault where the page already exists in
>> > the page cache, but the page table entry wasn't setup. I don't think that
>> > scenario can ever happen for anonymous, private mappings, so it doesn't
>> > really make sense to be able to register such mappings in this mode. If you
>> > create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then access it (read
>> > or write), that fault requires allocation of a new page, so userfaultfd
>> > does not consider that a "minor fault". My recollection though is if you
>> > make a file on tmpfs or hugetlbfs, fallocate() it or whatever, and you
>> > MAP_PRIVATE that file, *that* registration will work.
>> >
>> >
>> >>
>> >> It seems odd that anonymous page faults and COW would not be handled,
>> >> given that context.
>> >>
>> >> Anyway, that's unclear in any of the documentation. This just adds to my
>> >> last response where I explain my use case.
>> >>
>> >>
>> >>
>> >
>>
>>
> 




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-19 18:29                     ` David P. Reed
@ 2025-09-25 19:20                       ` Axel Rasmussen
  2025-09-27 18:45                         ` David P. Reed
  0 siblings, 1 reply; 26+ messages in thread
From: Axel Rasmussen @ 2025-09-25 19:20 UTC (permalink / raw)
  To: David P. Reed; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm

On Fri, Sep 19, 2025 at 11:29 AM David P. Reed <dpreed@deepplum.com> wrote:
>
>
>
> On Wednesday, September 17, 2025 12:13, "Axel Rasmussen" <axelrasmussen@google.com> said:
>
> > On Tue, Sep 16, 2025 at 12:52 PM David P. Reed <dpreed@deepplum.com> wrote:
> >>
> >>
> >>
> >> On Tuesday, September 16, 2025 14:35, "Axel Rasmussen" <axelrasmussen@google.com>
> >> said:
> >>
> >> > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com>
> >> wrote:
> >> >
> >> >> Than -
> >> >>
> >> >> Just to clarify -
> >> >> Looking at the man page for UFFDIO_API, there are two "feature bits" that
> >> >> indicate cases where "minor" handling is now supported, and can be enabled.
> >> >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> >> >> In my reading of the documents, these seem to imply that before they were
> >> >> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
> >> >> supported, and that the "new" additions to the MINOR mode were just for
> >> >> HUGETLBFS and MAP_SHARED cases.
> >> >>
> >> >
> >> > Actually minor fault support didn't exist at all before those two features
> >> > were added. :)
> >>
> >> Thanks for commenting. I'm not sure that's exactly true. Why is SNMEM
> >> (MAP_SHARED) supported, but not ordinary pages? I wasn't party to the evolution
> >> here, but so far no one has explained why there's a special difference between
> >> SHMEM and ordinary VMAs.
> >
> > I promise it's true, I wrote the UFFD minor fault handling feature. :)
> OK, but I am still confused as to SHMEM VMAs are supported and non-SHMEM are not, in the case of an anonymous mapped range.
>
> >
> > As for why... Like I said above, UFFD calls it a "minor" fault if the
> > PTE doesn't exist, but the page already exists in the page cache. If
> > the PTE does exist, you won't get either a minor *or* a missing fault.
> > If the page does not already existing the page cache, you'll get a
> > missing fault, not a minor fault.
> I'm assuming that you understand there is a profound difference between the "page cache" and the "swap cache" in Linux. I am referring to what happens when a page is in the swap cache, (which is primarily about anaonymous pages, but a weird corner case is that "tmpfs" is backed by the swap cache and the swap system, not by the page cache).
>
> The "historical reasons" for the swap cache not being the page cache weirdly difficult to decode - I've spent a chunk of months trying to do historical reasearch on how this came about, but more importantly, why. No luck on the why. (And the main reason seems to be that, if I were to guess, that the folks who built it wanted to avoid using "inodes", which are required by the whole page cache meechanism, perhaps because they thought inodes were "expensive").
>
> Anyway, I'm now understanding that UFFD's chosen a variant meaning of "minor page fault" that seems tied to pages that are file backed or SHMEM.
>
> A "swapped" page is anonymous by definition of what "swap" means in Linux. In Unix and other systems, swapping was a generic term that included file-backed paging as well as non-file-backed pages.
>
> Anyway, I'm quite puzzled why I can't seem to monitor MAP_PRIVATE|MAP_ANONYMOUS page faults with userfaultfd.  The reason I focus on CoW is that CoW and fork() behavior is basically the only user visible difference between MAP_PRIVATE and MAP_SHARED. And if you read random examples of how to use mmap(), quite often MAP_PRIVATE is suggested as if it were the "normal" usage (despite what happens on fork()).

You can monitor MAP_PRIVATE|MAP_ANONYMOUS faults with userfaultfd,
it's just that they're missing faults, not minor in userfaultfd
terminology, because resolving them requires a new page to be
allocated (UFFDIO_COPY, not UFFDIO_CONTINUE). The only exception I can
think of is swap faults, I could see anon swap faults (perhaps
specifically when the page is in the swap cache?) being considered
UFFD minor faults, but I would be curious to know what the use case is
for that / why you would want to do that. The original use case for
UFFD minor fault support was demand paging for VMs, where you have
some kind of shared memory (shmem or hugetlb) where one side of the
mapping is given to the VM, and the other side of the shared mapping
is used by the hypervisor to populate guest memory on-demand in
response to userfaultfd events.

To me it's not intended userfaultfd minor events are generated for
writeprotect faults, to me that's the domain of userfaultfd-wp, not
minor faults. James might be right that these unintentionally trigger
minor faults today, I would need to do some more reading of the code
to be certain though.

>
> >
> > So "ordinary" VMAs are not supported because I don't think there is
> > any way to create that condition with them? If you just
> > mmap(MAP_ANON|MAP_PRIVATE), those pages will never be in the page
> > cache, right? How would you go about doing so? You don't have an fd,
> > you can't fallocate it. If you specified MAP_POPULATE, the PTEs would
> > also be installed, so you just wouldn't get userfaults at all. If you
> > create the mapping, then fork, then write to it in the child, I think
> > the pages just get CoWed, I don't think userfaults are generated for
> > that, because the PTE was already there (albeit, with RO permissions).
> >
> > I guess maybe a way to make progress here is, can you list out what
> > sequence of steps you believe should result in a UFFD minor fault?
> > Like (for example):
> >
> > fd = memfd_create()
> > fallocate(fd, 0, 0, size)
> > mmap(fd, MAP_PRIVATE)
> > /* register mapping for UFFD minor faults */
> > /* read or write to mapping */
> >
> > Now we get a minor fault.
> >
> >
> >
> >>
> >> >
> >> > You are right that userfaultfd's use of "minor fault" is (unfortunately)
> >> > slightly different from the meaning in other contexts. I think the more
> >> > normal meaning is, faults which do not incur I/O (i.e., swap faults and
> >> > file faults [i.e., faults on non-swap-backed pages] are major, other faults
> >> > are minor).
> >> >
> >> > For userfaultfd, a minor fault is a fault where the page already exists in
> >> > the page cache, but the page table entry wasn't setup. I don't think that
> >> > scenario can ever happen for anonymous, private mappings, so it doesn't
> >> > really make sense to be able to register such mappings in this mode. If you
> >> > create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then access it (read
> >> > or write), that fault requires allocation of a new page, so userfaultfd
> >> > does not consider that a "minor fault". My recollection though is if you
> >> > make a file on tmpfs or hugetlbfs, fallocate() it or whatever, and you
> >> > MAP_PRIVATE that file, *that* registration will work.
> >> >
> >> >
> >> >>
> >> >> It seems odd that anonymous page faults and COW would not be handled,
> >> >> given that context.
> >> >>
> >> >> Anyway, that's unclear in any of the documentation. This just adds to my
> >> >> last response where I explain my use case.
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 22:04                   ` Axel Rasmussen
@ 2025-09-26 22:00                     ` Peter Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Xu @ 2025-09-26 22:00 UTC (permalink / raw)
  To: Axel Rasmussen; +Cc: James Houghton, David P. Reed, Andrew Morton, linux-mm

On Tue, Sep 16, 2025 at 03:04:46PM -0700, Axel Rasmussen wrote:
> On Tue, Sep 16, 2025 at 12:11 PM James Houghton <jthoughton@google.com>
> wrote:
> 
> > On Tue, Sep 16, 2025 at 11:35 AM Axel Rasmussen
> > <axelrasmussen@google.com> wrote:
> > >
> > >
> > >
> > > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed <dpreed@deepplum.com>
> > wrote:
> > >>
> > >> Than -
> > >>
> > >> Just to clarify -
> > >> Looking at the man page for UFFDIO_API, there are two "feature bits"
> > that indicate cases where "minor" handling is now supported, and can be
> > enabled.
> > >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> > >> In my reading of the documents, these seem to imply that before they
> > were added as new features, that MAP_PRIVATE|MAP_ANONYMOUS mappings were
> > supported, and that the "new" additions to the MINOR mode were just for
> > HUGETLBFS and MAP_SHARED cases.
> > >
> > >
> > > Actually minor fault support didn't exist at all before those two
> > features were added. :)
> > >
> > > You are right that userfaultfd's use of "minor fault" is (unfortunately)
> > slightly different from the meaning in other contexts. I think the more
> > normal meaning is, faults which do not incur I/O (i.e., swap faults and
> > file faults [i.e., faults on non-swap-backed pages] are major, other faults
> > are minor).
> > >
> > > For userfaultfd, a minor fault is a fault where the page already exists
> > in the page cache, but the page table entry wasn't setup. I don't think
> > that scenario can ever happen for anonymous, private mappings, so it
> > doesn't really make sense to be able to register such mappings in this
> > mode. If you create a mapping with mmap(MAP_ANON|MAP_PRIVATE) and then
> > access it (read or write), that fault requires allocation of a new page, so
> > userfaultfd does not consider that a "minor fault". My recollection though
> > is if you make a file on tmpfs or hugetlbfs, fallocate() it or whatever,
> > and you MAP_PRIVATE that file, *that* registration will work.
> >
> > Ah! You're right... MAP_PRIVATE *is* supported (for tmpfs and
> > hugetlbfs only), and UFFDIO_CONTINUE will, upon finding the page in
> > the page cache, install a RO PTE for it.
> >
> 
> Why does it have to be RO? I think it depends on the PROT_ flag you
> specified when you created the private mapping.

It needs to be RO because we're installing a page cache into a PRIVATE
mapping, hence we don't want the private mapper to update the page cache,
we want the 1st write to CoW there.  I believe you wrote the code. :)

Relevant lines in mfill_atomic_install_pte():

	if (page_in_cache && !vm_shared)
		writable = false;

> 
> 
> >
> > But what happens when the write comes after installing the RO PTE? My
> > reading of the code today makes me think that we'd get a minor
> > userfault and then be unable to continue...! (The only reasonable
> > behavior is that CoW is done without triggering a userfault... I
> > assumed/thought this was the behavior today. I wish I had time to test
> > this -- I hope I'm misreading it.)
> >
> 
> It's possible my memory is wrong, but I don't think UFFD minor fault
> handling really interacts with CoW faults. IOW, I think you get a UFFD
> minor fault when the PTE is missing, not when it's RO resulting in CoW. I
> think there we just CoW the page as per normal and no fault is reported via
> UFFD?

Yes, even though I don't think PRIVATE is a goal for minor fault, IIUC we
support it, a CoW should be a follow up if the minor fault is triggered
from a write.  If it's a read, then RO entry should start to work.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-16 17:09             ` David P. Reed
@ 2025-09-26 22:16               ` Peter Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Xu @ 2025-09-26 22:16 UTC (permalink / raw)
  To: David P. Reed
  Cc: James Houghton, Andrew Morton, linux-mm, Axel Rasmussen,
	Mike Rapoport, Andrea Arcangeli

Hello, David,

On Tue, Sep 16, 2025 at 01:09:43PM -0400, David P. Reed wrote:
> Than -
> 
> Thanks for your interest. Some clarifications on my current use case are interposed below. 

Sorry for a late respond.

I think I get a much better picture now with the answers below, thanks.
Looping in Mike and Andrea.

> 
> On Tuesday, September 16, 2025 12:13, "Peter Xu" <peterx@redhat.com> said:
> 
> > On Tue, Sep 16, 2025 at 11:52:18AM -0400, David P. Reed wrote:
> >> synchronous would be better. But what I want to do is at least get
> >> notifications of swapin events (including the case when the page is in
> >> swap cache). Also, using UFFDIO_COPY can be useful for the swap in case
> >> might make sense (but rarely, because there's no way to access the data
> >> that was swapped out).
> > 
> > Some more info on the use case might be helpful.  I can start with some
> > more questions if that helps.
> > 
> > - If it's about page hotness / coldness, have you tried existing facilities
> >   (page idle, DAMON, etc.)?  If so, why they won't work?
> 
> Yes. Those functions are just summarizers, giving counts and
> averages. They provide zero detail about specific pages to the
> application running in the process.
> 
> I can clarify my use case focus by pointing out what inspired me to use
> userfaultfd for application specific memory management (which is, after
> all, what userfaultfd was promoted for on Linux Weekly News a while back
> when it first came out). This 2024 paper is along the same lines as what
> I'm researching, and was published in 2024 Usenix proceedings.  ExtMem:
> Enabling Application Aware Virtual Memory Management for Data Intensive
> Applications
> https://www.usenix.org/conference/atc24/presentation/jalalian
>
> See figure 3 of the paper for their performance problem with userfaultfd
> vs. their kernel modifications (upcall).

Correct, userfaultfd does have such IPC overhead. SIGBUS sometimes can be
better, but AFAIU it has limitations.  E.g., I am not sure if the signals
will always work when the fault is triggered in either:

  (1) a kernel context using copy_from_user() / copy_to_user() or GUP

  (2) when a fault is scheduled somehow onto, for example, a kworker

AFAIU, (1) can really easily happen if one tries to do syscall read(),
write()..., where the buffer is userfaultfd protected.

Meanwhile, (2) normally can't happen but it can still happen in at least
the KVM use case that we heavily rely on, where KVM has a feature to be
able to offload a vCPU page fault to a kworker (we call it KVM async page
fault).

IIUC, such limitation will also apply to the upcall solution they provided.
From what I read on the paper, that should really be a mimic version of
signal handling, but making it per-thread, under the same task context.

>
> They had tried using userfaultfd for their work, and found it was "too
> slow" compared to what they call the "upcall" technique they achieved by
> modifying the kernel page fault handling path. (see paper for details -
> and Jalalian't thesis dives into it more deeply).  I could code up an
> equivalent to their "upcall" - but that would mean completely
> non-standard (and fraught with security issues, as well as not being able
> to use a separate management process).  For me, the performance concern
> is less problematic - I'm doing application analysis and
> experimentation. And I don't want to have to maintain a kernel patch set.
>
> Note that I expect to use madvise() and process_madvise() to manipulate
> page coldness and swapping as well.

Yes, looks like the right tools.

> 
> > 
> > - Assuming it's async reports that can be collected, what do you plan to do
> >   with the info?  Do you care about swap outs prior to swap ins?
>
> Detailed application paging measurements, modeling, and so forth.  I'm
> not asking for a big enhancement to userfaultfd - just expecting it
> should (as is) basically work, if the UFFDIO_REGISTER actually allowed to
> register the minor page fault mode,
>
> > 
> > - How sync events would be better in this case?
> 
> Simpler to coordinate the interaction with the faulting process by far.

I think I get the gut of how you would like to use it.  However, my
question is, if you want to do fine tuning of "which layer of memory should
hold what data" kind of thing; IOW trying to replace the linux mm swap
system but provide a likely better one, more suitable for your workload,
then why you have swap in/out at all?  Why do you care about that?

I was expecting your PoC (or ExtMEM) to completely bypass Linux swap, then
you can freely move memory pages between system RAM, NVMe, RDMA, etc. like
what ExtMEM paper mentioned.

So far I don't see it a block if one would like to say that the swap cache
is also one kind of page cache, then would it make sense to add MINOR fault
trapping to anonymous but only trap it when swapin?  Kind of ok when
initially read about it, doesn't sound too hard to impl either.  It's just
that I still want to double check with your use case first, because it
really sounds like you should have turned swap off.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-25 19:20                       ` Axel Rasmussen
@ 2025-09-27 18:45                         ` David P. Reed
  2025-09-29  5:30                           ` James Houghton
  0 siblings, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-27 18:45 UTC (permalink / raw)
  To: Axel Rasmussen; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 11281 bytes --]

OK - responses below.
I'm still unclear what my role is vs. the others cc'ed on this problem report is.

Is anyone here (other than Andrew) a decision maker on what userfaultfd is supposed to do? I can see what the current code DOES - and honestly, it's seriously whacked semantically. (see the ExtMem paper for a reasonable use case that it cannot serve, my use case is quite similar). So is anyone here wanting to improve the functionality? I'm sure its current functions are used by some folks here - Google employees presumably focused on ChromeOS or Android, I suppose, suggest that there's a use case there.

My role started out by reporting that the documentation is both incomplete and confusing, both in the man pages and the "kernel documentation". And the rationale presented in the documentation doesn't make sense. Some of you guys admit that you really don't understand how "swap" is different from "file-backed paging" (except for the corner cases of hugetlbfs [sort of "file backed"], "file-backed by /dev/zero" [which ends up using "swap"], and tmpfs [also "file backed" but using "swap"]. And yet "anonymous, private" uses "swap" and the "swap cache", not the "page cache".

Now, after digging into the question, I feel like there was never, ever a coherent architectural design for userfaultfd as a function. It's apparently just a "hack", not a "feature".

I'd be happy to propose a much more coherent design (in my opinion as an operating systems designer for the past more than 20 years, starting with Multics in 1970 - you guys may not be interested in my input, which is fair. Is Linus interested? That would be a bunch of work for me, because I would do a thorough job, not just a bunch of random patches. But I'm not proposing to join the maintainer-club - I'm retired from that space, and I find the Linux kernel contributors poorly organized and chaotic.

Or, I can just drop this interaction - concluding that userfaultfd is kind of useless as is, and really badly documented to boot.

On Thursday, September 25, 2025 15:20, "Axel Rasmussen" <axelrasmussen@google.com> said:

> On Fri, Sep 19, 2025 at 11:29 AM David P. Reed <dpreed@deepplum.com>
> wrote:
> >
> >
> >
> > On Wednesday, September 17, 2025 12:13, "Axel Rasmussen"
> <axelrasmussen@google.com> said:
> >
> > > On Tue, Sep 16, 2025 at 12:52 PM David P. Reed
> <dpreed@deepplum.com> wrote:
> > >>
> > >>
> > >>
> > >> On Tuesday, September 16, 2025 14:35, "Axel Rasmussen"
> <axelrasmussen@google.com>
> > >> said:
> > >>
> > >> > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed
> <dpreed@deepplum.com>
> > >> wrote:
> > >> >
> > >> >> Than -
> > >> >>
> > >> >> Just to clarify -
> > >> >> Looking at the man page for UFFDIO_API, there are two
> "feature bits" that
> > >> >> indicate cases where "minor" handling is now supported, and
> can be enabled.
> > >> >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> > >> >> In my reading of the documents, these seem to imply that
> before they were
> > >> >> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS
> mappings were
> > >> >> supported, and that the "new" additions to the MINOR mode
> were just for
> > >> >> HUGETLBFS and MAP_SHARED cases.
> > >> >>
> > >> >
> > >> > Actually minor fault support didn't exist at all before those
> two features
> > >> > were added. :)
> > >>
> > >> Thanks for commenting. I'm not sure that's exactly true. Why is
> SNMEM
> > >> (MAP_SHARED) supported, but not ordinary pages? I wasn't party to
> the evolution
> > >> here, but so far no one has explained why there's a special
> difference between
> > >> SHMEM and ordinary VMAs.
> > >
> > > I promise it's true, I wrote the UFFD minor fault handling feature. :)
> > OK, but I am still confused as to SHMEM VMAs are supported and non-SHMEM are
> not, in the case of an anonymous mapped range.
> >
> > >
> > > As for why... Like I said above, UFFD calls it a "minor" fault if the
> > > PTE doesn't exist, but the page already exists in the page cache. If
> > > the PTE does exist, you won't get either a minor *or* a missing fault.
> > > If the page does not already existing the page cache, you'll get a
> > > missing fault, not a minor fault.
> > I'm assuming that you understand there is a profound difference between the
> "page cache" and the "swap cache" in Linux. I am referring to what happens when a
> page is in the swap cache, (which is primarily about anaonymous pages, but a weird
> corner case is that "tmpfs" is backed by the swap cache and the swap system, not
> by the page cache).
> >
> > The "historical reasons" for the swap cache not being the page cache weirdly
> difficult to decode - I've spent a chunk of months trying to do historical
> reasearch on how this came about, but more importantly, why. No luck on the why.
> (And the main reason seems to be that, if I were to guess, that the folks who
> built it wanted to avoid using "inodes", which are required by the whole page
> cache meechanism, perhaps because they thought inodes were "expensive").
> >
> > Anyway, I'm now understanding that UFFD's chosen a variant meaning of "minor
> page fault" that seems tied to pages that are file backed or SHMEM.
> >
> > A "swapped" page is anonymous by definition of what "swap" means in Linux. In
> Unix and other systems, swapping was a generic term that included file-backed
> paging as well as non-file-backed pages.
> >
> > Anyway, I'm quite puzzled why I can't seem to monitor
> MAP_PRIVATE|MAP_ANONYMOUS page faults with userfaultfd. The reason I focus on CoW
> is that CoW and fork() behavior is basically the only user visible difference
> between MAP_PRIVATE and MAP_SHARED. And if you read random examples of how to use
> mmap(), quite often MAP_PRIVATE is suggested as if it were the "normal" usage
> (despite what happens on fork()).
> 
> You can monitor MAP_PRIVATE|MAP_ANONYMOUS faults with userfaultfd,
> it's just that they're missing faults, not minor in userfaultfd
> terminology, because resolving them requires a new page to be
> allocated (UFFDIO_COPY, not UFFDIO_CONTINUE).

There is no sensible way to respond to a "missing event" when "missing" means the page is swapped out (to SWAP) by UFFDIO_COPY or UFFDIO_ZEROPAGE. That's just weird, and you continue to insist on it. Where is the page that was swapped out? Well, one could look at the PTE in /proc/pid/maps, and you find that its "swap entry" is there as an index into a block device. (so, maybe you can open the swap device using some file descriptor and mmap() it into the manager process, then UFFDIO_COPY, but what if the swap page is actually in the "swap cache", you can't mmap any swap cache page via any userspace API - do you know a way to do that?)

Now I reported a bug in UFFIO_REGISTER, which you keep saying is the same as UFFDIO_CONTINUE. Well, it isn't! I can register a minor handler (which allows continue) if I use MAP_ANONYMOUS|MAP_SHARED. The same "swap cache" mechanics exactly apply. The only "sharing" is potential future sharing after that process forks, in which case, the same "swap page" is shared until a Copy on Write forces the page to be unshared - it is a writeable page, just sharing the same physical block. It can be swapped out to the swap cache and the swap device, which sets the PTE to be a "swap entry" that causes a page fault.
The swap device doesn't know where the pages are mapped. You need to look at the PTEs of all the processes to find the translation to swap cache entry, and if you want to go backward from swap entry to pages, you need to use a special XArray that finds VMAs given swap entry.

But the point here I keep making is that UFFDIO_REGISTER rejects only MAP_ANONYMOUS that are MAP_PRIVATE and also not huge pages. To me that's weird.

If it is the CoW case that doesn't work (I doubt it), well, you have to read the swapped out page into memory before copying it anyway. Then you copy on write, from the page read or found in the swap cache.

Now, as you say, that may require allocating a new page, also in the swap cache. Is that a "missing" page in the weird userfaultfd terminology? If so, to handle it can't be done with UFFIO_COPY, because you can't access the contents from userspace. And it's not "write protected" from the perspective of WP.

> The only exception I can
> think of is swap faults, I could see anon swap faults (perhaps
> specifically when the page is in the swap cache?) being considered
> UFFD minor faults, but I would be curious to know what the use case is
> for that / why you would want to do that. The original use case for
> UFFD minor fault support was demand paging for VMs, where you have
> some kind of shared memory (shmem or hugetlb) where one side of the
> mapping is given to the VM, and the other side of the shared mapping
> is used by the hypervisor to populate guest memory on-demand in
> response to userfaultfd events.

I think I've just answered this. userfaultfd doesn't support the "swap out" part of anonymous swapping at all. So, how could a manager get the page contents as of the instant it is put in the swap cache for writing out to the swap device? There's no "swap out" event mechanism, and no way to treat the swap device cached into the swap cache as a page source. (not to mention the zswap mechanism, which compresses some of the pages into an invisible piece of memory).

> 
> To me it's not intended userfaultfd minor events are generated for
> writeprotect faults, to me that's the domain of userfaultfd-wp, not
> minor faults. James might be right that these unintentionally trigger
> minor faults today, I would need to do some more reading of the code
> to be certain though.

I don't particulary care about writeprotect faults, but CoW probably shouldn't be considered the same as a writeprotect fault, because CoW is triggered by a write into a writeable area, ONLY in one of the mappings, whichever is written first. The process doesn't think of it as a "write" - it just is a kernel optimization of a common case where fork is followed by non-use, so the actual copy could have been done at fork time, semantically. It's a deferred read and allocation. 

I hope this helps clarify my concerns.

There are several reasonable outcomes -

1. Much better documentation of what the code actually does (and why).
2. Fix the "bug" that prevents REGISTER of "minor" handler on private, anonymous mappings (obviously, you can REGISTER missing handlers as well), then document actually what happens during the life cycle of swapping of pages in detail, including MAP_PRIVATE|MAP_ANONYMOUS VMAs.
3. Do a thorough analysis of what userfaultfd really should do, if the goal is to provide the ability of a "manager process" to get to handle all cases of page fault behavior on a case-by-case basis for regions of user addressable pages.

I'd be happy to contribute to (but not manage) whichever outcome - and I have what I think is a reasonable use case. (and I'm aware that this API accidentally created a serious hacker exploit earlier in its life, by creating a way to hang one process from another. I think that's no longer so easy.)

[-- Attachment #2: Type: text/html, Size: 14673 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-27 18:45                         ` David P. Reed
@ 2025-09-29  5:30                           ` James Houghton
  2025-09-29 19:44                             ` David P. Reed
  0 siblings, 1 reply; 26+ messages in thread
From: James Houghton @ 2025-09-29  5:30 UTC (permalink / raw)
  To: David P. Reed; +Cc: Axel Rasmussen, Peter Xu, Andrew Morton, linux-mm

On Sat, Sep 27, 2025 at 11:45 AM David P. Reed <dpreed@deepplum.com> wrote:
>
> OK - responses below.

I think Peter will be able to help you the most, but I want to give my
two cents anyway.

>
> I'm still unclear what my role is vs. the others cc'ed on this problem report is.
>
> Is anyone here (other than Andrew) a decision maker on what userfaultfd is supposed to do? I can see what the current code DOES - and honestly, it's seriously whacked semantically. (see the ExtMem paper for a reasonable use case that it cannot serve, my use case is quite similar). So is anyone here wanting to improve the functionality? I'm sure its current functions are used by some folks here - Google employees presumably focused on ChromeOS or Android, I suppose, suggest that there's a use case there.

I think all of us want userfaultfd to be as useful as possible. :)
Peter, Axel, and I are quite familiar with userfaultfd's use as a tool
for enabling post-copy live migration for virtual machines.
Userfaultfd minor faults were created expressly for this purpose. Axel
wrote the userfaultfd minor fault support; I wrote the corresponding
userspace code to use it in Google Cloud.

Peter is quite a bit more familiar with userfaultfd than me (and I
think Axel, but I don't want to speak for him), so please excuse our
mistakes. (mm is complicated!)

There are a few others who care about userfaultfd who might jump in as
soon as patches get sent. I think these folks (so on top of Peter and
Andrew, people like Suren, Lorenzo, David Hildenbrand) will be the
folks who Ack or Nak the patches.

>
>
>
> My role started out by reporting that the documentation is both incomplete and confusing, both in the man pages and the "kernel documentation". And the rationale presented in the documentation doesn't make sense. Some of you guys admit that you really don't understand how "swap" is different from "file-backed paging" (except for the corner cases of hugetlbfs [sort of "file backed"], "file-backed by /dev/zero" [which ends up using "swap"], and tmpfs [also "file backed" but using "swap"]. And yet "anonymous, private" uses "swap" and the "swap cache", not the "page cache".

The documentation is confusing; I agreed with you originally that it
should be updated. (Do you want to send a patch? Perhaps I could give
it a go when I find the time.)

I spent some time writing out how I define the various terms being
used here, I'll leave it at the end of this email in case it is
helpful, but otherwise please just ignore it. I wouldn't say that the
rationale in the documentation doesn't make sense. Userfaultfd exists
to solve specific problems.

>
> Now, after digging into the question, I feel like there was never, ever a coherent architectural design for userfaultfd as a function. It's apparently just a "hack", not a "feature".

Userfaultfd certainly isn't perfect, but it is critical for things
like VM live migration, Android GC, CRIU, etc..

>
> I'd be happy to propose a much more coherent design (in my opinion as an operating systems designer for the past more than 20 years, starting with Multics in 1970 - you guys may not be interested in my input, which is fair. Is Linus interested? That would be a bunch of work for me, because I would do a thorough job, not just a bunch of random patches. But I'm not proposing to join the maintainer-club - I'm retired from that space, and I find the Linux kernel contributors poorly organized and chaotic.
>
> Or, I can just drop this interaction - concluding that userfaultfd is kind of useless as is, and really badly documented to boot.

I am interested to hear your ideas for how you think userfaultfd
should work and how it solves your problem. :) At the end of the day,
I'm just trying (though clearly failing miserably) to help you solve
your problem.

Your characterization of userfaultfd as a "useless" "bunch of random
patches" that is just a "hack" is wrong. I understand; it doesn't
support your needs. I think what Peter, Axel, and I have been trying
to understand is what exactly you're trying to do and how userfaultfd
could (or may not) help you get there. You've shared some[1]
details[2] about what you're looking for, so thank you for that, but I
am still struggling to understand how the flexibility that you're
asking for is actually the right tool for the problem(s) you're trying
to solve.

[1]: https://lore.kernel.org/linux-mm/1758037039.08578612@apps.rackspace.com/
[2]: https://lore.kernel.org/linux-mm/1758042583.108320755@apps.rackspace.com/

> There is no sensible way to respond to a "missing event" when "missing" means the page is swapped out (to SWAP) by UFFDIO_COPY or UFFDIO_ZEROPAGE. That's just weird, and you continue to insist on it. Where is the page that was swapped out? Well, one could look at the PTE in /proc/pid/maps, and you find that its "swap entry" is there as an index into a block device. (so, maybe you can open the swap device using some file descriptor and mmap() it into the manager process, then UFFDIO_COPY, but what if the swap page is actually in the "swap cache", you can't mmap any swap cache page via any userspace API - do you know a way to do that?)

(Please see the terms that I use at the bottom of this email; let me
reply using those terms.)

UFFDIO_COPY has quite well-defined semantics (albeit, perhaps not
*documented* well):

* For anonymous VMAs: UFFDIO_COPY will allocate page(s), copy some
user memory into the page(s) and map those pages at the specified VAs.
* For hugetlbfs and shmem/tmpfs VMAs, UFFDIO_COPY will fill holes in
the file's page cache with new pages, copy the user memory in, and map
those pages. UFFDIO_CONTINUE is additionally supported; it skips the
hole-filling step and requires the page cache to be populated.

For UFFDIO_COPY, if a page at a to-be-populated VA has already been
allocated (including if it has been reclaimed), the call will be
rejected. It would effectively be overwriting the contents of the
page; this is not supported today.

If "missing" includes swapped out pages, UFFDIO_COPY and
UFFDIO_ZEROPAGE would need to be allowed to overwrite the existing
contents. "Sensible" or not, there has been no need for this yet.

> Now I reported a bug in UFFIO_REGISTER [...]

The bug you reported is in the documentation only.

> [...] which you keep saying is the same as UFFDIO_CONTINUE. Well, it isn't! I can register a minor handler (which allows continue) if I use MAP_ANONYMOUS|MAP_SHARED. The same "swap cache" mechanics exactly apply. The only "sharing" is potential future sharing after that process forks, in which case, the same "swap page" is shared until a Copy on Write forces the page to be unshared - it is a writeable page, just sharing the same physical block. It can be swapped out to the swap cache and the swap device, which sets the PTE to be a "swap entry" that causes a page fault.

(Using the terms at the bottom of this email.)

For UFFDIO_CONTINUE, the swap cache mechanics are like:

1. For anonymous pages in the VMA: swap-outs will not clear the PTEs,
touching the page will swap it back in again, UFFDIO_CONTINUE on it is
disallowed.
2. For page cache pages in the VMA (i.e., not-yet-written-to pages for
MAP_PRIVATE, any page for MAP_SHARED): swap-outs will clear the PTEs,
and touching the page will trigger a minor fault, and UFFDIO_CONTINUE
will swap it back in.

For MAP_ANONYMOUS|MAP_PRIVATE, all pages in the VMA will be anonymous
pages, so UFFDIO_CONTINUE will never be allowed, therefore
registration in the first place is disallowed.

(IMHO, it was dubious to have even allowed registering userfaultfd
minor faults with *any* MAP_PRIVATE VMA.)

> The swap device doesn't know where the pages are mapped. You need to look at the PTEs of all the processes to find the translation to swap cache entry, and if you want to go backward from swap entry to pages, you need to use a special XArray that finds VMAs given swap entry.
>
> But the point here I keep making is that UFFDIO_REGISTER rejects only MAP_ANONYMOUS that are MAP_PRIVATE and also not huge pages. To me that's weird.

I hope my above explanation (of sorts) makes it a little less weird.

> If it is the CoW case that doesn't work (I doubt it), well, you have to read the swapped out page into memory before copying it anyway. Then you copy on write, from the page read or found in the swap cache.
>
> Now, as you say, that may require allocating a new page, also in the swap cache. Is that a "missing" page in the weird userfaultfd terminology? If so, to handle it can't be done with UFFIO_COPY, because you can't access the contents from userspace. And it's not "write protected" from the perspective of WP.

No it isn't a missing userfault. Data exists at the VA for which a
userfault would be generated, therefore it cannot be "missing".

>
>
>
> > The only exception I can
> > think of is swap faults, I could see anon swap faults (perhaps
> > specifically when the page is in the swap cache?) being considered
> > UFFD minor faults, but I would be curious to know what the use case is
> > for that / why you would want to do that. The original use case for
> > UFFD minor fault support was demand paging for VMs, where you have
> > some kind of shared memory (shmem or hugetlb) where one side of the
> > mapping is given to the VM, and the other side of the shared mapping
> > is used by the hypervisor to populate guest memory on-demand in
> > response to userfaultfd events.
>
>
>
> I think I've just answered this. userfaultfd doesn't support the "swap out" part of anonymous swapping at all. So, how could a manager get the page contents as of the instant it is put in the swap cache for writing out to the swap device? There's no "swap out" event mechanism, and no way to treat the swap device cached into the swap cache as a page source. (not to mention the zswap mechanism, which compresses some of the pages into an invisible piece of memory).
>
>
> >
> > To me it's not intended userfaultfd minor events are generated for
> > writeprotect faults, to me that's the domain of userfaultfd-wp, not
> > minor faults. James might be right that these unintentionally trigger
> > minor faults today, I would need to do some more reading of the code
> > to be certain though.
>
> I don't particulary care about writeprotect faults, but CoW probably shouldn't be considered the same as a writeprotect fault, because CoW is triggered by a write into a writeable area, ONLY in one of the mappings, whichever is written first. The process doesn't think of it as a "write" - it just is a kernel optimization of a common case where fork is followed by non-use, so the actual copy could have been done at fork time, semantically. It's a deferred read and allocation.
>
>
>
> I hope this helps clarify my concerns.
>
> There are several reasonable outcomes -
>
> 1. Much better documentation of what the code actually does (and why).

Agreed.

> 2. Fix the "bug" that prevents REGISTER of "minor" handler on private, anonymous mappings (obviously, you can REGISTER missing handlers as well), then document actually what happens during the life cycle of swapping of pages in detail, including MAP_PRIVATE|MAP_ANONYMOUS VMAs.

Not a bug.

> 3. Do a thorough analysis of what userfaultfd really should do, if the goal is to provide the ability of a "manager process" to get to handle all cases of page fault behavior on a case-by-case basis for regions of user addressable pages.

What userfaultfd "should do" is up to the problems we need it to solve.

> I'd be happy to contribute to (but not manage) whichever outcome - and I have what I think is a reasonable use case. (and I'm aware that this API accidentally created a serious hacker exploit earlier in its life, by creating a way to hang one process from another. I think that's no longer so easy.)

I would be glad to hear what changes you think should be made to
userfaultfd to better suit your needs.

Sorry if this reply is somewhat incoherent; I've gone back and forth a
few times on how to respond to your points in the most helpful way I
can. I've tried to be as clear as possible without being too verbose.

- James

--

Alrighty here are the terms/definitions I use, as I mentioned above.
Again feel, free to ignore them if they are unhelpful:

A "file-backed VMA" will load pages into the page cache. For most
filesystems, the page is loaded from a disk (or a proper device), but
for special filesystems like tmpfs, hugetlbfs, and ramfs, the page
cache is populated with zeroed pages initially.

tmpfs is kind of like a filesystem API for shmem, but they are so
interconnected that many people use the terms interchangeably. (To
clarify, I don't think of "shmem" as shorthand for "shared memory"; to
me, it is the name of an mm subsystem.) Every MAP_ANONYMOUS|MAP_SHARED
VMA is a shmem VMA; it is as if there is a tmpfs file backing VMAs
like these, so they are in some contexts considered "file-backed". See
shmem_zero_setup(). As far as I'm concerned, vma->vm_file is set, so
the VMA is file-backed (even though the mmap flags included
MAP_ANONYMOUS). I assume this is what you are referring to when you
say "file-backed by /dev/zero".

For any MAP_PRIVATE VMA, some pages may be "anonymous", in that no
page cache is holding a reference to it (i.e., generally speaking, the
only references on the page are the ones taken by the PTEs mapping the
page). Reclaim of pages like these will put them in a swap cache.

For pages where a reference is held in a page cache, if the page is
dirty, it can be written out to disk. shmem implements "writeout" by
swapping just like anonymous pages, but other filesystems implement it
how you would expect.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-29  5:30                           ` James Houghton
@ 2025-09-29 19:44                             ` David P. Reed
  2025-09-29 20:30                               ` Peter Xu
  0 siblings, 1 reply; 26+ messages in thread
From: David P. Reed @ 2025-09-29 19:44 UTC (permalink / raw)
  To: James Houghton; +Cc: Axel Rasmussen, Peter Xu, Andrew Morton, linux-mm

[-- Attachment #1: Type: text/plain, Size: 18403 bytes --]


James -
 
This was greatly helpful, as now I can decode a bit better.

The "big picture" insight you provided is that it is primarily (exclusively?) focused on post-copy Live Migration as its motivating use case never was clear to me before you clarified that in this message. Aha!
 
That's certainly different from what I'm hoping to use it for. (Just as an aside, starting in 2012 or so, I did a lot of design and implementation work on VM-based virtual memory, both at SAP Labs Research and at a startup I co-founded called TidalScale that created an "inverse virtualization" platform that moved memory among nodes of a tightly coupled "distributed x86 virtual machine". Essentially, that was a system that was constantly executing as if it was in post-copy live migration - the pages flowed between nodes, as did the virtual cpus. HPE acquired the product, which worked very well.)
 
I'm not focused on live migration at all, so you can see why I might be confused. What really interests me here is moving "kernel functions" out of the kernel - there's been a lot of work, for example, in I/O from userspace, which I follow closely. I grew up doing OS research in the early 1970's where for lots of reasons the "monolithic kernel" design was resisted (e.g. in the Unix sphere, Mach at CMU).  I worked during my M.S. on the Multics operating system, in particular with paging, and even in my Bachelor's thesis, on dealing with multiprocessor and multiprocess paging behavior. Since Multics was what we now call a "multiprocessor- centric" operating system with many CPUs sharing memory.

So what I have spent a lot of time over the years thinking about is how a system with many processes on many cpus can effectively share memory when competing for RAM and cache and "disk".
My 1973 bachelor's thesis was "Estimating Working Sets on Multics" (which was a time sharing system that supported ~100 concurrent users if provisioned with 3 processors, more if provisioned with 8-10 processors). The B.S. thesis recognized that by abandoning common shared LRU list reclaim, the OS could make more efficient use of RAM while swapping out to a "paging drum" that was super low latency at the time. So my brain is wired to know that the current Linux paging (reclaim and fault handling) isn't great. [well, it was born as a uniprocessor OS, and still is architected to privilege working well on a uniprocessor - rather than starting as Multics did, with the idea that there are lots of cores. You can see the mess in Linux with all the global locks in the mm kernel code, slowly being addressed.]
 
That's more context.

So userfaultfd is a tool I think can be used to move monitoring (which may include supervising reclaim) into userspace. It's not complete. process_madvise() may allow moving more into ring 3, but unfortunately it doesn't support MADV_PAGEOUT from MADVISE.

That may give you more context.  (I am not a believer in the idea that the Linux kernel is where you protect the system from hackers. The argument for moving function out of the kernel is that you untangle the spaghetti mess of the Linux kernel. Userspace processes can be inside the security perimeter of the system, if they are well designed and the kernel supports the right abstractions and the right protection mechanisms between processes. I am not sure that Torvalds agrees, but I am a LOT more experienced than he. It's his system, though.)
 
Comments intercalated below.
 
On Monday, September 29, 2025 01:30, "James Houghton" <jthoughton@google.com> said:



> On Sat, Sep 27, 2025 at 11:45 AM David P. Reed <dpreed@deepplum.com>
> wrote:
> >
> > OK - responses below.
> 
> I think Peter will be able to help you the most, but I want to give my
> two cents anyway.
> 
> >
> > I'm still unclear what my role is vs. the others cc'ed on this problem report
> is.
> >
> > Is anyone here (other than Andrew) a decision maker on what userfaultfd is
> supposed to do? I can see what the current code DOES - and honestly, it's
> seriously whacked semantically. (see the ExtMem paper for a reasonable use case
> that it cannot serve, my use case is quite similar). So is anyone here wanting to
> improve the functionality? I'm sure its current functions are used by some folks
> here - Google employees presumably focused on ChromeOS or Android, I suppose,
> suggest that there's a use case there.
> 
> I think all of us want userfaultfd to be as useful as possible. :)
> Peter, Axel, and I are quite familiar with userfaultfd's use as a tool
> for enabling post-copy live migration for virtual machines.
> Userfaultfd minor faults were created expressly for this purpose. Axel
> wrote the userfaultfd minor fault support; I wrote the corresponding
> userspace code to use it in Google Cloud.
 
Excellent clarification. And congratulations on making that work.

> 
> Peter is quite a bit more familiar with userfaultfd than me (and I
> think Axel, but I don't want to speak for him), so please excuse our
> mistakes. (mm is complicated!)
> 
> There are a few others who care about userfaultfd who might jump in as
> soon as patches get sent. I think these folks (so on top of Peter and
> Andrew, people like Suren, Lorenzo, David Hildenbrand) will be the
> folks who Ack or Nak the patches.
 
That's good to know.
 
> 
> >
> >
> >
> > My role started out by reporting that the documentation is both incomplete
> and confusing, both in the man pages and the "kernel documentation". And the
> rationale presented in the documentation doesn't make sense. Some of you guys
> admit that you really don't understand how "swap" is different from "file-backed
> paging" (except for the corner cases of hugetlbfs [sort of "file backed"],
> "file-backed by /dev/zero" [which ends up using "swap"], and tmpfs [also "file
> backed" but using "swap"]. And yet "anonymous, private" uses "swap" and the "swap
> cache", not the "page cache".
> 
> The documentation is confusing; I agreed with you originally that it
> should be updated. (Do you want to send a patch? Perhaps I could give
> it a go when I find the time.)
> 
> I spent some time writing out how I define the various terms being
> used here, I'll leave it at the end of this email in case it is
> helpful, but otherwise please just ignore it. I wouldn't say that the
> rationale in the documentation doesn't make sense. Userfaultfd exists
> to solve specific problems.

I thought it was a general purpose interface. My mistake. But I think it can be more general, at least encompassing my goal of having a userspace "interface" that monitors processes' page faults.

> 
> >
> > Now, after digging into the question, I feel like there was never, ever a
> coherent architectural design for userfaultfd as a function. It's apparently just
> a "hack", not a "feature".
> 
> Userfaultfd certainly isn't perfect, but it is critical for things
> like VM live migration, Android GC, CRIU, etc..
> 
> >
> > I'd be happy to propose a much more coherent design (in my opinion as an
> operating systems designer for the past more than 20 years, starting with Multics
> in 1970 - you guys may not be interested in my input, which is fair. Is Linus
> interested? That would be a bunch of work for me, because I would do a thorough
> job, not just a bunch of random patches. But I'm not proposing to join the
> maintainer-club - I'm retired from that space, and I find the Linux kernel
> contributors poorly organized and chaotic.
> >
> > Or, I can just drop this interaction - concluding that userfaultfd is kind of
> useless as is, and really badly documented to boot.
> 
> I am interested to hear your ideas for how you think userfaultfd
> should work and how it solves your problem. :) At the end of the day,
> I'm just trying (though clearly failing miserably) to help you solve
> your problem.
> 
> Your characterization of userfaultfd as a "useless" "bunch of random
> patches" that is just a "hack" is wrong. I understand; it doesn't
> support your needs. I think what Peter, Axel, and I have been trying
> to understand is what exactly you're trying to do and how userfaultfd
> could (or may not) help you get there. You've shared some[1]
> details[2] about what you're looking for, so thank you for that, but I
> am still struggling to understand how the flexibility that you're
> asking for is actually the right tool for the problem(s) you're trying
> to solve.
> 
> [1]: https://lore.kernel.org/linux-mm/1758037039.08578612@apps.rackspace.com/
> [2]: https://lore.kernel.org/linux-mm/1758042583.108320755@apps.rackspace.com/
> 
> > There is no sensible way to respond to a "missing event" when "missing" means
> the page is swapped out (to SWAP) by UFFDIO_COPY or UFFDIO_ZEROPAGE. That's just
> weird, and you continue to insist on it. Where is the page that was swapped out?
> Well, one could look at the PTE in /proc/pid/maps, and you find that its "swap
> entry" is there as an index into a block device. (so, maybe you can open the swap
> device using some file descriptor and mmap() it into the manager process, then
> UFFDIO_COPY, but what if the swap page is actually in the "swap cache", you can't
> mmap any swap cache page via any userspace API - do you know a way to do that?)
> 
> (Please see the terms that I use at the bottom of this email; let me
> reply using those terms.)
> 
> UFFDIO_COPY has quite well-defined semantics (albeit, perhaps not
> *documented* well):
> 
> * For anonymous VMAs: UFFDIO_COPY will allocate page(s), copy some
> user memory into the page(s) and map those pages at the specified VAs.
> * For hugetlbfs and shmem/tmpfs VMAs, UFFDIO_COPY will fill holes in
> the file's page cache with new pages, copy the user memory in, and map
> those pages. UFFDIO_CONTINUE is additionally supported; it skips the
> hole-filling step and requires the page cache to be populated.
> 
> For UFFDIO_COPY, if a page at a to-be-populated VA has already been
> allocated (including if it has been reclaimed), the call will be
> rejected. It would effectively be overwriting the contents of the
> page; this is not supported today.
> 
> If "missing" includes swapped out pages, UFFDIO_COPY and
> UFFDIO_ZEROPAGE would need to be allowed to overwrite the existing
> contents. "Sensible" or not, there has been no need for this yet.
> 
> > Now I reported a bug in UFFIO_REGISTER [...]
> 
> The bug you reported is in the documentation only.
> 
> > [...] which you keep saying is the same as UFFDIO_CONTINUE. Well, it isn't! I
> can register a minor handler (which allows continue) if I use
> MAP_ANONYMOUS|MAP_SHARED. The same "swap cache" mechanics exactly apply. The only
> "sharing" is potential future sharing after that process forks, in which case, the
> same "swap page" is shared until a Copy on Write forces the page to be unshared -
> it is a writeable page, just sharing the same physical block. It can be swapped
> out to the swap cache and the swap device, which sets the PTE to be a "swap entry"
> that causes a page fault.
> 
> (Using the terms at the bottom of this email.)
> 
> For UFFDIO_CONTINUE, the swap cache mechanics are like:
> 
> 1. For anonymous pages in the VMA: swap-outs will not clear the PTEs,
> touching the page will swap it back in again, UFFDIO_CONTINUE on it is
> disallowed.
> 2. For page cache pages in the VMA (i.e., not-yet-written-to pages for
> MAP_PRIVATE, any page for MAP_SHARED): swap-outs will clear the PTEs,
> and touching the page will trigger a minor fault, and UFFDIO_CONTINUE
> will swap it back in.
> 
> For MAP_ANONYMOUS|MAP_PRIVATE, all pages in the VMA will be anonymous
> pages, so UFFDIO_CONTINUE will never be allowed, therefore
> registration in the first place is disallowed.
> 
> (IMHO, it was dubious to have even allowed registering userfaultfd
> minor faults with *any* MAP_PRIVATE VMA.)
> 
> > The swap device doesn't know where the pages are mapped. You need to look at
> the PTEs of all the processes to find the translation to swap cache entry, and if
> you want to go backward from swap entry to pages, you need to use a special XArray
> that finds VMAs given swap entry.
> >
> > But the point here I keep making is that UFFDIO_REGISTER rejects only
> MAP_ANONYMOUS that are MAP_PRIVATE and also not huge pages. To me that's weird.
> 
> I hope my above explanation (of sorts) makes it a little less weird.
> 
> > If it is the CoW case that doesn't work (I doubt it), well, you have to read
> the swapped out page into memory before copying it anyway. Then you copy on write,
> from the page read or found in the swap cache.
> >
> > Now, as you say, that may require allocating a new page, also in the swap
> cache. Is that a "missing" page in the weird userfaultfd terminology? If so, to
> handle it can't be done with UFFIO_COPY, because you can't access the contents
> from userspace. And it's not "write protected" from the perspective of WP.
> 
> No it isn't a missing userfault. Data exists at the VA for which a
> userfault would be generated, therefore it cannot be "missing".
> 
> >
> >
> >
> > > The only exception I can
> > > think of is swap faults, I could see anon swap faults (perhaps
> > > specifically when the page is in the swap cache?) being considered
> > > UFFD minor faults, but I would be curious to know what the use case is
> > > for that / why you would want to do that. The original use case for
> > > UFFD minor fault support was demand paging for VMs, where you have
> > > some kind of shared memory (shmem or hugetlb) where one side of the
> > > mapping is given to the VM, and the other side of the shared mapping
> > > is used by the hypervisor to populate guest memory on-demand in
> > > response to userfaultfd events.
> >
> >
> >
> > I think I've just answered this. userfaultfd doesn't support the "swap out"
> part of anonymous swapping at all. So, how could a manager get the page contents
> as of the instant it is put in the swap cache for writing out to the swap device?
> There's no "swap out" event mechanism, and no way to treat the swap device cached
> into the swap cache as a page source. (not to mention the zswap mechanism, which
> compresses some of the pages into an invisible piece of memory).
> >
> >
> > >
> > > To me it's not intended userfaultfd minor events are generated for
> > > writeprotect faults, to me that's the domain of userfaultfd-wp, not
> > > minor faults. James might be right that these unintentionally trigger
> > > minor faults today, I would need to do some more reading of the code
> > > to be certain though.
> >
> > I don't particulary care about writeprotect faults, but CoW probably
> shouldn't be considered the same as a writeprotect fault, because CoW is triggered
> by a write into a writeable area, ONLY in one of the mappings, whichever is
> written first. The process doesn't think of it as a "write" - it just is a kernel
> optimization of a common case where fork is followed by non-use, so the actual
> copy could have been done at fork time, semantically. It's a deferred read and
> allocation.
> >
> >
> >
> > I hope this helps clarify my concerns.
> >
> > There are several reasonable outcomes -
> >
> > 1. Much better documentation of what the code actually does (and why).
> 
> Agreed.
> 
> > 2. Fix the "bug" that prevents REGISTER of "minor" handler on private,
> anonymous mappings (obviously, you can REGISTER missing handlers as well), then
> document actually what happens during the life cycle of swapping of pages in
> detail, including MAP_PRIVATE|MAP_ANONYMOUS VMAs.
> 
> Not a bug.
> 
> > 3. Do a thorough analysis of what userfaultfd really should do, if the goal
> is to provide the ability of a "manager process" to get to handle all cases of
> page fault behavior on a case-by-case basis for regions of user addressable
> pages.
> 
> What userfaultfd "should do" is up to the problems we need it to solve.
> 
> > I'd be happy to contribute to (but not manage) whichever outcome - and I have
> what I think is a reasonable use case. (and I'm aware that this API accidentally
> created a serious hacker exploit earlier in its life, by creating a way to hang
> one process from another. I think that's no longer so easy.)
> 
> I would be glad to hear what changes you think should be made to
> userfaultfd to better suit your needs.
> 
> Sorry if this reply is somewhat incoherent; I've gone back and forth a
> few times on how to respond to your points in the most helpful way I
> can. I've tried to be as clear as possible without being too verbose.
> 
> - James
> 
> --
> 
> Alrighty here are the terms/definitions I use, as I mentioned above.
> Again feel, free to ignore them if they are unhelpful:
> 
> A "file-backed VMA" will load pages into the page cache. For most
> filesystems, the page is loaded from a disk (or a proper device), but
> for special filesystems like tmpfs, hugetlbfs, and ramfs, the page
> cache is populated with zeroed pages initially.
> 
> tmpfs is kind of like a filesystem API for shmem, but they are so
> interconnected that many people use the terms interchangeably. (To
> clarify, I don't think of "shmem" as shorthand for "shared memory"; to
> me, it is the name of an mm subsystem.) Every MAP_ANONYMOUS|MAP_SHARED
> VMA is a shmem VMA; it is as if there is a tmpfs file backing VMAs
> like these, so they are in some contexts considered "file-backed". See
> shmem_zero_setup(). As far as I'm concerned, vma->vm_file is set, so
> the VMA is file-backed (even though the mmap flags included
> MAP_ANONYMOUS). I assume this is what you are referring to when you
> say "file-backed by /dev/zero".
> 
> For any MAP_PRIVATE VMA, some pages may be "anonymous", in that no
> page cache is holding a reference to it (i.e., generally speaking, the
> only references on the page are the ones taken by the PTEs mapping the
> page). Reclaim of pages like these will put them in a swap cache.
> 
> For pages where a reference is held in a page cache, if the page is
> dirty, it can be written out to disk. shmem implements "writeout" by
> swapping just like anonymous pages, but other filesystems implement it
> how you would expect.
> 

[-- Attachment #2: Type: text/html, Size: 23198 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-29 19:44                             ` David P. Reed
@ 2025-09-29 20:30                               ` Peter Xu
  2025-10-01 22:16                                 ` Axel Rasmussen
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Xu @ 2025-09-29 20:30 UTC (permalink / raw)
  To: David P. Reed; +Cc: James Houghton, Axel Rasmussen, Andrew Morton, linux-mm

On Mon, Sep 29, 2025 at 03:44:52PM -0400, David P. Reed wrote:
> I thought it was a general purpose interface. My mistake. But I think it
> can be more general, at least encompassing my goal of having a userspace
> "interface" that monitors processes' page faults.

To James: thanks for the great writeup.  Somehow, I just feel like userfaultfd
(as a linux submodule) got some sheer luck to have you around. :)

To David: just to say, I still think it's a general purpose interface, at
least that's the hope..

I agree with you at least on one point you mentioned, that shmem also can
swap, and that was accounted as minor faults when swapin happens at a
specific virtual address.  It doesn't sound fair if anon isn't doing the
same. Indeed.

It was just not in the radar when minor fault was introduced by Axel, even
it was for a solo purpose for live migration at that time.. but the hope is
the interface designed should service a generic purpose.

Now the problem is, userfaultfd wasn't initially used for monitoring system
activities.  As its name implies, it provides the userspace a way to
resolve a fault, but only if a fault happens first..

Meanwhile, system activities should definitely at least involve swapouts,
which unfortunately doesn't involve page faults, but only happen the other
way round when the system wants to secretly move things out.. that is what
userfaultfd is out of control.

It just sounds like it won't suffice your need even if we could add minor
fault support for anon private memories on swap cache. However, if
userfaultfd is used to do everything (including swap in/outs), then it's by
nature all trappable + accountable, on both swap in/outs to/from any media.
Then swapout will be driven by the userspace too, then everything will be
in solid control, including monitoring of the activities.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-09-29 20:30                               ` Peter Xu
@ 2025-10-01 22:16                                 ` Axel Rasmussen
  2025-10-17 21:07                                   ` David P. Reed
  0 siblings, 1 reply; 26+ messages in thread
From: Axel Rasmussen @ 2025-10-01 22:16 UTC (permalink / raw)
  To: Peter Xu; +Cc: David P. Reed, James Houghton, Andrew Morton, linux-mm

Thanks for linking the ExtMem paper David, that makes it a lot more
clear to me what your expectations are.

I think basically, userfaultfd has evolved incrementally, and it only
has a handful of features needed to address pretty specific use cases,
it doesn't have the full flexibility / generality you would need to do
"full memory management in userspace". Not to say I think it shouldn't
be able to do that from a philosophical point of view, I just mean to
say it would take quite a lot of work to get there.

Performance is also a big concern. Userfaultfd performance is not
great, in fact scalability issues are one of the reasons we have been
pursuing guest_memfd based approaches to VM demand paging, instead of
userfaultfd.

I don't disagree that in principle it makes sense for anon private
swap faults to generate userfaultfd minor fault events, it's just
until now nobody had ever wanted to do that, so it hasn't been
implemented yet. :) For what it's worth, I don't think this would get
you where you want to go by itself though, because the only action you
could take in response to such an event today is UFFDIO_CONTINUE,
which would simply swap in + map the page, you would have no
opportunity to e.g. populate the page contents from elsewhere, you'd
be delegating all of that to the existing in-kernel swap
implementation. So it doesn't really get you all the way to "full
userspace memory management".

On Mon, Sep 29, 2025 at 1:30 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Sep 29, 2025 at 03:44:52PM -0400, David P. Reed wrote:
> > I thought it was a general purpose interface. My mistake. But I think it
> > can be more general, at least encompassing my goal of having a userspace
> > "interface" that monitors processes' page faults.
>
> To James: thanks for the great writeup.  Somehow, I just feel like userfaultfd
> (as a linux submodule) got some sheer luck to have you around. :)
>
> To David: just to say, I still think it's a general purpose interface, at
> least that's the hope..
>
> I agree with you at least on one point you mentioned, that shmem also can
> swap, and that was accounted as minor faults when swapin happens at a
> specific virtual address.  It doesn't sound fair if anon isn't doing the
> same. Indeed.
>
> It was just not in the radar when minor fault was introduced by Axel, even
> it was for a solo purpose for live migration at that time.. but the hope is
> the interface designed should service a generic purpose.
>
> Now the problem is, userfaultfd wasn't initially used for monitoring system
> activities.  As its name implies, it provides the userspace a way to
> resolve a fault, but only if a fault happens first..
>
> Meanwhile, system activities should definitely at least involve swapouts,
> which unfortunately doesn't involve page faults, but only happen the other
> way round when the system wants to secretly move things out.. that is what
> userfaultfd is out of control.
>
> It just sounds like it won't suffice your need even if we could add minor
> fault support for anon private memories on swap cache. However, if
> userfaultfd is used to do everything (including swap in/outs), then it's by
> nature all trappable + accountable, on both swap in/outs to/from any media.
> Then swapout will be driven by the userspace too, then everything will be
> in solid control, including monitoring of the activities.
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
  2025-10-01 22:16                                 ` Axel Rasmussen
@ 2025-10-17 21:07                                   ` David P. Reed
  0 siblings, 0 replies; 26+ messages in thread
From: David P. Reed @ 2025-10-17 21:07 UTC (permalink / raw)
  To: Axel Rasmussen; +Cc: Peter Xu, James Houghton, Andrew Morton, linux-mm

Hi Axel -

Thanks for the long reply. I've been focused elsewhere for a couple weeks, but I'm getting back to this.

Comments below:

On Wednesday, October 1, 2025 18:16, "Axel Rasmussen" <axelrasmussen@google.com> said:

> Thanks for linking the ExtMem paper David, that makes it a lot more
> clear to me what your expectations are.

Note that personally, I'm not trying to do what the ExtMem people tried to do - implement full memory management in userspace. I'm simply trying to monitor all the paging activity in a set of processes. So in many ways my goal is less ambitious than theirs. And userfaultfd almost does everything I want, with the exception of the case of anonymous+private paging.

> 
> I think basically, userfaultfd has evolved incrementally, and it only
> has a handful of features needed to address pretty specific use cases,
> it doesn't have the full flexibility / generality you would need to do
> "full memory management in userspace". Not to say I think it shouldn't
> be able to do that from a philosophical point of view, I just mean to
> say it would take quite a lot of work to get there.
> 
> Performance is also a big concern. Userfaultfd performance is not
> great, in fact scalability issues are one of the reasons we have been
> pursuing guest_memfd based approaches to VM demand paging, instead of
> userfaultfd.

I don't really want to do VM demand paging. I've done that before, at a previous startup I co-founded, called TidalScale. (well, you could call it demand-paging, but in fact it was more complex. However, HPE now owns TidalScale, and it would be silly for me to focus on that stuff (distributing virtual memory, virtual processors, and virtual I/O devices throughout a set of big servers, migrating them among the different servers). We did some amazing things with that, but I also learned that there's a limit to what virtualization+migration can do, performance-wise.

> 
> I don't disagree that in principle it makes sense for anon private
> swap faults to generate userfaultfd minor fault events, it's just
> until now nobody had ever wanted to do that, so it hasn't been
> implemented yet. :) For what it's worth, I don't think this would get
> you where you want to go by itself though, because the only action you
> could take in response to such an event today is UFFDIO_CONTINUE,
> which would simply swap in + map the page, you would have no
> opportunity to e.g. populate the page contents from elsewhere, you'd
> be delegating all of that to the existing in-kernel swap
> implementation. So it doesn't really get you all the way to "full
> userspace memory management".

Yet, that's exactly the additional capability I want - just to get the event and continue, after doing some stuff with the information at the time of the event.

So if I could have just that, it would be great. I thought that it was there already, since the restriction isn't mentioned in the documentation.

The alternative for me is to write a lot of "out-of-tree" kernel code that hooks (using k[ret]probes?) into all the paging mechanisms in the kernel, and then maintain it across releases. I don't really want to do that. And to create a hypervisor extension just to do this from deep below the applications seems silly.

I realize that there a performance drag to using userfaultfd, but for my purposes that is pretty irrelevant.

And I'm kind of surprised that this case doesn't "just work", since supposedly one can register for minor page faults on other non-file-backed pages, just not "MAP_PRIVATE" ones, which get rejected at the "register" ioctl.

Regards,
David

> 
> 
> On Mon, Sep 29, 2025 at 1:30 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Mon, Sep 29, 2025 at 03:44:52PM -0400, David P. Reed wrote:
>> > I thought it was a general purpose interface. My mistake. But I think it
>> > can be more general, at least encompassing my goal of having a userspace
>> > "interface" that monitors processes' page faults.
>>
>> To James: thanks for the great writeup.  Somehow, I just feel like userfaultfd
>> (as a linux submodule) got some sheer luck to have you around. :)
>>
>> To David: just to say, I still think it's a general purpose interface, at
>> least that's the hope..
>>
>> I agree with you at least on one point you mentioned, that shmem also can
>> swap, and that was accounted as minor faults when swapin happens at a
>> specific virtual address.  It doesn't sound fair if anon isn't doing the
>> same. Indeed.
>>
>> It was just not in the radar when minor fault was introduced by Axel, even
>> it was for a solo purpose for live migration at that time.. but the hope is
>> the interface designed should service a generic purpose.
>>
>> Now the problem is, userfaultfd wasn't initially used for monitoring system
>> activities.  As its name implies, it provides the userspace a way to
>> resolve a fault, but only if a fault happens first..
>>
>> Meanwhile, system activities should definitely at least involve swapouts,
>> which unfortunately doesn't involve page faults, but only happen the other
>> way round when the system wants to secretly move things out.. that is what
>> userfaultfd is out of control.
>>
>> It just sounds like it won't suffice your need even if we could add minor
>> fault support for anon private memories on swap cache. However, if
>> userfaultfd is used to do everything (including swap in/outs), then it's by
>> nature all trappable + accountable, on both swap in/outs to/from any media.
>> Then swapout will be driven by the userspace too, then everything will be
>> in solid control, including monitoring of the activities.
>>
>> Thanks,
>>
>> --
>> Peter Xu
>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-10-17 21:08 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-15 20:13 PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails David P. Reed
2025-09-15 20:24 ` James Houghton
2025-09-15 22:58   ` David P. Reed
2025-09-16  0:31     ` James Houghton
2025-09-16 14:48       ` Peter Xu
2025-09-16 15:52         ` David P. Reed
2025-09-16 16:13           ` Peter Xu
2025-09-16 17:09             ` David P. Reed
2025-09-26 22:16               ` Peter Xu
2025-09-16 17:27             ` David P. Reed
2025-09-16 18:35               ` Axel Rasmussen
2025-09-16 19:10                 ` James Houghton
2025-09-16 19:47                   ` David P. Reed
2025-09-16 22:04                   ` Axel Rasmussen
2025-09-26 22:00                     ` Peter Xu
2025-09-16 19:52                 ` David P. Reed
2025-09-17 16:13                   ` Axel Rasmussen
2025-09-19 18:29                     ` David P. Reed
2025-09-25 19:20                       ` Axel Rasmussen
2025-09-27 18:45                         ` David P. Reed
2025-09-29  5:30                           ` James Houghton
2025-09-29 19:44                             ` David P. Reed
2025-09-29 20:30                               ` Peter Xu
2025-10-01 22:16                                 ` Axel Rasmussen
2025-10-17 21:07                                   ` David P. Reed
2025-09-16 15:37       ` David P. Reed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox