linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "David P. Reed" <dpreed@deepplum.com>
To: "James Houghton" <jthoughton@google.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	linux-mm@kvack.org, "Peter Xu" <peterx@redhat.com>,
	"Axel Rasmussen" <axelrasmussen@google.com>
Subject: Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
Date: Tue, 16 Sep 2025 11:37:19 -0400 (EDT)	[thread overview]
Message-ID: <1758037039.08578612@apps.rackspace.com> (raw)
In-Reply-To: <CADrL8HX78-oh0k2qAgqPvNVAhi4ESYvjRsScPGR2P2Dts13Bfw@mail.gmail.com>



On Monday, September 15, 2025 20:31, "James Houghton" <jthoughton@google.com> said:

> On Mon, Sep 15, 2025 at 3:58 PM David P. Reed <dpreed@deepplum.com> wrote:
>>
>>
>>
>> On Monday, September 15, 2025 16:24, "James Houghton" <jthoughton@google.com>
>> said:
>>
>> > On Mon, Sep 15, 2025 at 1:13 PM David P. Reed <dpreed@deepplum.com>
>> wrote:
>> >>
>> >>
>> >> [1.] One line summary of the problem: userfaultfd REGISTER minor mode on
>> >> MAP_PRIVATE fails
>> >> [2.] Full description of the problem/report:
>> >> The userfaultfd man page and the kernel docs seem to indicate that an area
>> >> mapped
>> >> MAP_PRIVATE|MAP_ANONYMOUS can be registered to handle MINOR page faults on
>> >> regular pages.
>> >> However, testing showed that not to work. MAP_SHARED does allow registration
>> for
>> >> MINOR
>> >> page fault events, though.
>> >> Either the documentation or the code should be fixed, IMO. Now reading the
>> code
>> >> that rejects
>> >> this case in the kernel source, the test in vma_can_userfault() that rejects
>> this
>> >> is this
>> >> line:
>> >>         if ((vm_flags & VM_UFFD_MINOR) &&
>> >>             (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
>> >>                 return false;
>> >> which probably should include !vma_is_anonymous(vma).
>> >>
>> >> Or maybe the COW that might happen if the program were forked is something
>> that
>> >> can't be handled, which seems odd.
>> >
>> > UFFDIO_CONTINUE, the resolution ioctl for userfaultfd minor faults,
>> > doesn't have defined semantics for MAP_PRIVATE mappings. The
>> > documentation is unclear that MAP_PRIVATE + userfaultfd minor faults
>> > is invalid, but this is intentional behavior.
>> >
>> > What would you like UFFDIO_CONTINUE on MAP_PRIVATE to do? Should it
>> > populate a read-only PTE? Should it do CoW and populate a writable
>> > PTE? I'm curious to hear more about your use case (and why UFFDIO_COPY
>> > doesn't do what you want).
>> >
>>
>> Well, I was just expecting to UFFDIO_CONTINUE to do whatever "normally" gets
>> done. So, the normal case for MAP_PRIVATE|MAP_ANONYMOUS, if the page is in the
>> swap cache and thus takes a minor fault, would depend on whether the access was a
>> write or a read.
> 
> This minor fault is not a *userfaultfd* minor fault, and even if
> registering UFFD_REGISTER_MODE_MINOR on this VMA were allowed, you
> wouldn't get userfaults. This is because swap-outs for MAP_ANONYMOUS
> VMAs leave behind a swap entry (!pte_present() && !pte_none()).
> UFFDIO_CONTINUE cannot resolve this condition, so no minor fault is
> generated in the first place.
> 
> Why can't UFFDIO_CONTINUE resolve this condition? Well UFFDIO_CONTINUE
> only populates pte_none() PTEs; it will not and should not obliterate
> a swap entry. And no one has a use-case for making it trigger a
> swap-in.

Well, it's not a page-missing fault, because the page may be in the swap cache and not yet on disk, which the documentation says is not a major (page missing) fault. Maybe it's a documentation problem, if "userfault minor" and "minor fault" aren't the same?
As I note in the report, the documentation is pretty unclear on this point (and also on why MAP_PRIVATE doesn't work).

> 
> The same logic applies to CoW; CoW faults are not (minor) userfaults
> because UFFDIO_CONTINUE cannot resolve them.
> 
>> For a read, the page just gets installed in the page map from the swap cache.
>> For a write, if the page hasn't yet been copied, a copy is made of the swap cache
>> contents of that page at that point, and the new copy is installed into the page
>> table of the writing process.
> 
> Sure, but if this is the behavior you want, why do you want/need userfaultfd?

Because I am tracking page creation events. The COW case creates a new page, and UFFDIO_COPY isn't able to express "just proceed". If the COW is silent, then I won't see page creation by COW. Maybe we need another "mode" besides "missing" and "minor" that gets triggered by COW?  (note that write-protect isn't quite the same as that, because it gets triggered by writes that don't cause COW, also - if it is even allowed for the case of MAP_ANONYMOUS|MAP_PRIVATE.

> 
>> However, the problem I'm reporting is that I can't even register such a page for
>> minor page faults.
> 
> I understand; I find it easier to speak in terms of the behavior of
> the resolution ioctl (it is equivalent).
>
How is it equivalent to the REGISTER rejecting a mode?
 
>> Now there is a question of the meaning of UUFIO_COPY should be (not continue). If
>> page is MAP_PRIVATE, MAP_COPY is like writing to the page at the time of the
>> minor fault. So the version of the data in the swap cache for the page should be
>> ignored, replacing the  local version makes sense. Any other process that still
>> has the original version from the time of the fork() that shared the page should
>> not be affected, I would think.
>>
>> There is a confusing possibility, however, with the file descriptor for uffd. In
>> the case of a fork(), the file descriptor would be shared, and so either fork
>> could end up listening via poll/select.
>>
>> It's hard to decide what is right semantically, because the normal use of
>> userfault is to monitor from another process, though you can use read() in the
>> same process as the faulting one - this seems to be because either fork or a
>> unix-socket can be the path for sending the file descriptor to another process.
>> But this is just definitional, the actual user design would have to handle faults
>> in one place or another.
>>
>> Now in this case, whichever process does the first read() on the file descriptor
>> would get the information about the minor fault. (I assume both would NOT, but
>> I'm early in my use of userfaultfd). So it could continue or copy, as desired.
>>
>> Generally, anyone using userfaultfd would understand the nuances of fork() and
>> file handle duplication. So they would probably close the fd in one process or
>> the other, as appropriate. (I admit I haven't tested what happens if both forks
>> try to use the file descriptor, but I can imagine it might be useful if they
>> coordinate carefully).
> 
> I am not really following how the above connects to not being able to
> use userfaultfd minor faults for MAP_PRIVATE.
> 
>>
>> Now, if many forks end up sharing the uffd file descriptor and also end up with
>> copy-on-write shared pages in the MAP_PRIVATE region, the above definitions of
>> the continue and copy would continue to make sense - to me anyway.
>>
>> Hope this helps
> 
> I still don't have a solid grasp of what your use case is.
> 

My use case is simple, and has been described elsewhere by others. Creating a userspace paging monitor process that can catch page faults in userspace, either tracking them or modifying their behavior.  Not being able to handle anonymous, private pages at all seems to make it useless for that purpose. (doing research on using that fault info to drive madvise LRU management for certain cases). (I can do this with kretprobes thru a kernel driver, but since the mm code is rapidly evolving, it's not anything like an ABI that works across versions).

It's interesting that anonymous private huge page minor fault mode is not rejected, just regular page. (The code snippet above is what rejects regular pages but not huge pages mapped privately).

Just curious - are you a designer or maintainer of userfaultfd? You aren't listed as a maintainer. I would be able to provide a patch set that "fixes" the behavior to be the way I believe would be the most useful, but the point of reporting this as a problem is to avoid rejection by the maintainer, Andrew Morton, if somehow I've missed a subtle concern that isn't explained in the documentation. 
I could also provide a documentation patch to clarify the MAP_PRIVATE|MAP_ANONYMOUS rejection of minor fault handling doesn't work.



      parent reply	other threads:[~2025-09-16 15:37 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 20:13 David P. Reed
2025-09-15 20:24 ` James Houghton
2025-09-15 22:58   ` David P. Reed
2025-09-16  0:31     ` James Houghton
2025-09-16 14:48       ` Peter Xu
2025-09-16 15:52         ` David P. Reed
2025-09-16 16:13           ` Peter Xu
2025-09-16 17:09             ` David P. Reed
2025-09-26 22:16               ` Peter Xu
2025-09-16 17:27             ` David P. Reed
2025-09-16 18:35               ` Axel Rasmussen
2025-09-16 19:10                 ` James Houghton
2025-09-16 19:47                   ` David P. Reed
2025-09-16 22:04                   ` Axel Rasmussen
2025-09-26 22:00                     ` Peter Xu
2025-09-16 19:52                 ` David P. Reed
2025-09-17 16:13                   ` Axel Rasmussen
2025-09-19 18:29                     ` David P. Reed
2025-09-25 19:20                       ` Axel Rasmussen
2025-09-27 18:45                         ` David P. Reed
2025-09-29  5:30                           ` James Houghton
2025-09-29 19:44                             ` David P. Reed
2025-09-29 20:30                               ` Peter Xu
2025-10-01 22:16                                 ` Axel Rasmussen
2025-10-17 21:07                                   ` David P. Reed
2025-09-16 15:37       ` David P. Reed [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1758037039.08578612@apps.rackspace.com \
    --to=dpreed@deepplum.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=jthoughton@google.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox