linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: James Houghton <jthoughton@google.com>
To: "David P. Reed" <dpreed@deepplum.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>,
	Peter Xu <peterx@redhat.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org
Subject: Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
Date: Sun, 28 Sep 2025 22:30:10 -0700	[thread overview]
Message-ID: <CADrL8HW0eNsHnEsEdKYRNvFRBMvrDMrHawa55Kik9QFeVNEwgA@mail.gmail.com> (raw)
In-Reply-To: <1758998720.44976697@apps.rackspace.com>

On Sat, Sep 27, 2025 at 11:45 AM David P. Reed <dpreed@deepplum.com> wrote:
>
> OK - responses below.

I think Peter will be able to help you the most, but I want to give my
two cents anyway.

>
> I'm still unclear what my role is vs. the others cc'ed on this problem report is.
>
> Is anyone here (other than Andrew) a decision maker on what userfaultfd is supposed to do? I can see what the current code DOES - and honestly, it's seriously whacked semantically. (see the ExtMem paper for a reasonable use case that it cannot serve, my use case is quite similar). So is anyone here wanting to improve the functionality? I'm sure its current functions are used by some folks here - Google employees presumably focused on ChromeOS or Android, I suppose, suggest that there's a use case there.

I think all of us want userfaultfd to be as useful as possible. :)
Peter, Axel, and I are quite familiar with userfaultfd's use as a tool
for enabling post-copy live migration for virtual machines.
Userfaultfd minor faults were created expressly for this purpose. Axel
wrote the userfaultfd minor fault support; I wrote the corresponding
userspace code to use it in Google Cloud.

Peter is quite a bit more familiar with userfaultfd than me (and I
think Axel, but I don't want to speak for him), so please excuse our
mistakes. (mm is complicated!)

There are a few others who care about userfaultfd who might jump in as
soon as patches get sent. I think these folks (so on top of Peter and
Andrew, people like Suren, Lorenzo, David Hildenbrand) will be the
folks who Ack or Nak the patches.

>
>
>
> My role started out by reporting that the documentation is both incomplete and confusing, both in the man pages and the "kernel documentation". And the rationale presented in the documentation doesn't make sense. Some of you guys admit that you really don't understand how "swap" is different from "file-backed paging" (except for the corner cases of hugetlbfs [sort of "file backed"], "file-backed by /dev/zero" [which ends up using "swap"], and tmpfs [also "file backed" but using "swap"]. And yet "anonymous, private" uses "swap" and the "swap cache", not the "page cache".

The documentation is confusing; I agreed with you originally that it
should be updated. (Do you want to send a patch? Perhaps I could give
it a go when I find the time.)

I spent some time writing out how I define the various terms being
used here, I'll leave it at the end of this email in case it is
helpful, but otherwise please just ignore it. I wouldn't say that the
rationale in the documentation doesn't make sense. Userfaultfd exists
to solve specific problems.

>
> Now, after digging into the question, I feel like there was never, ever a coherent architectural design for userfaultfd as a function. It's apparently just a "hack", not a "feature".

Userfaultfd certainly isn't perfect, but it is critical for things
like VM live migration, Android GC, CRIU, etc..

>
> I'd be happy to propose a much more coherent design (in my opinion as an operating systems designer for the past more than 20 years, starting with Multics in 1970 - you guys may not be interested in my input, which is fair. Is Linus interested? That would be a bunch of work for me, because I would do a thorough job, not just a bunch of random patches. But I'm not proposing to join the maintainer-club - I'm retired from that space, and I find the Linux kernel contributors poorly organized and chaotic.
>
> Or, I can just drop this interaction - concluding that userfaultfd is kind of useless as is, and really badly documented to boot.

I am interested to hear your ideas for how you think userfaultfd
should work and how it solves your problem. :) At the end of the day,
I'm just trying (though clearly failing miserably) to help you solve
your problem.

Your characterization of userfaultfd as a "useless" "bunch of random
patches" that is just a "hack" is wrong. I understand; it doesn't
support your needs. I think what Peter, Axel, and I have been trying
to understand is what exactly you're trying to do and how userfaultfd
could (or may not) help you get there. You've shared some[1]
details[2] about what you're looking for, so thank you for that, but I
am still struggling to understand how the flexibility that you're
asking for is actually the right tool for the problem(s) you're trying
to solve.

[1]: https://lore.kernel.org/linux-mm/1758037039.08578612@apps.rackspace.com/
[2]: https://lore.kernel.org/linux-mm/1758042583.108320755@apps.rackspace.com/

> There is no sensible way to respond to a "missing event" when "missing" means the page is swapped out (to SWAP) by UFFDIO_COPY or UFFDIO_ZEROPAGE. That's just weird, and you continue to insist on it. Where is the page that was swapped out? Well, one could look at the PTE in /proc/pid/maps, and you find that its "swap entry" is there as an index into a block device. (so, maybe you can open the swap device using some file descriptor and mmap() it into the manager process, then UFFDIO_COPY, but what if the swap page is actually in the "swap cache", you can't mmap any swap cache page via any userspace API - do you know a way to do that?)

(Please see the terms that I use at the bottom of this email; let me
reply using those terms.)

UFFDIO_COPY has quite well-defined semantics (albeit, perhaps not
*documented* well):

* For anonymous VMAs: UFFDIO_COPY will allocate page(s), copy some
user memory into the page(s) and map those pages at the specified VAs.
* For hugetlbfs and shmem/tmpfs VMAs, UFFDIO_COPY will fill holes in
the file's page cache with new pages, copy the user memory in, and map
those pages. UFFDIO_CONTINUE is additionally supported; it skips the
hole-filling step and requires the page cache to be populated.

For UFFDIO_COPY, if a page at a to-be-populated VA has already been
allocated (including if it has been reclaimed), the call will be
rejected. It would effectively be overwriting the contents of the
page; this is not supported today.

If "missing" includes swapped out pages, UFFDIO_COPY and
UFFDIO_ZEROPAGE would need to be allowed to overwrite the existing
contents. "Sensible" or not, there has been no need for this yet.

> Now I reported a bug in UFFIO_REGISTER [...]

The bug you reported is in the documentation only.

> [...] which you keep saying is the same as UFFDIO_CONTINUE. Well, it isn't! I can register a minor handler (which allows continue) if I use MAP_ANONYMOUS|MAP_SHARED. The same "swap cache" mechanics exactly apply. The only "sharing" is potential future sharing after that process forks, in which case, the same "swap page" is shared until a Copy on Write forces the page to be unshared - it is a writeable page, just sharing the same physical block. It can be swapped out to the swap cache and the swap device, which sets the PTE to be a "swap entry" that causes a page fault.

(Using the terms at the bottom of this email.)

For UFFDIO_CONTINUE, the swap cache mechanics are like:

1. For anonymous pages in the VMA: swap-outs will not clear the PTEs,
touching the page will swap it back in again, UFFDIO_CONTINUE on it is
disallowed.
2. For page cache pages in the VMA (i.e., not-yet-written-to pages for
MAP_PRIVATE, any page for MAP_SHARED): swap-outs will clear the PTEs,
and touching the page will trigger a minor fault, and UFFDIO_CONTINUE
will swap it back in.

For MAP_ANONYMOUS|MAP_PRIVATE, all pages in the VMA will be anonymous
pages, so UFFDIO_CONTINUE will never be allowed, therefore
registration in the first place is disallowed.

(IMHO, it was dubious to have even allowed registering userfaultfd
minor faults with *any* MAP_PRIVATE VMA.)

> The swap device doesn't know where the pages are mapped. You need to look at the PTEs of all the processes to find the translation to swap cache entry, and if you want to go backward from swap entry to pages, you need to use a special XArray that finds VMAs given swap entry.
>
> But the point here I keep making is that UFFDIO_REGISTER rejects only MAP_ANONYMOUS that are MAP_PRIVATE and also not huge pages. To me that's weird.

I hope my above explanation (of sorts) makes it a little less weird.

> If it is the CoW case that doesn't work (I doubt it), well, you have to read the swapped out page into memory before copying it anyway. Then you copy on write, from the page read or found in the swap cache.
>
> Now, as you say, that may require allocating a new page, also in the swap cache. Is that a "missing" page in the weird userfaultfd terminology? If so, to handle it can't be done with UFFIO_COPY, because you can't access the contents from userspace. And it's not "write protected" from the perspective of WP.

No it isn't a missing userfault. Data exists at the VA for which a
userfault would be generated, therefore it cannot be "missing".

>
>
>
> > The only exception I can
> > think of is swap faults, I could see anon swap faults (perhaps
> > specifically when the page is in the swap cache?) being considered
> > UFFD minor faults, but I would be curious to know what the use case is
> > for that / why you would want to do that. The original use case for
> > UFFD minor fault support was demand paging for VMs, where you have
> > some kind of shared memory (shmem or hugetlb) where one side of the
> > mapping is given to the VM, and the other side of the shared mapping
> > is used by the hypervisor to populate guest memory on-demand in
> > response to userfaultfd events.
>
>
>
> I think I've just answered this. userfaultfd doesn't support the "swap out" part of anonymous swapping at all. So, how could a manager get the page contents as of the instant it is put in the swap cache for writing out to the swap device? There's no "swap out" event mechanism, and no way to treat the swap device cached into the swap cache as a page source. (not to mention the zswap mechanism, which compresses some of the pages into an invisible piece of memory).
>
>
> >
> > To me it's not intended userfaultfd minor events are generated for
> > writeprotect faults, to me that's the domain of userfaultfd-wp, not
> > minor faults. James might be right that these unintentionally trigger
> > minor faults today, I would need to do some more reading of the code
> > to be certain though.
>
> I don't particulary care about writeprotect faults, but CoW probably shouldn't be considered the same as a writeprotect fault, because CoW is triggered by a write into a writeable area, ONLY in one of the mappings, whichever is written first. The process doesn't think of it as a "write" - it just is a kernel optimization of a common case where fork is followed by non-use, so the actual copy could have been done at fork time, semantically. It's a deferred read and allocation.
>
>
>
> I hope this helps clarify my concerns.
>
> There are several reasonable outcomes -
>
> 1. Much better documentation of what the code actually does (and why).

Agreed.

> 2. Fix the "bug" that prevents REGISTER of "minor" handler on private, anonymous mappings (obviously, you can REGISTER missing handlers as well), then document actually what happens during the life cycle of swapping of pages in detail, including MAP_PRIVATE|MAP_ANONYMOUS VMAs.

Not a bug.

> 3. Do a thorough analysis of what userfaultfd really should do, if the goal is to provide the ability of a "manager process" to get to handle all cases of page fault behavior on a case-by-case basis for regions of user addressable pages.

What userfaultfd "should do" is up to the problems we need it to solve.

> I'd be happy to contribute to (but not manage) whichever outcome - and I have what I think is a reasonable use case. (and I'm aware that this API accidentally created a serious hacker exploit earlier in its life, by creating a way to hang one process from another. I think that's no longer so easy.)

I would be glad to hear what changes you think should be made to
userfaultfd to better suit your needs.

Sorry if this reply is somewhat incoherent; I've gone back and forth a
few times on how to respond to your points in the most helpful way I
can. I've tried to be as clear as possible without being too verbose.

- James

--

Alrighty here are the terms/definitions I use, as I mentioned above.
Again feel, free to ignore them if they are unhelpful:

A "file-backed VMA" will load pages into the page cache. For most
filesystems, the page is loaded from a disk (or a proper device), but
for special filesystems like tmpfs, hugetlbfs, and ramfs, the page
cache is populated with zeroed pages initially.

tmpfs is kind of like a filesystem API for shmem, but they are so
interconnected that many people use the terms interchangeably. (To
clarify, I don't think of "shmem" as shorthand for "shared memory"; to
me, it is the name of an mm subsystem.) Every MAP_ANONYMOUS|MAP_SHARED
VMA is a shmem VMA; it is as if there is a tmpfs file backing VMAs
like these, so they are in some contexts considered "file-backed". See
shmem_zero_setup(). As far as I'm concerned, vma->vm_file is set, so
the VMA is file-backed (even though the mmap flags included
MAP_ANONYMOUS). I assume this is what you are referring to when you
say "file-backed by /dev/zero".

For any MAP_PRIVATE VMA, some pages may be "anonymous", in that no
page cache is holding a reference to it (i.e., generally speaking, the
only references on the page are the ones taken by the PTEs mapping the
page). Reclaim of pages like these will put them in a swap cache.

For pages where a reference is held in a page cache, if the page is
dirty, it can be written out to disk. shmem implements "writeout" by
swapping just like anonymous pages, but other filesystems implement it
how you would expect.


  reply	other threads:[~2025-09-29  5:30 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 20:13 David P. Reed
2025-09-15 20:24 ` James Houghton
2025-09-15 22:58   ` David P. Reed
2025-09-16  0:31     ` James Houghton
2025-09-16 14:48       ` Peter Xu
2025-09-16 15:52         ` David P. Reed
2025-09-16 16:13           ` Peter Xu
2025-09-16 17:09             ` David P. Reed
2025-09-26 22:16               ` Peter Xu
2025-09-16 17:27             ` David P. Reed
2025-09-16 18:35               ` Axel Rasmussen
2025-09-16 19:10                 ` James Houghton
2025-09-16 19:47                   ` David P. Reed
2025-09-16 22:04                   ` Axel Rasmussen
2025-09-26 22:00                     ` Peter Xu
2025-09-16 19:52                 ` David P. Reed
2025-09-17 16:13                   ` Axel Rasmussen
2025-09-19 18:29                     ` David P. Reed
2025-09-25 19:20                       ` Axel Rasmussen
2025-09-27 18:45                         ` David P. Reed
2025-09-29  5:30                           ` James Houghton [this message]
2025-09-29 19:44                             ` David P. Reed
2025-09-29 20:30                               ` Peter Xu
2025-10-01 22:16                                 ` Axel Rasmussen
2025-10-17 21:07                                   ` David P. Reed
2025-09-16 15:37       ` David P. Reed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADrL8HW0eNsHnEsEdKYRNvFRBMvrDMrHawa55Kik9QFeVNEwgA@mail.gmail.com \
    --to=jthoughton@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=dpreed@deepplum.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox