James -

 

This was greatly helpful, as now I can decode a bit better.


The "big picture" insight you provided is that it is primarily (exclusively?) focused on post-copy Live Migration as its motivating use case never was clear to me before you clarified that in this message. Aha!

 

That's certainly different from what I'm hoping to use it for. (Just as an aside, starting in 2012 or so, I did a lot of design and implementation work on VM-based virtual memory, both at SAP Labs Research and at a startup I co-founded called TidalScale that created an "inverse virtualization" platform that moved memory among nodes of a tightly coupled "distributed x86 virtual machine". Essentially, that was a system that was constantly executing as if it was in post-copy live migration - the pages flowed between nodes, as did the virtual cpus. HPE acquired the product, which worked very well.)

 

I'm not focused on live migration at all, so you can see why I might be confused. What really interests me here is moving "kernel functions" out of the kernel - there's been a lot of work, for example, in I/O from userspace, which I follow closely. I grew up doing OS research in the early 1970's where for lots of reasons the "monolithic kernel" design was resisted (e.g. in the Unix sphere, Mach at CMU).  I worked during my M.S. on the Multics operating system, in particular with paging, and even in my Bachelor's thesis, on dealing with multiprocessor and multiprocess paging behavior. Since Multics was what we now call a "multiprocessor- centric" operating system with many CPUs sharing memory.

So what I have spent a lot of time over the years thinking about is how a system with many processes on many cpus can effectively share memory when competing for RAM and cache and "disk".

My 1973 bachelor's thesis was "Estimating Working Sets on Multics" (which was a time sharing system that supported ~100 concurrent users if provisioned with 3 processors, more if provisioned with 8-10 processors). The B.S. thesis recognized that by abandoning common shared LRU list reclaim, the OS could make more efficient use of RAM while swapping out to a "paging drum" that was super low latency at the time. So my brain is wired to know that the current Linux paging (reclaim and fault handling) isn't great. [well, it was born as a uniprocessor OS, and still is architected to privilege working well on a uniprocessor - rather than starting as Multics did, with the idea that there are lots of cores. You can see the mess in Linux with all the global locks in the mm kernel code, slowly being addressed.]

 

That's more context.

So userfaultfd is a tool I think can be used to move monitoring (which may include supervising reclaim) into userspace. It's not complete. process_madvise() may allow moving more into ring 3, but unfortunately it doesn't support MADV_PAGEOUT from MADVISE.

That may give you more context.  (I am not a believer in the idea that the Linux kernel is where you protect the system from hackers. The argument for moving function out of the kernel is that you untangle the spaghetti mess of the Linux kernel. Userspace processes can be inside the security perimeter of the system, if they are well designed and the kernel supports the right abstractions and the right protection mechanisms between processes. I am not sure that Torvalds agrees, but I am a LOT more experienced than he. It's his system, though.)

 

Comments intercalated below.

 

On Monday, September 29, 2025 01:30, "James Houghton" <jthoughton@google.com> said:

> On Sat, Sep 27, 2025 at 11:45 AM David P. Reed <dpreed@deepplum.com>
> wrote:
> >
> > OK - responses below.
>
> I think Peter will be able to help you the most, but I want to give my
> two cents anyway.
>
> >
> > I'm still unclear what my role is vs. the others cc'ed on this problem report
> is.
> >
> > Is anyone here (other than Andrew) a decision maker on what userfaultfd is
> supposed to do? I can see what the current code DOES - and honestly, it's
> seriously whacked semantically. (see the ExtMem paper for a reasonable use case
> that it cannot serve, my use case is quite similar). So is anyone here wanting to
> improve the functionality? I'm sure its current functions are used by some folks
> here - Google employees presumably focused on ChromeOS or Android, I suppose,
> suggest that there's a use case there.
>
> I think all of us want userfaultfd to be as useful as possible. :)
> Peter, Axel, and I are quite familiar with userfaultfd's use as a tool
> for enabling post-copy live migration for virtual machines.
> Userfaultfd minor faults were created expressly for this purpose. Axel
> wrote the userfaultfd minor fault support; I wrote the corresponding
> userspace code to use it in Google Cloud.

 

Excellent clarification. And congratulations on making that work.


>
> Peter is quite a bit more familiar with userfaultfd than me (and I
> think Axel, but I don't want to speak for him), so please excuse our
> mistakes. (mm is complicated!)
>
> There are a few others who care about userfaultfd who might jump in as
> soon as patches get sent. I think these folks (so on top of Peter and
> Andrew, people like Suren, Lorenzo, David Hildenbrand) will be the
> folks who Ack or Nak the patches.

 

That's good to know.

 

>
> >
> >
> >
> > My role started out by reporting that the documentation is both incomplete
> and confusing, both in the man pages and the "kernel documentation". And the
> rationale presented in the documentation doesn't make sense. Some of you guys
> admit that you really don't understand how "swap" is different from "file-backed
> paging" (except for the corner cases of hugetlbfs [sort of "file backed"],
> "file-backed by /dev/zero" [which ends up using "swap"], and tmpfs [also "file
> backed" but using "swap"]. And yet "anonymous, private" uses "swap" and the "swap
> cache", not the "page cache".
>
> The documentation is confusing; I agreed with you originally that it
> should be updated. (Do you want to send a patch? Perhaps I could give
> it a go when I find the time.)
>
> I spent some time writing out how I define the various terms being
> used here, I'll leave it at the end of this email in case it is
> helpful, but otherwise please just ignore it. I wouldn't say that the
> rationale in the documentation doesn't make sense. Userfaultfd exists
> to solve specific problems.

I thought it was a general purpose interface. My mistake. But I think it can be more general, at least encompassing my goal of having a userspace "interface" that monitors processes' page faults.


>
> >
> > Now, after digging into the question, I feel like there was never, ever a
> coherent architectural design for userfaultfd as a function. It's apparently just
> a "hack", not a "feature".
>
> Userfaultfd certainly isn't perfect, but it is critical for things
> like VM live migration, Android GC, CRIU, etc..
>
> >
> > I'd be happy to propose a much more coherent design (in my opinion as an
> operating systems designer for the past more than 20 years, starting with Multics
> in 1970 - you guys may not be interested in my input, which is fair. Is Linus
> interested? That would be a bunch of work for me, because I would do a thorough
> job, not just a bunch of random patches. But I'm not proposing to join the
> maintainer-club - I'm retired from that space, and I find the Linux kernel
> contributors poorly organized and chaotic.
> >
> > Or, I can just drop this interaction - concluding that userfaultfd is kind of
> useless as is, and really badly documented to boot.
>
> I am interested to hear your ideas for how you think userfaultfd
> should work and how it solves your problem. :) At the end of the day,
> I'm just trying (though clearly failing miserably) to help you solve
> your problem.
>
> Your characterization of userfaultfd as a "useless" "bunch of random
> patches" that is just a "hack" is wrong. I understand; it doesn't
> support your needs. I think what Peter, Axel, and I have been trying
> to understand is what exactly you're trying to do and how userfaultfd
> could (or may not) help you get there. You've shared some[1]
> details[2] about what you're looking for, so thank you for that, but I
> am still struggling to understand how the flexibility that you're
> asking for is actually the right tool for the problem(s) you're trying
> to solve.
>
> [1]: https://lore.kernel.org/linux-mm/1758037039.08578612@apps.rackspace.com/
> [2]: https://lore.kernel.org/linux-mm/1758042583.108320755@apps.rackspace.com/
>
> > There is no sensible way to respond to a "missing event" when "missing" means
> the page is swapped out (to SWAP) by UFFDIO_COPY or UFFDIO_ZEROPAGE. That's just
> weird, and you continue to insist on it. Where is the page that was swapped out?
> Well, one could look at the PTE in /proc/pid/maps, and you find that its "swap
> entry" is there as an index into a block device. (so, maybe you can open the swap
> device using some file descriptor and mmap() it into the manager process, then
> UFFDIO_COPY, but what if the swap page is actually in the "swap cache", you can't
> mmap any swap cache page via any userspace API - do you know a way to do that?)
>
> (Please see the terms that I use at the bottom of this email; let me
> reply using those terms.)
>
> UFFDIO_COPY has quite well-defined semantics (albeit, perhaps not
> *documented* well):
>
> * For anonymous VMAs: UFFDIO_COPY will allocate page(s), copy some
> user memory into the page(s) and map those pages at the specified VAs.
> * For hugetlbfs and shmem/tmpfs VMAs, UFFDIO_COPY will fill holes in
> the file's page cache with new pages, copy the user memory in, and map
> those pages. UFFDIO_CONTINUE is additionally supported; it skips the
> hole-filling step and requires the page cache to be populated.
>
> For UFFDIO_COPY, if a page at a to-be-populated VA has already been
> allocated (including if it has been reclaimed), the call will be
> rejected. It would effectively be overwriting the contents of the
> page; this is not supported today.
>
> If "missing" includes swapped out pages, UFFDIO_COPY and
> UFFDIO_ZEROPAGE would need to be allowed to overwrite the existing
> contents. "Sensible" or not, there has been no need for this yet.
>
> > Now I reported a bug in UFFIO_REGISTER [...]
>
> The bug you reported is in the documentation only.
>
> > [...] which you keep saying is the same as UFFDIO_CONTINUE. Well, it isn't! I
> can register a minor handler (which allows continue) if I use
> MAP_ANONYMOUS|MAP_SHARED. The same "swap cache" mechanics exactly apply. The only
> "sharing" is potential future sharing after that process forks, in which case, the
> same "swap page" is shared until a Copy on Write forces the page to be unshared -
> it is a writeable page, just sharing the same physical block. It can be swapped
> out to the swap cache and the swap device, which sets the PTE to be a "swap entry"
> that causes a page fault.
>
> (Using the terms at the bottom of this email.)
>
> For UFFDIO_CONTINUE, the swap cache mechanics are like:
>
> 1. For anonymous pages in the VMA: swap-outs will not clear the PTEs,
> touching the page will swap it back in again, UFFDIO_CONTINUE on it is
> disallowed.
> 2. For page cache pages in the VMA (i.e., not-yet-written-to pages for
> MAP_PRIVATE, any page for MAP_SHARED): swap-outs will clear the PTEs,
> and touching the page will trigger a minor fault, and UFFDIO_CONTINUE
> will swap it back in.
>
> For MAP_ANONYMOUS|MAP_PRIVATE, all pages in the VMA will be anonymous
> pages, so UFFDIO_CONTINUE will never be allowed, therefore
> registration in the first place is disallowed.
>
> (IMHO, it was dubious to have even allowed registering userfaultfd
> minor faults with *any* MAP_PRIVATE VMA.)
>
> > The swap device doesn't know where the pages are mapped. You need to look at
> the PTEs of all the processes to find the translation to swap cache entry, and if
> you want to go backward from swap entry to pages, you need to use a special XArray
> that finds VMAs given swap entry.
> >
> > But the point here I keep making is that UFFDIO_REGISTER rejects only
> MAP_ANONYMOUS that are MAP_PRIVATE and also not huge pages. To me that's weird.
>
> I hope my above explanation (of sorts) makes it a little less weird.
>
> > If it is the CoW case that doesn't work (I doubt it), well, you have to read
> the swapped out page into memory before copying it anyway. Then you copy on write,
> from the page read or found in the swap cache.
> >
> > Now, as you say, that may require allocating a new page, also in the swap
> cache. Is that a "missing" page in the weird userfaultfd terminology? If so, to
> handle it can't be done with UFFIO_COPY, because you can't access the contents
> from userspace. And it's not "write protected" from the perspective of WP.
>
> No it isn't a missing userfault. Data exists at the VA for which a
> userfault would be generated, therefore it cannot be "missing".
>
> >
> >
> >
> > > The only exception I can
> > > think of is swap faults, I could see anon swap faults (perhaps
> > > specifically when the page is in the swap cache?) being considered
> > > UFFD minor faults, but I would be curious to know what the use case is
> > > for that / why you would want to do that. The original use case for
> > > UFFD minor fault support was demand paging for VMs, where you have
> > > some kind of shared memory (shmem or hugetlb) where one side of the
> > > mapping is given to the VM, and the other side of the shared mapping
> > > is used by the hypervisor to populate guest memory on-demand in
> > > response to userfaultfd events.
> >
> >
> >
> > I think I've just answered this. userfaultfd doesn't support the "swap out"
> part of anonymous swapping at all. So, how could a manager get the page contents
> as of the instant it is put in the swap cache for writing out to the swap device?
> There's no "swap out" event mechanism, and no way to treat the swap device cached
> into the swap cache as a page source. (not to mention the zswap mechanism, which
> compresses some of the pages into an invisible piece of memory).
> >
> >
> > >
> > > To me it's not intended userfaultfd minor events are generated for
> > > writeprotect faults, to me that's the domain of userfaultfd-wp, not
> > > minor faults. James might be right that these unintentionally trigger
> > > minor faults today, I would need to do some more reading of the code
> > > to be certain though.
> >
> > I don't particulary care about writeprotect faults, but CoW probably
> shouldn't be considered the same as a writeprotect fault, because CoW is triggered
> by a write into a writeable area, ONLY in one of the mappings, whichever is
> written first. The process doesn't think of it as a "write" - it just is a kernel
> optimization of a common case where fork is followed by non-use, so the actual
> copy could have been done at fork time, semantically. It's a deferred read and
> allocation.
> >
> >
> >
> > I hope this helps clarify my concerns.
> >
> > There are several reasonable outcomes -
> >
> > 1. Much better documentation of what the code actually does (and why).
>
> Agreed.
>
> > 2. Fix the "bug" that prevents REGISTER of "minor" handler on private,
> anonymous mappings (obviously, you can REGISTER missing handlers as well), then
> document actually what happens during the life cycle of swapping of pages in
> detail, including MAP_PRIVATE|MAP_ANONYMOUS VMAs.
>
> Not a bug.
>
> > 3. Do a thorough analysis of what userfaultfd really should do, if the goal
> is to provide the ability of a "manager process" to get to handle all cases of
> page fault behavior on a case-by-case basis for regions of user addressable
> pages.
>
> What userfaultfd "should do" is up to the problems we need it to solve.
>
> > I'd be happy to contribute to (but not manage) whichever outcome - and I have
> what I think is a reasonable use case. (and I'm aware that this API accidentally
> created a serious hacker exploit earlier in its life, by creating a way to hang
> one process from another. I think that's no longer so easy.)
>
> I would be glad to hear what changes you think should be made to
> userfaultfd to better suit your needs.
>
> Sorry if this reply is somewhat incoherent; I've gone back and forth a
> few times on how to respond to your points in the most helpful way I
> can. I've tried to be as clear as possible without being too verbose.
>
> - James
>
> --
>
> Alrighty here are the terms/definitions I use, as I mentioned above.
> Again feel, free to ignore them if they are unhelpful:
>
> A "file-backed VMA" will load pages into the page cache. For most
> filesystems, the page is loaded from a disk (or a proper device), but
> for special filesystems like tmpfs, hugetlbfs, and ramfs, the page
> cache is populated with zeroed pages initially.
>
> tmpfs is kind of like a filesystem API for shmem, but they are so
> interconnected that many people use the terms interchangeably. (To
> clarify, I don't think of "shmem" as shorthand for "shared memory"; to
> me, it is the name of an mm subsystem.) Every MAP_ANONYMOUS|MAP_SHARED
> VMA is a shmem VMA; it is as if there is a tmpfs file backing VMAs
> like these, so they are in some contexts considered "file-backed". See
> shmem_zero_setup(). As far as I'm concerned, vma->vm_file is set, so
> the VMA is file-backed (even though the mmap flags included
> MAP_ANONYMOUS). I assume this is what you are referring to when you
> say "file-backed by /dev/zero".
>
> For any MAP_PRIVATE VMA, some pages may be "anonymous", in that no
> page cache is holding a reference to it (i.e., generally speaking, the
> only references on the page are the ones taken by the PTEs mapping the
> page). Reclaim of pages like these will put them in a swap cache.
>
> For pages where a reference is held in a page cache, if the page is
> dirty, it can be written out to disk. shmem implements "writeout" by
> swapping just like anonymous pages, but other filesystems implement it
> how you would expect.
>