Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "David P. Reed" <dpreed@deepplum.com>
To: "Axel Rasmussen" <axelrasmussen@google.com>
Cc: "Peter Xu" <peterx@redhat.com>,
	"James Houghton" <jthoughton@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linux-mm@kvack.org
Subject: Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
Date: Sat, 27 Sep 2025 14:45:20 -0400 (EDT)	[thread overview]
Message-ID: <1758998720.44976697@apps.rackspace.com> (raw)
In-Reply-To: <CAJHvVcj_gd=48k-dgbLeEoqn_f+QD-ifscu_DPvpAmPd1Kg=GA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 11281 bytes --]

OK - responses below.
I'm still unclear what my role is vs. the others cc'ed on this problem report is.

Is anyone here (other than Andrew) a decision maker on what userfaultfd is supposed to do? I can see what the current code DOES - and honestly, it's seriously whacked semantically. (see the ExtMem paper for a reasonable use case that it cannot serve, my use case is quite similar). So is anyone here wanting to improve the functionality? I'm sure its current functions are used by some folks here - Google employees presumably focused on ChromeOS or Android, I suppose, suggest that there's a use case there.

My role started out by reporting that the documentation is both incomplete and confusing, both in the man pages and the "kernel documentation". And the rationale presented in the documentation doesn't make sense. Some of you guys admit that you really don't understand how "swap" is different from "file-backed paging" (except for the corner cases of hugetlbfs [sort of "file backed"], "file-backed by /dev/zero" [which ends up using "swap"], and tmpfs [also "file backed" but using "swap"]. And yet "anonymous, private" uses "swap" and the "swap cache", not the "page cache".

Now, after digging into the question, I feel like there was never, ever a coherent architectural design for userfaultfd as a function. It's apparently just a "hack", not a "feature".

I'd be happy to propose a much more coherent design (in my opinion as an operating systems designer for the past more than 20 years, starting with Multics in 1970 - you guys may not be interested in my input, which is fair. Is Linus interested? That would be a bunch of work for me, because I would do a thorough job, not just a bunch of random patches. But I'm not proposing to join the maintainer-club - I'm retired from that space, and I find the Linux kernel contributors poorly organized and chaotic.

Or, I can just drop this interaction - concluding that userfaultfd is kind of useless as is, and really badly documented to boot.

On Thursday, September 25, 2025 15:20, "Axel Rasmussen" <axelrasmussen@google.com> said:

> On Fri, Sep 19, 2025 at 11:29 AM David P. Reed <dpreed@deepplum.com>
> wrote:
> >
> >
> >
> > On Wednesday, September 17, 2025 12:13, "Axel Rasmussen"
> <axelrasmussen@google.com> said:
> >
> > > On Tue, Sep 16, 2025 at 12:52 PM David P. Reed
> <dpreed@deepplum.com> wrote:
> > >>
> > >>
> > >>
> > >> On Tuesday, September 16, 2025 14:35, "Axel Rasmussen"
> <axelrasmussen@google.com>
> > >> said:
> > >>
> > >> > On Tue, Sep 16, 2025 at 10:27 AM David P. Reed
> <dpreed@deepplum.com>
> > >> wrote:
> > >> >
> > >> >> Than -
> > >> >>
> > >> >> Just to clarify -
> > >> >> Looking at the man page for UFFDIO_API, there are two
> "feature bits" that
> > >> >> indicate cases where "minor" handling is now supported, and
> can be enabled.
> > >> >> UFFD_FEATURE_MINOR_HUGETLBFS and UFFD_FEATURE_MINOR_SHMEM
> > >> >> In my reading of the documents, these seem to imply that
> before they were
> > >> >> added as new features, that MAP_PRIVATE|MAP_ANONYMOUS
> mappings were
> > >> >> supported, and that the "new" additions to the MINOR mode
> were just for
> > >> >> HUGETLBFS and MAP_SHARED cases.
> > >> >>
> > >> >
> > >> > Actually minor fault support didn't exist at all before those
> two features
> > >> > were added. :)
> > >>
> > >> Thanks for commenting. I'm not sure that's exactly true. Why is
> SNMEM
> > >> (MAP_SHARED) supported, but not ordinary pages? I wasn't party to
> the evolution
> > >> here, but so far no one has explained why there's a special
> difference between
> > >> SHMEM and ordinary VMAs.
> > >
> > > I promise it's true, I wrote the UFFD minor fault handling feature. :)
> > OK, but I am still confused as to SHMEM VMAs are supported and non-SHMEM are
> not, in the case of an anonymous mapped range.
> >
> > >
> > > As for why... Like I said above, UFFD calls it a "minor" fault if the
> > > PTE doesn't exist, but the page already exists in the page cache. If
> > > the PTE does exist, you won't get either a minor *or* a missing fault.
> > > If the page does not already existing the page cache, you'll get a
> > > missing fault, not a minor fault.
> > I'm assuming that you understand there is a profound difference between the
> "page cache" and the "swap cache" in Linux. I am referring to what happens when a
> page is in the swap cache, (which is primarily about anaonymous pages, but a weird
> corner case is that "tmpfs" is backed by the swap cache and the swap system, not
> by the page cache).
> >
> > The "historical reasons" for the swap cache not being the page cache weirdly
> difficult to decode - I've spent a chunk of months trying to do historical
> reasearch on how this came about, but more importantly, why. No luck on the why.
> (And the main reason seems to be that, if I were to guess, that the folks who
> built it wanted to avoid using "inodes", which are required by the whole page
> cache meechanism, perhaps because they thought inodes were "expensive").
> >
> > Anyway, I'm now understanding that UFFD's chosen a variant meaning of "minor
> page fault" that seems tied to pages that are file backed or SHMEM.
> >
> > A "swapped" page is anonymous by definition of what "swap" means in Linux. In
> Unix and other systems, swapping was a generic term that included file-backed
> paging as well as non-file-backed pages.
> >
> > Anyway, I'm quite puzzled why I can't seem to monitor
> MAP_PRIVATE|MAP_ANONYMOUS page faults with userfaultfd. The reason I focus on CoW
> is that CoW and fork() behavior is basically the only user visible difference
> between MAP_PRIVATE and MAP_SHARED. And if you read random examples of how to use
> mmap(), quite often MAP_PRIVATE is suggested as if it were the "normal" usage
> (despite what happens on fork()).
> 
> You can monitor MAP_PRIVATE|MAP_ANONYMOUS faults with userfaultfd,
> it's just that they're missing faults, not minor in userfaultfd
> terminology, because resolving them requires a new page to be
> allocated (UFFDIO_COPY, not UFFDIO_CONTINUE).

There is no sensible way to respond to a "missing event" when "missing" means the page is swapped out (to SWAP) by UFFDIO_COPY or UFFDIO_ZEROPAGE. That's just weird, and you continue to insist on it. Where is the page that was swapped out? Well, one could look at the PTE in /proc/pid/maps, and you find that its "swap entry" is there as an index into a block device. (so, maybe you can open the swap device using some file descriptor and mmap() it into the manager process, then UFFDIO_COPY, but what if the swap page is actually in the "swap cache", you can't mmap any swap cache page via any userspace API - do you know a way to do that?)

Now I reported a bug in UFFIO_REGISTER, which you keep saying is the same as UFFDIO_CONTINUE. Well, it isn't! I can register a minor handler (which allows continue) if I use MAP_ANONYMOUS|MAP_SHARED. The same "swap cache" mechanics exactly apply. The only "sharing" is potential future sharing after that process forks, in which case, the same "swap page" is shared until a Copy on Write forces the page to be unshared - it is a writeable page, just sharing the same physical block. It can be swapped out to the swap cache and the swap device, which sets the PTE to be a "swap entry" that causes a page fault.
The swap device doesn't know where the pages are mapped. You need to look at the PTEs of all the processes to find the translation to swap cache entry, and if you want to go backward from swap entry to pages, you need to use a special XArray that finds VMAs given swap entry.

But the point here I keep making is that UFFDIO_REGISTER rejects only MAP_ANONYMOUS that are MAP_PRIVATE and also not huge pages. To me that's weird.

If it is the CoW case that doesn't work (I doubt it), well, you have to read the swapped out page into memory before copying it anyway. Then you copy on write, from the page read or found in the swap cache.

Now, as you say, that may require allocating a new page, also in the swap cache. Is that a "missing" page in the weird userfaultfd terminology? If so, to handle it can't be done with UFFIO_COPY, because you can't access the contents from userspace. And it's not "write protected" from the perspective of WP.

> The only exception I can
> think of is swap faults, I could see anon swap faults (perhaps
> specifically when the page is in the swap cache?) being considered
> UFFD minor faults, but I would be curious to know what the use case is
> for that / why you would want to do that. The original use case for
> UFFD minor fault support was demand paging for VMs, where you have
> some kind of shared memory (shmem or hugetlb) where one side of the
> mapping is given to the VM, and the other side of the shared mapping
> is used by the hypervisor to populate guest memory on-demand in
> response to userfaultfd events.

I think I've just answered this. userfaultfd doesn't support the "swap out" part of anonymous swapping at all. So, how could a manager get the page contents as of the instant it is put in the swap cache for writing out to the swap device? There's no "swap out" event mechanism, and no way to treat the swap device cached into the swap cache as a page source. (not to mention the zswap mechanism, which compresses some of the pages into an invisible piece of memory).

> 
> To me it's not intended userfaultfd minor events are generated for
> writeprotect faults, to me that's the domain of userfaultfd-wp, not
> minor faults. James might be right that these unintentionally trigger
> minor faults today, I would need to do some more reading of the code
> to be certain though.

I don't particulary care about writeprotect faults, but CoW probably shouldn't be considered the same as a writeprotect fault, because CoW is triggered by a write into a writeable area, ONLY in one of the mappings, whichever is written first. The process doesn't think of it as a "write" - it just is a kernel optimization of a common case where fork is followed by non-use, so the actual copy could have been done at fork time, semantically. It's a deferred read and allocation. 

I hope this helps clarify my concerns.

There are several reasonable outcomes -

1. Much better documentation of what the code actually does (and why).
2. Fix the "bug" that prevents REGISTER of "minor" handler on private, anonymous mappings (obviously, you can REGISTER missing handlers as well), then document actually what happens during the life cycle of swapping of pages in detail, including MAP_PRIVATE|MAP_ANONYMOUS VMAs.
3. Do a thorough analysis of what userfaultfd really should do, if the goal is to provide the ability of a "manager process" to get to handle all cases of page fault behavior on a case-by-case basis for regions of user addressable pages.

I'd be happy to contribute to (but not manage) whichever outcome - and I have what I think is a reasonable use case. (and I'm aware that this API accidentally created a serious hacker exploit earlier in its life, by creating a way to hang one process from another. I think that's no longer so easy.)

[-- Attachment #2: Type: text/html, Size: 14673 bytes --]

next prev parent reply	other threads:[~2025-09-27 18:45 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 20:13 David P. Reed
2025-09-15 20:24 ` James Houghton
2025-09-15 22:58   ` David P. Reed
2025-09-16  0:31     ` James Houghton
2025-09-16 14:48       ` Peter Xu
2025-09-16 15:52         ` David P. Reed
2025-09-16 16:13           ` Peter Xu
2025-09-16 17:09             ` David P. Reed
2025-09-26 22:16               ` Peter Xu
2025-09-16 17:27             ` David P. Reed
2025-09-16 18:35               ` Axel Rasmussen
2025-09-16 19:10                 ` James Houghton
2025-09-16 19:47                   ` David P. Reed
2025-09-16 22:04                   ` Axel Rasmussen
2025-09-26 22:00                     ` Peter Xu
2025-09-16 19:52                 ` David P. Reed
2025-09-17 16:13                   ` Axel Rasmussen
2025-09-19 18:29                     ` David P. Reed
2025-09-25 19:20                       ` Axel Rasmussen
2025-09-27 18:45                         ` David P. Reed [this message]
2025-09-29  5:30                           ` James Houghton
2025-09-29 19:44                             ` David P. Reed
2025-09-29 20:30                               ` Peter Xu
2025-10-01 22:16                                 ` Axel Rasmussen
2025-10-17 21:07                                   ` David P. Reed
2025-09-16 15:37       ` David P. Reed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1758998720.44976697@apps.rackspace.com \
    --to=dpreed@deepplum.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=jthoughton@google.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox