Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "David P. Reed" <dpreed@deepplum.com>
To: "Axel Rasmussen" <axelrasmussen@google.com>
Cc: "Peter Xu" <peterx@redhat.com>,
	"James Houghton" <jthoughton@google.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linux-mm@kvack.org
Subject: Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
Date: Fri, 17 Oct 2025 17:07:57 -0400 (EDT)	[thread overview]
Message-ID: <1760735277.29994480@apps.rackspace.com> (raw)
In-Reply-To: <CAJHvVciLe9--JnNn1=q=ycrgQSTVt7Zq0e01KhaRmOeSLO86dQ@mail.gmail.com>

Hi Axel -

Thanks for the long reply. I've been focused elsewhere for a couple weeks, but I'm getting back to this.

Comments below:

On Wednesday, October 1, 2025 18:16, "Axel Rasmussen" <axelrasmussen@google.com> said:

> Thanks for linking the ExtMem paper David, that makes it a lot more
> clear to me what your expectations are.

Note that personally, I'm not trying to do what the ExtMem people tried to do - implement full memory management in userspace. I'm simply trying to monitor all the paging activity in a set of processes. So in many ways my goal is less ambitious than theirs. And userfaultfd almost does everything I want, with the exception of the case of anonymous+private paging.

> 
> I think basically, userfaultfd has evolved incrementally, and it only
> has a handful of features needed to address pretty specific use cases,
> it doesn't have the full flexibility / generality you would need to do
> "full memory management in userspace". Not to say I think it shouldn't
> be able to do that from a philosophical point of view, I just mean to
> say it would take quite a lot of work to get there.
> 
> Performance is also a big concern. Userfaultfd performance is not
> great, in fact scalability issues are one of the reasons we have been
> pursuing guest_memfd based approaches to VM demand paging, instead of
> userfaultfd.

I don't really want to do VM demand paging. I've done that before, at a previous startup I co-founded, called TidalScale. (well, you could call it demand-paging, but in fact it was more complex. However, HPE now owns TidalScale, and it would be silly for me to focus on that stuff (distributing virtual memory, virtual processors, and virtual I/O devices throughout a set of big servers, migrating them among the different servers). We did some amazing things with that, but I also learned that there's a limit to what virtualization+migration can do, performance-wise.

> 
> I don't disagree that in principle it makes sense for anon private
> swap faults to generate userfaultfd minor fault events, it's just
> until now nobody had ever wanted to do that, so it hasn't been
> implemented yet. :) For what it's worth, I don't think this would get
> you where you want to go by itself though, because the only action you
> could take in response to such an event today is UFFDIO_CONTINUE,
> which would simply swap in + map the page, you would have no
> opportunity to e.g. populate the page contents from elsewhere, you'd
> be delegating all of that to the existing in-kernel swap
> implementation. So it doesn't really get you all the way to "full
> userspace memory management".

Yet, that's exactly the additional capability I want - just to get the event and continue, after doing some stuff with the information at the time of the event.

So if I could have just that, it would be great. I thought that it was there already, since the restriction isn't mentioned in the documentation.

The alternative for me is to write a lot of "out-of-tree" kernel code that hooks (using k[ret]probes?) into all the paging mechanisms in the kernel, and then maintain it across releases. I don't really want to do that. And to create a hypervisor extension just to do this from deep below the applications seems silly.

I realize that there a performance drag to using userfaultfd, but for my purposes that is pretty irrelevant.

And I'm kind of surprised that this case doesn't "just work", since supposedly one can register for minor page faults on other non-file-backed pages, just not "MAP_PRIVATE" ones, which get rejected at the "register" ioctl.

Regards,
David

> 
> 
> On Mon, Sep 29, 2025 at 1:30 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Mon, Sep 29, 2025 at 03:44:52PM -0400, David P. Reed wrote:
>> > I thought it was a general purpose interface. My mistake. But I think it
>> > can be more general, at least encompassing my goal of having a userspace
>> > "interface" that monitors processes' page faults.
>>
>> To James: thanks for the great writeup.  Somehow, I just feel like userfaultfd
>> (as a linux submodule) got some sheer luck to have you around. :)
>>
>> To David: just to say, I still think it's a general purpose interface, at
>> least that's the hope..
>>
>> I agree with you at least on one point you mentioned, that shmem also can
>> swap, and that was accounted as minor faults when swapin happens at a
>> specific virtual address.  It doesn't sound fair if anon isn't doing the
>> same. Indeed.
>>
>> It was just not in the radar when minor fault was introduced by Axel, even
>> it was for a solo purpose for live migration at that time.. but the hope is
>> the interface designed should service a generic purpose.
>>
>> Now the problem is, userfaultfd wasn't initially used for monitoring system
>> activities.  As its name implies, it provides the userspace a way to
>> resolve a fault, but only if a fault happens first..
>>
>> Meanwhile, system activities should definitely at least involve swapouts,
>> which unfortunately doesn't involve page faults, but only happen the other
>> way round when the system wants to secretly move things out.. that is what
>> userfaultfd is out of control.
>>
>> It just sounds like it won't suffice your need even if we could add minor
>> fault support for anon private memories on swap cache. However, if
>> userfaultfd is used to do everything (including swap in/outs), then it's by
>> nature all trappable + accountable, on both swap in/outs to/from any media.
>> Then swapout will be driven by the userspace too, then everything will be
>> in solid control, including monitoring of the activities.
>>
>> Thanks,
>>
>> --
>> Peter Xu
>>
>

next prev parent reply	other threads:[~2025-10-17 21:08 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 20:13 David P. Reed
2025-09-15 20:24 ` James Houghton
2025-09-15 22:58   ` David P. Reed
2025-09-16  0:31     ` James Houghton
2025-09-16 14:48       ` Peter Xu
2025-09-16 15:52         ` David P. Reed
2025-09-16 16:13           ` Peter Xu
2025-09-16 17:09             ` David P. Reed
2025-09-26 22:16               ` Peter Xu
2025-09-16 17:27             ` David P. Reed
2025-09-16 18:35               ` Axel Rasmussen
2025-09-16 19:10                 ` James Houghton
2025-09-16 19:47                   ` David P. Reed
2025-09-16 22:04                   ` Axel Rasmussen
2025-09-26 22:00                     ` Peter Xu
2025-09-16 19:52                 ` David P. Reed
2025-09-17 16:13                   ` Axel Rasmussen
2025-09-19 18:29                     ` David P. Reed
2025-09-25 19:20                       ` Axel Rasmussen
2025-09-27 18:45                         ` David P. Reed
2025-09-29  5:30                           ` James Houghton
2025-09-29 19:44                             ` David P. Reed
2025-09-29 20:30                               ` Peter Xu
2025-10-01 22:16                                 ` Axel Rasmussen
2025-10-17 21:07                                   ` David P. Reed [this message]
2025-09-16 15:37       ` David P. Reed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1760735277.29994480@apps.rackspace.com \
    --to=dpreed@deepplum.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=jthoughton@google.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox