Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: "David P. Reed" <dpreed@deepplum.com>
Cc: James Houghton <jthoughton@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Axel Rasmussen <axelrasmussen@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: PROBLEM: userfaultfd REGISTER minor mode on MAP_PRIVATE range fails
Date: Fri, 26 Sep 2025 18:16:43 -0400	[thread overview]
Message-ID: <aNcQyzvre07V4UUC@x1.local> (raw)
In-Reply-To: <1758042583.108320755@apps.rackspace.com>

Hello, David,

On Tue, Sep 16, 2025 at 01:09:43PM -0400, David P. Reed wrote:
> Than -
> 
> Thanks for your interest. Some clarifications on my current use case are interposed below. 

Sorry for a late respond.

I think I get a much better picture now with the answers below, thanks.
Looping in Mike and Andrea.

> 
> On Tuesday, September 16, 2025 12:13, "Peter Xu" <peterx@redhat.com> said:
> 
> > On Tue, Sep 16, 2025 at 11:52:18AM -0400, David P. Reed wrote:
> >> synchronous would be better. But what I want to do is at least get
> >> notifications of swapin events (including the case when the page is in
> >> swap cache). Also, using UFFDIO_COPY can be useful for the swap in case
> >> might make sense (but rarely, because there's no way to access the data
> >> that was swapped out).
> > 
> > Some more info on the use case might be helpful.  I can start with some
> > more questions if that helps.
> > 
> > - If it's about page hotness / coldness, have you tried existing facilities
> >   (page idle, DAMON, etc.)?  If so, why they won't work?
> 
> Yes. Those functions are just summarizers, giving counts and
> averages. They provide zero detail about specific pages to the
> application running in the process.
> 
> I can clarify my use case focus by pointing out what inspired me to use
> userfaultfd for application specific memory management (which is, after
> all, what userfaultfd was promoted for on Linux Weekly News a while back
> when it first came out). This 2024 paper is along the same lines as what
> I'm researching, and was published in 2024 Usenix proceedings.  ExtMem:
> Enabling Application Aware Virtual Memory Management for Data Intensive
> Applications
> https://www.usenix.org/conference/atc24/presentation/jalalian
>
> See figure 3 of the paper for their performance problem with userfaultfd
> vs. their kernel modifications (upcall).

Correct, userfaultfd does have such IPC overhead. SIGBUS sometimes can be
better, but AFAIU it has limitations.  E.g., I am not sure if the signals
will always work when the fault is triggered in either:

  (1) a kernel context using copy_from_user() / copy_to_user() or GUP

  (2) when a fault is scheduled somehow onto, for example, a kworker

AFAIU, (1) can really easily happen if one tries to do syscall read(),
write()..., where the buffer is userfaultfd protected.

Meanwhile, (2) normally can't happen but it can still happen in at least
the KVM use case that we heavily rely on, where KVM has a feature to be
able to offload a vCPU page fault to a kworker (we call it KVM async page
fault).

IIUC, such limitation will also apply to the upcall solution they provided.
From what I read on the paper, that should really be a mimic version of
signal handling, but making it per-thread, under the same task context.

>
> They had tried using userfaultfd for their work, and found it was "too
> slow" compared to what they call the "upcall" technique they achieved by
> modifying the kernel page fault handling path. (see paper for details -
> and Jalalian't thesis dives into it more deeply).  I could code up an
> equivalent to their "upcall" - but that would mean completely
> non-standard (and fraught with security issues, as well as not being able
> to use a separate management process).  For me, the performance concern
> is less problematic - I'm doing application analysis and
> experimentation. And I don't want to have to maintain a kernel patch set.
>
> Note that I expect to use madvise() and process_madvise() to manipulate
> page coldness and swapping as well.

Yes, looks like the right tools.

> 
> > 
> > - Assuming it's async reports that can be collected, what do you plan to do
> >   with the info?  Do you care about swap outs prior to swap ins?
>
> Detailed application paging measurements, modeling, and so forth.  I'm
> not asking for a big enhancement to userfaultfd - just expecting it
> should (as is) basically work, if the UFFDIO_REGISTER actually allowed to
> register the minor page fault mode,
>
> > 
> > - How sync events would be better in this case?
> 
> Simpler to coordinate the interaction with the faulting process by far.

I think I get the gut of how you would like to use it.  However, my
question is, if you want to do fine tuning of "which layer of memory should
hold what data" kind of thing; IOW trying to replace the linux mm swap
system but provide a likely better one, more suitable for your workload,
then why you have swap in/out at all?  Why do you care about that?

I was expecting your PoC (or ExtMEM) to completely bypass Linux swap, then
you can freely move memory pages between system RAM, NVMe, RDMA, etc. like
what ExtMEM paper mentioned.

So far I don't see it a block if one would like to say that the swap cache
is also one kind of page cache, then would it make sense to add MINOR fault
trapping to anonymous but only trap it when swapin?  Kind of ok when
initially read about it, doesn't sound too hard to impl either.  It's just
that I still want to double check with your use case first, because it
really sounds like you should have turned swap off.

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2025-09-26 22:16 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 20:13 David P. Reed
2025-09-15 20:24 ` James Houghton
2025-09-15 22:58   ` David P. Reed
2025-09-16  0:31     ` James Houghton
2025-09-16 14:48       ` Peter Xu
2025-09-16 15:52         ` David P. Reed
2025-09-16 16:13           ` Peter Xu
2025-09-16 17:09             ` David P. Reed
2025-09-26 22:16               ` Peter Xu [this message]
2025-09-16 17:27             ` David P. Reed
2025-09-16 18:35               ` Axel Rasmussen
2025-09-16 19:10                 ` James Houghton
2025-09-16 19:47                   ` David P. Reed
2025-09-16 22:04                   ` Axel Rasmussen
2025-09-26 22:00                     ` Peter Xu
2025-09-16 19:52                 ` David P. Reed
2025-09-17 16:13                   ` Axel Rasmussen
2025-09-19 18:29                     ` David P. Reed
2025-09-25 19:20                       ` Axel Rasmussen
2025-09-27 18:45                         ` David P. Reed
2025-09-29  5:30                           ` James Houghton
2025-09-29 19:44                             ` David P. Reed
2025-09-29 20:30                               ` Peter Xu
2025-10-01 22:16                                 ` Axel Rasmussen
2025-10-17 21:07                                   ` David P. Reed
2025-09-16 15:37       ` David P. Reed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aNcQyzvre07V4UUC@x1.local \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=dpreed@deepplum.com \
    --cc=jthoughton@google.com \
    --cc=linux-mm@kvack.org \
    --cc=rppt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox