* Suppress pte soft-dirty bit with UFFDIO_COPY?
@ 2025-05-05 16:37 Kyle Huey
2025-05-05 20:05 ` Peter Xu
0 siblings, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2025-05-05 16:37 UTC (permalink / raw)
To: Andrew Morton, Peter Xu; +Cc: open list, linux-mm, criu, Robert O'Callahan
tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add
the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any
thoughts/objections?
The kernel has a "soft-dirty" bit on ptes which tracks if they've been
written to since the last time /proc/pid/clear_refs was used to clear
the soft-dirty bit. CRIU uses this to track which pages have been
modified since a previous checkpoint and reduce the size of the
checkpoints taken. I would like to use this in my debugger[0] to track
which pages a program function dirties when that function is invoked
from the debugger.
However, the runtime environment for this function is rather unusual.
In my debugger, the process being debugged doesn't actually exist
while it's being debugged. Instead, we have a database of all program
state (including registers and memory values) from when the process
was executed. It's in some sense a giant core dump that spans multiple
points in time. To execute a program function from the debugger we
rematerialize the program state at the desired point in time from our
database.
For performance reasons, we fill in the memory lazily[1] via
userfaultfd. This makes it difficult to use the soft-dirty bit to
track the writes the function triggers, because UFFDIO_COPY (and
friends) mark every page they touch as soft-dirty. Because we have the
canonical source of truth for the pages we materialize via UFFDIO_COPY
we're only interested in what happens after the userfaultfd operation.
Clearing the soft-dirty bit is complicated by two things:
1. There's no way to clear the soft-dirty bit on a single pte, so
instead we have to clear the soft-dirty bits for the entire process.
That requires us to process all the soft-dirty bits on every other pte
immediately to avoid data loss.
2. We need to clear the soft-dirty bits after the userfaultfd
operation, but in order to avoid racing with the task that triggered
the page fault we have to do a non-waking copy, then clear the bits,
and then separately wake up the task.
To work around all of this, we currently have a 4 step process:
1. Read /proc/pid/pagemap and note all ptes that are soft-dirty.
2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE.
3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process.
4. Do a UFFDIO_WAKE.
The overhead of all of this (particularly step 1) is a millisecond or
two *per page* that we lazily materialize, and while that's not
crippling for our purposes, it is rather undesirable. What I would
like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty
bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear
all the soft-dirty bits once after setting up all the mmaps in the
process the relevant ptes would then "just do the right thing" from
our perspective.
But I do want to get some feedback on this before I spend time writing
any code. Is there a reason not to do this? Or an alternate way to
achieve the same goal?
If this is generally sensible, then a couple questions:
1. Do I need a UFFD_FEATURE flag for this, or is it enough for a
program to be able to detect the existence of a
UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag
or returns EINVAL? I would tend to think the latter.
2. Should I add this mode for the other UFFDIO variants (ZEROPAGE,
MOVE, etc) at the same time even if I don't have any use for them?
- Kyle
[0] https://pernos.co/
[1] Conceptually this is similar to CRIU's `restore --lazy-pages`. We
set up all the mappings at the beginning but we don't back them.
Instead we UFFDIO_REGISTER them all and when they're touched for the
first time we go get the pages from our database and then UFFDIO_COPY
them into the address space.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-05 16:37 Suppress pte soft-dirty bit with UFFDIO_COPY? Kyle Huey
@ 2025-05-05 20:05 ` Peter Xu
2025-05-05 22:15 ` Kyle Huey
0 siblings, 1 reply; 8+ messages in thread
From: Peter Xu @ 2025-05-05 20:05 UTC (permalink / raw)
To: Kyle Huey; +Cc: Andrew Morton, open list, linux-mm, criu, Robert O'Callahan
Hi, Kyle,
On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote:
> tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add
> the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any
> thoughts/objections?
>
> The kernel has a "soft-dirty" bit on ptes which tracks if they've been
> written to since the last time /proc/pid/clear_refs was used to clear
> the soft-dirty bit. CRIU uses this to track which pages have been
> modified since a previous checkpoint and reduce the size of the
> checkpoints taken. I would like to use this in my debugger[0] to track
> which pages a program function dirties when that function is invoked
> from the debugger.
>
> However, the runtime environment for this function is rather unusual.
> In my debugger, the process being debugged doesn't actually exist
> while it's being debugged. Instead, we have a database of all program
> state (including registers and memory values) from when the process
> was executed. It's in some sense a giant core dump that spans multiple
> points in time. To execute a program function from the debugger we
> rematerialize the program state at the desired point in time from our
> database.
>
> For performance reasons, we fill in the memory lazily[1] via
> userfaultfd. This makes it difficult to use the soft-dirty bit to
> track the writes the function triggers, because UFFDIO_COPY (and
> friends) mark every page they touch as soft-dirty. Because we have the
> canonical source of truth for the pages we materialize via UFFDIO_COPY
> we're only interested in what happens after the userfaultfd operation.
>
> Clearing the soft-dirty bit is complicated by two things:
> 1. There's no way to clear the soft-dirty bit on a single pte, so
> instead we have to clear the soft-dirty bits for the entire process.
> That requires us to process all the soft-dirty bits on every other pte
> immediately to avoid data loss.
> 2. We need to clear the soft-dirty bits after the userfaultfd
> operation, but in order to avoid racing with the task that triggered
> the page fault we have to do a non-waking copy, then clear the bits,
> and then separately wake up the task.
>
> To work around all of this, we currently have a 4 step process:
> 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty.
> 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE.
> 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process.
> 4. Do a UFFDIO_WAKE.
>
> The overhead of all of this (particularly step 1) is a millisecond or
> two *per page* that we lazily materialize, and while that's not
> crippling for our purposes, it is rather undesirable. What I would
> like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty
> bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear
> all the soft-dirty bits once after setting up all the mmaps in the
> process the relevant ptes would then "just do the right thing" from
> our perspective.
>
> But I do want to get some feedback on this before I spend time writing
> any code. Is there a reason not to do this? Or an alternate way to
> achieve the same goal?
Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY?
If sync fault is a perf concern for frequent writes, just to mention at
least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC),
which is almost exactly soft dirty bits to me, though it solves a few
issues it has on e.g. false positives over vma merging and swapping, or
like you said missing of finer granule reset mechanisms.
Maybe you also want to have a look at the pagemap ioctl introduced some
time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in
soft-dirty-like way):
https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst
>
> If this is generally sensible, then a couple questions:
> 1. Do I need a UFFD_FEATURE flag for this, or is it enough for a
> program to be able to detect the existence of a
> UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag
> or returns EINVAL? I would tend to think the latter.
The latter requires all the setups needed, and an useless ioctl to probe.
Not a huge issue, but since userfaultfd is extensible, a feature flag might
be better as long as a new feature is well defined.
> 2. Should I add this mode for the other UFFDIO variants (ZEROPAGE,
> MOVE, etc) at the same time even if I don't have any use for them?
Probably not. I don't see a need to implement something just to make the
API look good.. If any chunk of code in the Linux kernel has no plan to be
used, we should probably not adding them since the start..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-05 20:05 ` Peter Xu
@ 2025-05-05 22:15 ` Kyle Huey
2025-05-12 3:06 ` Kyle Huey
0 siblings, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2025-05-05 22:15 UTC (permalink / raw)
To: Peter Xu; +Cc: Andrew Morton, open list, linux-mm, criu, Robert O'Callahan
On Mon, May 5, 2025 at 1:05 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Kyle,
>
> On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote:
> > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add
> > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any
> > thoughts/objections?
> >
> > The kernel has a "soft-dirty" bit on ptes which tracks if they've been
> > written to since the last time /proc/pid/clear_refs was used to clear
> > the soft-dirty bit. CRIU uses this to track which pages have been
> > modified since a previous checkpoint and reduce the size of the
> > checkpoints taken. I would like to use this in my debugger[0] to track
> > which pages a program function dirties when that function is invoked
> > from the debugger.
> >
> > However, the runtime environment for this function is rather unusual.
> > In my debugger, the process being debugged doesn't actually exist
> > while it's being debugged. Instead, we have a database of all program
> > state (including registers and memory values) from when the process
> > was executed. It's in some sense a giant core dump that spans multiple
> > points in time. To execute a program function from the debugger we
> > rematerialize the program state at the desired point in time from our
> > database.
> >
> > For performance reasons, we fill in the memory lazily[1] via
> > userfaultfd. This makes it difficult to use the soft-dirty bit to
> > track the writes the function triggers, because UFFDIO_COPY (and
> > friends) mark every page they touch as soft-dirty. Because we have the
> > canonical source of truth for the pages we materialize via UFFDIO_COPY
> > we're only interested in what happens after the userfaultfd operation.
> >
> > Clearing the soft-dirty bit is complicated by two things:
> > 1. There's no way to clear the soft-dirty bit on a single pte, so
> > instead we have to clear the soft-dirty bits for the entire process.
> > That requires us to process all the soft-dirty bits on every other pte
> > immediately to avoid data loss.
> > 2. We need to clear the soft-dirty bits after the userfaultfd
> > operation, but in order to avoid racing with the task that triggered
> > the page fault we have to do a non-waking copy, then clear the bits,
> > and then separately wake up the task.
> >
> > To work around all of this, we currently have a 4 step process:
> > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty.
> > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE.
> > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process.
> > 4. Do a UFFDIO_WAKE.
> >
> > The overhead of all of this (particularly step 1) is a millisecond or
> > two *per page* that we lazily materialize, and while that's not
> > crippling for our purposes, it is rather undesirable. What I would
> > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty
> > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear
> > all the soft-dirty bits once after setting up all the mmaps in the
> > process the relevant ptes would then "just do the right thing" from
> > our perspective.
> >
> > But I do want to get some feedback on this before I spend time writing
> > any code. Is there a reason not to do this? Or an alternate way to
> > achieve the same goal?
>
> Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY?
>
> If sync fault is a perf concern for frequent writes, just to mention at
> least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC),
> which is almost exactly soft dirty bits to me, though it solves a few
> issues it has on e.g. false positives over vma merging and swapping, or
> like you said missing of finer granule reset mechanisms.
>
> Maybe you also want to have a look at the pagemap ioctl introduced some
> time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in
> soft-dirty-like way):
>
> https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst
Thanks. This is all very helpful and I think I can construct what I
need out of these building blocks.
- Kyle
> > If this is generally sensible, then a couple questions:
> > 1. Do I need a UFFD_FEATURE flag for this, or is it enough for a
> > program to be able to detect the existence of a
> > UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag
> > or returns EINVAL? I would tend to think the latter.
>
> The latter requires all the setups needed, and an useless ioctl to probe.
> Not a huge issue, but since userfaultfd is extensible, a feature flag might
> be better as long as a new feature is well defined.
>
> > 2. Should I add this mode for the other UFFDIO variants (ZEROPAGE,
> > MOVE, etc) at the same time even if I don't have any use for them?
>
> Probably not. I don't see a need to implement something just to make the
> API look good.. If any chunk of code in the Linux kernel has no plan to be
> used, we should probably not adding them since the start..
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-05 22:15 ` Kyle Huey
@ 2025-05-12 3:06 ` Kyle Huey
2025-05-12 15:54 ` Peter Xu
0 siblings, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2025-05-12 3:06 UTC (permalink / raw)
To: Peter Xu; +Cc: Andrew Morton, open list, linux-mm, criu, Robert O'Callahan
On Mon, May 5, 2025 at 3:15 PM Kyle Huey <me@kylehuey.com> wrote:
>
> On Mon, May 5, 2025 at 1:05 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Kyle,
> >
> > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote:
> > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add
> > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any
> > > thoughts/objections?
> > >
> > > The kernel has a "soft-dirty" bit on ptes which tracks if they've been
> > > written to since the last time /proc/pid/clear_refs was used to clear
> > > the soft-dirty bit. CRIU uses this to track which pages have been
> > > modified since a previous checkpoint and reduce the size of the
> > > checkpoints taken. I would like to use this in my debugger[0] to track
> > > which pages a program function dirties when that function is invoked
> > > from the debugger.
> > >
> > > However, the runtime environment for this function is rather unusual.
> > > In my debugger, the process being debugged doesn't actually exist
> > > while it's being debugged. Instead, we have a database of all program
> > > state (including registers and memory values) from when the process
> > > was executed. It's in some sense a giant core dump that spans multiple
> > > points in time. To execute a program function from the debugger we
> > > rematerialize the program state at the desired point in time from our
> > > database.
> > >
> > > For performance reasons, we fill in the memory lazily[1] via
> > > userfaultfd. This makes it difficult to use the soft-dirty bit to
> > > track the writes the function triggers, because UFFDIO_COPY (and
> > > friends) mark every page they touch as soft-dirty. Because we have the
> > > canonical source of truth for the pages we materialize via UFFDIO_COPY
> > > we're only interested in what happens after the userfaultfd operation.
> > >
> > > Clearing the soft-dirty bit is complicated by two things:
> > > 1. There's no way to clear the soft-dirty bit on a single pte, so
> > > instead we have to clear the soft-dirty bits for the entire process.
> > > That requires us to process all the soft-dirty bits on every other pte
> > > immediately to avoid data loss.
> > > 2. We need to clear the soft-dirty bits after the userfaultfd
> > > operation, but in order to avoid racing with the task that triggered
> > > the page fault we have to do a non-waking copy, then clear the bits,
> > > and then separately wake up the task.
> > >
> > > To work around all of this, we currently have a 4 step process:
> > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty.
> > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE.
> > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process.
> > > 4. Do a UFFDIO_WAKE.
> > >
> > > The overhead of all of this (particularly step 1) is a millisecond or
> > > two *per page* that we lazily materialize, and while that's not
> > > crippling for our purposes, it is rather undesirable. What I would
> > > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty
> > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear
> > > all the soft-dirty bits once after setting up all the mmaps in the
> > > process the relevant ptes would then "just do the right thing" from
> > > our perspective.
> > >
> > > But I do want to get some feedback on this before I spend time writing
> > > any code. Is there a reason not to do this? Or an alternate way to
> > > achieve the same goal?
> >
> > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY?
> >
> > If sync fault is a perf concern for frequent writes, just to mention at
> > least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC),
> > which is almost exactly soft dirty bits to me, though it solves a few
> > issues it has on e.g. false positives over vma merging and swapping, or
> > like you said missing of finer granule reset mechanisms.
> >
> > Maybe you also want to have a look at the pagemap ioctl introduced some
> > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in
> > soft-dirty-like way):
> >
> > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst
>
>
> Thanks. This is all very helpful and I think I can construct what I
> need out of these building blocks.
>
> - Kyle
That works like a charm, thanks.
The only problem I ran into is that the man page for userfaultfd(2)
claims there's a handshake pattern where you can call UFFDIO_API
twice, once with 0 to enumerate all supported features, and then again
with the feature mask you want to initialize the API. In reality the
API only permits a single UFFDIO_API call because of the internal
UFFD_FEATURE_INITIALIZED flag, so doing this handshake requires
creating a sacrificial fd.
If the man page is not just totally wrong then this may have been an
unintentional regression from 22e5fe2a2a279.
- Kyle
> > > If this is generally sensible, then a couple questions:
> > > 1. Do I need a UFFD_FEATURE flag for this, or is it enough for a
> > > program to be able to detect the existence of a
> > > UFFDIO_COPY_MODE_DONTSOFTDIRTY by whether the ioctl accepts the flag
> > > or returns EINVAL? I would tend to think the latter.
> >
> > The latter requires all the setups needed, and an useless ioctl to probe.
> > Not a huge issue, but since userfaultfd is extensible, a feature flag might
> > be better as long as a new feature is well defined.
> >
> > > 2. Should I add this mode for the other UFFDIO variants (ZEROPAGE,
> > > MOVE, etc) at the same time even if I don't have any use for them?
> >
> > Probably not. I don't see a need to implement something just to make the
> > API look good.. If any chunk of code in the Linux kernel has no plan to be
> > used, we should probably not adding them since the start..
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-12 3:06 ` Kyle Huey
@ 2025-05-12 15:54 ` Peter Xu
2025-05-12 17:16 ` Kyle Huey
0 siblings, 1 reply; 8+ messages in thread
From: Peter Xu @ 2025-05-12 15:54 UTC (permalink / raw)
To: Kyle Huey
Cc: Andrew Morton, open list, linux-mm, criu, Robert O'Callahan,
Axel Rasmussen, Mike Rapoport, Andrea Arcangeli
On Sun, May 11, 2025 at 08:06:03PM -0700, Kyle Huey wrote:
> On Mon, May 5, 2025 at 3:15 PM Kyle Huey <me@kylehuey.com> wrote:
> >
> > On Mon, May 5, 2025 at 1:05 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Kyle,
> > >
> > > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote:
> > > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add
> > > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any
> > > > thoughts/objections?
> > > >
> > > > The kernel has a "soft-dirty" bit on ptes which tracks if they've been
> > > > written to since the last time /proc/pid/clear_refs was used to clear
> > > > the soft-dirty bit. CRIU uses this to track which pages have been
> > > > modified since a previous checkpoint and reduce the size of the
> > > > checkpoints taken. I would like to use this in my debugger[0] to track
> > > > which pages a program function dirties when that function is invoked
> > > > from the debugger.
> > > >
> > > > However, the runtime environment for this function is rather unusual.
> > > > In my debugger, the process being debugged doesn't actually exist
> > > > while it's being debugged. Instead, we have a database of all program
> > > > state (including registers and memory values) from when the process
> > > > was executed. It's in some sense a giant core dump that spans multiple
> > > > points in time. To execute a program function from the debugger we
> > > > rematerialize the program state at the desired point in time from our
> > > > database.
> > > >
> > > > For performance reasons, we fill in the memory lazily[1] via
> > > > userfaultfd. This makes it difficult to use the soft-dirty bit to
> > > > track the writes the function triggers, because UFFDIO_COPY (and
> > > > friends) mark every page they touch as soft-dirty. Because we have the
> > > > canonical source of truth for the pages we materialize via UFFDIO_COPY
> > > > we're only interested in what happens after the userfaultfd operation.
> > > >
> > > > Clearing the soft-dirty bit is complicated by two things:
> > > > 1. There's no way to clear the soft-dirty bit on a single pte, so
> > > > instead we have to clear the soft-dirty bits for the entire process.
> > > > That requires us to process all the soft-dirty bits on every other pte
> > > > immediately to avoid data loss.
> > > > 2. We need to clear the soft-dirty bits after the userfaultfd
> > > > operation, but in order to avoid racing with the task that triggered
> > > > the page fault we have to do a non-waking copy, then clear the bits,
> > > > and then separately wake up the task.
> > > >
> > > > To work around all of this, we currently have a 4 step process:
> > > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty.
> > > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE.
> > > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process.
> > > > 4. Do a UFFDIO_WAKE.
> > > >
> > > > The overhead of all of this (particularly step 1) is a millisecond or
> > > > two *per page* that we lazily materialize, and while that's not
> > > > crippling for our purposes, it is rather undesirable. What I would
> > > > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty
> > > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear
> > > > all the soft-dirty bits once after setting up all the mmaps in the
> > > > process the relevant ptes would then "just do the right thing" from
> > > > our perspective.
> > > >
> > > > But I do want to get some feedback on this before I spend time writing
> > > > any code. Is there a reason not to do this? Or an alternate way to
> > > > achieve the same goal?
> > >
> > > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY?
> > >
> > > If sync fault is a perf concern for frequent writes, just to mention at
> > > least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC),
> > > which is almost exactly soft dirty bits to me, though it solves a few
> > > issues it has on e.g. false positives over vma merging and swapping, or
> > > like you said missing of finer granule reset mechanisms.
> > >
> > > Maybe you also want to have a look at the pagemap ioctl introduced some
> > > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in
> > > soft-dirty-like way):
> > >
> > > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst
> >
> >
> > Thanks. This is all very helpful and I think I can construct what I
> > need out of these building blocks.
> >
> > - Kyle
>
> That works like a charm, thanks.
>
> The only problem I ran into is that the man page for userfaultfd(2)
> claims there's a handshake pattern where you can call UFFDIO_API
> twice, once with 0 to enumerate all supported features, and then again
> with the feature mask you want to initialize the API. In reality the
> API only permits a single UFFDIO_API call because of the internal
> UFFD_FEATURE_INITIALIZED flag, so doing this handshake requires
> creating a sacrificial fd.
This is true, almost all apps I'm aware that are using userfaultfd needs
that. It's indeed confusing.
>
> If the man page is not just totally wrong then this may have been an
> unintentional regression from 22e5fe2a2a279.
IMHO 22e5fe2a2a279 was correct, and it fixed a possible race due to
ctx->state before. The new cmpxchg() plus the INITIALIZED flag should avoid
the race.
In this case it should be the man page that was wrong since this commit of
man page, afaict:
commit a252b3345f5b0a4ecafa7d4fb1ac73cb4fd4877f (HEAD)
Author: Axel Rasmussen <axelrasmussen@google.com>
Date: Tue Oct 3 12:45:43 2023 -0700
ioctl_userfaultfd.2: Describe two-step feature handshake
I'll see if Axel / Mike / Andrea has any comment, otherwise I'll propose a
patch to fix the man-pages and state the fact (that we need a sacrificial
fd).
Maybe I should really add the UFFDIO_FEATURES ioctl to allow fetching the
feature flags from kernel separately, considering how much trouble we've
hit with this whole thing..
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-12 15:54 ` Peter Xu
@ 2025-05-12 17:16 ` Kyle Huey
2025-05-13 13:24 ` Peter Xu
0 siblings, 1 reply; 8+ messages in thread
From: Kyle Huey @ 2025-05-12 17:16 UTC (permalink / raw)
To: Peter Xu
Cc: Andrew Morton, open list, linux-mm, criu, Robert O'Callahan,
Axel Rasmussen, Mike Rapoport, Andrea Arcangeli
On Mon, May 12, 2025 at 8:54 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Sun, May 11, 2025 at 08:06:03PM -0700, Kyle Huey wrote:
> > On Mon, May 5, 2025 at 3:15 PM Kyle Huey <me@kylehuey.com> wrote:
> > >
> > > On Mon, May 5, 2025 at 1:05 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hi, Kyle,
> > > >
> > > > On Mon, May 05, 2025 at 09:37:01AM -0700, Kyle Huey wrote:
> > > > > tl;dr I'd like to add UFFDIO_COPY_MODE_DONTSOFTDIRTY that does not add
> > > > > the _PAGE_SOFT_DIRTY bit to the relevant pte flags. Any
> > > > > thoughts/objections?
> > > > >
> > > > > The kernel has a "soft-dirty" bit on ptes which tracks if they've been
> > > > > written to since the last time /proc/pid/clear_refs was used to clear
> > > > > the soft-dirty bit. CRIU uses this to track which pages have been
> > > > > modified since a previous checkpoint and reduce the size of the
> > > > > checkpoints taken. I would like to use this in my debugger[0] to track
> > > > > which pages a program function dirties when that function is invoked
> > > > > from the debugger.
> > > > >
> > > > > However, the runtime environment for this function is rather unusual.
> > > > > In my debugger, the process being debugged doesn't actually exist
> > > > > while it's being debugged. Instead, we have a database of all program
> > > > > state (including registers and memory values) from when the process
> > > > > was executed. It's in some sense a giant core dump that spans multiple
> > > > > points in time. To execute a program function from the debugger we
> > > > > rematerialize the program state at the desired point in time from our
> > > > > database.
> > > > >
> > > > > For performance reasons, we fill in the memory lazily[1] via
> > > > > userfaultfd. This makes it difficult to use the soft-dirty bit to
> > > > > track the writes the function triggers, because UFFDIO_COPY (and
> > > > > friends) mark every page they touch as soft-dirty. Because we have the
> > > > > canonical source of truth for the pages we materialize via UFFDIO_COPY
> > > > > we're only interested in what happens after the userfaultfd operation.
> > > > >
> > > > > Clearing the soft-dirty bit is complicated by two things:
> > > > > 1. There's no way to clear the soft-dirty bit on a single pte, so
> > > > > instead we have to clear the soft-dirty bits for the entire process.
> > > > > That requires us to process all the soft-dirty bits on every other pte
> > > > > immediately to avoid data loss.
> > > > > 2. We need to clear the soft-dirty bits after the userfaultfd
> > > > > operation, but in order to avoid racing with the task that triggered
> > > > > the page fault we have to do a non-waking copy, then clear the bits,
> > > > > and then separately wake up the task.
> > > > >
> > > > > To work around all of this, we currently have a 4 step process:
> > > > > 1. Read /proc/pid/pagemap and note all ptes that are soft-dirty.
> > > > > 2. Do the UFFDIO_COPY with UFFDIO_COPY_MODE_DONTWAKE.
> > > > > 3. Write to /proc/pid/clear_refs to clear soft-dirty bits across the process.
> > > > > 4. Do a UFFDIO_WAKE.
> > > > >
> > > > > The overhead of all of this (particularly step 1) is a millisecond or
> > > > > two *per page* that we lazily materialize, and while that's not
> > > > > crippling for our purposes, it is rather undesirable. What I would
> > > > > like to have instead is a UFFDIO_COPY mode that leaves the soft-dirty
> > > > > bit unchanged, i.e. a UFFDIO_COPY_MODE_DONTSOFTDIRTY. Since we clear
> > > > > all the soft-dirty bits once after setting up all the mmaps in the
> > > > > process the relevant ptes would then "just do the right thing" from
> > > > > our perspective.
> > > > >
> > > > > But I do want to get some feedback on this before I spend time writing
> > > > > any code. Is there a reason not to do this? Or an alternate way to
> > > > > achieve the same goal?
> > > >
> > > > Have you looked at the wr-protect mode, and UFFDIO_COPY_MODE_WP for _COPY?
> > > >
> > > > If sync fault is a perf concern for frequent writes, just to mention at
> > > > least latest Linux also supports async tracking (UFFD_FEATURE_WP_ASYNC),
> > > > which is almost exactly soft dirty bits to me, though it solves a few
> > > > issues it has on e.g. false positives over vma merging and swapping, or
> > > > like you said missing of finer granule reset mechanisms.
> > > >
> > > > Maybe you also want to have a look at the pagemap ioctl introduced some
> > > > time ago ("Pagemap Scan IOCTL", which, IIRC was trying to use uffd-wp in
> > > > soft-dirty-like way):
> > > >
> > > > https://www.kernel.org/doc/Documentation/admin-guide/mm/pagemap.rst
> > >
> > >
> > > Thanks. This is all very helpful and I think I can construct what I
> > > need out of these building blocks.
> > >
> > > - Kyle
> >
> > That works like a charm, thanks.
> >
> > The only problem I ran into is that the man page for userfaultfd(2)
> > claims there's a handshake pattern where you can call UFFDIO_API
> > twice, once with 0 to enumerate all supported features, and then again
> > with the feature mask you want to initialize the API. In reality the
> > API only permits a single UFFDIO_API call because of the internal
> > UFFD_FEATURE_INITIALIZED flag, so doing this handshake requires
> > creating a sacrificial fd.
>
> This is true, almost all apps I'm aware that are using userfaultfd needs
> that. It's indeed confusing.
>
> >
> > If the man page is not just totally wrong then this may have been an
> > unintentional regression from 22e5fe2a2a279.
>
> IMHO 22e5fe2a2a279 was correct, and it fixed a possible race due to
> ctx->state before. The new cmpxchg() plus the INITIALIZED flag should avoid
> the race.
>
> In this case it should be the man page that was wrong since this commit of
> man page, afaict:
>
> commit a252b3345f5b0a4ecafa7d4fb1ac73cb4fd4877f (HEAD)
> Author: Axel Rasmussen <axelrasmussen@google.com>
> Date: Tue Oct 3 12:45:43 2023 -0700
>
> ioctl_userfaultfd.2: Describe two-step feature handshake
>
> I'll see if Axel / Mike / Andrea has any comment, otherwise I'll propose a
> patch to fix the man-pages and state the fact (that we need a sacrificial
> fd).
>
> Maybe I should really add the UFFDIO_FEATURES ioctl to allow fetching the
> feature flags from kernel separately, considering how much trouble we've
> hit with this whole thing..
Personally I don't think it's a real issue to have to create a
sacrificial fd once at process initialization to see what features are
available. I wouldn't have even said anything if the man page hadn't
explicitly told me there was another way.
- Kyle
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-12 17:16 ` Kyle Huey
@ 2025-05-13 13:24 ` Peter Xu
2025-05-23 20:32 ` Axel Rasmussen
0 siblings, 1 reply; 8+ messages in thread
From: Peter Xu @ 2025-05-13 13:24 UTC (permalink / raw)
To: Kyle Huey
Cc: Andrew Morton, open list, linux-mm, criu, Robert O'Callahan,
Axel Rasmussen, Mike Rapoport, Andrea Arcangeli
On Mon, May 12, 2025 at 10:16:12AM -0700, Kyle Huey wrote:
> Personally I don't think it's a real issue to have to create a
> sacrificial fd once at process initialization to see what features are
> available. I wouldn't have even said anything if the man page hadn't
> explicitly told me there was another way.
Yes, that's indeed the part that could be confusing and needs fixing. Just
to keep a record (I have you copied), I sent the man-pages changes here:
https://lore.kernel.org/r/20250512171922.356408-1-peterx@redhat.com
We can stick with the sacrificial fd until there's a solid clue showing
that we should introduce a new way to probe.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Suppress pte soft-dirty bit with UFFDIO_COPY?
2025-05-13 13:24 ` Peter Xu
@ 2025-05-23 20:32 ` Axel Rasmussen
0 siblings, 0 replies; 8+ messages in thread
From: Axel Rasmussen @ 2025-05-23 20:32 UTC (permalink / raw)
To: Peter Xu
Cc: Kyle Huey, Andrew Morton, open list, linux-mm, criu,
Robert O'Callahan, Mike Rapoport, Andrea Arcangeli
On Tue, May 13, 2025 at 6:25 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, May 12, 2025 at 10:16:12AM -0700, Kyle Huey wrote:
> > Personally I don't think it's a real issue to have to create a
> > sacrificial fd once at process initialization to see what features are
> > available. I wouldn't have even said anything if the man page hadn't
> > explicitly told me there was another way.
>
> Yes, that's indeed the part that could be confusing and needs fixing. Just
> to keep a record (I have you copied), I sent the man-pages changes here:
>
> https://lore.kernel.org/r/20250512171922.356408-1-peterx@redhat.com
Agreed, at a high level I think this is the right fix. I believe I
just forgot the probing required a separate FD when I wrote that
version of the man page. :)
>
> We can stick with the sacrificial fd until there's a solid clue showing
> that we should introduce a new way to probe.
For what it's worth, I'm still convinced the whole handshake / probing
thing is overcomplicated, and it would be simpler to just do:
1. Userspace asks for the features it wants (UFFDIO_API)
2. Kernel responds (fills in the struct) with the (possibly subset) of
features it supports
3. Userspace can react as it sees fit if it gets a subset (fail with
error, gracefully degrade, ...)
But, based on previous discussion of that I believe I'm in the minority. :)
If we are sticking with the handshake approach, I agree needing a
second uffd is no big deal. We could add an ioctl to just probe
without configuring, but that would purely be for convenience, and I
don't think it saves many lines of code in userspace. So, on balance /
considering the small benefit I would probably prefer keeping the
kernel simpler.
>
> Thanks,
>
> --
> Peter Xu
>
On Tue, May 13, 2025 at 6:25 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, May 12, 2025 at 10:16:12AM -0700, Kyle Huey wrote:
> > Personally I don't think it's a real issue to have to create a
> > sacrificial fd once at process initialization to see what features are
> > available. I wouldn't have even said anything if the man page hadn't
> > explicitly told me there was another way.
>
> Yes, that's indeed the part that could be confusing and needs fixing. Just
> to keep a record (I have you copied), I sent the man-pages changes here:
>
> https://lore.kernel.org/r/20250512171922.356408-1-peterx@redhat.com
>
> We can stick with the sacrificial fd until there's a solid clue showing
> that we should introduce a new way to probe.
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-05-23 20:32 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-05 16:37 Suppress pte soft-dirty bit with UFFDIO_COPY? Kyle Huey
2025-05-05 20:05 ` Peter Xu
2025-05-05 22:15 ` Kyle Huey
2025-05-12 3:06 ` Kyle Huey
2025-05-12 15:54 ` Peter Xu
2025-05-12 17:16 ` Kyle Huey
2025-05-13 13:24 ` Peter Xu
2025-05-23 20:32 ` Axel Rasmussen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox