linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [GIT PULL] tracing: Fixes for 7.0
       [not found] <20260305103941.11f1b27d@gandalf.local.home>
@ 2026-03-05 16:44 ` Linus Torvalds
  2026-03-05 16:52   ` Steven Rostedt
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Linus Torvalds @ 2026-03-05 16:44 UTC (permalink / raw)
  To: Steven Rostedt, David Hildenbrand, Jason Gunthorpe, Leon Romanovsky
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Huiwen He, Jerome Marchand,
	Qing Wang, Shengming Hu, Linux-MM, linux-rdma

[ Adding linux-mm and the rdma people ]

On Thu, 5 Mar 2026 at 07:39, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> - Fix accounting of the memory mapped ring buffer on fork
>
>   Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
>   this does not prevent the application from calling madvise(MADVISE_DOFORK).

I wonder how many other things have this assumption.

Now, many (most?) of the VM_DONTCOPY users set VM_IO too, because the
most common reason they don't want to be copied is that it's a special
mapping.

And then madvise() does nothing.

But I also get the feeling that the whole *reason* for MADV_DOFORK
existing in the first place simply doesn't exist any more.

It was added two decades ago when as a hack for the rdma people who
wanted to mix fork (with COW) and concurrent DMA, which just didn't
work reliably because the COW would break either way.

See commit f822566165dd ("[PATCH] madvise MADV_DONTFORK/MADV_DOFORK").

And that should just not be an issue any more thanks to how it's now
done with page pinning rather than with the old GUP interfaces.

So while I've pulled the tracing fix, I get the feeling that people
should at least think about just making MADV_{DO,DONT}FORK go away.

Now, Debian code search does show some users (libfabric, libibverbs),
and maybe they actually want the forking behavior for other reasons
too.

But I get the feeling that maybe we should at least limit MADV_DOFORK
only to the case where the *source* of the DONTFORK was the user, not
some kernel mapping.

                  Linus


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: Fixes for 7.0
  2026-03-05 16:44 ` [GIT PULL] tracing: Fixes for 7.0 Linus Torvalds
@ 2026-03-05 16:52   ` Steven Rostedt
  2026-03-05 17:00   ` David Hildenbrand (Arm)
  2026-03-05 19:07   ` Jason Gunthorpe
  2 siblings, 0 replies; 6+ messages in thread
From: Steven Rostedt @ 2026-03-05 16:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Jason Gunthorpe, Leon Romanovsky,
	Masami Hiramatsu, Mathieu Desnoyers, Huiwen He, Jerome Marchand,
	Qing Wang, Shengming Hu, Linux-MM, linux-rdma, Lorenzo Stoakes

On Thu, 5 Mar 2026 08:44:27 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [ Adding linux-mm and the rdma people ]
> 
> On Thu, 5 Mar 2026 at 07:39, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > - Fix accounting of the memory mapped ring buffer on fork
> >
> >   Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
> >   this does not prevent the application from calling madvise(MADVISE_DOFORK).  
> 
> I wonder how many other things have this assumption.
> 
> Now, many (most?) of the VM_DONTCOPY users set VM_IO too, because the
> most common reason they don't want to be copied is that it's a special
> mapping.
> 
> And then madvise() does nothing.
> 
> But I also get the feeling that the whole *reason* for MADV_DOFORK
> existing in the first place simply doesn't exist any more.
> 
> It was added two decades ago when as a hack for the rdma people who
> wanted to mix fork (with COW) and concurrent DMA, which just didn't
> work reliably because the COW would break either way.
> 
> See commit f822566165dd ("[PATCH] madvise MADV_DONTFORK/MADV_DOFORK").
> 
> And that should just not be an issue any more thanks to how it's now
> done with page pinning rather than with the old GUP interfaces.
> 
> So while I've pulled the tracing fix, I get the feeling that people
> should at least think about just making MADV_{DO,DONT}FORK go away.
> 
> Now, Debian code search does show some users (libfabric, libibverbs),
> and maybe they actually want the forking behavior for other reasons
> too.
> 
> But I get the feeling that maybe we should at least limit MADV_DOFORK
> only to the case where the *source* of the DONTFORK was the user, not
> some kernel mapping.

Right, I was a bit confused when I saw this too. I asked the memory folks
about adding the VM_IO, and Lorenzo suggested against it. You can read the
discussion here:

   https://lore.kernel.org/all/e4deff21-2fb5-4f37-a7d3-ede5f69a4489@lucifer.local/

-- Steve


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: Fixes for 7.0
  2026-03-05 16:44 ` [GIT PULL] tracing: Fixes for 7.0 Linus Torvalds
  2026-03-05 16:52   ` Steven Rostedt
@ 2026-03-05 17:00   ` David Hildenbrand (Arm)
  2026-03-05 17:17     ` Linus Torvalds
  2026-03-05 19:07   ` Jason Gunthorpe
  2 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-05 17:00 UTC (permalink / raw)
  To: Linus Torvalds, Steven Rostedt, Jason Gunthorpe, Leon Romanovsky
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Huiwen He, Jerome Marchand,
	Qing Wang, Shengming Hu, Linux-MM, linux-rdma

On 3/5/26 17:44, Linus Torvalds wrote:
> [ Adding linux-mm and the rdma people ]
> 
> On Thu, 5 Mar 2026 at 07:39, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> - Fix accounting of the memory mapped ring buffer on fork
>>
>>   Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
>>   this does not prevent the application from calling madvise(MADVISE_DOFORK).
> 
> I wonder how many other things have this assumption.
> 
> Now, many (most?) of the VM_DONTCOPY users set VM_IO too, because the
> most common reason they don't want to be copied is that it's a special
> mapping.
> 
> And then madvise() does nothing.
> 
> But I also get the feeling that the whole *reason* for MADV_DOFORK
> existing in the first place simply doesn't exist any more.
> 
> It was added two decades ago when as a hack for the rdma people who
> wanted to mix fork (with COW) and concurrent DMA, which just didn't
> work reliably because the COW would break either way.

Yes.

> 
> See commit f822566165dd ("[PATCH] madvise MADV_DONTFORK/MADV_DOFORK").
> 
> And that should just not be an issue any more thanks to how it's now
> done with page pinning rather than with the old GUP interfaces.
> 

There are still weird cases with O_DIRECT and concurrent fork, where we
don't use FOLL_PIN just yet (I know, I know, ... all shaky).

QEMU traditionally sets MADV_DONTFORK on guest RAM. One reason is to
speed up fork(), because it doesn't need all the guest RAM in fork'ed
child processes.

> So while I've pulled the tracing fix, I get the feeling that people
> should at least think about just making MADV_{DO,DONT}FORK go away.
> 
> Now, Debian code search does show some users (libfabric, libibverbs),
> and maybe they actually want the forking behavior for other reasons
> too.
> 

I suspect we cannot get rid of it that easily. But ...

> But I get the feeling that maybe we should at least limit MADV_DOFORK
> only to the case where the *source* of the DONTFORK was the user, not
> some kernel mapping.
... that makes sense. Forbid toggling it on something that has
VM_SPECIAL set, maybe.

Or at least forbid re-enabling it for such mappings.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: Fixes for 7.0
  2026-03-05 17:00   ` David Hildenbrand (Arm)
@ 2026-03-05 17:17     ` Linus Torvalds
  2026-03-05 18:59       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2026-03-05 17:17 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Steven Rostedt, Jason Gunthorpe, Leon Romanovsky,
	Masami Hiramatsu, Mathieu Desnoyers, Huiwen He, Jerome Marchand,
	Qing Wang, Shengming Hu, Linux-MM, linux-rdma

On Thu, 5 Mar 2026 at 09:00, David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> QEMU traditionally sets MADV_DONTFORK on guest RAM. One reason is to
> speed up fork(), because it doesn't need all the guest RAM in fork'ed
> child processes.

Yes, I think the MADV_DONTFORK thing makes sense on its own - more so
than MADV_DOFORK does.

Because it's a very valid thing for user space to do exactly for that
"speed up fork()" case.

It's similar to how we also export a MADV_WIPEONFORK - for a different
use-case, where we don't want the copying behavior (typically because
we want the child to re-create its own set of data: I thin the main
reason tends to be for things like reseeding random number generation
after fork etc).

So it's just MADV_DOFORK I don't particularly like, because it had
pre-existing kernel semantics (the VM_DONTCOPY bit predates the MADV_*
bits by many many years).

Not copying on fork is always safe. But copying something that the
kernel has said "don't copy" just sounds *wrong*.

> > But I get the feeling that maybe we should at least limit MADV_DOFORK
> > only to the case where the *source* of the DONTFORK was the user, not
> > some kernel mapping.
>
> ... that makes sense. Forbid toggling it on something that has
> VM_SPECIAL set, maybe.

Yeah, I think VM_SPECIAL would be a better match than just checking
VM_IO.  At least it would also catch things like that VM_DONTEXPAND,
and PFN mappings.

So just changing the existing VM_IO test to cover all the VM_SPECIAL
bits would be a simple improvement.

Maybe I should just do that and see if anybody even notices (and
revert and re-think if somebody does)

                 Linus


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: Fixes for 7.0
  2026-03-05 17:17     ` Linus Torvalds
@ 2026-03-05 18:59       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-05 18:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, Jason Gunthorpe, Leon Romanovsky,
	Masami Hiramatsu, Mathieu Desnoyers, Huiwen He, Jerome Marchand,
	Qing Wang, Shengming Hu, Linux-MM, linux-rdma, Lorenzo Stoakes

On 3/5/26 18:17, Linus Torvalds wrote:
> On Thu, 5 Mar 2026 at 09:00, David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> QEMU traditionally sets MADV_DONTFORK on guest RAM. One reason is to
>> speed up fork(), because it doesn't need all the guest RAM in fork'ed
>> child processes.
> 
> Yes, I think the MADV_DONTFORK thing makes sense on its own - more so
> than MADV_DOFORK does.
> 
> Because it's a very valid thing for user space to do exactly for that
> "speed up fork()" case.
> 
> It's similar to how we also export a MADV_WIPEONFORK - for a different
> use-case, where we don't want the copying behavior (typically because
> we want the child to re-create its own set of data: I thin the main
> reason tends to be for things like reseeding random number generation
> after fork etc).
> 
> So it's just MADV_DOFORK I don't particularly like, because it had
> pre-existing kernel semantics (the VM_DONTCOPY bit predates the MADV_*
> bits by many many years).
> 
> Not copying on fork is always safe. But copying something that the
> kernel has said "don't copy" just sounds *wrong*.
> 
>>> But I get the feeling that maybe we should at least limit MADV_DOFORK
>>> only to the case where the *source* of the DONTFORK was the user, not
>>> some kernel mapping.
>>
>> ... that makes sense. Forbid toggling it on something that has
>> VM_SPECIAL set, maybe.

CCing Lorenzo.

> 
> Yeah, I think VM_SPECIAL would be a better match than just checking
> VM_IO.  At least it would also catch things like that VM_DONTEXPAND,
> and PFN mappings.
> 
> So just changing the existing VM_IO test to cover all the VM_SPECIAL
> bits would be a simple improvement.

Ack.

> 
> Maybe I should just do that and see if anybody even notices (and
> revert and re-think if somebody does)

Agreed. We could think about letting it sit a bit in -next before moving
it to mainline.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [GIT PULL] tracing: Fixes for 7.0
  2026-03-05 16:44 ` [GIT PULL] tracing: Fixes for 7.0 Linus Torvalds
  2026-03-05 16:52   ` Steven Rostedt
  2026-03-05 17:00   ` David Hildenbrand (Arm)
@ 2026-03-05 19:07   ` Jason Gunthorpe
  2 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-03-05 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, David Hildenbrand, Leon Romanovsky,
	Masami Hiramatsu, Mathieu Desnoyers, Huiwen He, Jerome Marchand,
	Qing Wang, Shengming Hu, Linux-MM, linux-rdma

On Thu, Mar 05, 2026 at 08:44:27AM -0800, Linus Torvalds wrote:
> Now, Debian code search does show some users (libfabric, libibverbs),
> and maybe they actually want the forking behavior for other reasons
> too.

DOFORK in libibverbs is a consequence of its automatic DONTFORK, is is
explcitily there to undo a prior DONTFORK only.

This is because the library wrappers would automatically DONTFORK
regions of memory based on library usage, and then DOFORK them back to
normal after the library is done with them.

This whole path is disabled on modern kernels in favour of the fixed
fork support.

I think your point that DOFORK should only take action on previous
DONTFORK is correct.

Jason


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-05 19:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20260305103941.11f1b27d@gandalf.local.home>
2026-03-05 16:44 ` [GIT PULL] tracing: Fixes for 7.0 Linus Torvalds
2026-03-05 16:52   ` Steven Rostedt
2026-03-05 17:00   ` David Hildenbrand (Arm)
2026-03-05 17:17     ` Linus Torvalds
2026-03-05 18:59       ` David Hildenbrand (Arm)
2026-03-05 19:07   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox