linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Lunn <andrew@lunn.ch>
To: David Howells <dhowells@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org
Subject: Re: MSG_ZEROCOPY and the O_DIRECT vs fork() race
Date: Fri, 2 May 2025 16:21:55 +0200	[thread overview]
Message-ID: <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> (raw)
In-Reply-To: <1021352.1746193306@warthog.procyon.org.uk>

On Fri, May 02, 2025 at 02:41:46PM +0100, David Howells wrote:
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > > I'm looking into making the sendmsg() code properly handle the 'DIO vs
> > > fork' issue (where pages need pinning rather than refs taken) and also
> > > getting rid of the taking of refs entirely as the page refcount is going
> > > to go away in the relatively near future.
> > 
> > Sorry, new to this conversation, and i don't know what you mean by DIO
> > vs fork.
> 
> As I understand it, there's a race between O_DIRECT I/O and fork whereby if
> you, say, start a DIO read operation on a page and then fork, the target page
> gets attached to child and a copy made for the parent (because the refcount is
> elevated by the I/O) - and so only the child sees the result.  This is made
> more interesting by such as AIO where the parent gets the completion
> notification, but not the data.
> 
> Further, a DIO write is then alterable by the child if the DMA has not yet
> happened.
> 
> One of the things mm/gup.c does is to work around this issue...  However, I
> don't think that MSG_ZEROCOPY handles this - and so zerocopy sendmsg is, I
> think, subject to the same race.

For zerocopy, you probably should be talking to Eric Dumazet, David Wei.

I don't know too much about this, but from the Ethernet drivers
perspective, i _think_ it has no idea about zero copy. It is just
passed a skbuf containing data, nothing special about it. Once the
interface says it is on the wire, the driver tells the netdev core it
has finished with the skbuf.

So, i guess your question about CRC is to do with CoW? If the driver
does not touch the data, just DMA it out, the page could be shared
between the processes. If it needs to modify it, put CRCs into the
packet, that write means the page cannot be shared? If you have
scatter/gather you can place the headers in kernel memory and do
writes to set the CRCs without touching the userspace data. I don't
know, but i suspect this is how it is done. There is also an skbuf
operation to linearize a packet, which will allocate a new skbuf big
enough to contain the whole packet in a single segment, and do a
memcpy of the fragments. Not what you want for zerocopy, but if your
interface does not have the needed support, there is not much choice.

	Andrew


  parent reply	other threads:[~2025-05-02 14:22 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch>
     [not found] ` <1015189.1746187621@warthog.procyon.org.uk>
2025-05-02 13:41   ` David Howells
2025-05-02 13:48     ` David Hildenbrand
2025-05-02 14:21     ` Andrew Lunn [this message]
2025-05-02 16:21     ` Reorganising how the networking layer handles memory David Howells
2025-05-05 20:14       ` Jakub Kicinski
2025-05-06 13:50       ` David Howells
2025-05-06 13:56         ` Christoph Hellwig
2025-05-06 18:20         ` Jakub Kicinski
2025-05-07 13:45         ` David Howells
2025-05-07 17:47           ` Willem de Bruijn
2025-05-07 13:49         ` David Howells
2025-05-12 14:51     ` AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN David Howells
2025-05-12 21:59       ` David Hildenbrand
2025-06-23 11:50       ` Christian Brauner
2025-06-23 13:53       ` Christoph Hellwig
2025-06-23 14:16       ` David Howells
2025-06-23 10:50     ` How to handle P2P DMA with only {physaddr,len} in bio_vec? David Howells
2025-06-23 13:46       ` Christoph Hellwig
2025-06-23 23:38         ` Alistair Popple
2025-06-24  9:02       ` David Howells
2025-06-24 12:18         ` Jason Gunthorpe
2025-06-24 12:39         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch \
    --to=andrew@lunn.ch \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=dhowells@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox