From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A60C3C5AD49
	for <linux-mm@archiver.kernel.org>; Fri, 30 May 2025 16:22:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0EDF56B0151; Fri, 30 May 2025 12:22:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 09E7E6B0153; Fri, 30 May 2025 12:22:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EA8E76B0159; Fri, 30 May 2025 12:22:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id C4CC86B0151
	for <linux-mm@kvack.org>; Fri, 30 May 2025 12:22:33 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 6245C1A0EFD
	for <linux-mm@kvack.org>; Fri, 30 May 2025 16:22:33 +0000 (UTC)
X-FDA: 83500092186.06.604ECE8
Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180])
	by imf21.hostedemail.com (Postfix) with ESMTP id 7CDA71C000D
	for <linux-mm@kvack.org>; Fri, 30 May 2025 16:22:31 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=QeBBpq5m;
	spf=pass (imf21.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=almasrymina@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748622151;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vf6Yud61p5FojbXNBMIgSMyrP3LQYQ4cOutBnHeT7EE=;
	b=qojFBll/GVnA7kFLQ9ugz6ptRe9UYUb+rkJtt3ZXxIoSwxbWvUAumKZqyNNizKcF/+gHJD
	6uPssudMKKLNQnQMq/Avmcvolz3BUE/RRs0ZcZFhkdcm01K+W8z+ZvPl8P5zdcCbGbifrV
	+66uUb/2SicCc6x4mtWZY+TEXfkV+FE=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=QeBBpq5m;
	spf=pass (imf21.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=almasrymina@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748622151; a=rsa-sha256;
	cv=none;
	b=GxN5syn1sjI36El5Zr/tXAxujjeRnOEKFh5BuDJA/YQGCD+wMQ5PR2PvNEHyt/gIMX2Kb/
	SDCREy5fF9BI6loyTq51B63v5qCcQsJneAskJaN0nSyZK50kFmREutoIrI6+bK2mFfTJJK
	EOsQyz1mc9chWJmOlRQMsCkR1X+v2CQ=
Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2349068ebc7so228505ad.0
        for <linux-mm@kvack.org>; Fri, 30 May 2025 09:22:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1748622150; x=1749226950; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=vf6Yud61p5FojbXNBMIgSMyrP3LQYQ4cOutBnHeT7EE=;
        b=QeBBpq5mapYN2N7TBVbr+JIMHzY5UVH7w+DNOEIA0iuY4i84MfRnBwN5r/1eC6OEIu
         vsAiUv272XErwGpYjPf17niOEZY+q5/G4TQ1Ou5/YIHxvs/8qPdQkPOf7Q+XOoNKwEXG
         rwcSOWL9+d1wL1I3jkTnp4K5mNF7QGYdJE1SkudNtoRJtOOOQeh65SYliTmQXL1kwVHc
         iNysKpEwbiqpQi1bkKjkFEZsccPwH0monRKLTa8d5O3ypIZiQ8Cl1YSJvnyG3EXvfFBy
         BPH9MElj99nSSzb/nO1aKQJxCyBJsEbTimj0EuDB1vbmmZtVBiifP42x/zSVj2Hhrs2d
         K9Ow==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748622150; x=1749226950;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=vf6Yud61p5FojbXNBMIgSMyrP3LQYQ4cOutBnHeT7EE=;
        b=aQVvsrPpoUMc6ZcrdoI6RoA5QACVaYjgKqzGJW6QfmxlCLORnSPj768ZcyMwCklaMX
         tmLnaqiYPq3wf22jT1HQm1xm3jseV6SNzJZ/GcugYG2ul9mjvO3ebEH2G3gIcVwPwDBA
         DR1sugnTCqol86hDtLgnhGSm3fDK8chEI/F30JIsV7IJl2yM4dMdIeZeEiTPIVz7u2/l
         PzDVFrcjK3YOYvILYgGSo+J5CYLf55k8H9MzhbbV0ZBPbYGNLG4ZqL5B0rV2YM3SaBW6
         4UvuoiNnf8SvptMmG9uyd8DPr2QQEjjk9kc1hB/nkZ4Wz+pNCgeDrVlBGrCbq/KBEBxH
         YLgg==
X-Forwarded-Encrypted: i=1; AJvYcCU3yIRfk+L2+F4PVno1fNXmo5AoM4fwCvlgpn2QRbIHta63CeOpWbF0XNeM2TIZXX0uy+n5GqCe8A==@kvack.org
X-Gm-Message-State: AOJu0YxIRqsv/hXDSaNsmj/MybyYqPS0Eak6dgHvXbwdRUs43D2FgBKA
	f332LNxOYi+9PpDtfYi3GtcF2kRqi6vTnUm6TKuIU8l+7BRtKwUvjrsLEs5NM/sVIE2X5edyGPW
	iLmHkNNEWI33ohuSho1xBBDYjFrJshZGq9andnHeQ
X-Gm-Gg: ASbGnctkaashxUecwhBf0HQLUnhZJUMqDk2YSHp2GWirk3LCN7v69L5bn4NFtQx3s4n
	I4W/CpKQTWxZMXmzkRxZtmosImT9m51tahaPu3bvNox+98KcE2iXbTw0ANLAJEtKCP4rMKE8q8M
	PbLbc4GhaLMzRGbwtIthOmW/ukS3jOFk1DgCfzWpfber26HTUx6gVw/Q0=
X-Google-Smtp-Source: AGHT+IEtiJzkhMrbT4HUzOjYjtF5aql9Xb8hdrSN3Cti84xfY4Vb6+gwtO/xOvIq6bw5m4mFgqm8TBezlJPtYjT+ba0=
X-Received: by 2002:a17:903:32c2:b0:215:f0c6:4dbf with SMTP id
 d9443c01a7336-2352dfcade3mr3925465ad.14.1748622149915; Fri, 30 May 2025
 09:22:29 -0700 (PDT)
MIME-Version: 1.0
References: <770012.1748618092@warthog.procyon.org.uk>
In-Reply-To: <770012.1748618092@warthog.procyon.org.uk>
From: Mina Almasry <almasrymina@google.com>
Date: Fri, 30 May 2025 09:22:17 -0700
X-Gm-Features: AX0GCFv_9CblvKqE59Fn0enXIWdu9lK-Q9GgqjjgMg5PEL2NBfaZtCi1kNaSC5g
Message-ID: <CAHS8izMMU8QZrvXRiDjqwsBg_34s+dhvSyrU7XGMBuPF6eWyTA@mail.gmail.com>
Subject: Re: Device mem changes vs pinning/zerocopy changes
To: David Howells <dhowells@redhat.com>
Cc: willy@infradead.org, hch@infradead.org, Jakub Kicinski <kuba@kernel.org>, 
	Eric Dumazet <edumazet@google.com>, netdev@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: 7qakktiicwon8wgwdpn8by6ybig5ffnt
X-Rspamd-Queue-Id: 7CDA71C000D
X-Rspamd-Server: rspam11
X-HE-Tag: 1748622151-614168
X-HE-Meta: U2FsdGVkX1+4L843IKtJsJzaCS71CLilA2vImzpL6KCr6bUODsfRgvMg26B2W+Cb4t05O2xvFdgLBY9YqL+n4HqjYfhyvvlW9Kovo5JpFrJdecJgBc/CxKu2nzf+2EQThYnzjurUwgFS5j3vM8k8kMJuvP3GxTyPv1PnMmJYs8nfBOcXs2fLq0TXfjsY46uT2YOKnWG74lNl7mds0vev2o+3Oe0EdAAIeB6aNxSnGd2/QqYfHDg/ORtinHMiBxbZ7aeZSJJI/uzY+xhJIB6G+ccX0H3tw3WQaPASZOIN/fiMinXbuRdGoLiShajW6Ufgythja2mDq2OESAxzpRHjLEFeEHPrA+DOLLJlcr7RZwE4Z7+yh8dVajEE+pZi0drYyv1w9oaFEmS+lbDXK1eT7iDhb+l6vLu4uYon8AyjBv46VOOUYzwPAGj0TzctGryT+80cHbTEjGWdxFHnc9RgT0PvK2eUSwQWbAhpvI+HKy+i9H1exelM764SFN11Vw5V5Hi7vz32lWzk0iAASSk2NcgriBkGICPiHMe9dXkXC6teyteMwDBzGwfWRi6I2ur0NJhmHrU4uLLfu9OzJrIGyVdHNPcTXNW9MR7izakLbsp5e3bGDI/YhBzU7wrvgLoA1exi+ihC21b3LUx19KPi/2POR9He0XUpzZODi6NF5X3IZsSfPNGfenEJNBfbruHUqrKqiMwgmAiTlLtLCSWwT17K8ome/keFrH2pEm5AhdYmWOwRQzg8a4yQQEkyrbiXOVRipVPUa1P8aQtveHPhJmW5Wgtz5AzC3gWvCTdRiEuPvo7SKEZzqIhD9iC7SPWGjFiPMpIydujGtHRkyxITA3KQW5nt9Mc/IDAdJ1vzNCczpkAuzrHRUxfNYcPTIkwxthgjGslOlV8RBaRE1Khmtc8QPRirOiSskEtvj5hGIPejD0a1U1Cpf0zd/MDM0nGHI3OD3hZ6i/99ky/Sot/
 9cIggmS2
 zCFDDWch4YIwNpEWzmo1n1YjN/9J4sumLtnFDhMsSdN1DswdVRi6qmIbSYu3EhqusD2lBN9GMuy+k5QLJu90xl+AsRFiTj3xOL+hpjMxOHPAqOmwt+bkIITfGwASBpVQw9nZru8kYXap/a1gu8dB82Ud1hp72t3RJC1J2AU1zUuqR0cE7uUIq+KDFAWw6pdievjY5aGhee1JbsGnmwhEEzVbABLflSHSovUnOVLf44LulsbBHqfriHT8ThNAhi/MV5KecgFHePd4mW015LGEAhhfjd/W/hklfa+HBf4vQ3HLh2SFqZEhChvOWeKOiCI9sasOR
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, May 30, 2025 at 8:15=E2=80=AFAM David Howells <dhowells@redhat.com>=
 wrote:
>
> Hi Mina,
>
> I've seen your transmission-side TCP devicemem stuff has just gone in and=
 it
> conflicts somewhat with what I'm trying to do.  I think you're working on=
 the
> problem bottom up and I'm working on it top down, so if you're willing to
> collaborate on it...?
>

Hi David! Yes, very happy to collaborate. FWIW, my initial gut feeling
is that the work doesn't conflict that much. The tcp devmem
netmem/net_iov stuff is designed to follow the page stuff, and as the
usage of struct page changes we're happy moving net_iovs and netmems
to do the same thing. My read is that it will take a small amount of
extra work, but there are no in-principle design conflicts, at least
AFAICT so far.

> So, to summarise what we need to change (you may already know all of this=
):
>
>  (*) The refcount in struct page is going to go away.  The sk_buff fragme=
nt
>      wrangling code, however, occasionally decides to override the zeroco=
py
>      mode and grab refs on the pages pointed to by those fragments.  sk_b=
uffs
>      *really* want those page refs - and it does simplify memory handling=
.
>      But.
>

I believe the main challenge here is that there are many code paths in
the net stack that expect to be able to grab a ref on skb frags. See
all the callers of skb_frag_ref/skb_frag_unref:

tcp_grow_skb, __skb_zcopy_downgrade_managed, __pskb_copy_fclone,
pskb_expand_head, skb_zerocopy, skb_split, pksb_carve_inside_header,
pskb_care_inside_nonlinear, tcp_clone_payload, skb_segment.

I think to accomplish what you're describing we need to modify
skb_frag_ref to do something else other than taking a reference on the
page or net_iov. I think maybe taking a reference on the skb itself
may be acceptable, and the skb can 'guarantee' that the individual
frags underneath it don't disappear while these functions are
executing.

But devmem TCP doesn't get much in the way here, AFAICT. It's really
the fact that so many of the networking code paths want to obtain page
refs via skb_frag_ref so there is potentially a lot of code to touch.

>      Anyway, we need to stop taking refs where possible.  A fragment may =
in
>      future point to a sequence of pages and we would only be getting a r=
ef on
>      one of them.
>
>  (*) Further, the page struct is intended to be slimmed down to a single =
typed
>      pointer if possible, so all the metadata in the net_iov struct will =
have
>      to be separately allocated.
>

Yes, I'm already collaborating with Byungchul on this, and we're
making great progress. I think this may be close to getting fully
reviewed:

https://lore.kernel.org/netdev/20250509115126.63190-1-byungchul@sk.com/

According to his cover letter, he's actually finding the work that I
did with netmem/net_iovs useful for him.

>  (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the us=
er
>      pages specified by the iterator but we need to stop doing that.  We =
need
>      to call GUP to take a "pin" instead (and must not take any refs).  T=
he
>      pages we get access to may be folio-type, anon-type, some sort of de=
vice
>      type.
>

This also doesn't conflict with devmem changes, AFAICT. Currently in
devmem we take references on the user devmem only because pages also
take a reference on the user pages (we mirror devmem and pages to keep
the netstack uniform without excessive mem-type checks). If the code
path for  zerocopy_fill_skb_from_iter doesn't need to take ref on the
pages, we'll just migrate  zerocopy_fill_skb_from_devmem to do the
same thing.

>  (*) It would be good to do a batch lookup of user buffers to cut down on=
 the
>      number of page table trawls we do - but, on the other hand, that mig=
ht
>      generate more page faults upfront.
>
>  (*) Splice and vmsplice.  If only I could uninvent them...  Anyway, they=
 give
>      us buffers from a pipe - but the buffers come with destructors and s=
hould
>      not have refs taken on the pages we might think they have, but use t=
he
>      destructor instead.
>

The above 2 points are orthogonal to devmem. There is no page table
walks, splice, or vmsplice with devmem. We just have to make sure that
the generic changes you're implementing also work with the devmem
paths. I can test your changes and point if I see any issue, but I
don't see a conflict in-principle.

>  (*) The intention is to change struct bio_vec to be just physical addres=
s and
>      length, with no page pointer.  You'd then use, say, kmap_local_phys(=
) or
>      kmap_local_bvec() to access the contents from the cpu.  We could the=
n
>      revert the fragment pointers to being bio_vecs.
>

Ok, this part conflicts a bit, maybe. skb_frag_ref no longer uses
struct bio_vec. I merged that change before 6.12 kernel, it's not a
very recent change:

https://lore.kernel.org/netdev/20240214223405.1972973-3-almasrymina@google.=
com/

But, AFAICT, skb_frag_t needs a struct page inside of it, not just a
physical address. skb_frags can mmap'd into userspace for TCP
zerocopy, see tcp_zerocopy_vm_insert_batch (which is a very old
feature, it's not a recent change). There may be other call paths in
the net stack that require a full page and just a physical address
will do. (unless somehow we can mmap a physical address to the
userspace).

>  (*) Kernel services, such as network filesystems, can't pass kmalloc()'d=
 data
>      to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and,=
 in
>      any case, the object lifetime is not managed by refcount.  However, =
if we
>      had a destructor, this restriction could go away.
>

This also sounds orthogonal to devmem.

>
> So what I'd like to do is:
>
>  (1) Separate fragment lifetime management from sk_buff.  No more wanglin=
g of
>      refcounts in the skbuff code.  If you clone an skb, you stick an ext=
ra
>      ref on the lifetime management struct, not the page.
>

Agreed, and AFAICT devmem doesn't get in the way. Let me know if you
disagree or are seeing something different.

>  (2) Create a chainable 'network buffer' struct, e.g.:
>
>         enum net_txbuf_type {
>                 NET_TXBUF_BUFFERED,     /* Buffered copy of data */
>                 NET_TXBUF_ZCOPY_USER,   /* Zerocopy of user buffers */
>                 NET_TXBUF_ZCOPY_KERNEL, /* Zerocopy of kernel buffers */
>         };
>
>         struct net_txbuf {
>                 struct net_txbuf        next;
>                 struct mmpin            mm_pin;
>                 unsigned int            start_pos;
>                 unsigned int            end_pos;
>                 unsigned int            extracted_to;
>                 refcount_t              ref;
>                 enum net_txbuf_type     type;
>                 u8                      nr_used;
>                 bool                    wmem_charged;
>                 bool                    got_copied;
>                 union {
>                         /* For NET_TXBUF_BUFFERED: */
>                         struct {
>                                 void            *bufs[16];
>                                 u8              bufs_orders[16];
>                                 bool            last_buf_freeable;
>                         };
>                         /* For NET_TXBUF_ZCOPY_*: */
>                         struct {
>                                 struct sock     *sk;
>                                 struct sk_buff  *notify;
>                                 msg_completion_t completion;
>                                 void            *completion_data;
>                                 struct bio_vec  frags[12];
>                         };
>                 };
>         };
>
>      (Note this is very much still a WiP and very much subject to change)
>
>      So how I envision it working depends on the type of flow in the sock=
et.
>      For the transmission side of streaming sockets (e.g. TCP), the socke=
t
>      maintains a single chain of these.  Each txbuf is of a single type, =
but
>      multiple types can be interleaved.
>
>      For non-ZC flow, as data is imported, it's copied into pages attache=
d to
>      the current head txbuf of type BUFFERED, with more pages being attac=
hed
>      as we progress.  Successive writes just keep adding to the space in =
the
>      latest page added and each skbuff generated pins the txbuf it starts=
 at
>      and each txbuf pins its successor.
>
>      As skbuffs are consumed, they unpin the root txbuf.  However, this c=
ould
>      leave an awful lot of memory pinned for a long time, so I would miti=
gate
>      this in two ways: firstly, where possible, keep track of the transmi=
tted
>      byte position and progressively destruct the txbuf; secondly, if we
>      completely use up a partially filled txbuf then reset the queue.
>
>      An skbuff's frag list then has a bio_vec[] that refers to fragments =
of
>      the buffers recorded in the txbuf chain.  An skbuff may span multipl=
e
>      txbufs and a txbuf may provision multiple skbuffs.
>
>      For the transmission side of datagram sockets (e.g. UDP) where the
>      messages may complete out of order, I think I would give each datagr=
am
>      its own series of txbufs, but link the tails together to manage the
>      SO_EE_ORIGIN_ZEROCOPY notification generation if dealing with usersp=
ace.
>      If dealing with the kernel, there's no need to link them together as=
 the
>      kernel can provide a destructor for each datagram.
>

To be honest, I didn't follow the entirety of point #2 here, but the
problem is not related to devmem.

Is struct net_txbuf intended to replace struct sk_buff in the tx path
only? If so, I'm not sure that works. Currently TX and RX memory share
a single data structure (sk_buff), and I believe that is critical.
Because RX skbs can be forwarded to TX and vice-versa. You will need
to implement an net_txbuf_to_skb helper and vice versa if you go this
route, no? So I think, maybe, instead of introducing a new struct, you
have to make the modifications you envision to struct sk_buff itself?

 >  (3) When doing zerocopy from userspace, do calls to GUP to get batches =
of
>      non-contiguous pages into a bio_vec array.
>
>  (4) Because AF_UNIX and the loopback driver transfer packets from the
>      transmission queue of one socket down into the reception queue of
>      another, the use of txbufs would also need to extend onto the receiv=
e
>      side (and so "txbufs" would be a misnomer).
>

OK, you realize that TX packets can be forwarded to RX. The opposite
is true, RX can be forwarded to TX. And it's not just AF_UNIX and
loopback. Packets can be forwarded via ip forwarding, and tc, and
probably another half dozen features in the net stack I don't know
about. I think you need to modify the existing sk_buff. I think adding
a new struct and migrating the entire net stack to use that is a bit
too ambitious. But up to you. Just my 2 cents here.

>      When receiving a packet, a txbuf would need to be allocated and the
>      received buffers attached to it.  The pages wouldn't necessarily nee=
d
>      refcounts as the txbuf holds them.  The skbuff holds a ref on the tx=
buf.
>
>  (5) Cloning an skbuff would involve just taking an extra ref on the firs=
t
>      txbuf.  Splitting off part of an skbuff would involve fast-forwardin=
g the
>      txbuf chain for the second part and pinning that.
>
>  (6) I have a chained-bio_vec array concept with iov_iter type for it tha=
t
>      might make it easier to string together the fragments in a reassembl=
ed
>      packet and represent it as an iov_iter, thereby allowing us to use c=
ommon
>      iterator routines for things like ICMP and packet crypto.
>
>  (7) We need to separate net_iov from struct page, and it might make thin=
gs
>      easier if we do that now, allocating net_iov from a slab.
>

net_iov is already separate from struct page. The only commonality
they share is that we static_assert the offset of some page pool
fields are the same in struct page and struct net_iov, but that's it.
Byungchul is already handling the entire
netmem_desc-shrink-struct-page in the series I linked to earlier and I
would say his work is going well.

>  (8) Reference the txbuf in a splice and provide a destructor that drops =
that
>      reference.  For small splices, I'd be very tempted to simply copy th=
e
>      data.  For splice-out of data that was spliced into an AF_UNIX socke=
t or
>      zerocopy data that passed through a loopback device, I'm also very
>      tempted to make splice copy at that point.  There's a potential DoS
>      attack whereby someone can endlessly splice tiny bits of a message o=
r
>      just sit on them, preventing the original provider from recovering i=
ts
>      memory.
>
>  (9) Make it easy for a network filesystem to create an entire compound
>      message and present it to the socket in a single sendmsg() with a
>      destructor.
>
> I've pushed my current changes (very incomplete as they are) to:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs=
.git/log/?h=3Diov-experimental
>
> I'm writing functions to abstract out the loading of data into the txbuf =
chain
> and attach to skbuff.  These can be found in skbuff.c as net_txbuf_*().  =
I've
> modified the TCP sendmsg to use them.
>


--=20
Thanks,
Mina