From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Christoph Lameter <cl@linux.com>
Cc: "John Fastabend" <john.fastabend@gmail.com>,
"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
"Björn Töpel" <bjorn.topel@intel.com>,
"Karlsson, Magnus" <magnus.karlsson@intel.com>,
"Alexander Duyck" <alexander.duyck@gmail.com>,
"Mel Gorman" <mgorman@techsingularity.net>,
"Tom Herbert" <tom@herbertland.com>,
"Brenden Blanco" <bblanco@plumgrid.com>,
"Tariq Toukan" <tariqt@mellanox.com>,
"Saeed Mahameed" <saeedm@mellanox.com>,
"Jesse Brandeburg" <jesse.brandeburg@intel.com>,
"Kalman Meth" <METH@il.ibm.com>,
"Vladislav Yasevich" <vyasevich@gmail.com>,
brouer@redhat.com
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking
Date: Tue, 13 Dec 2016 17:10:28 +0100 [thread overview]
Message-ID: <20161213171028.24dbf519@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.20.1612121200280.13607@east.gentwo.org>
On Mon, 12 Dec 2016 12:06:59 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:
>
> > Hmmm. If you can rely on hardware setup to give you steering and
> > dedicated access to the RX rings. In those cases, I guess, the "push"
> > model could be a more direct API approach.
>
> If the hardware does not support steering then one should be able to
> provide those services in software.
This is the early demux problem. With the push-mode of registering
memory, you need hardware steering support, for zero-copy support, as
the software step happens after DMA engine have written into the memory.
My model pre-VMA map all the pages in the RX ring (if zero-copy gets
enabled, by a single user). The software step can filter and zero-copy
send packet-pages to the application/socket that requested this. The
disadvantage is all zero-copy application need to share this VMA
mapping. This is solved by configuring HW filters into a RX-queue, and
then only attach your zero-copy application to that queue.
> > I was shooting for a model that worked without hardware support.
> > And then transparently benefit from HW support by configuring a HW
> > filter into a specific RX queue and attaching/using to that queue.
>
> The discussion here is a bit amusing since these issues have been
> resolved a long time ago with the design of the RDMA subsystem. Zero
> copy is already in wide use. Memory registration is used to pin down
> memory areas. Work requests can be filed with the RDMA subsystem that
> then send and receive packets from the registered memory regions.
> This is not strictly remote memory access but this is a basic mode of
> operations supported by the RDMA subsystem. The mlx5 driver quoted
> here supports all of that.
I hear what you are saying. I will look into a push-model, as it might
be a better solution.
I will read up on RDMA + verbs and learn more about their API model. I
even plan to write a small sample program to get a feeling for the API,
and maybe we can use that as a baseline for the performance target we
can obtain on the same HW. (Thanks to Björn for already giving me some
pointer here)
> What is bad about RDMA is that it is a separate kernel subsystem.
> What I would like to see is a deeper integration with the network
> stack so that memory regions can be registred with a network socket
> and work requests then can be submitted and processed that directly
> read and write in these regions. The network stack should provide the
> services that the hardware of the NIC does not suppport as usual.
Interesting. So you even imagine sockets registering memory regions
with the NIC. If we had a proper NIC HW filter API across the drivers,
to register the steering rule (like ibv_create_flow), this would be
doable, but we don't (DPDK actually have an interesting proposal[1])
> The RX/TX ring in user space should be an additional mode of
> operation of the socket layer. Once that is in place the "Remote
> memory acces" can be trivially implemented on top of that and the
> ugly RDMA sidecar subsystem can go away.
I cannot follow that 100%, but I guess you are saying we also need a
more efficient mode of handing over pages/packet to userspace (than
going through the normal socket API calls).
Appreciate your input, it challenged my thinking.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[1] https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-12-13 16:10 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-05 14:31 Jesper Dangaard Brouer
2016-12-12 8:38 ` Mike Rapoport
2016-12-12 9:40 ` Jesper Dangaard Brouer
2016-12-12 14:14 ` Mike Rapoport
2016-12-12 14:49 ` John Fastabend
2016-12-12 17:13 ` Jesper Dangaard Brouer
2016-12-12 18:06 ` Christoph Lameter
2016-12-13 16:10 ` Jesper Dangaard Brouer [this message]
2016-12-13 16:36 ` Christoph Lameter
2016-12-13 17:43 ` John Fastabend
2016-12-13 19:53 ` David Miller
2016-12-13 20:08 ` John Fastabend
2016-12-14 9:39 ` Jesper Dangaard Brouer
2016-12-14 16:32 ` John Fastabend
2016-12-14 16:45 ` Alexander Duyck
2016-12-14 21:29 ` Jesper Dangaard Brouer
2016-12-14 22:45 ` Alexander Duyck
2016-12-15 8:28 ` Jesper Dangaard Brouer
2016-12-15 15:59 ` Alexander Duyck
2016-12-15 16:38 ` Christoph Lameter
2016-12-14 21:04 ` Jesper Dangaard Brouer
2016-12-13 18:39 ` Hannes Frederic Sowa
2016-12-14 17:00 ` Christoph Lameter
2016-12-14 17:37 ` David Laight
2016-12-14 19:43 ` Christoph Lameter
2016-12-14 20:37 ` Hannes Frederic Sowa
2016-12-14 21:22 ` Christoph Lameter
2016-12-13 9:42 ` Mike Rapoport
2016-12-12 15:10 ` Jesper Dangaard Brouer
2016-12-13 8:43 ` Mike Rapoport
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161213171028.24dbf519@redhat.com \
--to=brouer@redhat.com \
--cc=METH@il.ibm.com \
--cc=alexander.duyck@gmail.com \
--cc=bblanco@plumgrid.com \
--cc=bjorn.topel@intel.com \
--cc=cl@linux.com \
--cc=jesse.brandeburg@intel.com \
--cc=john.fastabend@gmail.com \
--cc=linux-mm@kvack.org \
--cc=magnus.karlsson@intel.com \
--cc=mgorman@techsingularity.net \
--cc=netdev@vger.kernel.org \
--cc=rppt@linux.vnet.ibm.com \
--cc=saeedm@mellanox.com \
--cc=tariqt@mellanox.com \
--cc=tom@herbertland.com \
--cc=vyasevich@gmail.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox