Re: Designing a safe RX-zero-copy Memory Model for Networking

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Christoph Lameter <cl@linux.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: "John Fastabend" <john.fastabend@gmail.com>,
	"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Alexander Duyck" <alexander.duyck@gmail.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Tom Herbert" <tom@herbertland.com>,
	"Brenden Blanco" <bblanco@plumgrid.com>,
	"Tariq Toukan" <tariqt@mellanox.com>,
	"Saeed Mahameed" <saeedm@mellanox.com>,
	"Jesse Brandeburg" <jesse.brandeburg@intel.com>,
	"Kalman Meth" <METH@il.ibm.com>,
	"Vladislav Yasevich" <vyasevich@gmail.com>
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking
Date: Tue, 13 Dec 2016 10:36:55 -0600 (CST)	[thread overview]
Message-ID: <alpine.DEB.2.20.1612131030310.32350@east.gentwo.org> (raw)
In-Reply-To: <20161213171028.24dbf519@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]

On Tue, 13 Dec 2016, Jesper Dangaard Brouer wrote:

> This is the early demux problem.  With the push-mode of registering
> memory, you need hardware steering support, for zero-copy support, as
> the software step happens after DMA engine have written into the memory.

Right. But we could fall back to software. Transfer to a kernel buffer and
then move stuff over. Not much of an improvment but it will make things
work.

> > The discussion here is a bit amusing since these issues have been
> > resolved a long time ago with the design of the RDMA subsystem. Zero
> > copy is already in wide use. Memory registration is used to pin down
> > memory areas. Work requests can be filed with the RDMA subsystem that
> > then send and receive packets from the registered memory regions.
> > This is not strictly remote memory access but this is a basic mode of
> > operations supported  by the RDMA subsystem. The mlx5 driver quoted
> > here supports all of that.
>
> I hear what you are saying.  I will look into a push-model, as it might
> be a better solution.
>  I will read up on RDMA + verbs and learn more about their API model.  I
> even plan to write a small sample program to get a feeling for the API,
> and maybe we can use that as a baseline for the performance target we
> can obtain on the same HW. (Thanks to BjA?rn for already giving me some
> pointer here)

Great.

> > What is bad about RDMA is that it is a separate kernel subsystem.
> > What I would like to see is a deeper integration with the network
> > stack so that memory regions can be registred with a network socket
> > and work requests then can be submitted and processed that directly
> > read and write in these regions. The network stack should provide the
> > services that the hardware of the NIC does not suppport as usual.
>
> Interesting.  So you even imagine sockets registering memory regions
> with the NIC.  If we had a proper NIC HW filter API across the drivers,
> to register the steering rule (like ibv_create_flow), this would be
> doable, but we don't (DPDK actually have an interesting proposal[1])

Well doing this would mean adding some features and that also would at
best allow general support for zero copy direct to user space with a
fallback to software if the hardware is missing some feature.

> > The RX/TX ring in user space should be an additional mode of
> > operation of the socket layer. Once that is in place the "Remote
> > memory acces" can be trivially implemented on top of that and the
> > ugly RDMA sidecar subsystem can go away.
>
> I cannot follow that 100%, but I guess you are saying we also need a
> more efficient mode of handing over pages/packet to userspace (than
> going through the normal socket API calls).

A work request contains the user space address of the data to be sent
and/or received. The address must be in a registered memory region. This
is different from copying the packet into kernel data structures.

I think this can easily be generalized. We need support for registering
memory regions, submissions of work request and the processing of
completion requets. QP (queue-pair) processing is probably the basis for
the whole scheme that is used in multiple context these days.

next prev parent reply	other threads:[~2016-12-13 16:37 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-05 14:31 Jesper Dangaard Brouer
2016-12-12  8:38 ` Mike Rapoport
2016-12-12  9:40   ` Jesper Dangaard Brouer
2016-12-12 14:14     ` Mike Rapoport
2016-12-12 14:49       ` John Fastabend
2016-12-12 17:13         ` Jesper Dangaard Brouer
2016-12-12 18:06           ` Christoph Lameter
2016-12-13 16:10             ` Jesper Dangaard Brouer
2016-12-13 16:36               ` Christoph Lameter [this message]
2016-12-13 17:43               ` John Fastabend
2016-12-13 19:53                 ` David Miller
2016-12-13 20:08                   ` John Fastabend
2016-12-14  9:39                     ` Jesper Dangaard Brouer
2016-12-14 16:32                       ` John Fastabend
2016-12-14 16:45                         ` Alexander Duyck
2016-12-14 21:29                           ` Jesper Dangaard Brouer
2016-12-14 22:45                             ` Alexander Duyck
2016-12-15  8:28                               ` Jesper Dangaard Brouer
2016-12-15 15:59                                 ` Alexander Duyck
2016-12-15 16:38                                 ` Christoph Lameter
2016-12-14 21:04                         ` Jesper Dangaard Brouer
2016-12-13 18:39               ` Hannes Frederic Sowa
2016-12-14 17:00                 ` Christoph Lameter
2016-12-14 17:37                   ` David Laight
2016-12-14 19:43                     ` Christoph Lameter
2016-12-14 20:37                       ` Hannes Frederic Sowa
2016-12-14 21:22                         ` Christoph Lameter
2016-12-13  9:42         ` Mike Rapoport
2016-12-12 15:10       ` Jesper Dangaard Brouer
2016-12-13  8:43         ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.20.1612131030310.32350@east.gentwo.org \
    --to=cl@linux.com \
    --cc=METH@il.ibm.com \
    --cc=alexander.duyck@gmail.com \
    --cc=bblanco@plumgrid.com \
    --cc=bjorn.topel@intel.com \
    --cc=brouer@redhat.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    --cc=tom@herbertland.com \
    --cc=vyasevich@gmail.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox