linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@linux.vnet.ibm.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Willem de Bruijn" <willemdebruijn.kernel@gmail.com>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Alexander Duyck" <alexander.duyck@gmail.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Tom Herbert" <tom@herbertland.com>,
	"Brenden Blanco" <bblanco@plumgrid.com>,
	"Tariq Toukan" <tariqt@mellanox.com>,
	"Saeed Mahameed" <saeedm@mellanox.com>,
	"Jesse Brandeburg" <jesse.brandeburg@intel.com>,
	"Kalman Meth" <METH@il.ibm.com>
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking
Date: Tue, 13 Dec 2016 10:43:38 +0200	[thread overview]
Message-ID: <20161213084337.GE19987@rapoport-lnx> (raw)
In-Reply-To: <20161212161026.0dfd2e13@redhat.com>

On Mon, Dec 12, 2016 at 04:10:26PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 12 Dec 2016 16:14:33 +0200
> Mike Rapoport <rppt@linux.vnet.ibm.com> wrote:
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> 
> Thanks for explaining that. It seems like a lot of overhead. I have to
> wrap my head around this... so, the hardware NIC is receiving the
> packet/page, in the RX ring, and after converting it to IOVs, it is
> conceptually transmitted into the guest, and then the guest-side have a
> RX-function to handle this packet. Correctly understood?

Almost :)
For the hardware NIC driver, the receive just follows the "normal" path.
It creates an skb for the packet and passes it to the net core RX. Then the
skb is delivered to tap/macvtap. The later converts the skb to IOVs and
IOVs are pushed to the guest address space.

On the guest side, virtio-net sees these IOVs as a part of its RX ring, it
creates an skb for the packet and passes the skb to the net core of the
guest.

> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.
> 
> XDP is a step _before_ the SKB is allocated.  The XDP eBPF program can
> modify the packet-page data, but I don't think it is needed for your
> use-case.  View XDP (primarily) as an early (demux) filter.
> 
> XDP is missing a feature your need, which is TX packet into another
> net_device (I actually imagine a port mapping table, that point to a
> net_device).  This require a new "TX-raw" NDO that takes a page (+
> offset and length). 
> 
> I imagine, the virtio driver (virtio_net or a new driver?) getting
> extended with this new "TX-raw" NDO, that takes "raw" packet-pages.
>  Whether zero-copy is possible is determined by checking if page
> originates from a page_pool that have enabled zero-copy (and likely
> matching against a "protection domain" id number).
 
That could be quite a few drivers that will need to implement "TX-raw" then
:)
In general case, the virtual NIC may be connected to the physical network
via long chain of virtual devices such as bridge, veth and ovs.
Actually, because of that we wanted to concentrate on macvtap...
 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence between a
> > NIC queue and virtual NIC, which is not the case with more generic tap
> > device. It could be that use of XDP will allow for a generic solution for
> > virtio case as well.
> 
> You don't need an XDP filter, if you can make the HW do the early demux
> binding into a queue.  The check for if memory is zero-copy enabled
> would be the same.
> 
> > >   
> > > > Have you considered using "push" model for setting the NIC's RX memory?  
> > > 
> > > I don't understand what you mean by a "push" model?  
> > 
> > Currently, memory allocation in NIC drivers boils down to alloc_page with
> > some wrapping code. I see two possible ways to make NIC use of some
> > preallocated pages: either NIC driver will call an API (probably different
> > from alloc_page) to obtain that memory, or there will be NDO API that
> > allows to set the NIC's RX buffers. I named the later case "push".
> 
> As you might have guessed, I'm not into the "push" model, because this
> means I cannot share the queue with the normal network stack.  Which I
> believe is possible as outlined (in email and [2]) and can be done with
> out HW filter features (like macvlan).

I think I should sleep on it a bit more :)
Probably we can add page_pool "backend" implementation to vhost...

--
Sincerely yours,
Mike. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      reply	other threads:[~2016-12-13  8:43 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-05 14:31 Jesper Dangaard Brouer
2016-12-12  8:38 ` Mike Rapoport
2016-12-12  9:40   ` Jesper Dangaard Brouer
2016-12-12 14:14     ` Mike Rapoport
2016-12-12 14:49       ` John Fastabend
2016-12-12 17:13         ` Jesper Dangaard Brouer
2016-12-12 18:06           ` Christoph Lameter
2016-12-13 16:10             ` Jesper Dangaard Brouer
2016-12-13 16:36               ` Christoph Lameter
2016-12-13 17:43               ` John Fastabend
2016-12-13 19:53                 ` David Miller
2016-12-13 20:08                   ` John Fastabend
2016-12-14  9:39                     ` Jesper Dangaard Brouer
2016-12-14 16:32                       ` John Fastabend
2016-12-14 16:45                         ` Alexander Duyck
2016-12-14 21:29                           ` Jesper Dangaard Brouer
2016-12-14 22:45                             ` Alexander Duyck
2016-12-15  8:28                               ` Jesper Dangaard Brouer
2016-12-15 15:59                                 ` Alexander Duyck
2016-12-15 16:38                                 ` Christoph Lameter
2016-12-14 21:04                         ` Jesper Dangaard Brouer
2016-12-13 18:39               ` Hannes Frederic Sowa
2016-12-14 17:00                 ` Christoph Lameter
2016-12-14 17:37                   ` David Laight
2016-12-14 19:43                     ` Christoph Lameter
2016-12-14 20:37                       ` Hannes Frederic Sowa
2016-12-14 21:22                         ` Christoph Lameter
2016-12-13  9:42         ` Mike Rapoport
2016-12-12 15:10       ` Jesper Dangaard Brouer
2016-12-13  8:43         ` Mike Rapoport [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161213084337.GE19987@rapoport-lnx \
    --to=rppt@linux.vnet.ibm.com \
    --cc=METH@il.ibm.com \
    --cc=alexander.duyck@gmail.com \
    --cc=bblanco@plumgrid.com \
    --cc=bjorn.topel@intel.com \
    --cc=brouer@redhat.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    --cc=tom@herbertland.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox