linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	cl@linux.com, rppt@linux.vnet.ibm.com, netdev@vger.kernel.org,
	linux-mm@kvack.org, willemdebruijn.kernel@gmail.com,
	bjorn.topel@intel.com, magnus.karlsson@intel.com,
	alexander.duyck@gmail.com, mgorman@techsingularity.net,
	tom@herbertland.com, bblanco@plumgrid.com, tariqt@mellanox.com,
	saeedm@mellanox.com, jesse.brandeburg@intel.com, METH@il.ibm.com,
	vyasevich@gmail.com, brouer@redhat.com
Subject: Re: Designing a safe RX-zero-copy Memory Model for Networking
Date: Wed, 14 Dec 2016 22:04:38 +0100	[thread overview]
Message-ID: <20161214220438.4608f2bb@redhat.com> (raw)
In-Reply-To: <5851740A.2080806@gmail.com>

On Wed, 14 Dec 2016 08:32:10 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 13 Dec 2016 12:08:21 -0800
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> >> On 16-12-13 11:53 AM, David Miller wrote:  
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Tue, 13 Dec 2016 09:43:59 -0800
> >>>     
> >>>> What does "zero-copy send packet-pages to the application/socket that
> >>>> requested this" mean? At the moment on x86 page-flipping appears to be
> >>>> more expensive than memcpy (I can post some data shortly) and shared
> >>>> memory was proposed and rejected for security reasons when we were
> >>>> working on bifurcated driver.    
> >>>
> >>> The whole idea is that we map all the active RX ring pages into
> >>> userspace from the start.
> >>>
> >>> And just how Jesper's page pool work will avoid DMA map/unmap,
> >>> it will also avoid changing the userspace mapping of the pages
> >>> as well.
> >>>
> >>> Thus avoiding the TLB/VM overhead altogether.
> >>>     
> > 
> > Exactly.  It is worth mentioning that pages entering the page pool need
> > to be cleared (measured cost 143 cycles), in order to not leak any
> > kernel info.  The primary focus of this design is to make sure not to
> > leak kernel info to userspace, but with an "exclusive" mode also
> > support isolation between applications.
> > 
> >   
> >> I get this but it requires applications to be isolated. The pages from
> >> a queue can not be shared between multiple applications in different
> >> trust domains. And the application has to be cooperative meaning it
> >> can't "look" at data that has not been marked by the stack as OK. In
> >> these schemes we tend to end up with something like virtio/vhost or
> >> af_packet.  
> > 
> > I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> > two would require CAP_NET_ADMIN privileges.  All modes have a trust
> > domain id, that need to match e.g. when page reach the socket.  
> 
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.

Good point.

> > 
> > Mode-1 "Shared": Application choose lowest isolation level, allowing
> >  multiple application to mmap VMA area.  
> 
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
> 
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
> 
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
> 
> > 
> > Mode-2 "Single-user": Application request it want to be the only user
> >  of the RX queue.  This blocks other application to mmap VMA area.
> >   
> 
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.

Yes, as describe in orig email and here[1]: "once an application
request zero-copy RX, then the driver must use a specific SKB
allocation mode and might have to reconfigure the RX-ring."

The SKB allocation mode is "read-only packet page", which is the
current default mode (also desc in document[1]) of using skb-frags.

[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
 
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
> 
> I assume all these concerns are shared between mode-1 and mode-2
> 
> > Mode-3 "Exclusive": Application request to own RX queue.  Packets are
> >  no longer allowed for normal netstack delivery.
> >   
> 
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.

Interesting.

> > Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> > still allowed to travel netstack and thus can contain packet data from
> > other normal applications.  This is part of the design, to share the
> > NIC between netstack and an accelerated userspace application using RX
> > zero-copy delivery.
> >   
> 
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I though the bifurcated driver work was rejected because it could leak
kernel info in the pages. This approach cannot.

   
> >> Any ACLs/filtering/switching/headers need to be done in hardware or
> >> the application trust boundaries are broken.  
> > 
> > The software solution outlined allow the application to make the
> > choice of what trust boundary it wants.
> > 
> > The "exclusive" mode-3 make most sense together with HW filters.
> > Already today, we support creating a new RX queue based on ethtool
> > ntuple HW filter and then you simply attach your application that
> > queue in mode-3, and have full isolation.
> >   
> 
> Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
> Without hardware filters we have no way of knowing who/what data is
> put in the page.

For sockets, an SKB carrying a RX zero-copy-able page can be steered
(as normal) into a given socket. Then we check if socket requested
zero-copy, and verify if the domain-id match between the page_pool and
socket.

You can also use XDP to filter and steer the packet (which will be
faster and using normal steering code). 

> >    
> >> If the above can not be met then a copy is needed. What I am trying
> >> to tease out is the above comment along with other statements like
> >> this "can be done with out HW filter features".  
> > 
> > Does this address your concerns?
> >   
> 
> I think we need to enforce strong isolation. An application should not
> be able to read data or impact other applications. I gather this is
> the case per comment about normal applications in mode-2. A slightly
> weaker statement would be to say applications can only impace/read
> data of other applications in their domain. This might be OK as well.

I think this approach covers the "weaker statement".  Because only page
within the pool are "exposed".  Thus, the domain is the NIC (possibly
restricted to a single RX queue).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2016-12-14 21:04 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-05 14:31 Jesper Dangaard Brouer
2016-12-12  8:38 ` Mike Rapoport
2016-12-12  9:40   ` Jesper Dangaard Brouer
2016-12-12 14:14     ` Mike Rapoport
2016-12-12 14:49       ` John Fastabend
2016-12-12 17:13         ` Jesper Dangaard Brouer
2016-12-12 18:06           ` Christoph Lameter
2016-12-13 16:10             ` Jesper Dangaard Brouer
2016-12-13 16:36               ` Christoph Lameter
2016-12-13 17:43               ` John Fastabend
2016-12-13 19:53                 ` David Miller
2016-12-13 20:08                   ` John Fastabend
2016-12-14  9:39                     ` Jesper Dangaard Brouer
2016-12-14 16:32                       ` John Fastabend
2016-12-14 16:45                         ` Alexander Duyck
2016-12-14 21:29                           ` Jesper Dangaard Brouer
2016-12-14 22:45                             ` Alexander Duyck
2016-12-15  8:28                               ` Jesper Dangaard Brouer
2016-12-15 15:59                                 ` Alexander Duyck
2016-12-15 16:38                                 ` Christoph Lameter
2016-12-14 21:04                         ` Jesper Dangaard Brouer [this message]
2016-12-13 18:39               ` Hannes Frederic Sowa
2016-12-14 17:00                 ` Christoph Lameter
2016-12-14 17:37                   ` David Laight
2016-12-14 19:43                     ` Christoph Lameter
2016-12-14 20:37                       ` Hannes Frederic Sowa
2016-12-14 21:22                         ` Christoph Lameter
2016-12-13  9:42         ` Mike Rapoport
2016-12-12 15:10       ` Jesper Dangaard Brouer
2016-12-13  8:43         ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161214220438.4608f2bb@redhat.com \
    --to=brouer@redhat.com \
    --cc=METH@il.ibm.com \
    --cc=alexander.duyck@gmail.com \
    --cc=bblanco@plumgrid.com \
    --cc=bjorn.topel@intel.com \
    --cc=cl@linux.com \
    --cc=davem@davemloft.net \
    --cc=jesse.brandeburg@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    --cc=tom@herbertland.com \
    --cc=vyasevich@gmail.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox