Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Arnd Bergmann <arnd@arndb.de>
Cc: virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, linux-mm@kvack.org,
	akpm@linux-foundation.org, hpa@zytor.com,
	gregory.haskins@gmail.com, Or Gerlitz <ogerlitz@voltaire.com>
Subject: Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server
Date: Wed, 19 Aug 2009 16:04:17 +0300	[thread overview]
Message-ID: <20090819130417.GB3080@redhat.com> (raw)
In-Reply-To: <200908191104.50672.arnd@arndb.de>

On Wed, Aug 19, 2009 at 11:04:50AM +0200, Arnd Bergmann wrote:
> On Sunday 16 August 2009, Michael S. Tsirkin wrote:
> > On Fri, Aug 14, 2009 at 01:40:36PM +0200, Arnd Bergmann wrote:
> > > 
> > > * most of the transports are sockets, tap uses a character device.
> > >   This could be dealt with by having both a struct socket * in
> > >   struct vhost_net *and* a struct file *, or by always keeping the
> > >   struct file and calling vfs_readv/vfs_writev for the data transport
> > >   in both cases.
> > 
> > I am concerned that character devices might have weird side effects with
> > read/write operations and that calling them from kernel thread the way I
> > do might have security implications. Can't point at anything specific
> > though at the moment.
> 
> I understand your feelings about passing a chardev fd into your driver
> and I agree that we need to be very careful if we want to allow it.
> 
> Maybe we could instead extend the 'splice' system call to work on a
> vhost_net file descriptor.  If we do that, we can put the access back
> into a user thread (or two) that stays in splice indefinetely

An issue with exposing internal threading model to userspace
in this way is that we lose control of e.g. CPU locality -
and it is very hard for userspace to get it right.

> to
> avoid some of the implications of kernel threads like the missing
> ability to handle transfer errors in user space.

Are you talking about TCP here?
Transfer errors are typically asynchronous - possibly eventfd
as I expose for vhost net is sufficient there.

> > I wonder - can we expose the underlying socket used by tap, or will that
> > create complex lifetime issues?
> 
> I think this could get more messy in the long run than calling vfs_readv
> on a random fd. It would mean deep internal knowledge of the tap driver
> in vhost_net, which I really would prefer to avoid.

No, what I had in mind is adding a GET_SOCKET ioctl to tap.
vhost would then just use the socket.

> > > * Each transport has a slightly different header, we have
> > >   - raw ethernet frames (raw, udp multicast, tap)
> > >   - 32-bit length + raw frames, possibly fragmented (tcp)
> > >   - 80-bit header + raw frames, possibly fragmented (tap with vnet_hdr)
> > >   To handle these three cases, we need either different ioctl numbers
> > >   so that vhost_net can choose the right one, or a flags field in
> > >   VHOST_NET_SET_SOCKET, like
> > > 
> > >   #define VHOST_NET_RAW		1
> > >   #define VHOST_NET_LEN_HDR	2
> > >   #define VHOST_NET_VNET_HDR	4
> > > 
> > >   struct vhost_net_socket {
> > > 	unsigned int flags;
> > > 	int fd;
> > >   };
> > >   #define VHOST_NET_SET_SOCKET _IOW(VHOST_VIRTIO, 0x30, struct vhost_net_socket)
> > 
> > It seems we can query the socket to find out the type, 
> 
> yes, I understand that you can do that, but I still think that decision
> should be left to user space. Adding a length header for TCP streams but
> not for UDP is something that we would normally want to do, but IMHO
> vhost_net should not need to know about this.
> 
> > or use the features ioctl.
> 
> Right, I had forgotten about that one. It's probably equivalent
> to the flags I suggested, except that one allows you to set features
> after starting the communication, while the other one prevents
> you from doing that.
> 
> > > Qemu could then automatically try to use vhost_net, if it's available
> > > in the kernel, or just fall back on software vlan otherwise.
> > > Does that make sense?
> > 
> > I agree, long term it should be enabled automatically when possible.
> 
> So how about making the qemu command line interface an extension to
> what Or Gerlitz has done for the raw packet sockets?
> 
> 	Arnd <><

Not sure I see the connection, but I have not thought about qemu
side of things too much yet - trying to get kernel bits in place
first so that there's a stable ABI to work with.

-- 
MST

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-08-19 13:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1250187913.git.mst@redhat.com>
2009-08-13 18:29 ` [PATCHv3 1/2] mm: export use_mm/unuse_mm to modules Michael S. Tsirkin
2009-08-13 18:29 ` [PATCHv3 2/2] vhost_net: a kernel-level virtio server Michael S. Tsirkin
2009-08-14 11:40   ` Arnd Bergmann
2009-08-16  6:51     ` Michael S. Tsirkin
2009-08-19  9:04       ` Arnd Bergmann
2009-08-19 13:04         ` Michael S. Tsirkin [this message]
2009-08-19 13:46           ` Arnd Bergmann
2009-08-19 14:20             ` Michael S. Tsirkin
2009-08-19 15:27               ` Arnd Bergmann
2009-08-20  8:31                 ` Michael S. Tsirkin
2009-08-20 13:10                   ` Arnd Bergmann
2009-08-20 13:38                     ` Michael S. Tsirkin
2009-08-20 14:31                       ` Arnd Bergmann
2009-08-20 14:42                         ` Michael S. Tsirkin
2009-08-20 15:10                           ` Arnd Bergmann
2009-08-21 13:20                       ` Gleb Natapov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090819130417.GB3080@redhat.com \
    --to=mst@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=gregory.haskins@gmail.com \
    --cc=hpa@zytor.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@voltaire.com \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox