linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	brauner@kernel.org, viro@zeniv.linux.org.uk,
	Bernd Schubert <bernd.schubert@fastmail.fm>,
	linux-mm@kvack.org, Josef Bacik <josef@toxicpanda.com>,
	Ming Lei <ming.lei@redhat.com>,
	kwolf@redhat.com
Subject: Re: [PATCH 0/5] sys_ringbuffer
Date: Thu, 6 Jun 2024 21:49:54 -0400	[thread overview]
Message-ID: <20240607014954.GA219708@fedora.redhat.com> (raw)
In-Reply-To: <20240603003306.2030491-1-kent.overstreet@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 6819 bytes --]

On Sun, Jun 02, 2024 at 08:32:57PM -0400, Kent Overstreet wrote:
> New syscall for mapping generic ringbuffers for arbitary (supported)
> file descriptors.
> 
> Ringbuffers can be created either when requested or at file open time,
> and can be mapped into multiple address spaces (naturally, since files
> can be shared as well).
> 
> Initial motivation is for fuse, but I plan on adding support to pipes
> and possibly sockets as well - pipes are a particularly interesting use
> case, because if both the sender and receiver of a pipe opt in to the
> new ringbuffer interface, we can make them the _same_ ringbuffer for
> true zero copy IO, while being backwards compatible with existing pipes.

Hi Kent,
I recently came across a similar use case where the ability to "upgrade"
an fd into a more efficient interface would be useful like in this pipe
scenario you are describing.

My use case is when you have a block device using the ublk driver. ublk
lets userspace servers implement block devices. ublk is great when
compatibility is required with applications that expect block device
fds, but when an application is willing to implement a shared memory
interface to communicate directly with the ublk server then going
through a block device is inefficient.

In my case the application is QEMU, where the virtual machine runs a
virtio-blk driver that could talk directly to the ublk server via
vhost-user-blk. vhost-user-blk is a protocol that allows the virtual
machine to talk directly to the ublk server via shared memory without
going through QEMU or the host kernel block layer.

QEMU would need a way to upgrade from a ublk block device file to a
vhost-user socket. Just like in your pipe example, this approach relies
on being able to go from a "compatibility" fd to a more efficient
interface gracefully when both sides support this feature.

The generic ringbuffer approach in this series would not work for
the vhost-user protocol because the client must be able to provide its
own memory and file descriptor passing is needed in general. The
protocol spec is here:
https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/vhost-user.rst

A different way to approach the fd upgrading problem is to treat this as
an AF_UNIX connectivity feature rather than a new ring buffer API.
Imagine adding a new address type to AF_UNIX for looking up connections
in a struct file (e.g. the pipe fd) instead of on the file system (or
the other AF_UNIX address types).

The first program creates the pipe and also an AF_UNIX socket. It calls
bind(2) on the socket with the sockaddr_un path
"/dev/self/fd/<fd>/<discriminator>" where fd is a pipe fd and
discriminator is a string like "ring-buffer" that describes the
service/protocol. The AF_UNIX kernel code parses this special path and
stores an association with the pipe file for future connect(2) calls.
The program listens on the AF_UNIX socket and then continues doing its
stuff.

The second program runs and inherits the pipe fd on stdin. It creates an
AF_UNIX socket and attempts to connect(2) to
"/dev/self/fd/0/ring-buffer". The AF_UNIX kernel code parses this
special path and establishes a connection between the connecting and
listening sockets inside the pipe fd's struct file. If connect(2) fails
then the second program knows that this is an ordinary pipe that does
not support upgrading to ring buffer operation.

Now the AF_UNIX socket can be used to pass shared memory for the ring
buffer and futexes. This AF_UNIX approach also works for my ublk block
device to vhost-user-blk upgrade use case. It does not require a new
ring buffer API but instead involves extending AF_UNIX.

You have more use cases than just the pipe scenario, maybe my half-baked
idea won't cover all of them, but I wanted to see what you think.

Stefan

> the ringbuffer_wait and ringbuffer_wakeup syscalls are probably going
> away in a future iteration, in favor of just using futexes.
> 
> In my testing, reading/writing from the ringbuffer 16 bytes at a time is
> ~7x faster than using read/write syscalls - and I was testing with
> mitigations off, real world benefit will be even higher.
> 
> Kent Overstreet (5):
>   darray: lift from bcachefs
>   darray: Fix darray_for_each_reverse() when darray is empty
>   fs: sys_ringbuffer
>   ringbuffer: Test device
>   ringbuffer: Userspace test helper
> 
>  MAINTAINERS                             |   7 +
>  arch/x86/entry/syscalls/syscall_32.tbl  |   3 +
>  arch/x86/entry/syscalls/syscall_64.tbl  |   3 +
>  fs/Makefile                             |   2 +
>  fs/bcachefs/Makefile                    |   1 -
>  fs/bcachefs/btree_types.h               |   2 +-
>  fs/bcachefs/btree_update.c              |   2 +
>  fs/bcachefs/btree_write_buffer_types.h  |   2 +-
>  fs/bcachefs/fsck.c                      |   2 +-
>  fs/bcachefs/journal_io.h                |   2 +-
>  fs/bcachefs/journal_sb.c                |   2 +-
>  fs/bcachefs/sb-downgrade.c              |   3 +-
>  fs/bcachefs/sb-errors_types.h           |   2 +-
>  fs/bcachefs/sb-members.h                |   3 +-
>  fs/bcachefs/subvolume.h                 |   1 -
>  fs/bcachefs/subvolume_types.h           |   2 +-
>  fs/bcachefs/thread_with_file_types.h    |   2 +-
>  fs/bcachefs/util.h                      |  28 +-
>  fs/ringbuffer.c                         | 474 ++++++++++++++++++++++++
>  fs/ringbuffer_test.c                    | 209 +++++++++++
>  {fs/bcachefs => include/linux}/darray.h |  61 +--
>  include/linux/darray_types.h            |  22 ++
>  include/linux/fs.h                      |   2 +
>  include/linux/mm_types.h                |   4 +
>  include/linux/ringbuffer_sys.h          |  18 +
>  include/uapi/linux/futex.h              |   1 +
>  include/uapi/linux/ringbuffer_sys.h     |  40 ++
>  init/Kconfig                            |   9 +
>  kernel/fork.c                           |   2 +
>  lib/Kconfig.debug                       |   5 +
>  lib/Makefile                            |   2 +-
>  {fs/bcachefs => lib}/darray.c           |  12 +-
>  tools/ringbuffer/Makefile               |   3 +
>  tools/ringbuffer/ringbuffer-test.c      | 254 +++++++++++++
>  34 files changed, 1125 insertions(+), 62 deletions(-)
>  create mode 100644 fs/ringbuffer.c
>  create mode 100644 fs/ringbuffer_test.c
>  rename {fs/bcachefs => include/linux}/darray.h (63%)
>  create mode 100644 include/linux/darray_types.h
>  create mode 100644 include/linux/ringbuffer_sys.h
>  create mode 100644 include/uapi/linux/ringbuffer_sys.h
>  rename {fs/bcachefs => lib}/darray.c (56%)
>  create mode 100644 tools/ringbuffer/Makefile
>  create mode 100644 tools/ringbuffer/ringbuffer-test.c
> 
> -- 
> 2.45.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

      parent reply	other threads:[~2024-06-07  1:50 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-03  0:32 Kent Overstreet
2024-06-03  0:32 ` [PATCH 1/5] darray: lift from bcachefs Kent Overstreet
2024-06-03  0:32 ` [PATCH 2/5] darray: Fix darray_for_each_reverse() when darray is empty Kent Overstreet
2024-06-03  0:33 ` [PATCH 3/5] fs: sys_ringbuffer Kent Overstreet
2024-06-03  4:16   ` kernel test robot
2024-06-03  4:38   ` kernel test robot
2024-06-23 22:13   ` Thomas Gleixner
2024-06-23 22:21     ` Kent Overstreet
2024-06-23 23:16       ` Thomas Gleixner
2024-06-24  0:27         ` Kent Overstreet
2024-06-03  0:33 ` [PATCH 4/5] ringbuffer: Test device Kent Overstreet
2024-06-03  0:33 ` [PATCH 5/5] ringbuffer: Userspace test helper Kent Overstreet
2024-06-07  1:49 ` Stefan Hajnoczi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240607014954.GA219708@fedora.redhat.com \
    --to=stefanha@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bernd.schubert@fastmail.fm \
    --cc=brauner@kernel.org \
    --cc=josef@toxicpanda.com \
    --cc=kent.overstreet@linux.dev \
    --cc=kwolf@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox