On Mon, Mar 28, 2022 at 12:35:33AM +0800, Ming Lei wrote:
>On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote:
>> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote:
>> > I'd like to discuss an interface to implement user space block devices,
>> > while avoiding local network NBD solutions.  There has been reiterated
>> > interest in the topic, both from researchers [1] and from the community,
>> > including a proposed session in LSFMM2018 [2] (though I don't think it
>> > happened).
>> >
>> > I've been working on top of the Google iblock implementation to find
>> > something upstreamable and would like to present my design and gather
>> > feedback on some points, in particular zero-copy and overall user space
>> > interface.
>> >
>> > The design I'm pending towards uses special fds opened by the driver to
>> > transfer data to/from the block driver, preferably through direct
>> > splicing as much as possible, to keep data only in kernel space.  This
>> > is because, in my use case, the driver usually only manipulates
>> > metadata, while data is forwarded directly through the network, or
>> > similar. It would be neat if we can leverage the existing
>> > splice/copy_file_range syscalls such that we don't ever need to bring
>> > disk data to user space, if we can avoid it.  I've also experimented
>> > with regular pipes, But I found no way around keeping a lot of pipes
>> > opened, one for each possible command 'slot'.
>> >
>> > [1] https://protect2.fireeye.com/v1/url?k=894d9ec4-e83076bc-894c158b-74fe485fffb1-3de06c94a9e9abfa&q=1&e=40f886a9-b53a-42b0-8e68-c94bc3813a9c&u=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3456727.3463768
>> > [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html
>> >
>> Actually, I'd rather have something like an 'inverse io_uring', where an
>> application creates a memory region separated into several 'ring' for
>> submission and completion.
>> Then the kernel could write/map the incoming data onto the rings, and
>> application can read from there.
>> Maybe it'll be worthwhile to look at virtio here.
>
>IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model
>does cover this case, the userspace part can submit SQEs beforehand
>for getting notification of each incoming io request from kernel driver,
>then after one io request is queued to the driver, the driver can
>queue a CQE for the previous submitted SQE. Recent posted patch of
>IORING_OP_URING_CMD[1] is perfect for such purpose.
I had added that as one of the potential usecases to discuss for
uring-cmd:
https://lore.kernel.org/linux-block/20220228092511.458285-1-joshi.k@samsung.com/
And your email is already bringing lot of clarity on this.

>I have written one such userspace block driver recently, and [2] is the
>kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3].
>Both the two parts look quite simple, but still in very early stage, so
>far only ubd-loop and ubd-null targets are implemented in [3]. Not only
>the io command communication channel is done via IORING_OP_URING_CMD, but
>also IO handling for ubd-loop is implemented via plain io_uring too.
>
>It is basically working, for ubd-loop, not see regression in 'xfstests -g auto'
>on the ubd block device compared with same xfstests on underlying disk, and
>my simple performance test on VM shows the result isn't worse than kernel loop
>driver with dio, or even much better on some test situations.
Added this in my to-be-read list. Thanks for sharing.