On Mon, Mar 28, 2022 at 12:35:33AM +0800, Ming Lei wrote: >On Tue, Feb 22, 2022 at 07:57:27AM +0100, Hannes Reinecke wrote: >> On 2/21/22 20:59, Gabriel Krisman Bertazi wrote: >> > I'd like to discuss an interface to implement user space block devices, >> > while avoiding local network NBD solutions. There has been reiterated >> > interest in the topic, both from researchers [1] and from the community, >> > including a proposed session in LSFMM2018 [2] (though I don't think it >> > happened). >> > >> > I've been working on top of the Google iblock implementation to find >> > something upstreamable and would like to present my design and gather >> > feedback on some points, in particular zero-copy and overall user space >> > interface. >> > >> > The design I'm pending towards uses special fds opened by the driver to >> > transfer data to/from the block driver, preferably through direct >> > splicing as much as possible, to keep data only in kernel space. This >> > is because, in my use case, the driver usually only manipulates >> > metadata, while data is forwarded directly through the network, or >> > similar. It would be neat if we can leverage the existing >> > splice/copy_file_range syscalls such that we don't ever need to bring >> > disk data to user space, if we can avoid it. I've also experimented >> > with regular pipes, But I found no way around keeping a lot of pipes >> > opened, one for each possible command 'slot'. >> > >> > [1] https://protect2.fireeye.com/v1/url?k=894d9ec4-e83076bc-894c158b-74fe485fffb1-3de06c94a9e9abfa&q=1&e=40f886a9-b53a-42b0-8e68-c94bc3813a9c&u=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3456727.3463768 >> > [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html >> > >> Actually, I'd rather have something like an 'inverse io_uring', where an >> application creates a memory region separated into several 'ring' for >> submission and completion. >> Then the kernel could write/map the incoming data onto the rings, and >> application can read from there. >> Maybe it'll be worthwhile to look at virtio here. > >IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model >does cover this case, the userspace part can submit SQEs beforehand >for getting notification of each incoming io request from kernel driver, >then after one io request is queued to the driver, the driver can >queue a CQE for the previous submitted SQE. Recent posted patch of >IORING_OP_URING_CMD[1] is perfect for such purpose. I had added that as one of the potential usecases to discuss for uring-cmd: https://lore.kernel.org/linux-block/20220228092511.458285-1-joshi.k@samsung.com/ And your email is already bringing lot of clarity on this. >I have written one such userspace block driver recently, and [2] is the >kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3]. >Both the two parts look quite simple, but still in very early stage, so >far only ubd-loop and ubd-null targets are implemented in [3]. Not only >the io command communication channel is done via IORING_OP_URING_CMD, but >also IO handling for ubd-loop is implemented via plain io_uring too. > >It is basically working, for ubd-loop, not see regression in 'xfstests -g auto' >on the ubd block device compared with same xfstests on underlying disk, and >my simple performance test on VM shows the result isn't worse than kernel loop >driver with dio, or even much better on some test situations. Added this in my to-be-read list. Thanks for sharing.