From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB360C433F5 for ; Mon, 28 Mar 2022 20:20:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D4CD8D0002; Mon, 28 Mar 2022 16:20:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4848A8D0001; Mon, 28 Mar 2022 16:20:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37D928D0002; Mon, 28 Mar 2022 16:20:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21]) by kanga.kvack.org (Postfix) with ESMTP id 2781C8D0001 for ; Mon, 28 Mar 2022 16:20:09 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id D591218200A11 for ; Mon, 28 Mar 2022 20:20:08 +0000 (UTC) X-FDA: 79294911696.20.07863EB Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [46.235.227.227]) by imf14.hostedemail.com (Postfix) with ESMTP id 33DFE10003A for ; Mon, 28 Mar 2022 20:20:08 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 417791F432B7 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1648498806; bh=wAuuLyABvD0PxQt8HNpAf1ZgVsHGsbuYyFn+5q4EtiQ=; h=From:To:Cc:Subject:References:Date:In-Reply-To:From; b=Aw3Q8vCphP2hTg4a/jOLx/2VJQMw6IWTvhPcklamzICh8ukFfUz9eNj0KsRirkM/h 1xpv8dgQiLgMxpKfEIwfrTzqwLIEXDSioi/zSg/y7E3ATQ99dm/wfok2IldgK1UsJs 1o5Xs1217MPq4FJ4bsF+8Bxvn/CGiZbZRJp06xrwfEu5cTphJ2tZpbZhfpeYK9VkDJ sCDYqTcdefphphiVILUiiWA1BsBAIuj7ilw4vl/5jqcYBlK1k6uphGW1srnSEp9knA Ey1FMcevLOXIRFwv2cvKqSTsjIAqAfMJ1l5+ALnTXPQ+PV+k35E8Yj/GlNCcvbFXD0 Nn/nfM1xTcqnA== From: Gabriel Krisman Bertazi To: Ming Lei Cc: Hannes Reinecke , lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org, Xiaoguang Wang , linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] block drivers in user space Organization: Collabora References: <87tucsf0sr.fsf@collabora.com> <986caf55-65d1-0755-383b-73834ec04967@suse.de> Date: Mon, 28 Mar 2022 16:20:03 -0400 In-Reply-To: (Ming Lei's message of "Mon, 28 Mar 2022 00:35:33 +0800") Message-ID: <87o81prfrg.fsf@collabora.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 33DFE10003A X-Rspam-User: Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=Aw3Q8vCp; dmarc=pass (policy=none) header.from=collabora.com; spf=pass (imf14.hostedemail.com: domain of krisman@collabora.com designates 46.235.227.227 as permitted sender) smtp.mailfrom=krisman@collabora.com X-Stat-Signature: p4xww4wa75ympfmjx8py6ffsxizj68dp X-HE-Tag: 1648498808-752986 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000434, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Ming Lei writes: > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model > does cover this case, the userspace part can submit SQEs beforehand > for getting notification of each incoming io request from kernel driver, > then after one io request is queued to the driver, the driver can > queue a CQE for the previous submitted SQE. Recent posted patch of > IORING_OP_URING_CMD[1] is perfect for such purpose. > > I have written one such userspace block driver recently, and [2] is the > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3]. > Both the two parts look quite simple, but still in very early stage, so > far only ubd-loop and ubd-null targets are implemented in [3]. Not only > the io command communication channel is done via IORING_OP_URING_CMD, but > also IO handling for ubd-loop is implemented via plain io_uring too. > > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto' > on the ubd block device compared with same xfstests on underlying disk, and > my simple performance test on VM shows the result isn't worse than kernel loop > driver with dio, or even much better on some test situations. Thanks for sharing. This is a very interesting implementation that seems to cover quite well the original use case. I'm giving it a try and will report back. > Wrt. this userspace block driver things, I am more interested in the following > sub-topics: > > 1) zero copy > - the ubd driver[2] needs one data copy: for WRITE request, copy pages > in io request to userspace buffer before handling the WRITE IO by ubdsrv; > for READ request, the reverse copy is done after READ request is > handled by ubdsrv > > - I tried to apply zero copy via remap_pfn_range() for avoiding this > data copy, but looks it can't work for ubd driver, since pages in the > remapped vm area can't be retrieved by get_user_pages_*() which is called in > direct io code path > > - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on > tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but > it has same limit of remap_pfn_range; Also Xiaoguang mentioned that > vm_insert_pages may work, but anonymous pages can not be remapped by > vm_insert_pages. > > - here the requirement is to remap either anonymous pages or page cache > pages into userspace vm, and the mapping/unmapping can be done for > each IO runtime. Is this requirement reasonable? If yes, is there any > easy way to implement it in kernel? I've run into the same issue with my fd implementation and haven't been able to workaround it. > 4) apply eBPF in userspace block driver > - it is one open topic, still not have specific or exact idea yet, > > - is there chance to apply ebpf for mapping ubd io into its target handling > for avoiding data copy and remapping cost for zero copy? I was thinking of something like this, or having a way for the server to only operate on the fds and do splice/sendfile. But, I don't know if it would be useful for many use cases. We also want to be able to send the data to userspace, for instance, for userspace networking. -- Gabriel Krisman Bertazi