From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05D3CC433EF for ; Tue, 29 Mar 2022 00:31:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0878D8D0002; Mon, 28 Mar 2022 20:31:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 00FFA8D0001; Mon, 28 Mar 2022 20:31:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF2388D0002; Mon, 28 Mar 2022 20:31:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id C91938D0001 for ; Mon, 28 Mar 2022 20:31:13 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 9A89E605D9 for ; Tue, 29 Mar 2022 00:31:13 +0000 (UTC) X-FDA: 79295544426.01.21E9C61 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf08.hostedemail.com (Postfix) with ESMTP id EE674160005 for ; Tue, 29 Mar 2022 00:31:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1648513872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=blF6aoqmnGTPE5i6PlDYbGpXTGI+hCBbEnyGdUhGGI4=; b=E76swVupBayxqSr3f3uj3uJz9HDwz2tTxKlqRY5iRlI/R0sw6tlGHfiDdTSt0EheD56ZKI W7YquXPDENMIh7krMtRFQpWnEA+bLtzDBZjxBEPyhdIl8ESHNayx3tESty/cP3p3e4iu/g Sc6enc/W3/zev1cUeRwoFS2/mazzmJk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-630-b6Cn8ebHOiiE0yx5HG7zLQ-1; Mon, 28 Mar 2022 20:31:08 -0400 X-MC-Unique: b6Cn8ebHOiiE0yx5HG7zLQ-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 950CE101A54C; Tue, 29 Mar 2022 00:31:07 +0000 (UTC) Received: from T590 (ovpn-8-18.pek2.redhat.com [10.72.8.18]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 32C2F401DAD; Tue, 29 Mar 2022 00:31:02 +0000 (UTC) Date: Tue, 29 Mar 2022 08:30:57 +0800 From: Ming Lei To: Gabriel Krisman Bertazi Cc: Hannes Reinecke , lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org, Xiaoguang Wang , linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] block drivers in user space Message-ID: References: <87tucsf0sr.fsf@collabora.com> <986caf55-65d1-0755-383b-73834ec04967@suse.de> <87o81prfrg.fsf@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87o81prfrg.fsf@collabora.com> X-Scanned-By: MIMEDefang 2.85 on 10.11.54.10 X-Stat-Signature: wx43smncjh8n1a6tiu5yxhw6kjctfbug X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: EE674160005 Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=E76swVup; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf08.hostedemail.com: domain of ming.lei@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=ming.lei@redhat.com X-Rspam-User: X-HE-Tag: 1648513872-55876 X-Bogosity: Ham, tests=bogofilter, spamicity=0.001763, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Mar 28, 2022 at 04:20:03PM -0400, Gabriel Krisman Bertazi wrote: > Ming Lei writes: > > > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model > > does cover this case, the userspace part can submit SQEs beforehand > > for getting notification of each incoming io request from kernel driver, > > then after one io request is queued to the driver, the driver can > > queue a CQE for the previous submitted SQE. Recent posted patch of > > IORING_OP_URING_CMD[1] is perfect for such purpose. > > > > I have written one such userspace block driver recently, and [2] is the > > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3]. > > Both the two parts look quite simple, but still in very early stage, so > > far only ubd-loop and ubd-null targets are implemented in [3]. Not only > > the io command communication channel is done via IORING_OP_URING_CMD, but > > also IO handling for ubd-loop is implemented via plain io_uring too. > > > > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto' > > on the ubd block device compared with same xfstests on underlying disk, and > > my simple performance test on VM shows the result isn't worse than kernel loop > > driver with dio, or even much better on some test situations. > > Thanks for sharing. This is a very interesting implementation that > seems to cover quite well the original use case. I'm giving it a try and > will report back. > > > Wrt. this userspace block driver things, I am more interested in the following > > sub-topics: > > > > 1) zero copy > > - the ubd driver[2] needs one data copy: for WRITE request, copy pages > > in io request to userspace buffer before handling the WRITE IO by ubdsrv; > > for READ request, the reverse copy is done after READ request is > > handled by ubdsrv > > > > - I tried to apply zero copy via remap_pfn_range() for avoiding this > > data copy, but looks it can't work for ubd driver, since pages in the > > remapped vm area can't be retrieved by get_user_pages_*() which is called in > > direct io code path > > > > - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on > > tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but > > it has same limit of remap_pfn_range; Also Xiaoguang mentioned that > > vm_insert_pages may work, but anonymous pages can not be remapped by > > vm_insert_pages. > > > > - here the requirement is to remap either anonymous pages or page cache > > pages into userspace vm, and the mapping/unmapping can be done for > > each IO runtime. Is this requirement reasonable? If yes, is there any > > easy way to implement it in kernel? > > I've run into the same issue with my fd implementation and haven't been > able to workaround it. > > > 4) apply eBPF in userspace block driver > > - it is one open topic, still not have specific or exact idea yet, > > > > - is there chance to apply ebpf for mapping ubd io into its target handling > > for avoiding data copy and remapping cost for zero copy? > > I was thinking of something like this, or having a way for the server to > only operate on the fds and do splice/sendfile. But, I don't know if it > would be useful for many use cases. We also want to be able to send the > data to userspace, for instance, for userspace networking. I understand the big point is that how to pass the io data to ubd driver's request/bio pages. But splice/sendfile just transfers data between two FDs, then how can the block request/bio's pages get filled with expected data? Can you explain a bit in detail? If block layer is bypassed, it won't be exposed as block disk to userspace. thanks, Ming