From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B7B5C27C53 for ; Wed, 12 Jun 2024 16:16:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE6046B009B; Wed, 12 Jun 2024 12:16:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C95716B009C; Wed, 12 Jun 2024 12:16:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0EA76B009E; Wed, 12 Jun 2024 12:16:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 947096B009B for ; Wed, 12 Jun 2024 12:16:05 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 144ABC17B3 for ; Wed, 12 Jun 2024 16:16:05 +0000 (UTC) X-FDA: 82222738290.01.EF1F506 Received: from fout1-smtp.messagingengine.com (fout1-smtp.messagingengine.com [103.168.172.144]) by imf23.hostedemail.com (Postfix) with ESMTP id BE5DE140028 for ; Wed, 12 Jun 2024 16:16:01 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=fastmail.fm header.s=fm1 header.b=P5+DFqUu; dkim=pass header.d=messagingengine.com header.s=fm1 header.b="S eKc1UO"; spf=pass (imf23.hostedemail.com: domain of bernd.schubert@fastmail.fm designates 103.168.172.144 as permitted sender) smtp.mailfrom=bernd.schubert@fastmail.fm; dmarc=pass (policy=none) header.from=fastmail.fm ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718208962; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yUVDTGhYlj4eqvhNv3LbdHV4m8qId3DIMMR0yTJOkJM=; b=aSFUCotJnHnobtczzOMjCRzYQPtLyV+qAgxQHv2zssk7PXIGNljWJ3Hj66CdFoeeSIGffb pKeIhu3O5us4Lx8n525s1kpe6LkVwAnAY+MDaOxPt7qE0h47YLLi0isQm7k/+nssKUBzee ZmGE1JPXjHOXvHaLck3yRHi7o1EODYk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718208962; a=rsa-sha256; cv=none; b=B0L92Ipca0Nt3vNcFIJW72RC4vOmvpwart8ft5YVxNvLcTW6saERKKUBR9DPIGMrXqvQHI V01fJV2azrcjqoK5j5YroXOpq2OY1Mkp1hoqjZ1H1LDF4PDto5GhdDsS8Un5HVF/zs4LdS k8xnHJ/MUNZJnF9eNePieXxlQ5DvBVc= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=fastmail.fm header.s=fm1 header.b=P5+DFqUu; dkim=pass header.d=messagingengine.com header.s=fm1 header.b="S eKc1UO"; spf=pass (imf23.hostedemail.com: domain of bernd.schubert@fastmail.fm designates 103.168.172.144 as permitted sender) smtp.mailfrom=bernd.schubert@fastmail.fm; dmarc=pass (policy=none) header.from=fastmail.fm Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailfout.nyi.internal (Postfix) with ESMTP id AE33413801A0; Wed, 12 Jun 2024 12:16:00 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Wed, 12 Jun 2024 12:16:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fastmail.fm; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm1; t=1718208960; x=1718295360; bh=yUVDTGhYlj4eqvhNv3LbdHV4m8qId3DIMMR0yTJOkJM=; b= P5+DFqUujekIlZNSWpOY/ofX2U2H2K5uwcBGQZElhpsyjWaIFvOw13BPAYYAqU4s Gpmp1PF4ohcyG+wIkt7upnntP2WgW9IbD83oeiq1KpaisGkT+LXz2mQ1hHHTYL9W 01ZPpo68i2fzgwh2gol7k6ulAgl2OzJXquKbARD1mKFJsQZpK6HhrR1fmTEj4zam kq9vlnXo6N7jMCXZ8c+5p1+jjURkdlpENfQA2fQwQjC/FveN5bNmwUq6Y+69x89m hqHhsQqjZseNjVQDzgBMbTItjpg6pb04SrXxDxj6jeeh7Rxmw/oh0awhiRNc10ab BQ1ie57V6dG4Q9neYJV5Gg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1718208960; x= 1718295360; bh=yUVDTGhYlj4eqvhNv3LbdHV4m8qId3DIMMR0yTJOkJM=; b=S eKc1UOEsuxbKs5p1pNZyWogTfSzgOJZJ71Q5mtbCU2TMzyo/dWvqV3Rl9xS6u8GC C5425Jun8AeSst+EzALpOs1tzruTIDjaRtdkuy9t39s6xsXhDnXajg8tdFgmYVGv uX/VZa8UzEn2kWq5lT72wwL7Od8hJExyVNnght/afOa0pmwUAO2XrywwCzdU3y79 UnTBsX6pEk4IKYjKdk28x9L6W2HR1enb+jw+78oIt/GwSDyAnNIRSyK3SPul08T/ NcGTcV7/8yxIQfDEOwbQMHdIfwZZba8uBAMQwDpsM6VhMFpJIk/S8jAqdCTM93oa fPGGTvnXwYjrdv9EAO7Sg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrfedugedgleejucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepkfffgggfuffvvehfhfgjtgfgsehtjeertddtvdejnecuhfhrohhmpeeuvghr nhguucfutghhuhgsvghrthcuoegsvghrnhgurdhstghhuhgsvghrthesfhgrshhtmhgrih hlrdhfmheqnecuggftrfgrthhtvghrnhepteetjeduvedufeffudetgfelvddvgfdtleeu ieefgeduheekleekfeefvdeiheeknecuffhomhgrihhnpehgnhhufigvvggsrdhorhhgne cuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepsggvrhhn ugdrshgthhhusggvrhhtsehfrghsthhmrghilhdrfhhm X-ME-Proxy: Feedback-ID: id8a24192:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 12 Jun 2024 12:15:58 -0400 (EDT) Message-ID: Date: Wed, 12 Jun 2024 18:15:57 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring To: Kent Overstreet , Bernd Schubert Cc: Miklos Szeredi , Amir Goldstein , "linux-fsdevel@vger.kernel.org" , Andrew Morton , "linux-mm@kvack.org" , Ingo Molnar , Peter Zijlstra , Andrei Vagin , "io-uring@vger.kernel.org" References: <20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com> <99d13ae4-8250-4308-b86d-14abd1de2867@fastmail.fm> <62ecc4cf-97c8-43e6-84a1-72feddf07d29@fastmail.fm> <4e5a84ab-4aa5-4d8b-aa12-625082d92073@ddn.com> From: Bernd Schubert Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: mkx5py8mfpystp3uorrdf6manjimayig X-Rspamd-Queue-Id: BE5DE140028 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1718208961-915785 X-HE-Meta: U2FsdGVkX1/CH52J2bq2M/nvEf7K3UmRCOxB2ALknvN10spaar7dZ4oJ8vJ7sp8yqHvf3kjpYueXKI25PKo3LpIpMaVBhsgiyQcWYpmCrAhwDb7KX9KsDeyIRbmyco6P5tUKa3TQemixFMe6Nl50PiUDgoDGby7+xm3CQhmP9uW7qwGqSI4+QC62AI7jYeRaXwKdFIY4wUKFqASjjaa1ZbfKaXMbCVR8/SDiIZE/N3PzER6f74P1hbqQu9OAJBkGm1wMt0lRLo1NrsQL00aqaYWGmkZ51v0djVwJxscjCcfgi6wCfSPn0BEIyplBbqIRqA3MCIwB3xtpAmO/kX9pL1uZ/CTKwpAY7modChmi0T93wCOblJBsH9fSEZh7I8cpZB753YH8/SJBOOfm9mBmA8NrbFkRNME+LNQBRbB6XzuzRHZTqNRB5kuTwUvv2bk+bePEdrhWzD2C7zJz4n67YMoSgMRXAiNZu2+ofuPc4g6t2DSCBhnZlASiYzpIMgJuVkbkfNrDZyXw06ibfk+YgaP0Nr6GRaV4B13YAjnjLEiWhDuxtdc7yaHe/+FRxfMhHebNXH3KsI5EaC/pjRBm+hBb5WVMdZZ/+D3NVrdNvfz0FLHh4BHcHXfAmRSYfjDLGHKeSyEUSH5hAboOhIohTLfU0p6IoBhabrwVTTvenGVyA8IoBuhXtWDFjgN1vhZn0Uk+ZBD4z5wm1GMIIz5O9tqdW9jmkyuUEiYEAxfgDimCZ4NvwogTgiCzlpmEIBzPhRTadSxfu8Opl4h2H8u7SvEgyZnAqzImUUedyErQaIqET0uTzR/7wmkvctiPJsZJSHggF/48MzuVdnUMqvbUcLr3RHQ9xe3XL1lSKsQf1dlou9o5iAYjQNBYOmg4TQsJ3TzwJxTmBf8L4vMtTGr0NaWJNqjgCXUE50pY7dyxTJ2QKtwMzk3bAOVuNhYHc6aCEj4Rhr7S9uISw4Rtlsv IygFoVLH jjrvt7qY8LwIX/Z5w73JsUZo8yN5p/ovwkaxa/aiNRim5ObdbH0KcjR9+w94DiiAcumEvVYP3s6/45nMe+FSyeOeCIKrR0IxKJMuPP3YSlJDFIJcWzijbVN7Fmz1I+CgC4p+IAM+Nbm+mkWc3yFTl+6e4lujQPDfSyzYloaL5B+3YMwhtEoNl0U8oCILgekfIC6OrI1yUbH0hTqFceEqDSAixQhH7jo69Cpg+SCtFepWEBpYRwl6YNeyebX5e40zA6nKZowSG8cYWN/Uwa0+AHlisTAn2ROVwawuwoYQJPydutvXk9OGPzQv2/H0LCfxAVJKE0ROsXPrA4Gl9WPiMsTomTcQTjLi4+0hm9wnXiwRk16bAH4mUqBan80eEFR6wm3iE7maQdmS8PQx0T/lY4n87WHpGtnokwjzLhR0y4UPY8AE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 6/12/24 17:55, Kent Overstreet wrote: > On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote: >> On 6/12/24 16:19, Kent Overstreet wrote: >>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: >>>> I will definitely look at it this week. Although I don't like the idea >>>> to have a new kthread. We already have an application thread and have >>>> the fuse server thread, why do we need another one? >>> >>> Ok, I hadn't found the fuse server thread - that should be fine. >>> >>>>> >>>>> The next thing I was going to look at is how you guys are using splice, >>>>> we want to get away from that too. >>>> >>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach >>>> could be adapted as well onto the current way of io-uring. >>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. >>>> >>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ >>>> >>>>> >>>>> Brian was also saying the fuse virtio_fs code may be worth >>>>> investigating, maybe that could be adapted? >>>> >>>> I need to check, but really, the majority of the new additions >>>> is just to set up things, shutdown and to have sanity checks. >>>> Request sending/completing to/from the ring is not that much new lines. >>> >>> What I'm wondering is how read/write requests are handled. Are the data >>> payloads going in the same ringbuffer as the commands? That could work, >>> if the ringbuffer is appropriately sized, but alignment is a an issue. >> >> That is exactly the big discussion Miklos and I have. Basically in my >> series another buffer is vmalloced, mmaped and then assigned to ring entries. >> Fuse meta headers and application payload goes into that buffer. >> In both kernel/userspace directions. io-uring only allows 80B, so only a >> really small request would fit into it. > > Well, the generic ringbuffer would lift that restriction. Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated in that code. At least all that setup code would be moved out of fuse. I will eventually come to your patches today. Now we only need to convince Miklos that your ring is better ;) > >> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse >> header - intrinsically fixed in the ring patches. > > *nod* > > That's the big question, put the data inline (with potential alignment > hassles) or manage (and map) a separate data structure. > > Maybe padding could be inserted to solve alignment? Right now I have this struct: struct fuse_ring_req { union { /* The first 4K are command data */ char ring_header[FUSE_RING_HEADER_BUF_SIZE]; struct { uint64_t flags; /* enum fuse_ring_buf_cmd */ uint32_t in_out_arg_len; uint32_t padding; /* kernel fills in, reads out */ union { struct fuse_in_header in; struct fuse_out_header out; }; }; }; char in_out_arg[]; }; Data go into in_out_arg, i.e. headers are padded by the union. I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size and not a fixed 4K. (I just see the stale comment 'enum fuse_ring_buf_cmd', will remove it in the next series) > > A separate data structure would only really be useful if it enabled zero > copy, but that should probably be a secondary enhancement. > >> I will now try without mmap and just provide a user buffer as pointer in the 80B >> section. >> >> >>> >>> We just looked up the device DMA requirements and with modern NVME only >>> 4 byte alignment is required, but the block layer likely isn't set up to >>> handle that. >> >> I think existing fuse headers have and their data have a 4 byte alignment. >> Maybe even 8 byte, I don't remember without looking through all request types. >> If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp >> without the ring patches it will fail because of alignment. Needs to be fixed >> in legacy fuse and would also avoid compat issues we had in libfuse when the >> kernel header was updated. >> >>> >>> So - prearranged buffer? Or are you using splice to get pages that >>> userspace has read into into the kernel pagecache? >> >> I didn't even try to use splice yet, because for the DDN (my employer) use case >> we cannot use zero copy, at least not without violating the rule that one >> cannot access the application buffer in userspace. > > DDN - lustre related? I have bit of ancient Lustre background, also with DDN, then went to Fraunhofer for FhGFS/BeeGFS (kind of competing with Lustre). Back at DDN initially on IME (burst buffer) and now Infinia. Lustre is mostly HPC only, Infina is kind of everything.