From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 069F6C25B74 for ; Thu, 30 May 2024 12:09:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7C8176B0095; Thu, 30 May 2024 08:09:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 776B06B0096; Thu, 30 May 2024 08:09:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63EA16B0099; Thu, 30 May 2024 08:09:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 41DFD6B0095 for ; Thu, 30 May 2024 08:09:35 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id C782A160F8F for ; Thu, 30 May 2024 12:09:34 +0000 (UTC) X-FDA: 82174942668.20.88848DD Received: from fhigh5-smtp.messagingengine.com (fhigh5-smtp.messagingengine.com [103.168.172.156]) by imf09.hostedemail.com (Postfix) with ESMTP id 6BBCA14000C for ; Thu, 30 May 2024 12:09:31 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=fastmail.fm header.s=fm1 header.b=3KyBIhXd; dkim=pass header.d=messagingengine.com header.s=fm1 header.b="D 4TNi1v"; dmarc=pass (policy=none) header.from=fastmail.fm; spf=pass (imf09.hostedemail.com: domain of bernd.schubert@fastmail.fm designates 103.168.172.156 as permitted sender) smtp.mailfrom=bernd.schubert@fastmail.fm ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717070971; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Rcpgxf7yUNYvpBMH0f93+H669i02dWQMV7juB+4+g68=; b=3t7a+4T8F8+R81/huIrvwo9JxfJPh/JhUQSaWx62o67E+reDd8xK2ohsPf5ebjPUW36En8 khsRtSRKYtz2sD/aN4i+re1AgKjrOHGmcLTyRPRJh/4rcOukZiv2oADtkGO8XeC+0BVlc7 9FFtl9vWxum/HyDrb+JDYl3+l56V2OU= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=fastmail.fm header.s=fm1 header.b=3KyBIhXd; dkim=pass header.d=messagingengine.com header.s=fm1 header.b="D 4TNi1v"; dmarc=pass (policy=none) header.from=fastmail.fm; spf=pass (imf09.hostedemail.com: domain of bernd.schubert@fastmail.fm designates 103.168.172.156 as permitted sender) smtp.mailfrom=bernd.schubert@fastmail.fm ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717070971; a=rsa-sha256; cv=none; b=JL+8+IUrJrg5SD/fsa/fW7By8YngVwT7q4De38AcaXpTevRWEJdKjblsH6CGYBc3gW356c crPxrgMKpRsCNgQxoZ06uNJ28An2w2kpDhxIjZ1KdyWiZxOzY4oWUfjVoPyRVPoMvh8Bae 3l/QeY3rkgCc6kFbwaiq390vLTsqtr8= Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailfhigh.nyi.internal (Postfix) with ESMTP id A04A81140239; Thu, 30 May 2024 08:09:30 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute6.internal (MEProxy); Thu, 30 May 2024 08:09:30 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fastmail.fm; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm1; t=1717070970; x=1717157370; bh=Rcpgxf7yUNYvpBMH0f93+H669i02dWQMV7juB+4+g68=; b= 3KyBIhXdP9kNQqwzpt89PFN1Jf9qZ5zG/VKLHEpefCfilZ3NPYDCMzKGdd6P3ON9 7S5P7rRPsenGLXQ4p99sMr3YXmpIhMYOmcdFg7MXq68/XPe2oj0LUBU8F/dgKEp8 mljIKqerPCGQfjVzs07VHiRWMR0Tvxmh8FrrD6JyCARZWmABIhp6Nj8j/IIrFiQX +LLp/LLdmeWeDJEWlcHYZ2Ukg8Jd7sT+Hbs8sztRfFBGp1XmrSGmpm3yRLsK9Htl 7e+T8t+eXsUWScdA0O+6wDS0JVR0EuKs3a5pJUlvhiU6M1z3fk6jE9dDugxRCuwG WB0L99SxHRl+/y672klfxQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1717070970; x= 1717157370; bh=Rcpgxf7yUNYvpBMH0f93+H669i02dWQMV7juB+4+g68=; b=D 4TNi1vK2jEiTV68yQj3cnBQVaTCSEMbr1qp968R7aKlUJ0NdfTWrBxE6paqNCmf5 BoR20p3K8x2WTTCF5/imKWOKlewimXWd4w+wvHLWwgAaB5QaBQ18JJttbKG9uq8M cyJ2Aw4RKyFs7FVeiGfWBtHZr39LgrRmBuBHDRuSxLBa7xDDbJPs3G7vMMbHXE09 OsOIzLuUyVfX3aRyywKVySsKFBxLXv9s/LcqzZF46D6wlVt2cW2UPMCrCVKTZiTU y1ztbONUDAn7LeMTk0xpnN6MoxdyDcYqvf1Qcp6tEngcA+y00fPbOhLRCaifKoQJ lg1nzRkUguesn9eFKlAwg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrvdekgedggeeiucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepkfffgggfuffvvehfhfgjtgfgsehtkeertddtvdejnecuhfhrohhmpeeuvghr nhguucfutghhuhgsvghrthcuoegsvghrnhgurdhstghhuhgsvghrthesfhgrshhtmhgrih hlrdhfmheqnecuggftrfgrthhtvghrnhepveelledthfehheejtdegiefhteejgefhtedt feevvefggfdtgeeugeffhfegjeeinecuffhomhgrihhnpehlfihnrdhnvghtpdhkvghrnh gvlhdrohhrghdpghhithhhuhgsrdgtohhmpdhushgvnhhigidrohhrghenucevlhhushht vghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegsvghrnhgurdhstghhuh gsvghrthesfhgrshhtmhgrihhlrdhfmh X-ME-Proxy: Feedback-ID: id8a24192:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 30 May 2024 08:09:28 -0400 (EDT) Message-ID: Date: Thu, 30 May 2024 14:09:26 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring To: Amir Goldstein , Bernd Schubert Cc: Miklos Szeredi , linux-fsdevel@vger.kernel.org, Andrew Morton , linux-mm@kvack.org, Ingo Molnar , Peter Zijlstra , Andrei Vagin , io-uring@vger.kernel.org, Josef Bacik References: <20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com> From: Bernd Schubert Content-Language: en-US, de-DE, fr In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 6BBCA14000C X-Stat-Signature: j6zr9o6r7x98nb5yqgcycqzpiq6x1tyx X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1717070971-614855 X-HE-Meta: U2FsdGVkX1+KAK8r61UMeG+QcLDaOm1KtkM0CFFj1GRN7nvzwkPCIl/Q7cPlIsxKQqAA7/bnLDnVFxC++No+SR0/eqtNxXntntUHfyIiibxyGHkfkyORdm3gR5gZnqjo3CGNLziRlfbH9rrikD6R8IAzi6jfdBCI/gjnLdtzGKqmzrC/2tDNriL2TYiCSjSNAPylYfb8GyHeY9HhycdXht4wQhqeRnzZmuqQty+6/DG/6Xal7KEKFf19hk5fh1dTIOtr+cR7cx6LLvuzIrk876xqFzNFRu7mXclCmJAOafX8IDN76IOfJJSOXR0GRMnOwrWqTLPGhsHMUbmgVJ1LWm7HeVWA3w/94B8B91ejV558PcVAKNPWYiD6yozL4uYqc+G4QmXKKek+X+/tisrGz8OPIJBKh3JWZQY+fz5gUthW28XLRLGjO423+11++35RmA5TR7OYOVU8nbyKrunab1hoYXXUmMQj0bKfq8uvAEzzp9UQe4B9WdiDVTTYPZ2D1pLIp2xmIz4BiPW+VpPK8RaEWEJtwrODljWZ2N62avRLC9qnKwIMBiC1nt9tOkmE/a5zD4nFbigeH2VwykYK+rRFnZF1U0vS9jqcrpHEJISPMaLxq2LIUiYqmC4mMP73e/Hd3TReXh8X3bl9mIURNtyHcxVIU1NV2RB3YCe3nzrDo9oThciZWezo5wXdQr+CfovD57wi45TvTnnrDAhRqDfFzfTlPVXMKXFP3p4Xp5GSbkbUw1AO6b+vLqNbw2+VtXN61HuFMyZipVbGqknYRTzWYf1S0tyPid8qO80hfOUHtiGQc9nlxqNh+n7hc7HiyyjrA4EkeJuTNNH2+5dSYNFYY/TwPhZ98hJ9Gpn7aSkZ2SmTY81btkTg5KVbi6ejx5UBS7lZQgcFa+08L0ZVTc9UgB8Xci3A5UlgKRrYBbnSox5xXL1zHMSWHOL7W7ze+XAO03UAWZr0PQnbqg2 ewk5uHXb fXljUfXGgf1SeE7mfLsfhNOMVL8oUV0ytY7fsDspJnBBZUZj1uoWtr12hgGfFXfbpSNp2GXj1Lxer0rbr9qEU3GlpaN0tqDfsdOhkgH95LcHlY5zdU25jY25CdgDgxnYR3bUmySa33WLMqb3/gjM7abWBEi1QV/WYWdUy0RhcN4FIRQhIOYS6qspYlndyRRsZfl6ba6gEH2alamBxvXcwIhcipJ8YqK5FGjOPl0sHxsC5Lof876Ocxznft2DMhEqveFbXnp0ZkJj9ht0y700hMjCItxiULaORpYd6eNSFQfe0k68fiKeEFe/ZYB5ueaCT63mjd00SCbqMtOTNxw/BUuWCvw0ikAuHxO/4U7VsQMwrBxB8dMZICqiMkNLWBZXiTEUmj+Ou2P2iuTE8laa+sDBoGy0Mwov0nX+QgCuZ2u8KIYFju6gF5xN4x+XO188a4ftdt0BBiNHDIHrt0O6kXoD2l0OdzR772aYxKA/JgvwAJOauy2uvIZHPTn2fWhfegyaiRHIhtU//ZEYAga648/cNrVAtuQaiw9Xjbv9Qh2QO2XJMwp84WNSqfz5x1Q6x9/yQUtUFEfbaJRXnETBQpHZnzCwycVsNSyjgbYM5MHrCLRE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 5/30/24 09:07, Amir Goldstein wrote: > On Wed, May 29, 2024 at 9:01 PM Bernd Schubert wrote: >> >> From: Bernd Schubert >> >> This adds support for uring communication between kernel and >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >> appraoch was taken from ublk. The patches are in RFC state, >> some major changes are still to be expected. >> >> Motivation for these patches is all to increase fuse performance. >> In fuse-over-io-uring requests avoid core switching (application >> on core X, processing of fuse server on random core Y) and use >> shared memory between kernel and userspace to transfer data. >> Similar approaches have been taken by ZUFS and FUSE2, though >> not over io-uring, but through ioctl IOs >> >> https://lwn.net/Articles/756625/ >> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2 >> >> Avoiding cache line bouncing / numa systems was discussed >> between Amir and Miklos before and Miklos had posted >> part of the private discussion here >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ >> >> This cache line bouncing should be addressed by these patches >> as well. >> >> I had also noticed waitq wake-up latencies in fuse before >> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/ >> >> This spinning approach helped with performance (>40% improvement >> for file creates), but due to random server side thread/core utilization >> spinning cannot be well controlled in /dev/fuse mode. >> With fuse-over-io-uring requests are handled on the same core >> (sync requests) or on core+1 (large async requests) and performance >> improvements are achieved without spinning. >> >> Splice/zero-copy is not supported yet, Ming Lei is working >> on io-uring support for ublk_drv, but I think so far there >> is no final agreement on the approach to be taken yet. >> Fuse-over-io-uring runs significantly faster than reads/writes >> over /dev/fuse, even with splice enabled, so missing zc >> should not be a blocking issue. >> >> The patches have been tested with multiple xfstest runs in a VM >> (32 cores) with a kernel that has several debug options >> enabled (like KASAN and MSAN). >> For some tests xfstests reports that O_DIRECT is not supported, >> I need to investigate that. Interesting part is that exactly >> these tests fail in plain /dev/fuse posix mode. I had to disabled >> generic/650, which is enabling/disabling cpu cores - given ring >> threads are bound to cores issues with that are no totally >> unexpected, but then there (scheduler) kernel messages that >> core binding for these threads is removed - this needs >> to be further investigates. >> Nice effect in io-uring mode is that tests run faster (like >> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still >> slow as this is with ASAN/leak-detection/etc. >> >> The corresponding libfuse patches are on my uring branch, >> but need cleanup for submission - will happen during the next >> days. >> https://github.com/bsbernd/libfuse/tree/uring >> >> If it should make review easier, patches posted here are on >> this branch >> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2 >> >> TODO list for next RFC versions >> - Let the ring configure ioctl return information, like mmap/queue-buf size >> - Request kernel side address and len for a request - avoid calculation in userspace? >> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even >> more important) >> - FUSE_INTERRUPT handling? >> - Logging (adds fields in the ioctl and also ring-request), >> any mismatch between client and server is currently very hard to understand >> through error codes >> >> Future work >> - notifications, probably on their own ring >> - zero copy >> >> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023, >> which, resulted in some tuning patches (at the end of the >> patch series). >> >> Some benchmark results >> ====================== >> >> System used for the benchmark is a 32 core (HyperThreading enabled) >> Xeon E5-2650 system. I don't have local disks attached that could do >>> 5GB/s IOs, for paged and dio results a patched version of passthrough-hp >> was used that bypasses final reads/writes. >> >> paged reads >> ----------- >> 128K IO size 1024K IO size >> jobs /dev/fuse uring gain /dev/fuse uring gain >> 1 1117 1921 1.72 1902 1942 1.02 >> 2 2502 3527 1.41 3066 3260 1.06 >> 4 5052 6125 1.21 5994 6097 1.02 >> 8 6273 10855 1.73 7101 10491 1.48 >> 16 6373 11320 1.78 7660 11419 1.49 >> 24 6111 9015 1.48 7600 9029 1.19 >> 32 5725 7968 1.39 6986 7961 1.14 >> >> dio reads (1024K) >> ----------------- >> >> jobs /dev/fuse uring gain >> 1 2023 3998 2.42 >> 2 3375 7950 2.83 >> 4 3823 15022 3.58 >> 8 7796 22591 2.77 >> 16 8520 27864 3.27 >> 24 8361 20617 2.55 >> 32 8717 12971 1.55 >> >> mmap reads (4K) >> --------------- >> (sequential, I probably should have made it random, sequential exposes >> a rather interesting/weird 'optimized' memcpy issue - sequential becomes >> reversed order 4K read) >> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/ >> >> jobs /dev/fuse uring gain >> 1 130 323 2.49 >> 2 219 538 2.46 >> 4 503 1040 2.07 >> 8 1472 2039 1.38 >> 16 2191 3518 1.61 >> 24 2453 4561 1.86 >> 32 2178 5628 2.58 >> >> (Results on request, setting MAP_HUGETLB much improves performance >> for both, io-uring mode then has a slight advantage only.) >> >> creates/s >> ---------- >> threads /dev/fuse uring gain >> 1 3944 10121 2.57 >> 2 8580 24524 2.86 >> 4 16628 44426 2.67 >> 8 46746 56716 1.21 >> 16 79740 102966 1.29 >> 20 80284 119502 1.49 >> >> (the gain drop with >=8 cores needs to be investigated) > Hi Amir, > Hi Bernd, > > Those are impressive results! thank you! > > When approaching the FUSE uring feature from marketing POV, > I think that putting the emphasis on metadata operations is the > best approach. I can add in some more results and probably need to redo at least the metadata tests. I have all the results in google docs and in plain text files, just a bit cumbersome maybe also spam to post all of it here. > > Not the dio reads are not important (I know that is part of your use case), > but I imagine there are a lot more people out there waiting for > improvement in metadata operations overhead. I think the DIO use case is declining. My fuse work is now related to the DDN Infina project, which has a DLM - this will all go via cache and notifications (into from/to client/server) I need to start to work on that asap... I'm also not too happy yet about cached writes/reads - need to find time to investigate where the limit is. > > To me it helps to know what the current main pain points are > for people using FUSE filesystems wrt performance. > > Although it may not be uptodate, the most comprehensive > study about FUSE performance overhead is this FAST17 paper: > > https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf Yeah, I had seen it. Just checking again, interesting is actually their instrumentation branch https://github.com/sbu-fsl/fuse-kernel-instrumentation This should be very useful upstream, in combination with Josefs fuse tracepoints (btw, thanks for the tracepoint patch Josef! I'm going to look at it and test it tomorrow). > > In this paper, table 3 summarizes the different overheads observed > per workload. According to this table, the workloads that degrade > performance worse on an optimized passthrough fs over SSD are: > - many file creates > - many file deletes > - many small file reads > In all these workloads, it was millions of files over many directories. > The highest performance regression reported was -83% on many > small file creations. > > The moral of this long story is that it would be nice to know > what performance improvement FUSE uring can aspire to. > This is especially relevant for people that would be interested > in combining the benefits of FUSE passthrough (for data) and > FUSE uring (for metadata). As written above, I can add a few more data. But if possible I wouldn't like to concentrate on benchmarking - this can be super time consuming and doesn't help unless one investigates what is actually limiting performance. Right now we see that io-uring helps, fixing the other limits is then the next step, imho. > > What did passthrough_hp do in your patched version with creates? > Did it actually create the files? Yeah, it creates files, I think on xfs (or ext4). I had tried tmpfs first, but it had issues with seekdir/telldir until recently - will switch back to tmpfs for next tests. > In how many directories? > Maybe the directory inode lock impeded performance improvement > with >=8 threads? I don't think the directory inode lock is an issue - this should be one (or more directories) per thread Basically /usr/lib64/openmpi/bin/mpirun \ --mca btl self -n $i --oversubscribe \ ./mdtest -F -n40000 -i1 \ -d /scratch/dest -u -b2 | tee ${fname}-$i.out (mdtest is really convenient for meta operations, although requires mpi, recent versions are here (the initial LLNL project merged with ior). https://github.com/hpc/ior "-F" Perform test on files only (no directories). "-n" number_of_items Every process will creat/stat/remove # directories and files "-i" iterations The number of iterations the test will run "-u" Create a unique working directory for each task "-b" branching_factor The branching factor of the hierarchical directory structure [default: 1]. (The older LLNL repo has a better mdtest README https://github.com/LLNL/mdtest) Also, regarding metadata, I definitely need to find time resume work on atomic-open. Besides performance, there is another use case https://github.com/libfuse/libfuse/issues/945. Sweet Tea Dorminy / Josef also seem to need that. > >> >> Remaining TODO list for RFCv3: >> -------------------------------- >> 1) Let the ring configure ioctl return information, >> like mmap/queue-buf size >> >> Right now libfuse and kernel have lots of duplicated setup code >> and any kind of pointer/offset mismatch results in a non-working >> ring that is hard to debug - probably better when the kernel does >> the calculations and returns that to server side >> >> 2) In combination with 1, ring requests should retrieve their >> userspace address and length from kernel side instead of >> calculating it through the mmaped queue buffer on their own. >> (Introduction of FUSE_URING_BUF_ADDR_FETCH) >> >> 3) Add log buffer into the ioctl and ring-request >> >> This is to provide better error messages (instead of just >> errno) >> >> 3) Multiple IO sizes per queue >> >> Small IOs and metadata requests do not need large buffer sizes, >> we need multiple IO sizes per queue. >> >> 4) FUSE_INTERRUPT handling >> >> These are not handled yet, kernel side is probably not difficult >> anymore as ring entries take fuse requests through lists. >> >> Long term TODO: >> -------------- >> Notifications through io-uring, maybe with a separated ring, >> but I'm not sure yet. > > Is that going to improve performance in any real life workload? > I'm rather sure that we at DDN will need it for our project with the DLM. I have other priorities for now - once it comes up, adding notifications over uring shouldn't be difficult. Thanks, Bernd