From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB13BC369A9 for ; Thu, 10 Apr 2025 15:11:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C131F2800CE; Thu, 10 Apr 2025 11:11:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC2D72800C8; Thu, 10 Apr 2025 11:11:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A8B642800CE; Thu, 10 Apr 2025 11:11:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 8A7172800C8 for ; Thu, 10 Apr 2025 11:11:23 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A42B45764A for ; Thu, 10 Apr 2025 15:11:24 +0000 (UTC) X-FDA: 83318472888.14.45B38F9 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) by imf02.hostedemail.com (Postfix) with ESMTP id D9C8280014 for ; Thu, 10 Apr 2025 15:11:21 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=NNMVSCRD; spf=pass (imf02.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744297883; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QekY4r+1ekyzezfOqZ2L++WB3a0HUkST9Br2kULSuMM=; b=B4DfKi0GJEW13bQbiZ7muDhGg20odYtjtWGFKmiylqC5iJHKp86pf5c6LXOaX5PyQ7YFKJ Q3ACVvoQpXtjiS4kG/GP1Q2mEmmD1gOb1HNgJO4rmGKrSx3GW4ZLRdkXAZENJCI/Rk1/AE pcGWzLBJQI1mOZ0434EJkHLCgT2goyg= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=NNMVSCRD; spf=pass (imf02.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744297883; a=rsa-sha256; cv=none; b=N+Sdy6A9OLXHdPIbKm79LcydtPhJ4CGw+Gcu4SCwUBUza2rDuOApcwreTjG+GMEz+c35L/ j0ZZMRt66wmywKn/O9vOawWofuwgi9yCAE2SeMFQtY3v1VWwJtQTHi/RBLuYgY69cHgssZ TO0MVL3fNXmUyRKlucp5Jn43C2rNl9I= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744297877; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=QekY4r+1ekyzezfOqZ2L++WB3a0HUkST9Br2kULSuMM=; b=NNMVSCRDg5tVWFI5OGLQe9kMe4Pp5U1G+y7YXM8WjmiOfr2uix2/Qmg59v8af+5Q9Z0J90NvW01j9PQVca6ehhWSnyLzzVa1Q1jRdRjRzz8x0ElzvXONmCgMP+RjhIT5gZ1lEmCC1rPkRckXIt+1Vg3r5r2gXyqNGydBUNQ6Nas= Received: from 192.168.31.58(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0WWPSFxj_1744297875 cluster:ay36) by smtp.aliyun-inc.com; Thu, 10 Apr 2025 23:11:16 +0800 Message-ID: <9a3cfb55-faae-4551-9bef-b9650432848a@linux.alibaba.com> Date: Thu, 10 Apr 2025 23:11:15 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree To: Joanne Koong Cc: miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com, Miklos Szeredi References: <20250404181443.1363005-1-joannelkoong@gmail.com> <20250404181443.1363005-4-joannelkoong@gmail.com> <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> Content-Language: en-US From: Jingbo Xu In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: D9C8280014 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: fthoxixht9d3q5xhi4ecoha1qgsxf6j1 X-HE-Tag: 1744297881-473898 X-HE-Meta: U2FsdGVkX1+bItsvbXJX5m6sHa2oTHj775Iljr0iaD0nuZGnPxfi4B+gDFhcE/AyYGKBnoEP2DlENzelXLZwA8J/uCIhrXmcCOzUOV1GX5UMa94leOWZWYmHPRBH3/MmV3wvVxUX329OYjx5mm9aLdfTPJjqnCouHBt2sA6OYRf+WPZSkz3ImTfLVQeucyHBqAXEGbmRAzNhGrV3VqL+QOBw2/HsRgYP7UJQT2OcuSbqrT4DGkr9tt7dlLdp9t1jk9czEGIRSUnES6smIMGti+VpNY+eWJU485sZK7SyC+k87n4dUB237k5aCFJfK5QVgw25dnYcup1nIxUY/fCinlbupjrXryHQzUQ41hD1Eiz1HrqZi4cOcfyDa8auOykv7CidRxcmqOnSPwPYttmv0k+OPei3xOOIsiD5nQqLgBz3Zmf2CfbWFLKSQsqlcDFnQmYfDSqbsn86adNIi1a7TtRQUoANU+XJtxW0ZhgVKspinTJHMFBp/Lm7u/OA0nKJWqiTRH3RH7niw50MHiYjoygPXD1X7jmW+Ok8cHRZD+jDjCiplT/FEGgYAG3SWJr3cFF9tMktnfDJxsGYPswL6G1UBeurR8gWuKwwJqcVpCGbVBB5h8MIr0mRW/EURgkryrSQ9tYCIzyiclLneHnTtsAu/51IMZW1//Bs+u4QFtIk+yJwkchhX6UvkkjC9S2PTyNNz3Ih7VDI7R/bluc2wAo/U5m4ebIPtmTYssNiCPEpTqLOeRMl0fWFVAF7B321eXixV8cJMjiBIA0yvmm4sDvm5Uuz06fia/4XCicaHH4QSNTANG6lGFuIuFbHhbxeH/emM1f2k0IOU94k9majziHAxB+NF+kbEN3GCr1fEarXZ7edLTB+4VulZftVHDczRayD8HG9tpbTBInFodG9IkOAiG/hOGll X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/10/25 11:07 PM, Joanne Koong wrote: > On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu wrote: >> >> >> >> On 4/10/25 7:47 AM, Joanne Koong wrote: >>> On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu wrote: >>>> >>>> Hi Joanne, >>>> >>>> On 4/5/25 2:14 AM, Joanne Koong wrote: >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa >>>>> ("fuse: support writable mmap")), a temp page is allocated for every >>>>> dirty page to be written back, the contents of the dirty page are copied over >>>>> to the temp page, and the temp page gets handed to the server to write back. >>>>> >>>>> This is done so that writeback may be immediately cleared on the dirty page, >>>>> and this in turn is done in order to mitigate the following deadlock scenario >>>>> that may arise if reclaim waits on writeback on the dirty page to complete: >>>>> * single-threaded FUSE server is in the middle of handling a request >>>>> that needs a memory allocation >>>>> * memory allocation triggers direct reclaim >>>>> * direct reclaim waits on a folio under writeback >>>>> * the FUSE server can't write back the folio since it's stuck in >>>>> direct reclaim >>>>> >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates >>>>> the situations described above, FUSE writeback does not need to use >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. >>>>> >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings >>>>> and removes the temporary pages + extra copying and the internal rb >>>>> tree. >>>>> >>>>> fio benchmarks -- >>>>> (using averages observed from 10 runs, throwing away outliers) >>>>> >>>>> Setup: >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount >>>>> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount >>>>> >>>>> bs = 1k 4k 1M >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s >>>>> % diff -3% 23% 45% >>>>> >>>>> Signed-off-by: Joanne Koong >>>>> Reviewed-by: Jingbo Xu >>>>> Acked-by: Miklos Szeredi >>>> >>> >>> Hi Jingbo, >>> >>> Thanks for sharing your analysis for this. >>> >>>> Overall this patch LGTM. >>>> >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is >>>> also unneeded then, at least the DIRECT IO routine (i.e. >>> >>> I took a look at fi->writectr and fi->queued_writes and my >>> understanding is that we do still need this. For example, for >>> truncates (I'm looking at fuse_do_setattr()), I think we still need to >>> prevent concurrent writeback or else the setattr request and the >>> writeback request could race which would result in a mismatch between >>> the file's reported size and the actual data written to disk. >> >> I haven't looked into the truncate routine yet. I will see it later. >> >>> >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is >>>> because after removing the temp page, the DIRECT IO routine has already >>>> been waiting for all inflight WRITE requests, see >>>> >>>> # DIRECT read >>>> generic_file_read_iter >>>> kiocb_write_and_wait >>>> filemap_write_and_wait_range >>> >>> Where do you see generic_file_read_iter() getting called for direct io reads? >> >> # DIRECT read >> fuse_file_read_iter >> fuse_cache_read_iter >> generic_file_read_iter >> kiocb_write_and_wait >> filemap_write_and_wait_range >> a_ops->direct_IO(),i.e. fuse_direct_IO() >> > > Oh I see, I thought files opened with O_DIRECT automatically call the > .direct_IO handler for reads/writes but you're right, it first goes > through .read_iter / .write_iter handlers, and the .direct_IO handler > only gets invoked through generic_file_read_iter() / > generic_file_direct_write() in mm/filemap.c > > There's two paths for direct io in FUSE: > a) fuse server sets fi->direct_io = true when a file is opened, which > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side > b) fuse server doesn't set fi->direct_io = true, but the client opens > the file with O_DIRECT > > We only go through the stack trace you listed above for the b) case. > For the a) case, we'll hit > > if (ff->open_flags & FOPEN_DIRECT_IO) > return fuse_direct_read_iter(iocb, to); > > and > > if (ff->open_flags & FOPEN_DIRECT_IO) > return fuse_direct_write_iter(iocb, from); > > which will invoke fuse_direct_IO() / fuse_direct_io() without going > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() / > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed > above. > > So for the a) case I think we'd still need the fuse_sync_writes() in > case there's still pending writeback. > > Do you agree with this analysis or am I missing something here? Yeah, that's true. But instead of calling fuse_sync_writes(), we can call filemap_wait_range() or something similar here. >> filp_close() >> filp_flush() >> filp->f_op->flush() >> fuse_flush() >> write_inode_now >> writeback_single_inode(WB_SYNC_ALL) >> do_writepages >> # flush dirty page >> filemap_fdatawait >> # wait for WRITE completion > > Nice. I missed that write_inode_now() will invoke filemap_fdatawait(). > This seems pretty straightforward. I'll remove the fuse_sync_writes() > call in fuse_flush() when I send out v8. > > The direct io one above is less straight-forward. I won't add that to > v8 but that can be done in a separate future patch when we figure that > out. Thanks for keep working on this. Appreciated. -- Thanks, Jingbo