From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF462C369A2 for ; Thu, 10 Apr 2025 02:12:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FBF06B02B8; Wed, 9 Apr 2025 22:12:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 683746B02B9; Wed, 9 Apr 2025 22:12:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 524676B02BA; Wed, 9 Apr 2025 22:12:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 315CD6B02B8 for ; Wed, 9 Apr 2025 22:12:42 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id CE4D95395B for ; Thu, 10 Apr 2025 02:12:43 +0000 (UTC) X-FDA: 83316510606.29.7E502EB Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by imf01.hostedemail.com (Postfix) with ESMTP id 6137C40007 for ; Thu, 10 Apr 2025 02:12:37 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="h/PVKnA/"; spf=pass (imf01.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744251162; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=R9121b9yjqO2sKle5RkcrFD3eZ/Qsdfxici1u6jpN8k=; b=ZBGaaodxIIwBtRTYMU6XAAxQAL/FxFJ6omt0gvk7/Dw/QXaaXCaffTn1M8IcCt4Xriq/6q rigQVum4TLO2H881AogvftZcHqgXCRxdlEqbiqsaSyh3kqzmXPLilSpKHD7p2R30rMveeK QfVe2siayOBgZQmTkg2ztuD4nhgXr5w= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744251162; a=rsa-sha256; cv=none; b=6lhDWcPhTqSeH0qGnyUagsk1z0vlI3ePS6ZlQeI7JVzkTIOiTQvIC8DAdtl0LVKbyg8dv/ P2WcqU60qeSonirehE0NyC/OynyVmcBVbDV1mEKCCFXGt0/fvxBxFPPhwJGpyY79fvonu6 Ux+LNQFGhP4Lg6LNWZwRAlkuQqRJRrE= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="h/PVKnA/"; spf=pass (imf01.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744251155; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=R9121b9yjqO2sKle5RkcrFD3eZ/Qsdfxici1u6jpN8k=; b=h/PVKnA/9uap2SKbGY3JE0tIksGJMBRXARVJm2F5tSekJ2q3jdOxQLXI8YCwQer3Umi/o39Y6m+TwNQ0+NjmM4nFax45xsuSSbmpWa9TGEphTaHhAC9fO8kN7kmlTtGVZjb3PE6kHb30UEDWbeT/VkdNKNnI6kGvyfuBDNMVtA4= Received: from 30.222.18.156(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0WWMTlsK_1744251153 cluster:ay36) by smtp.aliyun-inc.com; Thu, 10 Apr 2025 10:12:33 +0800 Message-ID: <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> Date: Thu, 10 Apr 2025 10:12:32 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree To: Joanne Koong Cc: miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com, Miklos Szeredi References: <20250404181443.1363005-1-joannelkoong@gmail.com> <20250404181443.1363005-4-joannelkoong@gmail.com> Content-Language: en-US From: Jingbo Xu In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: 6k6mai173uu3boa4w9nkiihdemqx9x4h X-Rspam-User: X-Rspamd-Queue-Id: 6137C40007 X-Rspamd-Server: rspam08 X-HE-Tag: 1744251157-895867 X-HE-Meta: U2FsdGVkX19vQ6GQGhTkukq7n3mlcMkp55a1JDFSkEvdXe9+mHIVeQeWtShPZX+vaTaZ1smR4HUlvBIEpgQsNJ9i0O7gnBVaKCO8F8uQapY1dJwjb06WZt7cqxeUQEGEfnuLy3fb52VwD6+SmTViNwBW0fRIVodnWIEo+Md6GT0mMAhuXz4pJGsqVnRuLdkX9GdXHEAO0HwfiLJLmhEhj7RC3xC9Ebo41bzVklXdxcIrN/eaUQcMcN8mnXJyFBtkKi0IfFYggYEnVgHupbOjAhgiKKe4kYH7b8UpJo/MIFk5VO+m68AMNT/cJGHnXJ6eiBokidDUUeCodRVSBax6M7tI7UCOws+W6SWLTdzV95+T3Gr88LZdexB8yoK1LJVfkrgrdmBFtpB6xpbyTz/ls0ToWUqSlpzQEyt4abtZ7BtRyrLnSGq33xEVP9tVpf29N5l3rWjt5zq4NCCHqmCSRvu1F6qNbLVCFLDeo1YMkk2RwDE31qwV9mPVpwiKKFR92ImTuceS4Yh3j6nELQQ4i9EzyriBW4Y4y4JeNVSpXGBp8tZI3IFNZeAuSZsBOC0Jbz3Xk/MWpHqBTGyYVhkM0MZiwSFOJOzuHa7ZCpGA6gBNLsstiBmS33H4FWBcTR4tXhzG7BbynT1LiGZaZ2n0KJPcsNqicaPo7/wS9YymBtcy+r/G+q+n7cJB6Nm0IE+tEt9DOTZl6/S6+b2DdnA31r4P11eqMm2zGrKdYIOYgZ9XuiaAPY2scxuCaCtq/dXJH5F0RQyX7wRYDyHcONl7tSmBeEDk6itIcQnEQF46Gj14LB1jfPQ8OgqUZwGTNGNzUxiE9uSdFmFnHRPs43qeYVJ52e+hOjoSCJ6EfRvTEPi8/wtgaKWlDrTHXMeU64Y7tLYq0FUg3eCetwGIl5isTcvE3N/ai2fKw59aQz6oA0tNwzX8DzobfyLzUL3uu6jkRLAaGdmY0I4/09LiKLb DwwGO+ls W91GlH8eJFT9v/6uLnar7FnptBOhylSr8KYN9O3lws1Ub5B+ZJIhXnhkGfzhQZad1wD648MW86uKX90lvUSpDjv+eCM+JegcndNCDRq5N87DJNV9rz9ZCMA0JNMaZ6ZtIbToWH5qe2+jwI7P0lgNZD4BTBojSEszPhcqaCbnAdCGJIi0Mtazf06mDvh+F07NEYDlsIi7u6eR7iV63WkkTBMIFhTOWB0iIY+jZWAcKdWVnagnC+G1vdGQPzKz7JyGuRWRqbonjKXAEpn3U5oXsbfoinvIkQ4CXqb4lFh6r4q9o3d24Mw+OLhOuUlCGFsN2HjCW9PIjmzChXijAZh5PvZbAs70ALOrBlHao X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/10/25 7:47 AM, Joanne Koong wrote: > On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu wrote: >> >> Hi Joanne, >> >> On 4/5/25 2:14 AM, Joanne Koong wrote: >>> In the current FUSE writeback design (see commit 3be5a52b30aa >>> ("fuse: support writable mmap")), a temp page is allocated for every >>> dirty page to be written back, the contents of the dirty page are copied over >>> to the temp page, and the temp page gets handed to the server to write back. >>> >>> This is done so that writeback may be immediately cleared on the dirty page, >>> and this in turn is done in order to mitigate the following deadlock scenario >>> that may arise if reclaim waits on writeback on the dirty page to complete: >>> * single-threaded FUSE server is in the middle of handling a request >>> that needs a memory allocation >>> * memory allocation triggers direct reclaim >>> * direct reclaim waits on a folio under writeback >>> * the FUSE server can't write back the folio since it's stuck in >>> direct reclaim >>> >>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates >>> the situations described above, FUSE writeback does not need to use >>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. >>> >>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings >>> and removes the temporary pages + extra copying and the internal rb >>> tree. >>> >>> fio benchmarks -- >>> (using averages observed from 10 runs, throwing away outliers) >>> >>> Setup: >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount >>> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount >>> >>> bs = 1k 4k 1M >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s >>> % diff -3% 23% 45% >>> >>> Signed-off-by: Joanne Koong >>> Reviewed-by: Jingbo Xu >>> Acked-by: Miklos Szeredi >> > > Hi Jingbo, > > Thanks for sharing your analysis for this. > >> Overall this patch LGTM. >> >> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is >> also unneeded then, at least the DIRECT IO routine (i.e. > > I took a look at fi->writectr and fi->queued_writes and my > understanding is that we do still need this. For example, for > truncates (I'm looking at fuse_do_setattr()), I think we still need to > prevent concurrent writeback or else the setattr request and the > writeback request could race which would result in a mismatch between > the file's reported size and the actual data written to disk. I haven't looked into the truncate routine yet. I will see it later. > >> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is >> because after removing the temp page, the DIRECT IO routine has already >> been waiting for all inflight WRITE requests, see >> >> # DIRECT read >> generic_file_read_iter >> kiocb_write_and_wait >> filemap_write_and_wait_range > > Where do you see generic_file_read_iter() getting called for direct io reads? # DIRECT read fuse_file_read_iter fuse_cache_read_iter generic_file_read_iter kiocb_write_and_wait filemap_write_and_wait_range a_ops->direct_IO(),i.e. fuse_direct_IO() > Similarly, where do you see generic_file_write_iter() getting called > for direct io writes? # DIRECT read fuse_file_write_iter fuse_cache_write_iter generic_file_write_iter generic_file_direct_write kiocb_invalidate_pages filemap_invalidate_pages filemap_write_and_wait_range a_ops->direct_IO(),i.e. fuse_direct_IO() > Where do you see fi->writectr / fi->queued-writes preventing this > race? IMO overall fi->writectr / fi->queued-writes are introduced to prevent DIRECT IO and writeback from sending duplicate (inflight) WRITE requests for the same page. For the DIRECT write routine: # non-FOPEN_DIRECT_IO DIRECT write fuse_cache_write_iter fuse_direct_IO fuse_direct_io fuse_sync_writes # FOPEN_DIRECT_IO DIRECT write fuse_direct_write_iter fuse_direct_IO fuse_direct_io fuse_sync_writes For the writeback routine: fuse_writepages() fuse_writepages_fill fuse_writepages_send # buffer the WRITE request in queued_writes list fuse_flush_writepages # flush WRITE only when fi->writectr >= 0 > It looks to me like in the existing code, this race condition > you described of direct write invalidating the page cache, then > another buffer write reads the page cache and dirties it, then > writeback is called on that, and the 2 write requests racing, could > still happen? > > >> However it seems that the writeback >> won't wait for previous inflight DIRECT WRITE requests, so I'm not much >> sure about that. Maybe other folks could offer more insights... > > My understanding is that these lines > > if (!cuse && filemap_range_has_writeback(...)) { > ... > fuse_sync_writes(inode); > ... > } > > in fuse_direct_io() is what waits on previous inflight direct write > requests to complete before the direct io happens. Right. > > >> >> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with >> which I'm pretty sure. > > Why don't we still need this for fuse_flush()? > > If a caller calls close(), this will call > > filp_close() > filp_flush() > filp->f_op->flush() > fuse_flush() > > it seems like we should still be waiting for all writebacks to finish > before sending the fuse server the fuse_flush request, no? > filp_close() filp_flush() filp->f_op->flush() fuse_flush() write_inode_now writeback_single_inode(WB_SYNC_ALL) do_writepages # flush dirty page filemap_fdatawait # wait for WRITE completion >> >>> --- >>> fs/fuse/file.c | 360 ++++------------------------------------------- >>> fs/fuse/fuse_i.h | 3 - >>> 2 files changed, 28 insertions(+), 335 deletions(-) >>> >>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >>> index 754378dd9f71..91ada0208863 100644 >>> --- a/fs/fuse/file.c >>> +++ b/fs/fuse/file.c >>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) >>> >>> struct fuse_writepage_args { >>> struct fuse_io_args ia; >>> - struct rb_node writepages_entry; >>> struct list_head queue_entry; >>> - struct fuse_writepage_args *next; >>> struct inode *inode; >>> struct fuse_sync_bucket *bucket; >>> }; >>> >>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, >>> - pgoff_t idx_from, pgoff_t idx_to) >>> -{ >>> - struct rb_node *n; >>> - >>> - n = fi->writepages.rb_node; >>> - >>> - while (n) { >>> - struct fuse_writepage_args *wpa; >>> - pgoff_t curr_index; >>> - >> >> -- >> Thanks, >> Jingbo -- Thanks, Jingbo