From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 973E8C369AB for ; Tue, 15 Apr 2025 07:49:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2E4BC2801CD; Tue, 15 Apr 2025 03:49:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 291222800C7; Tue, 15 Apr 2025 03:49:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 10E8E2801CD; Tue, 15 Apr 2025 03:49:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E126F2800C7 for ; Tue, 15 Apr 2025 03:49:14 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 59CD95DD21 for ; Tue, 15 Apr 2025 07:49:15 +0000 (UTC) X-FDA: 83335502670.06.E12F450 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf19.hostedemail.com (Postfix) with ESMTP id 391561A000D for ; Tue, 15 Apr 2025 07:49:11 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=uFvZ1YOf; spf=pass (imf19.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744703353; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zLPR7liFdTw1eqeX46Q2j532O9oTxRIw1mmcFVNsuFE=; b=vgqL0IQsrj251pIsQYIPp9uwfgQQYJLDSoLcStCmnBYNMDj5GuMYFK6GykpZrL/kGTgYaY 20JuSNE+oZjhjlKRfMv0b0YFmsm+zO4ypnE/XvSn5OykCKNu+nO9kRkASZkOyvOCpr9yZM JlzvoFUMgOCI0IqaGnx8Nv3uoExDmvQ= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=uFvZ1YOf; spf=pass (imf19.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744703353; a=rsa-sha256; cv=none; b=fyM8oTumGq4E3ZUuq14ZwfCxYkStyvQrCP2bI+989WhGNnGbFZsourC9wM/g2cO/88zyG9 RBWEAxZ+S3l8UDJqfbBztBPeAXwd8MOqKJ2T+UfQ4ufgE54dVOhdnbawiB2lWXhLibl+4i ypMjBq5r3/TehfbSYnQgh6EoKbB8KkY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744703349; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=zLPR7liFdTw1eqeX46Q2j532O9oTxRIw1mmcFVNsuFE=; b=uFvZ1YOfH8ek7oio0U1fPpwXwfUPHiYzC4yjGgOlYYknnKYendO2dxLWMWC6j2ry9sMjhqboppRkUYYlc0OmDHKgUckcfKJZS6F9yMAOpqfu4Xi8gY/6+zUOxwsEg0fgL6Qgd8FiDInQEIfyZ3t/XpfkYFSVmxOHIUbT0RGfEI4= Received: from 30.221.145.234(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0WX4-mjH_1744703346 cluster:ay36) by smtp.aliyun-inc.com; Tue, 15 Apr 2025 15:49:07 +0800 Message-ID: Date: Tue, 15 Apr 2025 15:49:06 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree To: Joanne Koong Cc: miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com, Miklos Szeredi References: <20250404181443.1363005-1-joannelkoong@gmail.com> <20250404181443.1363005-4-joannelkoong@gmail.com> <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> <9a3cfb55-faae-4551-9bef-b9650432848a@linux.alibaba.com> Content-Language: en-US From: Jingbo Xu In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 391561A000D X-Rspamd-Server: rspam05 X-Rspam-User: X-Stat-Signature: 4seaw69drekte74d9g9zhhi3hnix86hk X-HE-Tag: 1744703351-573447 X-HE-Meta: U2FsdGVkX1+YEJhBkch/irP2luFS5lWnripL0xHr0RzFPEbopFuuf7+pOPnlyiHbN36WEcOHgkjedQQ2shFB7cs9Y6VJOkLjkZTCFqwz8FU7fjoIDC+VgmWltCjzilki3g8qM0YjLlfMkHEDyXFOrY1ZIfQwSADFe+61awClGOfbn6XxCc9GKfOcONu2Z3iCCKSgAgehHP1eEd2bU4dBle8Zggxwnzk3dv10QAI75Gk0eCToJaWnj9rQ8gURHukqpKxJDYg6fIrrrvl52MzAOCCsLko4a6BNdqBc0NC6RD6rvARJPPWSEUutr5XVPDErgW1uNpUc9FFdUXmQzTO1qNn1KijbjQsu9FlL1+ZoCZJQGZGx4vypqjQHzzxPHZHPPFIk7ghzcT5mzqxbRcAKwzZ6z/TNuSTitwROt+Vj3GSkDZdWAX465xAoWzTnnZH5V0r42Bwek3pFbnJqHf1WLRiTS0omUjbVJbdKVqTVxpiOgWXshRQ2peAIpybZshTQtP5HJbqQczYOv/9mvPIjY9qBvV9bx9A/dXPgwNhVYwIE1tkWMuBvtbRyW5RV8E2LqdVnzoZyKLmXTJtWTPYb/MipBh7jI9aqQSXoTDwVpm9wSdDlbJRNUK7iV1Lis6mTAVmglMJFFGmgzox4jGKI3o0iiV4QSIqgze9zpaVROdTYcX27ZHPcnZmHx7FWB35uERtWNI1z7qWWkQxrlx6CUQsfgOOal/eUDKDR48Vo5ButZLHrIO8P4KKwfeCdPjAym+6y1VuxuEfpIx/0M1xw4zRp5EZNuC1OaUIbtz3PpLxPdWFV3BPgxPJE4KKBQ0Mn66MY36VZSgreKIlVFWwOcixEzLJJtsiTbaVj8DeO8j+GA0e35vJOoI6ib9Nxk5RRYOQSyAunCan6zyhYrsMn2vK3Q5+t0ag7K4uGuW/xwdJbM9hnsNDFCk/YgvFaiw27R9eE6RE6vP8eRzIYpcB pDDUFAAn ry/6VywtP/yMwaggLej791oNk9O3w8zT76cdlK1hQZTlZbXY+nYY+CWf7NJ/UDnPIFXw3vzcmG+s+9GIZA/U7HBILmY1FnNQZv9WBZDO79Rd6NFZfssBlDoaNh15GcKOAs+HSRpOijlEZx85gUjd4HyzTSfc6szUrZ/q5k3iaHDcpyC3mVcciW6Ad+oNgtI5bXeXkFLo0bf6b0v2VA+LLa8QqgA5brW1VvI/mNGATZaP3tTB8qFt7/5bRUX/jeMfDmWcvF093pfrblaZVjazv4jxccsQGik6F+rFF4emr8X62dflG1j+Wjpmxp2vl8K6I0sMBh3RNkIj5qfqA1yKzuE/xDv38J1yFaQUl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Joanne, Sorry for the late reply... On 4/11/25 12:11 AM, Joanne Koong wrote: > On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu wrote: >> >> On 4/10/25 11:07 PM, Joanne Koong wrote: >>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu wrote: >>>> >>>> >>>> >>>> On 4/10/25 7:47 AM, Joanne Koong wrote: >>>>> On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu wrote: >>>>>> >>>>>> Hi Joanne, >>>>>> >>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote: >>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa >>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every >>>>>>> dirty page to be written back, the contents of the dirty page are copied over >>>>>>> to the temp page, and the temp page gets handed to the server to write back. >>>>>>> >>>>>>> This is done so that writeback may be immediately cleared on the dirty page, >>>>>>> and this in turn is done in order to mitigate the following deadlock scenario >>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete: >>>>>>> * single-threaded FUSE server is in the middle of handling a request >>>>>>> that needs a memory allocation >>>>>>> * memory allocation triggers direct reclaim >>>>>>> * direct reclaim waits on a folio under writeback >>>>>>> * the FUSE server can't write back the folio since it's stuck in >>>>>>> direct reclaim >>>>>>> >>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates >>>>>>> the situations described above, FUSE writeback does not need to use >>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. >>>>>>> >>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings >>>>>>> and removes the temporary pages + extra copying and the internal rb >>>>>>> tree. >>>>>>> >>>>>>> fio benchmarks -- >>>>>>> (using averages observed from 10 runs, throwing away outliers) >>>>>>> >>>>>>> Setup: >>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount >>>>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount >>>>>>> >>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G >>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount >>>>>>> >>>>>>> bs = 1k 4k 1M >>>>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s >>>>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s >>>>>>> % diff -3% 23% 45% >>>>>>> >>>>>>> Signed-off-by: Joanne Koong >>>>>>> Reviewed-by: Jingbo Xu >>>>>>> Acked-by: Miklos Szeredi >>>>>> >>>>> >>>>> Hi Jingbo, >>>>> >>>>> Thanks for sharing your analysis for this. >>>>> >>>>>> Overall this patch LGTM. >>>>>> >>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is >>>>>> also unneeded then, at least the DIRECT IO routine (i.e. >>>>> >>>>> I took a look at fi->writectr and fi->queued_writes and my >>>>> understanding is that we do still need this. For example, for >>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to >>>>> prevent concurrent writeback or else the setattr request and the >>>>> writeback request could race which would result in a mismatch between >>>>> the file's reported size and the actual data written to disk. >>>> >>>> I haven't looked into the truncate routine yet. I will see it later. >>>> >>>>> >>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is >>>>>> because after removing the temp page, the DIRECT IO routine has already >>>>>> been waiting for all inflight WRITE requests, see >>>>>> >>>>>> # DIRECT read >>>>>> generic_file_read_iter >>>>>> kiocb_write_and_wait >>>>>> filemap_write_and_wait_range >>>>> >>>>> Where do you see generic_file_read_iter() getting called for direct io reads? >>>> >>>> # DIRECT read >>>> fuse_file_read_iter >>>> fuse_cache_read_iter >>>> generic_file_read_iter >>>> kiocb_write_and_wait >>>> filemap_write_and_wait_range >>>> a_ops->direct_IO(),i.e. fuse_direct_IO() >>>> >>> >>> Oh I see, I thought files opened with O_DIRECT automatically call the >>> .direct_IO handler for reads/writes but you're right, it first goes >>> through .read_iter / .write_iter handlers, and the .direct_IO handler >>> only gets invoked through generic_file_read_iter() / >>> generic_file_direct_write() in mm/filemap.c >>> >>> There's two paths for direct io in FUSE: >>> a) fuse server sets fi->direct_io = true when a file is opened, which >>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side >>> b) fuse server doesn't set fi->direct_io = true, but the client opens >>> the file with O_DIRECT >>> >>> We only go through the stack trace you listed above for the b) case. >>> For the a) case, we'll hit >>> >>> if (ff->open_flags & FOPEN_DIRECT_IO) >>> return fuse_direct_read_iter(iocb, to); >>> >>> and >>> >>> if (ff->open_flags & FOPEN_DIRECT_IO) >>> return fuse_direct_write_iter(iocb, from); >>> >>> which will invoke fuse_direct_IO() / fuse_direct_io() without going >>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() / >>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed >>> above. >>> >>> So for the a) case I think we'd still need the fuse_sync_writes() in >>> case there's still pending writeback. >>> >>> Do you agree with this analysis or am I missing something here? >> >> Yeah, that's true. But instead of calling fuse_sync_writes(), we can >> call filemap_wait_range() or something similar here. >> > > Agreed. Actually, the more I look at this, the more I think we can > replace all fuse_sync_writes() and get rid of it entirely. I have seen your latest reply that this cleaning up won't be included in this series, which is okay. > fuse_sync_writes() is called in: > > fuse_fsync(): > /* > * Start writeback against all dirty pages of the inode, then > * wait for all outstanding writes, before sending the FSYNC > * request. > */ > err = file_write_and_wait_range(file, start, end); > if (err) > goto out; > > fuse_sync_writes(inode); > > /* > * Due to implementation of fuse writeback > * file_write_and_wait_range() does not catch errors. > * We have to do this directly after fuse_sync_writes() > */ > err = file_check_and_advance_wb_err(file); > if (err) > goto out; > > > We can get rid of the fuse_sync_writes() and > file_check_and_advance_wb_err() entirely since now without temp pages, > the file_write_and_wait_range() call actually ensures that writeback > is completed > > > > fuse_writeback_range(): > static int fuse_writeback_range(struct inode *inode, loff_t > start, loff_t end) > { > int err = > filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX); > > if (!err) > fuse_sync_writes(inode); > > return err; > } > > > We can replace fuse_writeback_range() entirely with > filemap_write_and_wait_range(). > > > > fuse_direct_io(): > if (fopen_direct_io && fc->direct_io_allow_mmap) { > res = filemap_write_and_wait_range(mapping, pos, pos + > count - 1); > if (res) { > fuse_io_free(ia); > return res; > } > } > if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + > count - 1))) { > if (!write) > inode_lock(inode); > fuse_sync_writes(inode); > if (!write) > inode_unlock(inode); > } > > > I think this can just replaced with > if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) { > res = filemap_write_and_wait_range(mapping, > pos, pos + count - 1); > if (res) { > fuse_io_free(ia); > return res; > } > } Alright. But I would prefer doing this filemap_write_and_wait_range() in fuse_direct_write_iter() rather than fuse_direct_io() if possible. > since for the !fopen_direct_io case, it will already go through > filemap_write_and_wait_range(), as you mentioned in your previous > message. I think this also fixes a bug (?) in the original code - in > the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we > still need to write out dirty pages first, which we don't currently > do. Nope. In case of fopen_direct_io && !fc->direct_io_allow_mmap, there won't be any page cache at all, right? -- Thanks, Jingbo