From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA5DCC369A2 for ; Thu, 10 Apr 2025 15:07:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1A49928010B; Thu, 10 Apr 2025 11:07:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 12C53280109; Thu, 10 Apr 2025 11:07:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EC2A728010B; Thu, 10 Apr 2025 11:07:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C54ED280109 for ; Thu, 10 Apr 2025 11:07:42 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 751431A0EFB for ; Thu, 10 Apr 2025 15:07:43 +0000 (UTC) X-FDA: 83318463606.17.9FEAC16 Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by imf18.hostedemail.com (Postfix) with ESMTP id 67BFF1C000D for ; Thu, 10 Apr 2025 15:07:41 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Flq0J0TY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744297661; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KnneFsSBHQjs5Y+8keCO9raiB8uFLlirdcef9i/OwQA=; b=PEiBF/qYpmUy7Lq029pa+VNxlIu0yVvvKm77h/eiuGJH0AYZnDLssak35uWeFhHNK6RHmG NBKY7l3K/QXcKk75oywPeeyFLST3r/roDxlhMLU5j0+Uo/KPuxaIjL4NIlWPlJjR4imyJN 2R5Iwfa2mm2AbGLvQdPwriRhBsXlIfY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Flq0J0TY; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.167.43 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744297661; a=rsa-sha256; cv=none; b=SQd/8yPYcDVKTCy15eghresAmnoanwXIfR4pSoQGu9p+ZnXqzmoq7DlvAmeC4/dcJf9d6c b/LZTWEB8QSvGe7C25gXrtlKXUPLd28WjIvYFSXTBsQxHo+p8OHxXPe+DzHt4KO/3r6iOU KTbagl4+NdSYNVQJQPlkFRg70HRioCU= Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-5499d2134e8so1160314e87.0 for ; Thu, 10 Apr 2025 08:07:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1744297659; x=1744902459; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KnneFsSBHQjs5Y+8keCO9raiB8uFLlirdcef9i/OwQA=; b=Flq0J0TY9kKCUDJIK+6fQwfZV54Ltu19NcEJyxXFmG1Qk58X/iu6D+yehW5+OOf02h OYcmRH71+GOcqEKN1t7pUjk+8D/92RUxN7xCh593f4dsUUW+PKtKnM+xanxJyKOyv1vK aFWTcMgEmyV4hYTv5SS76sgylKv7ku9rKBLvgKEGX7A4K9ozwu9WQFC7qGIR/BBc7VbM o1+aSjzR/qZT0vWXwQQyRgGoV3+h7QCBHWWRWo/MgYRpjHM0Iw6nZ+B1CevlMrPopJ0o 0PSmE6j/B17ttMQj5bWsLmgLveDdD/X+IRe7mm32Ej8MPoC+0lDwVBnWZXYahC3S5RYg ek9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744297659; x=1744902459; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KnneFsSBHQjs5Y+8keCO9raiB8uFLlirdcef9i/OwQA=; b=H5r2/UvqmqVe01A4TFoC6Gq+idEX6kefTvJ2/LKYrFCTa9rcis/piV1IE6w0Rfuyub /YrO4YLiEUByduIelKptvfHjmDfh+P5OC3YED9+ZfmG/MqWqnRjeUBV+drCGnOcRGnyy 9pG0fDTnpGJ9iaroAfUTBRanQdLDv8J1YYkz03VZ3hMgkrx/J2iKof4VPkH4IakIVAfo w+XdDFBZgHdGeHztwCnOmbmEEH6ILxOznB7Cllrogj9gh8iDjkpcq4e8jYHAyxiB37kw G2vtPYzQHx17BGXhXWQm1/sghD+PtJnw0XWVE1AzyQtHTd+6jQS04ugfltDGhNuaXswe FO6A== X-Forwarded-Encrypted: i=1; AJvYcCV8lYjKV7Wn3Va4l/a+gIvrctY1Ezo7BAVMoyceexJkcek3gNh6nxQc3A5sdezN7WxWV9LRwmg4Gg==@kvack.org X-Gm-Message-State: AOJu0YzwN/5DXwbirIYmUpMtMdFzH9A5Yd/NrkUf+HZTOiWZuhJB8t9u WaOAr0PjElF/Tka4V1V1J4uhTTyjn+ZmN+v9N8vIbN5/uS+KLwPuxRFRXSkzMhG/H1qdt9AviTL DEdOBCBe7LrtJYAOBoz+9tJN+zj0= X-Gm-Gg: ASbGncuVsdtavcsO4PI7KRPrqJqVubx+NM7MGbfPmU4C10fpvgQtNIl+Cus2iGCCN/Z 2f1tbv7HTnDNHh69Hi9OU/2Q+1bvrtTuMBV+CzkjJPyO9EFeS0Crjl3ZtWPPXRK9SnlfpTRR06s EfauMiynHKnAnP1BRvDf5kV+m7KMRZyHG2 X-Google-Smtp-Source: AGHT+IHRtUYNWUNxDzi4/nSfOt7OrVyYp8hLoKm65hQ1sGNELiu2MmmusPeAcCf+CDu76CeULym1+xQDETP1rC9STGs= X-Received: by 2002:a05:6512:238c:b0:549:4e79:d4c0 with SMTP id 2adb3069b0e04-54cb689a69dmr940995e87.53.1744297658890; Thu, 10 Apr 2025 08:07:38 -0700 (PDT) MIME-Version: 1.0 References: <20250404181443.1363005-1-joannelkoong@gmail.com> <20250404181443.1363005-4-joannelkoong@gmail.com> <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> In-Reply-To: <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> From: Joanne Koong Date: Thu, 10 Apr 2025 08:07:22 -0700 X-Gm-Features: ATxdqUH1Oqc5Eyd8iLIPoAOljO4GF5V74mMh2sLenANHKIAzSIQEix_4HUbWNKs Message-ID: Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree To: Jingbo Xu Cc: miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com, Miklos Szeredi Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 67BFF1C000D X-Stat-Signature: 86hkwirsaqz147szj3oxuxzo6sqbkx9m X-Rspam-User: X-HE-Tag: 1744297661-583736 X-HE-Meta: U2FsdGVkX18sj1EO01UDOIeYb4UNKTlpN3lKbydnqZyU1SGoeoq7Bn6hLI+fz9HRRa5KaXmWjzluCDDc3gZzCJ281UFnROpUp8yyZmr81Jzl9VqZX3dYgW6dnb+5hsiKGiez5oos3kUe6Lm8YK94ebS24xMstLQ2envQ2zi+p3XoG4YoNXQzlHyiClT+zMN0dqR699YRCeWgsad+JnA4AByLTnhyWGxT4NcvRaJUbhHamyobAYdTcJjBADLJjxWUZb8byYer+YEF6NkyXH2UI5aVYVauVixTy6lUKMyO7TtcZCgphh+NvSb/Msn6rlbCvdXVcqV3+U4kprP6+f+xa9d8U+NF9eF0sSf3VIi6shlNGsaOSNXG4VLxBRUn1ElN3zZ2Evt8HYfOX2cShhzqhtOnYjsveXQc6XbMGgCBVSmfld46t5StOkrbnNnchQUmd6LwciXGjhllljKXKvLTZKhscKmznNDxz8yiGBjwXMzNSfE6WCwcNc+BOKd50NeGszTUdhfMQ381uVmrCIagWRtRF6vO2KYVzmYafAc2FN93tM8aoR/ppQraURqOdSK2xqQK86vWRMzsPb2clxuRflQLs3Up+i1rRHpMobbCF4K5tpgqFQZ1XQKzzPkXiCW3Vu7rWkj88oseF6lG5nzn+FlRWaareqvD4jXL2zSU/Rny/oLyfKK6YffD/yEvbS34ZwQ7ZteAPNp+8PRnKpY9cpr1dU8iWr6DWmQZ+pZyDDMROtrByQPKPCoQyQEc6PQKV8Ja73yFycLrEzr4BNHLOtAp1ZFnx+WPdaS9qoaq4UgWLek6rS95U1ARWFMICRaOqmOoWyA9shZc9WkWl2v4cFDRxAbpSBtzTWFLKRkuIWWXuy+LSKQyjkjnCROQJZBapkhh/Jd51xMpZ/BX2z7ve5I5VMcTdqPGZB7tgrXvJPqCBOvQAKGVpA8PBxXEXro0o2+nn2h86mgWXKDJqBB cUOD7/vF z9EZ1f85EtL0qejRiJ+w23k7ykKNYDWbPgFVpZIFsnDpzkxenpv7pvfnvh86nhRyfiX+80MAT6kJcK3/yAJHGZ0O0bkPYh5ml12n1old7/xm5BGigHWUo+DWT0jobCEgm8aNOBoW5cl0fEb6KR2I5jKmLSYuGg/wUc5EzWCEkf3bP+0Dw93P2E+TwoKeqcL9M0mXuVluPL0UjIYnakLMVr20S17IfTLSIxWtv12QpUlk3xLNClwQko8WCUjmoLWPjK/KwLjoyiXLiP4URq+3+3Kfc9V1karkvTgKf0/kpRKjAL0EwxBDfYj90APFtOiMvTErzaG4AweazxiMgig2es1ZfdITcvnkrw8GagMxFrLqUru3PBhIGz5fXrYuLlthnVqyjrYvBeo3u0Kg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 9, 2025 at 7:12=E2=80=AFPM Jingbo Xu wrote: > > > > On 4/10/25 7:47 AM, Joanne Koong wrote: > > On Tue, Apr 8, 2025 at 7:43=E2=80=AFPM Jingbo Xu wrote: > >> > >> Hi Joanne, > >> > >> On 4/5/25 2:14 AM, Joanne Koong wrote: > >>> In the current FUSE writeback design (see commit 3be5a52b30aa > >>> ("fuse: support writable mmap")), a temp page is allocated for every > >>> dirty page to be written back, the contents of the dirty page are cop= ied over > >>> to the temp page, and the temp page gets handed to the server to writ= e back. > >>> > >>> This is done so that writeback may be immediately cleared on the dirt= y page, > >>> and this in turn is done in order to mitigate the following deadlock = scenario > >>> that may arise if reclaim waits on writeback on the dirty page to com= plete: > >>> * single-threaded FUSE server is in the middle of handling a request > >>> that needs a memory allocation > >>> * memory allocation triggers direct reclaim > >>> * direct reclaim waits on a folio under writeback > >>> * the FUSE server can't write back the folio since it's stuck in > >>> direct reclaim > >>> > >>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitiga= tes > >>> the situations described above, FUSE writeback does not need to use > >>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mapping= s. > >>> > >>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > >>> and removes the temporary pages + extra copying and the internal rb > >>> tree. > >>> > >>> fio benchmarks -- > >>> (using averages observed from 10 runs, throwing away outliers) > >>> > >>> Setup: > >>> sudo mount -t tmpfs -o size=3D30G tmpfs ~/tmp_mount > >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads= =3D4 -o source=3D~/tmp_mount ~/fuse_mount > >>> > >>> fio --name=3Dwriteback --ioengine=3Dsync --rw=3Dwrite --bs=3D{1k,4k,1= M} --size=3D2G > >>> --numjobs=3D2 --ramp_time=3D30 --group_reporting=3D1 --directory=3D/r= oot/fuse_mount > >>> > >>> bs =3D 1k 4k 1M > >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>> % diff -3% 23% 45% > >>> > >>> Signed-off-by: Joanne Koong > >>> Reviewed-by: Jingbo Xu > >>> Acked-by: Miklos Szeredi > >> > > > > Hi Jingbo, > > > > Thanks for sharing your analysis for this. > > > >> Overall this patch LGTM. > >> > >> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism = is > >> also unneeded then, at least the DIRECT IO routine (i.e. > > > > I took a look at fi->writectr and fi->queued_writes and my > > understanding is that we do still need this. For example, for > > truncates (I'm looking at fuse_do_setattr()), I think we still need to > > prevent concurrent writeback or else the setattr request and the > > writeback request could race which would result in a mismatch between > > the file's reported size and the actual data written to disk. > > I haven't looked into the truncate routine yet. I will see it later. > > > > >> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is > >> because after removing the temp page, the DIRECT IO routine has alread= y > >> been waiting for all inflight WRITE requests, see > >> > >> # DIRECT read > >> generic_file_read_iter > >> kiocb_write_and_wait > >> filemap_write_and_wait_range > > > > Where do you see generic_file_read_iter() getting called for direct io = reads? > > # DIRECT read > fuse_file_read_iter > fuse_cache_read_iter > generic_file_read_iter > kiocb_write_and_wait > filemap_write_and_wait_range > a_ops->direct_IO(),i.e. fuse_direct_IO() > Oh I see, I thought files opened with O_DIRECT automatically call the .direct_IO handler for reads/writes but you're right, it first goes through .read_iter / .write_iter handlers, and the .direct_IO handler only gets invoked through generic_file_read_iter() / generic_file_direct_write() in mm/filemap.c There's two paths for direct io in FUSE: a) fuse server sets fi->direct_io =3D true when a file is opened, which will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side b) fuse server doesn't set fi->direct_io =3D true, but the client opens the file with O_DIRECT We only go through the stack trace you listed above for the b) case. For the a) case, we'll hit if (ff->open_flags & FOPEN_DIRECT_IO) return fuse_direct_read_iter(iocb, to); and if (ff->open_flags & FOPEN_DIRECT_IO) return fuse_direct_write_iter(iocb, from); which will invoke fuse_direct_IO() / fuse_direct_io() without going through the kiocb_write_and_wait() -> filemap_write_and_wait_range() / kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed above. So for the a) case I think we'd still need the fuse_sync_writes() in case there's still pending writeback. Do you agree with this analysis or am I missing something here? > > > Similarly, where do you see generic_file_write_iter() getting called > > for direct io writes? > > # DIRECT read > fuse_file_write_iter > fuse_cache_write_iter > generic_file_write_iter > generic_file_direct_write > kiocb_invalidate_pages > filemap_invalidate_pages > filemap_write_and_wait_range > a_ops->direct_IO(),i.e. fuse_direct_IO() > > > > Where do you see fi->writectr / fi->queued-writes preventing this > > race? > > IMO overall fi->writectr / fi->queued-writes are introduced to prevent > DIRECT IO and writeback from sending duplicate (inflight) WRITE requests > for the same page. > > For the DIRECT write routine: > > # non-FOPEN_DIRECT_IO DIRECT write > fuse_cache_write_iter > fuse_direct_IO > fuse_direct_io > fuse_sync_writes > > > # FOPEN_DIRECT_IO DIRECT write > fuse_direct_write_iter > fuse_direct_IO > fuse_direct_io > fuse_sync_writes > > > For the writeback routine: > fuse_writepages() > fuse_writepages_fill > fuse_writepages_send > # buffer the WRITE request in queued_writes list > fuse_flush_writepages > # flush WRITE only when fi->writectr >=3D 0 > > > > > It looks to me like in the existing code, this race condition > > you described of direct write invalidating the page cache, then > > another buffer write reads the page cache and dirties it, then > > writeback is called on that, and the 2 write requests racing, could > > still happen? > > > > > >> However it seems that the writeback > >> won't wait for previous inflight DIRECT WRITE requests, so I'm not muc= h > >> sure about that. Maybe other folks could offer more insights... > > > > My understanding is that these lines > > > > if (!cuse && filemap_range_has_writeback(...)) { > > ... > > fuse_sync_writes(inode); > > ... > > } > > > > in fuse_direct_io() is what waits on previous inflight direct write > > requests to complete before the direct io happens. > > Right. > > > > > > >> > >> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with > >> which I'm pretty sure. > > > > Why don't we still need this for fuse_flush()? > > > > If a caller calls close(), this will call > > > > filp_close() > > filp_flush() > > filp->f_op->flush() > > fuse_flush() > > > > it seems like we should still be waiting for all writebacks to finish > > before sending the fuse server the fuse_flush request, no? > > > > filp_close() > filp_flush() > filp->f_op->flush() > fuse_flush() > write_inode_now > writeback_single_inode(WB_SYNC_ALL) > do_writepages > # flush dirty page > filemap_fdatawait > # wait for WRITE completion Nice. I missed that write_inode_now() will invoke filemap_fdatawait(). This seems pretty straightforward. I'll remove the fuse_sync_writes() call in fuse_flush() when I send out v8. The direct io one above is less straight-forward. I won't add that to v8 but that can be done in a separate future patch when we figure that out. Thanks, Joanne > > >> > >>> --- > >>> fs/fuse/file.c | 360 ++++-----------------------------------------= -- > >>> fs/fuse/fuse_i.h | 3 - > >>> 2 files changed, 28 insertions(+), 335 deletions(-) > >>> > >>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c > >>> index 754378dd9f71..91ada0208863 100644 > >>> --- a/fs/fuse/file.c > >>> +++ b/fs/fuse/file.c > >>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl= _owner_t id) > >>> > >>> struct fuse_writepage_args { > >>> struct fuse_io_args ia; > >>> - struct rb_node writepages_entry; > >>> struct list_head queue_entry; > >>> - struct fuse_writepage_args *next; > >>> struct inode *inode; > >>> struct fuse_sync_bucket *bucket; > >>> }; > >>> > >>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_i= node *fi, > >>> - pgoff_t idx_from, pgoff_t i= dx_to) > >>> -{ > >>> - struct rb_node *n; > >>> - > >>> - n =3D fi->writepages.rb_node; > >>> - > >>> - while (n) { > >>> - struct fuse_writepage_args *wpa; > >>> - pgoff_t curr_index; > >>> - > >> > >> -- > >> Thanks, > >> Jingbo > > -- > Thanks, > Jingbo