From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4C4DC3601E for ; Thu, 10 Apr 2025 16:11:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1013F28007D; Thu, 10 Apr 2025 12:11:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0B02C6B0350; Thu, 10 Apr 2025 12:11:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E6BA528007D; Thu, 10 Apr 2025 12:11:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C70B36B034F for ; Thu, 10 Apr 2025 12:11:41 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CFC37120F99 for ; Thu, 10 Apr 2025 16:11:41 +0000 (UTC) X-FDA: 83318624802.14.20664AB Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) by imf11.hostedemail.com (Postfix) with ESMTP id DF85E40004 for ; Thu, 10 Apr 2025 16:11:39 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=baVR1CQT; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744301499; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=usXxrhDoknZeVBxvwY3ZQkR0+/Xv+tfgIblFwC2g8ns=; b=qzwE+CN+F7w7Pc8kYVOav3MWcOrlQhs50EfqItLTr5vYPeVGFCKjPBWfDTLaoFTpWZ8K+c ztZFditPI4CL5+V5ck/ZFwmSqW9MjQ/eIncznZzsRhGLvO55hulxxkwYWk2hhyayZLXG+1 W4SbSpZC8n2uKgJuaaTbB6hrezzI2JM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744301499; a=rsa-sha256; cv=none; b=vl0YxxXa++dg+A4uKXhieg6jgXCCmYOe7rSZwv+D8yC3olUuGtGdfNO5HRIbJ33doYDn1D gS614MXOmuFQTXKNC7di29BYj9vBrpaT9WOVT55byHPxRhnDNW/esCFUJavy03rye2a0dW FHT1vy7AQH2aBiyDdsWtiDp7iCxK8uQ= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=baVR1CQT; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.178 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-47663aeff1bso9406881cf.0 for ; Thu, 10 Apr 2025 09:11:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1744301499; x=1744906299; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=usXxrhDoknZeVBxvwY3ZQkR0+/Xv+tfgIblFwC2g8ns=; b=baVR1CQTj4boBZyvXOATY22BmKYHw/8FZKXlUy9BfSPy3LQ+riwY+PX1THKVtPdnFf vPS6rdDtKEw1ulFM78e4sqFGSDzqC6rafKV6cISpPpNUdVbOEG/+6T7b4CWyfFsisu6l nTcn0w4AN6yP+CtTK5BARQ6Av4RyrJVZF8e8Tb2G8ExIa9KMIj39nQFgprJTniAWM8mR 5HUeDv8N57+GGNpg7u37wtO6zKoh3c4IolbtA9Ui7hLKWToE2meAI8SXioJFl9GY3bNd coUggDrC/Kiy1M1fnN+ClBGjYE44WVaJudqqHsc6Cs0CRzgGrOl0VMGowcTN/CkEOhlY 9xXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744301499; x=1744906299; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=usXxrhDoknZeVBxvwY3ZQkR0+/Xv+tfgIblFwC2g8ns=; b=aer/nWi2sBAvm40dZ9pgBqgwEPtSFHHUVNznBMF+h476+u2EYLGeLRi2HoyXivVcat DJ/kznDyyD7xe3oSV8HK9+H/I1iUVFRo3Jw2w+RzUnvxAvw0rxzl2y8CpUMmUb0X7Edr Xep1Wd3HKWXkEkh3O2OCLOg3Zn8d6fliH/f9wsSi7/44sliL+FT4wnRuQUqLO8fe5mMq xR7lR9UmDd8xOG9s774/obU9N70xsvOJzSwkZqx3VQiF/14zBs3OIOrBWqz14Y3ZdD3F ULsu55yBFGuQWOwzfYLYn1EsgCoK3WfQvN6KOc7jC+yEFJY3KiRY073LPB+6J/zgUTQJ wfdA== X-Forwarded-Encrypted: i=1; AJvYcCUGClsX0C40N7vS16jpVE1hfmMkO41oQKMdRUdiVSp430v2wMp26FEi9gfdbK+aYotXWmHshGOa1Q==@kvack.org X-Gm-Message-State: AOJu0Yy3ph9SebYSWoZeDmiHc0mlgkKk2ppLbWAlUlZdr95JQYl4jLEM g77GDErItigVB1peo7TEg9tyRsmMSx07iFktZJeqMHxc4+s3yd+aS2bs7JiAHj6cPff1BmhgT93 eJIn8nwmrVQeTr0Ky65a0nBX7/zk= X-Gm-Gg: ASbGncuQQrvpWF+3dSa2CtMXpBiurwXjduHrss1kzeJLOla0Pd2sXEEiSL2s+FKupbj 3VkcrR2Ycswcb5PC9PV9ZNPVRd2h+ibmmB1juwSzY18bdg8uMloXEkBwNel6mG7flLIW9GnstIL ktb5+M1v9/qKlYQFRzoJgRJA== X-Google-Smtp-Source: AGHT+IHOMKNDcjT83ijbG7cVz9nkjKDhfcORp5mkizohR3T5dQeIgZsw9Iz92fWM33rFrVyo+F81AdObBNWx5Ui1m+8= X-Received: by 2002:a05:622a:11cf:b0:474:f4a4:2ca9 with SMTP id d75a77b69052e-4796cbc4b84mr40317001cf.25.1744301498809; Thu, 10 Apr 2025 09:11:38 -0700 (PDT) MIME-Version: 1.0 References: <20250404181443.1363005-1-joannelkoong@gmail.com> <20250404181443.1363005-4-joannelkoong@gmail.com> <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> <9a3cfb55-faae-4551-9bef-b9650432848a@linux.alibaba.com> In-Reply-To: <9a3cfb55-faae-4551-9bef-b9650432848a@linux.alibaba.com> From: Joanne Koong Date: Thu, 10 Apr 2025 09:11:28 -0700 X-Gm-Features: ATxdqUHBqT9PC48ClM1VKz8vkcfzG5ve54G0Utl1ySAgNzES2LqGM6vvDM6LZ9s Message-ID: Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree To: Jingbo Xu Cc: miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com, Miklos Szeredi Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: DF85E40004 X-Stat-Signature: n3jrgj16qens5p8gpacyjb3rs4ywpqkx X-Rspam-User: X-HE-Tag: 1744301499-44178 X-HE-Meta: U2FsdGVkX18wZGad329g03GJJPVCJIqV6GzsuEdcGnStih2neW+W98OfyQ2Xeya8spL2y1utguSIXHNmynGfRCB12vVQ9ps5xGKapEx4QnZoUg9VBCEYdQmbqiiap4Ax+8RACiow5s6dL809LctNoj/rzAvgxasyi2sLcGrFitSgiKov671txoEIsA/93uxmmKgiOCEgXwXEHkCRmPKNN1Ix5ckcIr3pyEH+K2QdvfWkRNxKF3sJHfsfro45ofL9ZNaqiwEnuK3D5isgWQFi+S2SDxzX34df/kH4K4cNj1Lu4yooKzKY0zr+Efk9BSk+2fcdBY3qsxZY0bJyUTLh7hx4/CfMzmtiwmS6aiMvkt5zhLTTcSG9AvJU/4sydI6kJDd5hXbvolce20ggh6pruxHsIerjwLeW8YsIe9mOLuW+QKUE8XdLsNHt7zEU8BMuHD2KUusPa2HySKtb89ZU8+Hz+PBlSaK9RQy2yHCL0M69iFGl5keOek/4tIEf4tYU8SD8an1QCgERZeX4dqDDGv5fm/sbx2993wHrmSD+P8CDfR0tNcooGrHkVNCxZJbFlqtvUidYK2ZqzM9U1o4JraLC4/wAnwG24PqYHwUnKl9rg/JHQH2CmLJ8tM70mIw7sNgZCRULBK/he4jKbqVJreQIvHI92JdkiqyNBGdp7A0IKWreJU3o63kbuhrBMswEvDHIxp6gMtBiAUOYxb1fCyszrKAwz4X1Cyp1FpDQDvgK8TvGLmt21ckSDgYSD7tiV8bzPGfeA77bFGPIZQPFOwux5aSX9BQwaW7ARjKK/v4myJ4hI3BjVvQarc1UJAOmIAxstbR/cRNk+eAEaPagCBkDPznSNdwQM41fhFhDmfkKPyz226zkozPulGLFToHTL7WmbN7Jfx5mLNhirTNN3w3Bq0VQgPh3mVdLbxKTwm87QPFAMFXRwywUEIGYF+kn3r52FNsPEMxGOkqbolG beLR0aWG HJMwLSqgs6NCbVaJAiw4oejfC+D6Wgyt4H7Z97ozYtwuuDRkGQRT4/brkE6sxOA3lac6CPbZXACrErQ5QnLP1a20WeYhy3LpC/wYXVCpSwvOr9lwbNkAV8MYNUXRzkBFr9a+1q6S4pMg1JilQoHlPDnveKUnbn1fiWlbAyspzu1GMTTYv9upqCes8NBg1x7Lb1+UPv9pE5BjVHlCJyUl8Vpha9G7U0I5uH4icB3urW6Q5vO9Jw+ld5+82X8BPDZW7F/oKGLEq9vpzNpECGKj+/jIEpLMM5300SzcGaQrT9MujAcbhYK1EmLU46HAuOrdmI42FsKTDCtXpnUHjHi8CrSp50haLvAXU0/yyamK/LWOSEebP7NE0ns2N8GihrSiHKVbKcgr7SRdig5k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Apr 10, 2025 at 8:11=E2=80=AFAM Jingbo Xu wrote: > > On 4/10/25 11:07 PM, Joanne Koong wrote: > > On Wed, Apr 9, 2025 at 7:12=E2=80=AFPM Jingbo Xu wrote: > >> > >> > >> > >> On 4/10/25 7:47 AM, Joanne Koong wrote: > >>> On Tue, Apr 8, 2025 at 7:43=E2=80=AFPM Jingbo Xu wrote: > >>>> > >>>> Hi Joanne, > >>>> > >>>> On 4/5/25 2:14 AM, Joanne Koong wrote: > >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa > >>>>> ("fuse: support writable mmap")), a temp page is allocated for ever= y > >>>>> dirty page to be written back, the contents of the dirty page are c= opied over > >>>>> to the temp page, and the temp page gets handed to the server to wr= ite back. > >>>>> > >>>>> This is done so that writeback may be immediately cleared on the di= rty page, > >>>>> and this in turn is done in order to mitigate the following deadloc= k scenario > >>>>> that may arise if reclaim waits on writeback on the dirty page to c= omplete: > >>>>> * single-threaded FUSE server is in the middle of handling a reques= t > >>>>> that needs a memory allocation > >>>>> * memory allocation triggers direct reclaim > >>>>> * direct reclaim waits on a folio under writeback > >>>>> * the FUSE server can't write back the folio since it's stuck in > >>>>> direct reclaim > >>>>> > >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and miti= gates > >>>>> the situations described above, FUSE writeback does not need to use > >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappi= ngs. > >>>>> > >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > >>>>> and removes the temporary pages + extra copying and the internal rb > >>>>> tree. > >>>>> > >>>>> fio benchmarks -- > >>>>> (using averages observed from 10 runs, throwing away outliers) > >>>>> > >>>>> Setup: > >>>>> sudo mount -t tmpfs -o size=3D30G tmpfs ~/tmp_mount > >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads= =3D4 -o source=3D~/tmp_mount ~/fuse_mount > >>>>> > >>>>> fio --name=3Dwriteback --ioengine=3Dsync --rw=3Dwrite --bs=3D{1k,4k= ,1M} --size=3D2G > >>>>> --numjobs=3D2 --ramp_time=3D30 --group_reporting=3D1 --directory=3D= /root/fuse_mount > >>>>> > >>>>> bs =3D 1k 4k 1M > >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>>>> % diff -3% 23% 45% > >>>>> > >>>>> Signed-off-by: Joanne Koong > >>>>> Reviewed-by: Jingbo Xu > >>>>> Acked-by: Miklos Szeredi > >>>> > >>> > >>> Hi Jingbo, > >>> > >>> Thanks for sharing your analysis for this. > >>> > >>>> Overall this patch LGTM. > >>>> > >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanis= m is > >>>> also unneeded then, at least the DIRECT IO routine (i.e. > >>> > >>> I took a look at fi->writectr and fi->queued_writes and my > >>> understanding is that we do still need this. For example, for > >>> truncates (I'm looking at fuse_do_setattr()), I think we still need t= o > >>> prevent concurrent writeback or else the setattr request and the > >>> writeback request could race which would result in a mismatch between > >>> the file's reported size and the actual data written to disk. > >> > >> I haven't looked into the truncate routine yet. I will see it later. > >> > >>> > >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is > >>>> because after removing the temp page, the DIRECT IO routine has alre= ady > >>>> been waiting for all inflight WRITE requests, see > >>>> > >>>> # DIRECT read > >>>> generic_file_read_iter > >>>> kiocb_write_and_wait > >>>> filemap_write_and_wait_range > >>> > >>> Where do you see generic_file_read_iter() getting called for direct i= o reads? > >> > >> # DIRECT read > >> fuse_file_read_iter > >> fuse_cache_read_iter > >> generic_file_read_iter > >> kiocb_write_and_wait > >> filemap_write_and_wait_range > >> a_ops->direct_IO(),i.e. fuse_direct_IO() > >> > > > > Oh I see, I thought files opened with O_DIRECT automatically call the > > .direct_IO handler for reads/writes but you're right, it first goes > > through .read_iter / .write_iter handlers, and the .direct_IO handler > > only gets invoked through generic_file_read_iter() / > > generic_file_direct_write() in mm/filemap.c > > > > There's two paths for direct io in FUSE: > > a) fuse server sets fi->direct_io =3D true when a file is opened, which > > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side > > b) fuse server doesn't set fi->direct_io =3D true, but the client opens > > the file with O_DIRECT > > > > We only go through the stack trace you listed above for the b) case. > > For the a) case, we'll hit > > > > if (ff->open_flags & FOPEN_DIRECT_IO) > > return fuse_direct_read_iter(iocb, to); > > > > and > > > > if (ff->open_flags & FOPEN_DIRECT_IO) > > return fuse_direct_write_iter(iocb, from); > > > > which will invoke fuse_direct_IO() / fuse_direct_io() without going > > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() / > > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed > > above. > > > > So for the a) case I think we'd still need the fuse_sync_writes() in > > case there's still pending writeback. > > > > Do you agree with this analysis or am I missing something here? > > Yeah, that's true. But instead of calling fuse_sync_writes(), we can > call filemap_wait_range() or something similar here. > Agreed. Actually, the more I look at this, the more I think we can replace all fuse_sync_writes() and get rid of it entirely. Right now, fuse_sync_writes() is called in: fuse_fsync(): /* * Start writeback against all dirty pages of the inode, then * wait for all outstanding writes, before sending the FSYNC * request. */ err =3D file_write_and_wait_range(file, start, end); if (err) goto out; fuse_sync_writes(inode); /* * Due to implementation of fuse writeback * file_write_and_wait_range() does not catch errors. * We have to do this directly after fuse_sync_writes() */ err =3D file_check_and_advance_wb_err(file); if (err) goto out; We can get rid of the fuse_sync_writes() and file_check_and_advance_wb_err() entirely since now without temp pages, the file_write_and_wait_range() call actually ensures that writeback is completed fuse_writeback_range(): static int fuse_writeback_range(struct inode *inode, loff_t start, loff_t end) { int err =3D filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX); if (!err) fuse_sync_writes(inode); return err; } We can replace fuse_writeback_range() entirely with filemap_write_and_wait_range(). fuse_direct_io(): if (fopen_direct_io && fc->direct_io_allow_mmap) { res =3D filemap_write_and_wait_range(mapping, pos, pos + count - 1); if (res) { fuse_io_free(ia); return res; } } if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) { if (!write) inode_lock(inode); fuse_sync_writes(inode); if (!write) inode_unlock(inode); } I think this can just replaced with if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse))= { res =3D filemap_write_and_wait_range(mapping, pos, pos + count - 1); if (res) { fuse_io_free(ia); return res; } } since for the !fopen_direct_io case, it will already go through filemap_write_and_wait_range(), as you mentioned in your previous message. I think this also fixes a bug (?) in the original code - in the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we still need to write out dirty pages first, which we don't currently do. What do you think? Thanks for all your careful review on this patchset throughout all of its iterations. > > > >> filp_close() > >> filp_flush() > >> filp->f_op->flush() > >> fuse_flush() > >> write_inode_now > >> writeback_single_inode(WB_SYNC_ALL) > >> do_writepages > >> # flush dirty page > >> filemap_fdatawait > >> # wait for WRITE completion > > > > Nice. I missed that write_inode_now() will invoke filemap_fdatawait(). > > This seems pretty straightforward. I'll remove the fuse_sync_writes() > > call in fuse_flush() when I send out v8. > > > > The direct io one above is less straight-forward. I won't add that to > > v8 but that can be done in a separate future patch when we figure that > > out. > > Thanks for keep working on this. Appreciated. > > -- > Thanks, > Jingbo