From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E96C1D6A224 for ; Thu, 14 Nov 2024 18:19:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E77C06B008A; Thu, 14 Nov 2024 13:19:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E27B56B008C; Thu, 14 Nov 2024 13:19:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CEFF86B0092; Thu, 14 Nov 2024 13:19:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id AF80D6B008A for ; Thu, 14 Nov 2024 13:19:51 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4948E413CF for ; Thu, 14 Nov 2024 18:19:51 +0000 (UTC) X-FDA: 82785512922.12.140C437 Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by imf08.hostedemail.com (Postfix) with ESMTP id B92E6160006 for ; Thu, 14 Nov 2024 18:19:19 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eRXsxXtX; spf=pass (imf08.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731608152; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vQmTHDuEtoYVL2J2etb8ouHOFXYwfhcn0IXo14bDCmA=; b=xxgVLDDe1f/u3luxoP+vlaTQYB43fSjtuq7p+FU/LzwXvymEadKWr6lH5XOO3D8k1bXPv7 5RVCzMK4POhSav1mY0ZMGJ7M/KCiZpM2fHvbfZwyji1FkxMT5+kHzYR1thpXvHFu+tbsRw /hE5pyZoVA4A1fi5oqsWVim1PKTOU8o= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eRXsxXtX; spf=pass (imf08.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.173 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731608152; a=rsa-sha256; cv=none; b=YXOtKsSjB+Aawmt0B+TrJ+QeskdEPAKNRD9a0j2Yjss38IGjGK5bOveNHgLrfbyBQ+s2Hx 8CaTPOzXGPiD6wDTNQk22YuPklTNEX6uVZpITuD2Gj1LXakLTM/hBS8XrtSJhjwxWFfhIO seDuGCBgd0c1cU3JV7ASMJxWwXo2OTY= Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-460af1a1154so5823021cf.0 for ; Thu, 14 Nov 2024 10:19:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731608388; x=1732213188; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vQmTHDuEtoYVL2J2etb8ouHOFXYwfhcn0IXo14bDCmA=; b=eRXsxXtXsOMCDw4FPte8uzrywD8VZnryyPmCoV0mbYltWeoamQaix9St7Up553cOOV n3aSjjuXYGAR1Ey55yaIDF7Ep73aTumhIFb1/1ivI11oDdcVG7yMmQeFoAi7YJVcmt3e G2zxFuihJL2SMjbsOU4Sb1jGUzlM5kILv+m7yjgQwaQ9fdCHJBwBhIfJviCqJty97zbW uTUaeiy9Jy6ASIRp0dlEMpOHK5tuQRXnIVCFZ+x2oSVb2vrVMVXWdoLsqJhbQ6jorn4n rWRgp5vSWhaLgbT7HL/Ug7L525/1XKb0zwTCmvdFTAx9Qf+mgu7CaZ0l6MsvPYdpbkdD Xafg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731608388; x=1732213188; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vQmTHDuEtoYVL2J2etb8ouHOFXYwfhcn0IXo14bDCmA=; b=Mitzkew4eRp61SlrCSZO6tkCr9JeT8p7rhBL1yzZQCCYdcsVZgd53gL+fyahjalsxb esJGbo/ZuFME0JX4UffO7hUFFp/CHYZNDWtnZZ9BNiueYBwPs2LBFgp71v55cvggZvsP F6gaF0jS6sBiZ9SegVM7GlSAExcmF5Ykl5c0mqzlyIyS/+B8ECZBFviXp6ZASQodGZYS GRXCb+nH79LPQh/quWC2gkj6dC+U8V9abxSPEnVrjNTbZ8SZfwB4uv1TZko/MaCWIWMS ahu8+xFpSQdpTi8uKFtUmjBTLf4aiTUQQXSf/Bw3dfhq69Gy+lYu0ZDCf9apQVZy33rJ 47mQ== X-Forwarded-Encrypted: i=1; AJvYcCWiRz6vU/5YtOtkKfIMDV1yCdCMVmw9wOsyO+HNeEW2YttTPefTV4hlpAb1AD94QeM56Ac6oIi11w==@kvack.org X-Gm-Message-State: AOJu0YxR6S/dlON8vY1LXhmTMycMcBxrbE6oj2A/VUqV4Al73u0YqoW1 1TtvGqFZS05aBYvpPy1f/fR2iBewfqWh/+SUgQoHKceL4du/iDAuLYwzkxbWfg9nJD6LExPMyfm YXDz+0ZgKuWahxPak25jmdAj/9tw= X-Google-Smtp-Source: AGHT+IGQww8gmC2Pzz5s2vUBPlXM6kYc+70EQ9eTZNXhBo9J8EpEQU9p/KGYbYMkSA7AmfcTN43Bn88giUGrvqr0ogI= X-Received: by 2002:a05:622a:5b94:b0:447:e769:76fc with SMTP id d75a77b69052e-4634b537fffmr90511671cf.34.1731608388309; Thu, 14 Nov 2024 10:19:48 -0800 (PST) MIME-Version: 1.0 References: <20241107235614.3637221-1-joannelkoong@gmail.com> <20241107235614.3637221-7-joannelkoong@gmail.com> <47661fe5-8616-4133-8d9c-faeb1ab96962@linux.alibaba.com> In-Reply-To: <47661fe5-8616-4133-8d9c-faeb1ab96962@linux.alibaba.com> From: Joanne Koong Date: Thu, 14 Nov 2024 10:19:37 -0800 Message-ID: Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree To: Jingbo Xu Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev, josef@toxicpanda.com, linux-mm@kvack.org, bernd.schubert@fastmail.fm, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: B92E6160006 X-Stat-Signature: pn46zt8fat8kr68jcri6kpsg948encm4 X-Rspam-User: X-HE-Tag: 1731608359-823423 X-HE-Meta: U2FsdGVkX1+vqI80buHjQVk6er5uMw5+eF3t+cMbNzYfLTX5UMKuVGAY+og/u1z1IhPsQbdaJrlU4Q1mNh8xCN92F2d+zckiUMpBf2Qc4+MAjrMOp6K0zGjj1wMWTLX67PAwAvpNN/BC9sCuTlMkNWSC+zG+rvriGGfl8E/sEgN2ayJRSWVB2HMAfZDmnfw64hDP1+iQi1773zZIkrCzp2qWH2GRT/gqMboJfyOw+6nPWsimU6J5W87uFsb05Ujthd5vT21fVBsJ/62yLZVn5fL18uJAv5A1XM9hokziY09aDHPDTlZWC8A3YSMQz9AaBSo7sI2THnziLOPXr0ROaVyb6J8Ubmuhm3nr/QvGe+Pst4clje42uRHgRasLKWP35Enm8NP6PX5I+VehRHMsEfYzw8pzbA3nxs5CJm1OMiexCLT3Oik/Zmt36i6CYBWvVuXSzbqUg06Sf13I6WsMWWN83ZsmRjZYwVgE+5kmup2A5AlzPTPe7CGDMlBs6rj7Ej9Ei6LJMRThT5azwCGK5JB+9ohuToQNUfemKGXQZPMWTop4z5KCMCJyyma0S7/iyU/oWTs4zxAbI2XFnbZxmFSbNhaf3C5Jt8KeGg5568L+0Ih9sqqTDqIwLyS9TQx6rw8dRIJkbEfHr8DQL8YlA0m/QW3eXjMBdT6WiiHtgaQ9i2vPqXg6jlV5+DtlroOXuOrJnEQIOIMU2cY+tSAhdHB849LyyB2EdZ0cUn/es6R7jIdm5ILUfUR9yPgmH/M/InVwIQ2GvXoWyaBfXtiJzO2DxATeHnuBc0nx3rJewyasLfAD6SPbckjMRwxP4SJHnybUPyXbBn+wrYAVW7+XFBEfa3/hfQxcryctIfpg4xyQ2F71FJ6zdglTkPzkHvE6HF4KpwElQhUWKDdJeT7GYLfBD1hRMLViEHXLFSF043vIzcQca0REZMfv7FLeNwr+DmRbIqUm7rTpTx9IvdZ vn3A8Pfb 64XZ7M2rZ9xP/9kzQ6c0yscwB104jwmBZuLf770zU0CUaC9i7fuS3uvwMJXlpDwkYWXf7jUFCUPMTWOOtHkxvLE2oYtSBn/9mkIHI8VO4uDMldmx1mY9mA8gnfNK9khIV+6CAEd/mbTNuygofYOKghzL40/MhMH8blg0GrJYDVMYvCh6WfRgntANBo4wIIZL+pAhFGql/6Wsd0YWLAVBH31PjeFH8NBVpe5vQx8gbEHB8NnriFrT9uOVt5ymp3mOO+xCPf6rrYGGwGEqGb6QfzI3GS56R8Ih10aI0tU7SjJT2kE+55Y0DohFtkjCTxLdznW0+v+FvxVEk+dpB/7KBeYpyLsH62GmAh1YRlbYA8ERKhiwXhZPvPVHv6w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 13, 2024 at 5:47=E2=80=AFPM Jingbo Xu wrote: > > > > On 11/14/24 8:39 AM, Joanne Koong wrote: > > On Tue, Nov 12, 2024 at 1:25=E2=80=AFAM Jingbo Xu wrote: > >> > >> Hi Joanne, > >> > >> On 11/8/24 7:56 AM, Joanne Koong wrote: > >>> Currently, we allocate and copy data to a temporary folio when > >>> handling writeback in order to mitigate the following deadlock scenar= io > >>> that may arise if reclaim waits on writeback to complete: > >>> * single-threaded FUSE server is in the middle of handling a request > >>> that needs a memory allocation > >>> * memory allocation triggers direct reclaim > >>> * direct reclaim waits on a folio under writeback > >>> * the FUSE server can't write back the folio since it's stuck in > >>> direct reclaim > >>> > >>> To work around this, we allocate a temporary folio and copy over the > >>> original folio to the temporary folio so that writeback can be > >>> immediately cleared on the original folio. This additionally requires= us > >>> to maintain an internal rb tree to keep track of writeback state on t= he > >>> temporary folios. > >>> > >>> A recent change prevents reclaim logic from waiting on writeback for > >>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it. > >>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which > >>> will prevent FUSE folios from running into the reclaim deadlock descr= ibed > >>> above) and removes the temporary folio + extra copying and the intern= al > >>> rb tree. > >>> > >>> fio benchmarks -- > >>> (using averages observed from 10 runs, throwing away outliers) > >>> > >>> Setup: > >>> sudo mount -t tmpfs -o size=3D30G tmpfs ~/tmp_mount > >>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads= =3D4 -o source=3D~/tmp_mount ~/fuse_mount > >>> > >>> fio --name=3Dwriteback --ioengine=3Dsync --rw=3Dwrite --bs=3D{1k,4k,1= M} --size=3D2G > >>> --numjobs=3D2 --ramp_time=3D30 --group_reporting=3D1 --directory=3D/r= oot/fuse_mount > >>> > >>> bs =3D 1k 4k 1M > >>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>> % diff -3% 23% 45% > >>> > >>> Signed-off-by: Joanne Koong > >> > >> I think there are some places checking or waiting for writeback could = be > >> reconsidered if they are still needed or not after we dropping the tem= p > >> page design. > >> > >> As they are inherited from the original implementation, at least they > >> are harmless. I think they could be remained in this patch, and could > >> be cleaned up later if really needed. > >> > > > > Thank you for the thorough inspection! > > > >> > >>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, s= truct folio *folio) > >>> * have writeback that extends beyond the lifetime of the folio= . So > >>> * make sure we read a properly synced folio. > >>> */ > >>> - fuse_wait_on_folio_writeback(inode, folio); > >>> + folio_wait_writeback(folio); > >> > >> I doubt if wait-on-writeback is needed here, as now page cache won't b= e > >> freed until the writeback IO completes. > >> > >> The routine attempts to free page cache, e.g. invalidate_inode_pages2(= ) > >> (generally called by distributed filesystems when the file content has > >> been modified from remote) or truncate_inode_pages() (called from > >> truncate(2) or inode eviction) will wait for writeback completion (if > >> any) before freeing page. > >> > >> Thus I don't think there's any possibility that .read_folio() or > >> .readahead() will be called when the writeback has not completed. > >> > > > > Great point. I'll remove this line and the comment above it too. > > > >> > >>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fus= e_io_args *ia, > >>> int err; > >>> > >>> for (i =3D 0; i < ap->num_folios; i++) > >>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]); > >>> + folio_wait_writeback(ap->folios[i]); > >> > >> Ditto. > > Actually this is a typo and I originally meant that waiting for > writeback in fuse_send_readpages() is unneeded as page cache won't be > freed until the writeback IO completes. > > > - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first,= last)); > > + filemap_fdatawait_range(inode->i_mapping, first, last); > Gotcha and agreed. I'll remove this wait from readahead(). > > In fact the above waiting for writeback in fuse_send_write_pages() is > needed. The reason is as follows: > > >> > > > > Why did we need this fuse_wait_on_folio_writeback() even when we had > > the temp pages? If I'm understanding it correctly, > > fuse_send_write_pages() is only called for the writethrough case (by > > fuse_perform_write()), so these folios would never even be under > > writeback, no? > > I think mmap write could make the page dirty and the writeback could be > triggered then. > Ohhh I see, thanks for the explanation. > > > >> > >> > >>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args= *wpa, struct folio *folio, > >>> - struct folio *tmp_folio, uint= 32_t folio_index) > >>> + uint32_t folio_index) > >>> { > >>> struct inode *inode =3D folio->mapping->host; > >>> struct fuse_args_pages *ap =3D &wpa->ia.ap; > >>> > >>> - folio_copy(tmp_folio, folio); > >>> - > >>> - ap->folios[folio_index] =3D tmp_folio; > >>> + folio_get(folio); > >> > >> I still think this folio_get() here is harmless but redundant. > >> > >> Ditto page cache won't be freed before writeback completes. > >> > >> Besides, other .writepages() implementaions e.g. iomap_writepages() al= so > >> doen't get the refcount when constructing the writeback IO. > >> > > > > Point taken. I'll remove this then, since other .writepages() also > > don't obtain a refcount. > > > >> > >>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, = struct address_space *mapping, > >>> if (IS_ERR(folio)) > >>> goto error; > >>> > >>> - fuse_wait_on_page_writeback(mapping->host, folio->index); > >>> + folio_wait_writeback(folio); > >> > >> I also doubt if wait_on_writeback() is needed here, as now there won't > >> be duplicate writeback IOs for the same page. > > > > What prevents there from being a duplicate writeback write for the > > same page? This is the path I'm looking at: > > > > ksys_write() > > vfs_write() > > new_sync_write() > > op->write_iter() > > fuse_file_write_iter() > > fuse_cache_write_iter() > > generic_file_write_iter() > > __generic_file_write_iter() > > generic_perform_write() > > op->write_begin() > > fuse_write_begin() > > > > but I'm not seeing where there's anything that prevents a duplicate > > write from happening. > > I mean there won't be duplicate *writeback* rather than *write* for the > same page. You could write the page cache and make it dirty at the time > when the writeback for the same page is still on going, as long as we > can ensure that even when the page is dirtied again, there won't be a > duplicate writeback IO for the same page when the previous writeback IO > has not completed yet. > I think we still need this folio_wait_writeback() since we're calling fuse_do_readfolio() and removing the folio_wait_writeback() from fuse_do_readfolio(). else we could read back stale data if the writeback hasn't gone through yet. I think we could probably move the folio_wait_writeback() here in fuse_write_begin() to be right before the fuse_do_readfolio() call and skip waiting on writeback if we hit the "success" gotos. Thanks, Joanne > > > -- > Thanks, > Jingbo