From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D838D68BCE for ; Fri, 15 Nov 2024 18:30:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A54BA6B00A9; Fri, 15 Nov 2024 13:30:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A03FC6B00AA; Fri, 15 Nov 2024 13:30:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 87C716B00AC; Fri, 15 Nov 2024 13:30:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 668F26B00A9 for ; Fri, 15 Nov 2024 13:30:10 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1389C81AD6 for ; Fri, 15 Nov 2024 18:30:10 +0000 (UTC) X-FDA: 82789167678.12.98FB8D2 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf16.hostedemail.com (Postfix) with ESMTP id B9E5D18000E for ; Fri, 15 Nov 2024 18:29:22 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=IvLyp+yG; spf=pass (imf16.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731695231; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OgWDIT45dQIM+h/phEG2pY1AjbP8LJBfsmHjONXWjVg=; b=k9rgwI4jHaRMv9MsAssSJzwfs1imYp10HRNcRpYhUFnLwYfMr8fBsb3ymGvf+y/+Wqw+oc /TQN8AP3RUhvVj3MT+kQ0CZaNALQV3AY3AgD7ZPyCJOkUq1He9ip1ysnmor5xbj22qTDG1 3AukVTnhWZpjZ6kHKYrsxwxQAgeKvOI= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=IvLyp+yG; spf=pass (imf16.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731695231; a=rsa-sha256; cv=none; b=a0wLEv/uVhFfl4PTWCgwfkDXZ58bGCJFpqRBUASAlGK08XFKadz0hej6mJLvQI9HgvqOmC b3U5KzRIP5BgRcbR4HxoVbXRhDrriITuYVyzJZDmxVWMpsZV3B5EjucdYVcFKULrCkNWAG 97boe2pWRxMtLA3cLrZeQlT4nCSbn8U= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-460c2418e37so7333451cf.0 for ; Fri, 15 Nov 2024 10:30:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731695407; x=1732300207; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OgWDIT45dQIM+h/phEG2pY1AjbP8LJBfsmHjONXWjVg=; b=IvLyp+yG+NMbWyJumK99TylpeABXcV2AWVwWzh5AQfVh6miMSsVZTwOQXawBSMF0oa naz+y0KeWQavZhRMEEhFgw20uS1SdmrLmCLlfaHbcgd4L3PruC7aE3plqbR4YZq2T6tB sEsmk6gTmAHBpW5MP4UORLuxAzR6HY9GSGF0dPD8zztcPh//GDI4WmL6ZAoRB/dGNISp M6N3usOPCphD6Og1uVy+8gYvRIH85aJyaO/nPLeMIazMCSDhg1GJLNLWSzq4s/8blI4s LzzFJlmyBymJHv5THfWBvb/FWBbes822BVSUPvyPMPrYIG8ux0Ba3ffHMz6G1ovs5KuC 1kIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731695407; x=1732300207; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OgWDIT45dQIM+h/phEG2pY1AjbP8LJBfsmHjONXWjVg=; b=cIQhCUIqClKN86HGatQzGsXBEB0lE2vlBCBrAMiRvsWFK1qQWNyZkdvTLWJ7mVD7+a BndDbcJ1NRyj4gM8mTr+WEJSe99/iIpVMubJVfLrLUnki7mJL2k8wTfNzxIfsaMxQhU1 VHJIvmspPR0UytYVFXXbmy6jTtyjeiRMJRjQd4s5LRhU/VS6BNULp+6h70ebcYbb6rj/ /UW1gX46KM5xjbo08Eu2oYqruo3wRw+YhCkkNxtOTvSL6O5EBEVQNUlDFAiymlqURG+F PWc1tT7hJSpOLJzkw0obVhOYkEB/T3i+jkRPm3zg2aFQt9G3Gknd4wZ78U/mLhlyRRgq jbsA== X-Forwarded-Encrypted: i=1; AJvYcCWcY50gciMOIA5ycRCDXGhg5bD69D4HGyRHnL+XhsGqOkbSEqvtOKFAANrBucFMkeEhLZYSP9GmVA==@kvack.org X-Gm-Message-State: AOJu0YywjeqFvGsu3VZSBxGQaOZ5aHbidqz9Ed10eoSxh8ygoBln+PWR izNnAmbHc8nVnjnMbtt+KhFYAkA8pEqSUZMtgRx08QrFUTrZM66RvQ85OZBNG9fF/XBfdsa6gP1 ooctIo81u2xAJZQjYRwvRWXWgYYk= X-Google-Smtp-Source: AGHT+IFCTpzSuKNN8yphcjMu6Ku0wflbo+Dlo0yXROFq5GUOLZ+cx88al9g3ArsoqKVhUX8D+APpxzzrhxm0nNp+1MU= X-Received: by 2002:a05:622a:c7:b0:460:7b6e:9475 with SMTP id d75a77b69052e-46363de8150mr42619061cf.10.1731695406832; Fri, 15 Nov 2024 10:30:06 -0800 (PST) MIME-Version: 1.0 References: <20241107235614.3637221-1-joannelkoong@gmail.com> <20241107235614.3637221-7-joannelkoong@gmail.com> <47661fe5-8616-4133-8d9c-faeb1ab96962@linux.alibaba.com> <8cd0a96c-7d8f-4a38-afc6-2c48bc701da8@linux.alibaba.com> In-Reply-To: <8cd0a96c-7d8f-4a38-afc6-2c48bc701da8@linux.alibaba.com> From: Joanne Koong Date: Fri, 15 Nov 2024 10:29:56 -0800 Message-ID: Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree To: Jingbo Xu Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev, josef@toxicpanda.com, linux-mm@kvack.org, bernd.schubert@fastmail.fm, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: B9E5D18000E X-Stat-Signature: bypbs9diu7as1f8xwkug6y3mygcp1cse X-Rspam-User: X-HE-Tag: 1731695362-505148 X-HE-Meta: U2FsdGVkX1/1/ALH/7ZM946vwCiiSCWRQViaenOd/XfwPGJUehmaPQCcDbcrqna7s24F3pLiqwO8JxCF/4I7Q4J4HELONUo87r8DUOObgM3GT4riboosl0jxYon4w2vNCszW30+Qnl+eL+PExKT6lO4cGv7iFlYoc9BIvj08xKtOGB6P6pB+y+DoKBpE3zSo7A/WjYSdEXjm7vKNpC7swk/WVQf32G3xmyksvZokOMOgsywfs2zjgPLa/2y9OppLJwpoNfptuyyHg5K0cVsdg02vLPEYHpHuA9a2RYJvhZ0N236Saufz59tAL//zR5zSvRrSMHSWpBsa4S9od5W5TMtm54pA3QgpsfZ9wegzjNMCgl2HPaEAs6lL2YyAlbksK7mor/EYsHNWm9/s2ezBozclGvzI0869Rc22g0W/xKTDhUdUWri6Ph7ksvlec4P9VQmqte0y5eW8E/MW9+BEMGKrXvvJvfrmP2YOuUdQHSSw/TlRnaJlqWq/eHCFa/rId3LFOMgs+80VUvQo72CFSrqEOspINeIOcTE9OZ5yeKDyqD0Rw/0bwDt7WRRZLtEIqT48PET2OGHSMEjeN13kV6v1BRx60B9TOsyfnyA1dBNtlSN2dVC4nyE0In/E2XOk4AaYwwJKL5/GF+0zJ7FoR3f5MwHyv45hFrEYPzen57qc+3RshBFrxtrWt5gHAJCpcl1rQ2NO091eq2ZA0PQK3UeDQSTA1B9BWFfCDCEzU1tHYx923EpqYxMJfQ5wO8p/ooADFocZmEi8kfYYH+11Wl6wDKW/ZgWL4mVgQv/Oce/76+LJuupbvjNrGVMakyVbkLvzKkHS9ldA/v2iMsZdMaYp0X+oO14+bnI23wCKNTWp2k9GwIfgek0PZYAWEJl+Td0V9za8mzbC44C8lSeriykGomx57L0Ugvh2llePPUWQOwJu6clsBvPgEQRk/CWsbhhRnKlHbQOsVSMi2YJ B6abEU0t sJQvxH5ysmuZzLPXT4fTN5eYsGaQbSboGVls65bvvSW6C/036H7Glnew5oTj0+FHKPLju6ESUq03svHxy+Hmccv5pkTilL+mlWqTLo/JsJcEtk39XGXCzjhJO5WDgu6DtEdiqPmks4oNUwehMtPUWe76Uw7ruWHHm6SK+k7AE3JUE78SAMSOrvZvbwalXzKl5j7SpRZtge5ONk5Fp0xRlxt7WY7+zT24KkoOsuyfY4YI3U1wR0pIuusD1RlGfaIU74070I5NQ6y8el2eiBBV0hQ4Wspb+z8zpEjfJ/Wi68U668l9Zsxg84obBR89pOvvo5JSOido9u2xetiDThiB2GilbzV/TJCMdBUiFYHXSK1TgDQ6j9EGGDcMfUw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 14, 2024 at 6:18=E2=80=AFPM Jingbo Xu wrote: > > > > On 11/15/24 2:19 AM, Joanne Koong wrote: > > On Wed, Nov 13, 2024 at 5:47=E2=80=AFPM Jingbo Xu wrote: > >> > >> > >> > >> On 11/14/24 8:39 AM, Joanne Koong wrote: > >>> On Tue, Nov 12, 2024 at 1:25=E2=80=AFAM Jingbo Xu wrote: > >>>> > >>>> Hi Joanne, > >>>> > >>>> On 11/8/24 7:56 AM, Joanne Koong wrote: > >>>>> Currently, we allocate and copy data to a temporary folio when > >>>>> handling writeback in order to mitigate the following deadlock scen= ario > >>>>> that may arise if reclaim waits on writeback to complete: > >>>>> * single-threaded FUSE server is in the middle of handling a reques= t > >>>>> that needs a memory allocation > >>>>> * memory allocation triggers direct reclaim > >>>>> * direct reclaim waits on a folio under writeback > >>>>> * the FUSE server can't write back the folio since it's stuck in > >>>>> direct reclaim > >>>>> > >>>>> To work around this, we allocate a temporary folio and copy over th= e > >>>>> original folio to the temporary folio so that writeback can be > >>>>> immediately cleared on the original folio. This additionally requir= es us > >>>>> to maintain an internal rb tree to keep track of writeback state on= the > >>>>> temporary folios. > >>>>> > >>>>> A recent change prevents reclaim logic from waiting on writeback fo= r > >>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in i= t. > >>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (whi= ch > >>>>> will prevent FUSE folios from running into the reclaim deadlock des= cribed > >>>>> above) and removes the temporary folio + extra copying and the inte= rnal > >>>>> rb tree. > >>>>> > >>>>> fio benchmarks -- > >>>>> (using averages observed from 10 runs, throwing away outliers) > >>>>> > >>>>> Setup: > >>>>> sudo mount -t tmpfs -o size=3D30G tmpfs ~/tmp_mount > >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads= =3D4 -o source=3D~/tmp_mount ~/fuse_mount > >>>>> > >>>>> fio --name=3Dwriteback --ioengine=3Dsync --rw=3Dwrite --bs=3D{1k,4k= ,1M} --size=3D2G > >>>>> --numjobs=3D2 --ramp_time=3D30 --group_reporting=3D1 --directory=3D= /root/fuse_mount > >>>>> > >>>>> bs =3D 1k 4k 1M > >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s > >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s > >>>>> % diff -3% 23% 45% > >>>>> > >>>>> Signed-off-by: Joanne Koong > >>>> > >>>> I think there are some places checking or waiting for writeback coul= d be > >>>> reconsidered if they are still needed or not after we dropping the t= emp > >>>> page design. > >>>> > >>>> As they are inherited from the original implementation, at least the= y > >>>> are harmless. I think they could be remained in this patch, and cou= ld > >>>> be cleaned up later if really needed. > >>>> > >>> > >>> Thank you for the thorough inspection! > >>> > >>>> > >>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file,= struct folio *folio) > >>>>> * have writeback that extends beyond the lifetime of the fol= io. So > >>>>> * make sure we read a properly synced folio. > >>>>> */ > >>>>> - fuse_wait_on_folio_writeback(inode, folio); > >>>>> + folio_wait_writeback(folio); > >>>> > >>>> I doubt if wait-on-writeback is needed here, as now page cache won't= be > >>>> freed until the writeback IO completes. > >>>> > >>>> The routine attempts to free page cache, e.g. invalidate_inode_pages= 2() > >>>> (generally called by distributed filesystems when the file content h= as > >>>> been modified from remote) or truncate_inode_pages() (called from > >>>> truncate(2) or inode eviction) will wait for writeback completion (i= f > >>>> any) before freeing page. > >>>> > >>>> Thus I don't think there's any possibility that .read_folio() or > >>>> .readahead() will be called when the writeback has not completed. > >>>> > >>> > >>> Great point. I'll remove this line and the comment above it too. > >>> > >>>> > >>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct f= use_io_args *ia, > >>>>> int err; > >>>>> > >>>>> for (i =3D 0; i < ap->num_folios; i++) > >>>>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]); > >>>>> + folio_wait_writeback(ap->folios[i]); > >>>> > >>>> Ditto. > >> > >> Actually this is a typo and I originally meant that waiting for > >> writeback in fuse_send_readpages() is unneeded as page cache won't be > >> freed until the writeback IO completes. > >> > >>> - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, firs= t, last)); > >>> + filemap_fdatawait_range(inode->i_mapping, first, last); > >> > > > > Gotcha and agreed. I'll remove this wait from readahead(). > > > >> > >> In fact the above waiting for writeback in fuse_send_write_pages() is > >> needed. The reason is as follows: > >> > >>>> > >>> > >>> Why did we need this fuse_wait_on_folio_writeback() even when we had > >>> the temp pages? If I'm understanding it correctly, > >>> fuse_send_write_pages() is only called for the writethrough case (by > >>> fuse_perform_write()), so these folios would never even be under > >>> writeback, no? > >> > >> I think mmap write could make the page dirty and the writeback could b= e > >> triggered then. > >> > > > > Ohhh I see, thanks for the explanation. > > > >>> > >>>> > >>>> > >>>>> static void fuse_writepage_args_page_fill(struct fuse_writepage_ar= gs *wpa, struct folio *folio, > >>>>> - struct folio *tmp_folio, ui= nt32_t folio_index) > >>>>> + uint32_t folio_index) > >>>>> { > >>>>> struct inode *inode =3D folio->mapping->host; > >>>>> struct fuse_args_pages *ap =3D &wpa->ia.ap; > >>>>> > >>>>> - folio_copy(tmp_folio, folio); > >>>>> - > >>>>> - ap->folios[folio_index] =3D tmp_folio; > >>>>> + folio_get(folio); > >>>> > >>>> I still think this folio_get() here is harmless but redundant. > >>>> > >>>> Ditto page cache won't be freed before writeback completes. > >>>> > >>>> Besides, other .writepages() implementaions e.g. iomap_writepages() = also > >>>> doen't get the refcount when constructing the writeback IO. > >>>> > >>> > >>> Point taken. I'll remove this then, since other .writepages() also > >>> don't obtain a refcount. > >>> > >>>> > >>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file= , struct address_space *mapping, > >>>>> if (IS_ERR(folio)) > >>>>> goto error; > >>>>> > >>>>> - fuse_wait_on_page_writeback(mapping->host, folio->index); > >>>>> + folio_wait_writeback(folio); > >>>> > >>>> I also doubt if wait_on_writeback() is needed here, as now there won= 't > >>>> be duplicate writeback IOs for the same page. > >>> > >>> What prevents there from being a duplicate writeback write for the > >>> same page? This is the path I'm looking at: > >>> > >>> ksys_write() > >>> vfs_write() > >>> new_sync_write() > >>> op->write_iter() > >>> fuse_file_write_iter() > >>> fuse_cache_write_iter() > >>> generic_file_write_iter() > >>> __generic_file_write_iter() > >>> generic_perform_write() > >>> op->write_begin() > >>> fuse_write_begin() > >>> > >>> but I'm not seeing where there's anything that prevents a duplicate > >>> write from happening. > >> > >> I mean there won't be duplicate *writeback* rather than *write* for th= e > >> same page. You could write the page cache and make it dirty at the ti= me > >> when the writeback for the same page is still on going, as long as we > >> can ensure that even when the page is dirtied again, there won't be a > >> duplicate writeback IO for the same page when the previous writeback I= O > >> has not completed yet. > >> > > > > I think we still need this folio_wait_writeback() since we're calling > > fuse_do_readfolio() and removing the folio_wait_writeback() from > > fuse_do_readfolio(). else we could read back stale data if the > > writeback hasn't gone through yet. > > I think we could probably move the folio_wait_writeback() here in > > fuse_write_begin() to be right before the fuse_do_readfolio() call and > > skip waiting on writeback if we hit the "success" gotos. > > IIUC if cache hit when fuse_write_begin() is called, the page is marked > with PG_update which indicates that (.readpage()) the IO request reading > data from disk to page is already done. Thus fuse_write_begin() will do > nothing and return directly. > > > if (folio_test_uptodate(folio) || len >=3D folio_size(folio)) > > goto success; > > Otherwise fuse_do_readfolio() is called to read data when cache miss, > i.e. the page doesn't exist in the page cache before or it gets > invalidated previously. Since we can ensure that no page can be > invalidated until the writeback IO completes, we can ensure that there's > no in-flight writeback IO when fuse_do_readfolio() is called. > > Other folks can also correct me if I get wrong, since I also attempts to > figure out all these details about the page cache management recently. > You're right, this writeback wait is unneeded. I'll remove this for v5. Thanks, Joanne > > -- > Thanks, > Jingbo