From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 89758D75BA5
	for <linux-mm@archiver.kernel.org>; Thu, 21 Nov 2024 03:08:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 880C56B007B; Wed, 20 Nov 2024 22:08:43 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8308C6B0083; Wed, 20 Nov 2024 22:08:43 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6F7AF6B0085; Wed, 20 Nov 2024 22:08:43 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 51D906B007B
	for <linux-mm@kvack.org>; Wed, 20 Nov 2024 22:08:43 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id C3051160AF2
	for <linux-mm@kvack.org>; Thu, 21 Nov 2024 03:08:42 +0000 (UTC)
X-FDA: 82808616786.09.6815E05
Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119])
	by imf09.hostedemail.com (Postfix) with ESMTP id 90C8D14000A
	for <linux-mm@kvack.org>; Thu, 21 Nov 2024 03:08:03 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b="LEH+/mPi";
	dmarc=pass (policy=none) header.from=linux.alibaba.com;
	spf=pass (imf09.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1732158335;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=KCXXX5o9L2eLUDNzQHtOaw7+2ZdwxM3iR4LbK/u0Zys=;
	b=v0gk7uC03T8SPaDu+bGADA6p+JEVESPS8iQR5VJoJIrv0SlAgJ5RJYLDyIt2U2Eo6rjlZM
	prTlTz71dewh5Q0huK2/zm/EFskctKp5oB+jNa1n4WAQFDxJXtOcj2l9X/7Td34L3YkE5k
	3WK55SkN+c30xDUfFVAmbFOgWk9RtA4=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b="LEH+/mPi";
	dmarc=pass (policy=none) header.from=linux.alibaba.com;
	spf=pass (imf09.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732158335; a=rsa-sha256;
	cv=none;
	b=lK40g+JH0iyvXVPlX6uUNEF0shl3uExQtgNDsT++6wrB3CwBN42WPS6Mh4Pn0J0kuiXwdJ
	1IVQ/0bRBDsn5kUq8Ebgmc1q5Cjt9vsEqbue0k78XIjno92/u1S3n1g+6DOjALRCHKZQPq
	pQghn/eU4ELP4nkHlHN0dp2V9Zk99zg=
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1732158516; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type;
	bh=KCXXX5o9L2eLUDNzQHtOaw7+2ZdwxM3iR4LbK/u0Zys=;
	b=LEH+/mPipiBJDZzDe7i+76c/ZDEjS8+svRqz+mU3vuv8Wir6ijEn57vv2SWhYNj/K4T2geOqn866B1qnMtIiOiwA878BXmZ91S/vaIKDNs1Ub+u3fal6LtBG3SBNDGZlsbwcglOB68EcaCc9JiKBE9udU455XXgvK5u1nwLCW4E=
Received: from 30.221.144.254(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0WJuAsIM_1732158514 cluster:ay36)
          by smtp.aliyun-inc.com;
          Thu, 21 Nov 2024 11:08:35 +0800
Message-ID: <53277184-6c03-469a-bb5e-3249460258ba@linux.alibaba.com>
Date: Thu, 21 Nov 2024 11:08:32 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v5 5/5] fuse: remove tmp folio for writebacks and internal
 rb tree
To: Joanne Koong <joannelkoong@gmail.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev,
 josef@toxicpanda.com, linux-mm@kvack.org, bernd.schubert@fastmail.fm,
 kernel-team@meta.com
References: <20241115224459.427610-1-joannelkoong@gmail.com>
 <20241115224459.427610-6-joannelkoong@gmail.com>
 <cad4a8b3-8065-4187-875f-1810263b988c@linux.alibaba.com>
 <CAJnrk1aiNZM_JhCwNX+XCdBWsqWxujLi3sUYaQEuN-qnA2gneQ@mail.gmail.com>
Content-Language: en-US
From: Jingbo Xu <jefflexu@linux.alibaba.com>
In-Reply-To: <CAJnrk1aiNZM_JhCwNX+XCdBWsqWxujLi3sUYaQEuN-qnA2gneQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 90C8D14000A
X-Stat-Signature: 59wwq6awfjeykx4p1c6soom99kc1e9jm
X-Rspam-User: 
X-HE-Tag: 1732158483-426421
X-HE-Meta: U2FsdGVkX19nP/esFQoxVDggB13YpFw3NgfcE1GmwXH3nPupDMivUdc4lxJn/SxZcIajNqx8Nqg70IYBP2bY4RzGSUXkv8KqgKZxZsOLX6Mgvdn/Qn+sTS7f4mN1NWsc7Yc3c6P/li2b5r4zLGQmWeUqt51aVjOEVNcnqPuAu+vFhCAYnqw84AiKzZnX2YWEjEVBwA8wNyiB4plmD4qY37JDeaeszs+6w593klKksFmIfC04S+JfVCS9T6fZYMTgzOEstd6nlHZWr93aUIS0r/zFQfG9c8yqZQnYbiKk+7Ptc3h2VqTLxpyYkTakNQ3S6YmkOizhWuNz41/CJujcY6iuO8S8U5b/k+qIxt76rIEdNvHB9swOWC/qfvVxcTo/pqw18WAg08OPl37K9OjkcsvDvaJ8ZG5O1OukCq90B/COmp8WuXUzYQeW/zF08Xgc1j9/qGgMwUQTKh4iMoKc3clczAEdQsMCZvyPLVR8SZGA6DMygv9GZn0AbEzSg5ujCe3hX38aPJ+l/02ywdafILFycnrDjSLrKYmPou5fFiFNDbJCq+NB54sYcoQvofbsHhmO9azh9F6906Nw+IoLpcu/ZtKaYwA2cRbNvPdoHGkkJi8fD3I96Gt12sBbIdaVtQhfiT/rqNlOi1F9owJCtqeNUl1WQ7bKr2nVoMPa7nyDcpfb9ejE6g47LgSSZ89e1L8MaII5uAi2qnKSl9hNlNuExcpkVfWAjZDT/KEUAhXC71MVUc+qo5SEXFmHJ1oh0ZXZIX5XWiuYaZ6veWiXXLdhAAZ6uGAlMK+CEPu4BNADQElS+RLxa8UbnqKWI8l8yFbE7fak0+O8xwG8R+BmXTfs1vVWms7MT1wWYz9xOmn2Pr2MKvMqnhWMIwiPupsrfpn+BTicQI+f5j07YytFGkx1PxrlpgFK4zABYlcq396d6YGE3rjKeDsJQlkIDn7e5Goj9LmWHbsC8QBuI74
 QxzMKO98
 Aw+qVOPu/mUwJGi0uc1kJdeHKIgf6Lqbkr4rVB4rZwswzDNRSYpp+igdM4iv2b1SG4t+xWR3OuM1I1gI3lvRbmunDFzBv/eRb3B6Fi2wHpR2VRpAsCEF67nCLdko8hBcIy6OMSnONe71WtzDa32G1nJFv0CHEyaiHnKOWfbQq1IOzYE//40yHxyK7oSRVKJn6tkoT+4d356JXlBkdjrJ0+IxVzcE8QJOLbrwN1Etwl7x20v3zLMbHJV6p+h1hb43QecrQIfrhBi+XSMcHnsU/XSL5Tsl2S1tuUNYK4S6crGobJLW8ne+tuWbNceOrm9SQNlqCAAJ4rWA0o3WezFWnzT2vpHqNMOHgv41Y
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 11/21/24 5:53 AM, Joanne Koong wrote:
> On Wed, Nov 20, 2024 at 1:56 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> On 11/16/24 6:44 AM, Joanne Koong wrote:
>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>> dirty page to be written back, the contents of the dirty page are copied over
>>> to the temp page, and the temp page gets handed to the server to write back.
>>>
>>> This is done so that writeback may be immediately cleared on the dirty page,
>>> and this in turn is done for two reasons:
>>> a) in order to mitigate the following deadlock scenario that may arise
>>> if reclaim waits on writeback on the dirty page to complete:
>>> * single-threaded FUSE server is in the middle of handling a request
>>>   that needs a memory allocation
>>> * memory allocation triggers direct reclaim
>>> * direct reclaim waits on a folio under writeback
>>> * the FUSE server can't write back the folio since it's stuck in
>>>   direct reclaim
>>> b) in order to unblock internal (eg sync, page compaction) waits on
>>> writeback without needing the server to complete writing back to disk,
>>> which may take an indeterminate amount of time.
>>>
>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>> the situations described above, FUSE writeback does not need to use
>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>
>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>> and removes the temporary pages + extra copying and the internal rb
>>> tree.
>>>
>>> fio benchmarks --
>>> (using averages observed from 10 runs, throwing away outliers)
>>>
>>> Setup:
>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>
>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>
>>>         bs =  1k          4k            1M
>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>> % diff        -3%          23%         45%
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> ---
>>>  fs/fuse/file.c | 339 +++----------------------------------------------
>>>  1 file changed, 20 insertions(+), 319 deletions(-)
>>>
>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>> index 88d0946b5bc9..56289ac58596 100644
>>> --- a/fs/fuse/file.c
>>> +++ b/fs/fuse/file.c
>>> @@ -1172,7 +1082,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>>>       int err;
>>>
>>>       for (i = 0; i < ap->num_folios; i++)
>>> -             fuse_wait_on_folio_writeback(inode, ap->folios[i]);
>>> +             folio_wait_writeback(ap->folios[i]);
>>>
>>>       fuse_write_args_fill(ia, ff, pos, count);
>>>       ia->write.in.flags = fuse_write_flags(iocb);
>>> @@ -1622,7 +1532,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>>>                       return res;
>>>               }
>>>       }
>>> -     if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
>>> +     if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) {
>>>               if (!write)
>>>                       inode_lock(inode);
>>>               fuse_sync_writes(inode);
>>> @@ -1825,7 +1735,7 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa)
>>>               fuse_sync_bucket_dec(wpa->bucket);
>>>
>>>       for (i = 0; i < ap->num_folios; i++)
>>> -             folio_put(ap->folios[i]);
>>> +             folio_end_writeback(ap->folios[i]);
>>
>> I noticed that if we folio_end_writeback() in fuse_writepage_finish()
>> (rather than fuse_writepage_free()), there's ~50% buffer write
>> bandwridth performance gain (5500MB -> 8500MB)[*]
>>
>> The fuse server is generally implemented in multi-thread style, and
>> multi (fuse server) worker threads could fetch and process FUSE_WRITE
>> requests of one fuse inode.  Then there's serious lock contention for
>> the xarray lock (of the address space) when these multi worker threads
>> call fuse_writepage_end->folio_end_writeback when they are sending
>> replies of FUSE_WRITE requests.
>>
>> The lock contention is greatly alleviated when folio_end_writeback() is
>> serialized with fi->lock.  IOWs in the current implementation
>> (folio_end_writeback() in fuse_writepage_free()), each worker thread
>> needs to compete for the xarray lock for 256 times (one fuse request can
>> contain at most 256 pages if FUSE_MAX_MAX_PAGES is 256) when completing
>> a FUSE_WRITE request.
>>
>> After moving folio_end_writeback() to fuse_writepage_finish(), each
>> worker thread needs to compete for fi->lock only once.  IOWs the locking
>> granularity is larger now.
>>
> 
> Interesting! Thanks for sharing. Are you able to consistently repro
> these results and on different machines? When I run it locally on my
> machine using the commands you shared, I'm seeing roughly the same
> throughput:
> 
> Current implementation (folio_end_writeback() in fuse_writepage_free()):
>   WRITE: bw=385MiB/s (404MB/s), 385MiB/s-385MiB/s (404MB/s-404MB/s),
> io=113GiB (121GB), run=300177-300177msec
>   WRITE: bw=384MiB/s (403MB/s), 384MiB/s-384MiB/s (403MB/s-403MB/s),
> io=113GiB (121GB), run=300178-300178msec
> 
> fuse_end_writeback() in fuse_writepage_finish():
>   WRITE: bw=387MiB/s (406MB/s), 387MiB/s-387MiB/s (406MB/s-406MB/s),
> io=113GiB (122GB), run=300165-300165msec
>   WRITE: bw=381MiB/s (399MB/s), 381MiB/s-381MiB/s (399MB/s-399MB/s),
> io=112GiB (120GB), run=300143-300143msec
> 
> I wonder if it's because your machine is so much faster that lock
> contention makes a difference for you whereas on my machine there's
> other things that slow it down before lock contention comes into play.

Yeah, I agree that the lock contention matters only when the writeback
kworker consumes 100% CPU, i.e. when the writeback kworker is the
bottleneck.  To expose that, the passthrough_hp daemon works in
benchmark[*] mode (I noticed that passthrough_hp can be the bottleneck
when disabling "--bypass-rw" mode).

[*]
https://github.com/libfuse/libfuse/pull/807/commits/e83789cc6e83ca42ccc9899c4f7f8c69f31cbff9


> 
> I see your point about why it would make sense that having
> folio_end_writeback() in fuse_writepage_finish() inside the scope of
> the fi->lock could make it faster, but I also could see how having it
> outside the lock could make it faster as well. I'm thinking about the
> scenario where if there's 8 threads all executing
> fuse_send_writepage() at the same time, calling folio_end_writeback()
> outside the fi->lock would unblock other threads trying to get the
> fi->lock and that other thread could execute while
> folio_end_writeback() gets executed.
> 
> Looking at it some more, it seems like it'd be useful if there was
> some equivalent api to folio_end_writeback() that takes in an array of
> folios and would only need to grab the xarray lock once to clear
> writeback on all the folios in the array.

Yes it's exactly what we need.


> 
> When fuse supports large folios [*] this will help lock contention on
> the xarray lock as well because there'll be less folio_end_writeback()
> calls.

Cool, it definitely helps.


> 
> I'm happy to move the fuse_end_writeback() call to
> fuse_writepage_finish() considering what you're seeing. 5500 Mb ->
> 8800 Mb is a huge perf improvement!

This statistics is tested in benchmark ("--bypass-rw") mode.  When
disabling "--bypass-rw" mode and testing fuse passthrough_hp over a ext4
over nvme, the performance gain is ~10% (4009MB/s ->4428MB/s).


>>> @@ -2367,54 +2111,23 @@ static int fuse_writepages_fill(struct folio *folio,
>>>               data->wpa = NULL;
>>>       }
>>>
>>> -     err = -ENOMEM;
>>> -     tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0);
>>> -     if (!tmp_folio)
>>> -             goto out_unlock;
>>> -
>>> -     /*
>>> -      * The page must not be redirtied until the writeout is completed
>>> -      * (i.e. userspace has sent a reply to the write request).  Otherwise
>>> -      * there could be more than one temporary page instance for each real
>>> -      * page.
>>> -      *
>>> -      * This is ensured by holding the page lock in page_mkwrite() while
>>> -      * checking fuse_page_is_writeback().  We already hold the page lock
>>> -      * since clear_page_dirty_for_io() and keep it held until we add the
>>> -      * request to the fi->writepages list and increment ap->num_folios.
>>> -      * After this fuse_page_is_writeback() will indicate that the page is
>>> -      * under writeback, so we can release the page lock.
>>> -      */
>>>       if (data->wpa == NULL) {
>>>               err = -ENOMEM;
>>>               wpa = fuse_writepage_args_setup(folio, data->ff);
>>> -             if (!wpa) {
>>> -                     folio_put(tmp_folio);
>>> +             if (!wpa)
>>>                       goto out_unlock;
>>> -             }
>>>               fuse_file_get(wpa->ia.ff);
>>>               data->max_folios = 1;
>>>               ap = &wpa->ia.ap;
>>>       }
>>>       folio_start_writeback(folio);
>>
>> There's also a lock contention for the xarray lock when calling
>> folio_start_writeback().
>>
>> I also noticed a strange thing that, if we lock fi->lock and unlock
>> immediately, the write bandwidth improves by 5% (8500MB -> 9000MB).  The
> 
> Interesting! By lock fi->lock and unlock immediately, do you mean
> locking it, then unlocking it, then calling folio_start_writeback() or
> locking it, calling folio_start_writeback() and then unlocking it?

Either way works, as long as we lock/unlock fi->lock in
fuse_writepages_fill()...  The lock contention is further alleviated
when folio_start_writeback() is inside the critical area of fi->lock.


-- 
Thanks,
Jingbo