From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEF5AC369AB for ; Tue, 15 Apr 2025 08:08:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E897280086; Tue, 15 Apr 2025 04:08:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0968B280081; Tue, 15 Apr 2025 04:08:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E51C9280086; Tue, 15 Apr 2025 04:08:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C6FB0280081 for ; Tue, 15 Apr 2025 04:08:27 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A003B5D84E for ; Tue, 15 Apr 2025 08:08:28 +0000 (UTC) X-FDA: 83335551096.27.ACBE682 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf12.hostedemail.com (Postfix) with ESMTP id 1F88C40002 for ; Tue, 15 Apr 2025 08:08:24 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=UBvZB5ph; spf=pass (imf12.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744704507; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6uJ8JAW/CX0s14a5D8l/CfXdE901CGm9XOoEGCn47xE=; b=YFQBgxlsuPX5FwxW4uLfmdCqIsyu5ye9b/idoXuPCjZevy6J9HPXKBFZnQ+YY9zg4DRf3h NA2ZJpvF2wPTLincgSj2ODchsfB15kNel/Omw55tkRJ++6BgdrR/itxLtOrtQAQeNF50ou 0xEdcTdSXocH+ddiHjnCyN5jwr5XK4A= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=UBvZB5ph; spf=pass (imf12.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744704507; a=rsa-sha256; cv=none; b=isYG23wykhjuleXUd3yLMFaXJ7MAz8aurZeJP9+3IM0keHpPeuqZM5NfNnug+jJ4E5EMwV iWHQ+6T7NfT7/TFp7D99a+vzkF76WIgIULoIdsMbt6qyI1vUurE/gtfvudgVLpj0Gz3CBT SVJBHX0AIcKymQ+hYvt/j7r25PmBbDY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744704502; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=6uJ8JAW/CX0s14a5D8l/CfXdE901CGm9XOoEGCn47xE=; b=UBvZB5phvbW3jWbiSS14J4CcLeUkeaAOqVNy6OuD1RJAcr0fvBwqXR3FYvAQU/f1Fe5kHvaduIT1QEG7dvgU1H7kTpfrg5SOlav0GxHdTxCskud1v4dCedjcgFj1hUP8OL5RdVAh5b8wS9kCWRvhfBF4poxrxamrw4sIiiympSw= Received: from 30.221.145.234(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0WX42Ebp_1744704500 cluster:ay36) by smtp.aliyun-inc.com; Tue, 15 Apr 2025 16:08:20 +0800 Message-ID: Date: Tue, 15 Apr 2025 16:08:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v8 0/2] fuse: remove temp page copies in writeback To: Joanne Koong , miklos@szeredi.hu, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: shakeel.butt@linux.dev, david@redhat.com, bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org, kernel-team@meta.com References: <20250414222210.3995795-1-joannelkoong@gmail.com> Content-Language: en-US From: Jingbo Xu In-Reply-To: <20250414222210.3995795-1-joannelkoong@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 1F88C40002 X-Stat-Signature: ae8esahtshckq94srhas1d3n4yb1euaz X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1744704504-73030 X-HE-Meta: U2FsdGVkX18U3AwKpy18zXnXJrzP5msPws0CV0+knHQ7yClGF9EPtWNE0LaDM1u19PjI2M9/QqHyvBiWqDuXq4VO32J10WIbvbbbvC+NoFa2ZpBSpcySZXcM5tUOlL1c+q0bzBIqArq+4mpDrjnMPpuXSlSQwBUsu9A/72cxpPxcydymOPEOEnNU0Hw2I120ek/P2KN+W37WvWdOwwkgKKvLeDSj7KE44v8te4z5P2bXfGu0PhwhzyYxaHrjfopMjeke8srOU+mze6qwC4d9FARYum9RP7GnwNndJ6CLBqTG+g0Yz27K0z1wTGomA5HDc9U0sXsBH/9TZ0gCa+ABXFP79Fe5lE3BtDpxkh4xm9A0ZspJdXMYert4kQ3F0qi0C5t10OZLZikO+mk4RLwXWAKbuczvwlrsIBWoZ05/T7M6WlRVCYTpOuHYtd3K7/WqTa7JrJDYXsE6lpQpR2NVm39joh0G2eDUMpXczFNH6EEzxPL1bkDrlIqCeaBntSvz/+0Z65AiL7SLl6Taq9/VmfN54aFrA3eK7lO3aRUvXqO7doLVQg1TYcZopEztaB6bVSEr32riWVxOOyi1IvahMwFIzm9NE2yWPTZJXVt7D+SyxjPpUTqnuT0OvjRcLKf96Nd7kY4g/oyqR5LHQbraz6yCjdz5MhmgIxRP5ZRHERCf27GxFV2955n2Tl8hKAQHsJBfSI/+jYCirTGtw9IJ7pcfllVblK+EY9yCaaZVv4IalH33nc2FhVlHUGYa4/ZNvtK28iCGy3QWG3BqQCkkDsPF8B8sHWEaX7xFikESxPS9D5L5zdcMrbdQyulzvA2a6sPkzicvKV9YiBKE6JmtMYc0zhXNGAf38C+sSUJFiqhikkdNPb+MAd8XovMmM0USbjP5ufNqr018HvS/NvOPQjJQKPIIhAjcRbBMtMKpS3GLXaSyAu2JSrR8oXsKZpQWJhSqorNMfWIaAi4lP9f GQeht99r cRUjFX3VcnI6JO9tdFwk2HsfRAOW+Si/Je0rjRiCKg0qdrjK9pT6iDhrcMFQ8oitP0keBxNBmKfE2l9UMiryRYx+HTuJ4JoprHssKo+2L7ckn+JCxdYDW/lKdQrSuiAy6+sHJdENUH8R/99VuMWbUZfHp0xzR67aujKZohlv/VDw+Ln2STodn70VIpvrwIpE6pRh6LTjeZuWljQH+YvwRCDMswdSw62aQ8SJYLhdgwFJHEYf+LhyBx3VTZJdGBgJ1EUVeqv+7P9VxTXQy5reE33amGzsjAElOtOBCT9cwQIJwXAViThdVG+3WKbeb52STstw0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/15/25 6:22 AM, Joanne Koong wrote: > The purpose of this patchset is to help make writeback in FUSE filesystems as > fast as possible. > > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to the > temp page, and the temp page gets handed to the server to write back. This is > done so that writeback may be immediately cleared on the dirty page, and this > in turn is done in order to mitigate the following deadlock scenario that may > arise if reclaim waits on writeback on the dirty page to complete (more > details > can be found in this thread [1]): > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > > Allocating and copying dirty pages to temp pages is the biggest performance > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > altogether (which will also allow us to get rid of the internal FUSE rb tree > that is needed to keep track of writeback status on the temp pages). > Benchmarks show approximately a 20% improvement in throughput for 4k > block-size writes and a 45% improvement for 1M block-size writes. > > In the current reclaim code, there is one scenario where writeback is waited > on, which is the case where the system is running legacy cgroupv1 and reclaim > encounters a folio that already has the reclaim flag set and the caller did > not have __GFP_FS (or __GFP_IO if swap) set. > > This patchset adds a new mapping flag, AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, > which filesystems may set on its inode mappings to indicate that reclaim > should not wait on writeback. FUSE will set this flag on its mappings. Reclaim > for the legacy cgroup v1 case described above will skip reclaim of folios with > that flag set. With this flag set, now FUSE can remove temp pages altogether. > > With this change, writeback state is now only cleared on the dirty page after > the server has written it back to disk. If the server is deliberately > malicious or well-intentioned but buggy, this may stall sync(2) and page > migration, but for sync(2), a malicious server may already stall this by not > replying to the FUSE_SYNCFS request and for page migration, there are already > many easier ways to stall this by having FUSE permanently hold the folio lock. > A fuller discussion on this can be found in [2]. Long-term, there needs to be > a more comprehensive solution for addressing migration of FUSE pages that > handles all scenarios where FUSE may permanently hold the lock, but that is > outside the scope of this patchset and will be done as future work. Please > also note that this change also now ensures that when sync(2) returns, FUSE > filesystems will have persisted writeback changes. > > For this patchset, it would be ideal if the first patch could be taken by > Andrew to the mm tree and the second patch could be taken by Miklos into the > fuse tree, as the fuse large folios patchset [3] depends on the second patch. > > Thanks, > Joanne > > [1] > https://lore.kernel.org/linux-kernel/495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com/ > [2] > https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/ > [3] > https://lore.kernel.org/linux-fsdevel/20241213221818.322371-1-joannelkoong@gmail.com/ > > Changelog > --------- > v7: > https://lore.kernel.org/linux-fsdevel/20250404181443.1363005-1-joannelkoong@gmail.com/ > Changes from v7 -> v8: > * Rename from AS_WRITEBACK_INDETERMINATE to > AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM (David) and merge patch 1 + 2 > * Remove unnecessary fuse_sync_writes() call in fuse_flush() (Jingbo) > > v6: > https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/ > Changes from v6 -> v7: > * Drop migration and sync patches, as they are useless if a server is > determined to be malicious > > v5: > https://lore.kernel.org/linux-fsdevel/20241115224459.427610-1-joannelkoong@gmail.com/ > Changes from v5 -> v6: > * Add Shakeel and Jingbo's reviewed-bys > * Move folio_end_writeback() to fuse_writepage_finish() (Jingbo) > * Embed fuse_writepage_finish_stat() logic inline (Jingbo) > * Remove node_stat NR_WRITEBACK inc/sub (Jingbo) > > v4: > https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/ > Changes from v4 -> v5: > * AS_WRITEBACK_MAY_BLOCK -> AS_WRITEBACK_INDETERMINATE (Shakeel) > * Drop memory hotplug patch (David and Shakeel) > * Remove some more kunnecessary writeback waits in fuse code (Jingbo) > * Make commit message for reclaim patch more concise - drop part about > deadlock and just focus on how it may stall waits > > v3: > https://lore.kernel.org/linux-fsdevel/20241107191618.2011146-1-joannelkoong@gmail.com/ > Changes from v3 -> v4: > * Use filemap_fdatawait_range() instead of filemap_range_has_writeback() in > readahead > > v2: > https://lore.kernel.org/linux-fsdevel/20241014182228.1941246-1-joannelkoong@gmail.com/ > Changes from v2 -> v3: > * Account for sync and page migration cases as well (Miklos) > * Change AS_NO_WRITEBACK_RECLAIM to the more generic AS_WRITEBACK_MAY_BLOCK > * For fuse inodes, set mapping_writeback_may_block only if fc->writeback_cache > is enabled > > v1: > https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@gmail.com/T/#t > Changes from v1 -> v2: > * Have flag in "enum mapping_flags" instead of creating asop_flags (Shakeel) > * Set fuse inodes to use AS_NO_WRITEBACK_RECLAIM (Shakeel) > > Joanne Koong (2): > mm: skip folio reclaim in legacy memcg contexts for deadlockable > mappings > fuse: remove tmp folio for writebacks and internal rb tree > > fs/fuse/file.c | 364 ++++------------------------------------ > fs/fuse/fuse_i.h | 3 - > include/linux/pagemap.h | 11 ++ > mm/vmscan.c | 12 +- > 4 files changed, 48 insertions(+), 342 deletions(-) > LGTM. Reviewed-by: Jingbo Xu -- Thanks, Jingbo