From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28DB7E7717F for ; Fri, 13 Dec 2024 11:53:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 529376B007B; Fri, 13 Dec 2024 06:53:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D97E6B0082; Fri, 13 Dec 2024 06:53:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A0DF6B0083; Fri, 13 Dec 2024 06:53:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1B06C6B007B for ; Fri, 13 Dec 2024 06:53:46 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7F200C0EE9 for ; Fri, 13 Dec 2024 11:52:58 +0000 (UTC) X-FDA: 82889773638.21.773012A Received: from mail-qt1-f173.google.com (mail-qt1-f173.google.com [209.85.160.173]) by imf15.hostedemail.com (Postfix) with ESMTP id 24DF2A000A for ; Fri, 13 Dec 2024 11:52:24 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=szeredi.hu header.s=google header.b=lztK93fi; dmarc=pass (policy=quarantine) header.from=szeredi.hu; spf=pass (imf15.hostedemail.com: domain of miklos@szeredi.hu designates 209.85.160.173 as permitted sender) smtp.mailfrom=miklos@szeredi.hu ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734090752; a=rsa-sha256; cv=none; b=XN5Eu7Y3e4f0ewaO2NT/DFpcqR9r0/bFbtDCj7JnfY6ml+LW36W777vwwm6Wj5dWAWyL+8 BaOONaH1lAnLw0hLePjNtocL13aouSMJEhJollRPiU6QD2rbktx/pSTxG8LTktnCLhIQzg 0I+TIwXzaMCtvq7ZIna8YCzYtAF5E6Q= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=szeredi.hu header.s=google header.b=lztK93fi; dmarc=pass (policy=quarantine) header.from=szeredi.hu; spf=pass (imf15.hostedemail.com: domain of miklos@szeredi.hu designates 209.85.160.173 as permitted sender) smtp.mailfrom=miklos@szeredi.hu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734090752; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+NGPcY3l/EZhVhhJTmICKQCQuQuwzjLgNykHYWpY1Hk=; b=fn6J+uwzlaMeTauL94rEP1nNxOCowKGZAk/dCIMUp0WHgrExR5TGC7CisrAS1J4Cn/XqHC i3wnGNSqUFl1xsLTZKBm8CHXhHywBVFnvZO+v2+aKAE5kntZiFZCSo58k0J1RZnTUN5oA5 Dljr2osNfbyTMXpRRIkVKFE/fHRE0NU= Received: by mail-qt1-f173.google.com with SMTP id d75a77b69052e-467a3f1e667so4571071cf.0 for ; Fri, 13 Dec 2024 03:52:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=szeredi.hu; s=google; t=1734090775; x=1734695575; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=+NGPcY3l/EZhVhhJTmICKQCQuQuwzjLgNykHYWpY1Hk=; b=lztK93fiAHFsCgFWmr4SEzoY8tyZqdGuXB4Hx0EGVSHRNnGXIHUDlNNHM1F07TyFmo vdzaoZxIIIWrtml0SwAoHjPFZXp2GgkH4tskColdWPBaU3SF42r72yNH//zCuxJSATk7 yoepNJIN1gQOXIJCieyJyLxYK3F2uOfNd3Jmk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734090775; x=1734695575; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=+NGPcY3l/EZhVhhJTmICKQCQuQuwzjLgNykHYWpY1Hk=; b=uEXAdjbCRFQmcG+9UITZYbv3n8v1FaM86F9GTvfzcNlMQwvBuuQoT6AJL1Ef9eX2Ir n+eXSHfEg58pkql9IjX3BmDew7hfTN2FY2hR61nK0bym4pS4zlBhr13ozBF7bjqdJL2H MuN5ayVloea2Y8Gl0MNYr375dIHViIXH3xTBhN+rFwKranM+k9nTbiMB7LBSztCgDvs7 SI8sBU1gOhUTlsgRo05Xy/cigVzisBVxZbJ9DQHrP0M/+BbCywWIEJwuVWzUPzHzVYeI c67BHfv5iWDj+gte9/TeCQRWEbYwVDDy7S349imigw4jrjP2f3FOawxviS2Ol0rNZ3wd dHwA== X-Forwarded-Encrypted: i=1; AJvYcCW0yqx1gzc4/Zb0ASgrtUeFOPFOgU8H9QiCh+PWqvoL7Tk9P5oOjLOzV1blK9hTKA7OL2oxhNp4Mw==@kvack.org X-Gm-Message-State: AOJu0YxhbwS5EoPos0mD1Z6uoOLhLxPSBT2iISgs6GW8AOm1N7vYhm+D AlIJ5Xh2SzHUBTEp8xIOXncyvPjqJ7KOHYXhXvOgQQ4c+INZG1ySCbop8AggOeuay0U5mEbV0wT WQ2OCUpz45/UJcXo4ydt7aw+OIjO/tiqaJ6DsLg== X-Gm-Gg: ASbGnctu9pl2WBVJrrePe159jEFStUUbOl8gOR0USrdoFToF/y9qdAhpid8qjcxHJS8 4w7NLtEArTumjkfbzwRIBh65LGqQ32/er9yk= X-Google-Smtp-Source: AGHT+IHQ5iKsRm70cezL/Fapc8X/4zOvwPT2o0pnkVRtQOL+nnzFvte3TJb9Lo3rtEwSs+SIsbmqp3YdxP3rEhAknrE= X-Received: by 2002:ac8:59d2:0:b0:467:65d4:7e07 with SMTP id d75a77b69052e-467a585b61emr32037931cf.53.1734090775425; Fri, 13 Dec 2024 03:52:55 -0800 (PST) MIME-Version: 1.0 References: <20241122232359.429647-1-joannelkoong@gmail.com> In-Reply-To: <20241122232359.429647-1-joannelkoong@gmail.com> From: Miklos Szeredi Date: Fri, 13 Dec 2024 12:52:44 +0100 Message-ID: Subject: Re: [PATCH v6 0/5] fuse: remove temp page copies in writeback To: Joanne Koong Cc: linux-fsdevel@vger.kernel.org, shakeel.butt@linux.dev, jefflexu@linux.alibaba.com, josef@toxicpanda.com, bernd.schubert@fastmail.fm, linux-mm@kvack.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: tb3s6brzgdzph75ettzqoipe6cjucdt7 X-Rspam-User: X-Rspamd-Queue-Id: 24DF2A000A X-Rspamd-Server: rspam08 X-HE-Tag: 1734090744-77867 X-HE-Meta: U2FsdGVkX19ozlJu+ZYVe/lAo7tKSGrQ7ympV+RQh8MtPMSbfRjPDrqLOjJBcTtvfAwBjc2kX1OXqBlBnBkYZtTMYcZT3u1SRNyIsQ0Gc/MMzcxNu33R/7QNQUNqEvBG7aCTOmweA1dIqhqbbxMCh65cL/CYUrneT+CdbnwHrvqimpwfYhwLF8Boie6mO0arl8p/k6ofGKglXvXZlc7V7dCKRHGy5feliFJ3ZTEsLQSIc9JCt4aQojkjPcDVS+x+vC4gnYnr0CW6yNixxwJCWM0KvQF2eaE0bVSyCnSRd9uBDoTp44M6wBkyI0RmblLe7fQwyG+sHMKePI1+cmCm5UxgN1yVP4fsIexZ/4/R6AxpWoJOwRKprfrW25YUcxd6wnEhOhbo4SYd6Zz+7hYf2bIKQ/KWV/HrLIOND94sVmlim4l4Ido1OL6tQ9+iCJLYTcb40kPCbZzIKZYbJ1JPCmtvMZHpNq8OOXDB8ztuCYxAL56TfnQx4uRpV09cmCkBRqTukaHJJondzs7x3Q0YgrR0jKlhmHfL/mIwHVG+oYSVaIbOdaWWw/WYAHrWwJU2yfciHqApNwhtVuLn3H0bDWFJoK9Jxx6rWmSWPbASobACpeJ9HChwMWixOxB+2Vgyo6BrRbqqwmRORU9OAO8+jwiFhtY9sMy4h67lxhexdGPu3xLZZGbnqDhjiVfmPKKIW7Q2HOHuhJv9qe0jp14UBKOJip11Ra/6jOAFXW492dS/DISYiSCG0Tygcv0sfRsLi3WmfnIHuDX19EdC3N5c3VkIuhSA7LmCADfuf3SOIlE9Fbs6w2WousJTMt6xaIV+9ViTg+kPNTQ9Ibr6wXEM6nd0PBs7iLnK13rZZJbeqEgY3cXpaw0UbPaTwdvfh69V/0bEuDtMODVzDlJCG5VHarDhTn6oE0jmchalLzFAIOGW3uskT3Oo915QbytxLEWHbLn1/NIxqRXT1uo5vkH is2Kn5kA DELgSgN+sWz0PKRz2p9NY3zLcENbFfxN+DPcbhIPlLavcbNN2nXygOxHKbl7rP4qJWxnquesgLGkAoRqjdRcOKsb7Zsr1oufff9Xbulid6m8xScAnkosF+8ZjGfmiFrDcfV9vnbs12Vthrz8SWvZ8BCuIap3X7a5cnDQmIDadXlkwmWFPZC1yu7jfZ+c/lptEIu1DJlD1L2VeqP1TjYrfA+zF2BjDOvsSj+ai6FBFq8ERgE54OG23ikX71Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.013893, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, 23 Nov 2024 at 00:24, Joanne Koong wrote: > > The purpose of this patchset is to help make writeback-cache write > performance in FUSE filesystems as fast as possible. > > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap"))), a temp page is allocated for every dirty > page to be written back, the contents of the dirty page are copied over to the > temp page, and the temp page gets handed to the server to write back. This is > done so that writeback may be immediately cleared on the dirty page, and this > in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise if > reclaim waits on writeback on the dirty page to complete (more details can be > found in this thread [1]): > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on writeback > without needing the server to complete writing back to disk, which may take > an indeterminate amount of time. > > Allocating and copying dirty pages to temp pages is the biggest performance > bottleneck for FUSE writeback. This patchset aims to get rid of the temp page > altogether (which will also allow us to get rid of the internal FUSE rb tree > that is needed to keep track of writeback status on the temp pages). > Benchmarks show approximately a 20% improvement in throughput for 4k > block-size writes and a 45% improvement for 1M block-size writes. > > With removing the temp page, writeback state is now only cleared on the dirty > page after the server has written it back to disk. This may take an > indeterminate amount of time. As well, there is also the possibility of > malicious or well-intentioned but buggy servers where writeback may in the > worst case scenario, never complete. This means that any > folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to > be carefully audited. > > In particular, these are the cases that need to be accounted for: > * potentially deadlocking in reclaim, as mentioned above > * potentially stalling sync(2) > * potentially stalling page migration / compaction > > This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which > filesystems may set on its inode mappings to indicate that writeback > operations may take an indeterminate amount of time to complete. FUSE will set > this flag on its mappings. This patchset adds checks to the critical parts of > reclaim, sync, and page migration logic where writeback may be waited on. > > Please note the following: > * For sync(2), waiting on writeback will be skipped for FUSE, but this has no > effect on existing behavior. Dirty FUSE pages are already not guaranteed to > be written to disk by the time sync(2) returns (eg writeback is cleared on > the dirty page but the server may not have written out the temp page to disk > yet). If the caller wishes to ensure the data has actually been synced to > disk, they should use fsync(2)/fdatasync(2) instead. > * AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be > waited on when in writeback. There are some cases where the wait is > desirable. For example, for the sync_file_range() syscall, it is fine to > wait on the writeback since the caller passes in a fd for the operation. Looks good, thanks. Acked-by: Miklos Szeredi I think this should go via the mm tree. Thanks, Miklos