From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB569D44171 for ; Wed, 20 Nov 2024 09:56:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 703336B007B; Wed, 20 Nov 2024 04:56:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B04B6B0083; Wed, 20 Nov 2024 04:56:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5784E6B0085; Wed, 20 Nov 2024 04:56:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 320256B007B for ; Wed, 20 Nov 2024 04:56:20 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id DB191120510 for ; Wed, 20 Nov 2024 09:56:19 +0000 (UTC) X-FDA: 82806016524.30.D15F9E6 Received: from out30-110.freemail.mail.aliyun.com (out30-110.freemail.mail.aliyun.com [115.124.30.110]) by imf22.hostedemail.com (Postfix) with ESMTP id B24A5C000E for ; Wed, 20 Nov 2024 09:55:12 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=bftd82mH; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf22.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.110 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732096486; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OdX8fHoMlswM6ZP/2AT7Q0ptZ6gYWHmieT+vd+jRDeg=; b=f9c9x5svO0uFWSAwuJiL0pQpVN3Dcw0ZbHYeqgYwG6/1wxj0o01iKCx8EbLXqY1WLwpuxW R4AppE/C0GvZxpZdWGItGy/xxsHo3XuECrhcQ5KObtDZQKdLnHpGIaudQqJGVpStaxI2+6 kX975daWXdRd451y9Bb686hZH0groCc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=bftd82mH; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf22.hostedemail.com: domain of jefflexu@linux.alibaba.com designates 115.124.30.110 as permitted sender) smtp.mailfrom=jefflexu@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732096486; a=rsa-sha256; cv=none; b=RfHsHWxt6K3uTfBHBeewZTF3TXbT2bWj+17Iazr88X9nz4JsxpsXV9MgqqSsYo0lM5dD4Y cS8yFT6SIwOwahpd+/Sw0Y1c5jcsabgsVbUAUMUsfhJ1mgBnN3j3F0R9xWpvddV+OgTvVy YOYPQmNESM0Bkz5idwvtBtNNj/7AnZA= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1732096573; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=OdX8fHoMlswM6ZP/2AT7Q0ptZ6gYWHmieT+vd+jRDeg=; b=bftd82mHdqsfg5wP0aSO5DoZbpxQHZ+oZGRQd6puRYNcoed2xQkST7NZc2JgHWIfy1yx3q2VdIaKC2BW1Cu866EboPFsntjyvynp9P+7u0IpDcwCVz9p/OsyJieUOlYiYjHcuiREmLveywvTJI2UU38MeH6ZfuFyp+Myqf5pfII= Received: from 30.221.144.243(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0WJrw6GJ_1732096571 cluster:ay36) by smtp.aliyun-inc.com; Wed, 20 Nov 2024 17:56:12 +0800 Message-ID: Date: Wed, 20 Nov 2024 17:56:09 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v5 5/5] fuse: remove tmp folio for writebacks and internal rb tree To: Joanne Koong , miklos@szeredi.hu, linux-fsdevel@vger.kernel.org Cc: shakeel.butt@linux.dev, josef@toxicpanda.com, linux-mm@kvack.org, bernd.schubert@fastmail.fm, kernel-team@meta.com References: <20241115224459.427610-1-joannelkoong@gmail.com> <20241115224459.427610-6-joannelkoong@gmail.com> Content-Language: en-US From: Jingbo Xu In-Reply-To: <20241115224459.427610-6-joannelkoong@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B24A5C000E X-Stat-Signature: uqqh59rs7n6m16opx87frqsbmu56m757 X-HE-Tag: 1732096512-381625 X-HE-Meta: U2FsdGVkX1+6dtvuPwoAsItZkVpitffyEjOEUmXg50pCzEZpmuv+GXi+ZaN2wDlIRrVeb0Ib0Qqvx+apXcq9S/yUXpgk0m4pucVQevLrBmKhZLvwgqVq/46fC0NhwKelsM5bv4ogOBCl7Wi3Dvmv6mnTsLW71/lM4AypAQdgP9O7sBb9GLhXuT+NJuhdLDAl5dsLPyDU1DHdygcw6N3Eq+ZCiv6PjCfw6ew27OatN9ynjnLImdNwHSujuDfYc5Kj7Q/BcAmSQ610TLLJ7c1dcNmWx7aEn9cnH7nKlnxuSZ0h+VAgOnjz+zY+zzBzCedNFJjWlCKjQcUN+Zn4F0ZrU1uvbdPwqCXv+E30Ix61juOdWyN50yZxjHSROxAZrHQPTHidj1C7aZtNor4eedj3V2xSfTjE96qbJK+ONHKWd/juVIR9EXw/J5D6PblVVXdoQbegPtjYfaTLuTQnsE+XBjHqp6VJmfmj3SYhQ47z7nthCh+Sq3f6uDyMtMY9ho2q95t6GwybtmSxZAsHgrcroH1JDlnIEjZH2W+Na6Vw6HvSLQ83fFJkxB58O8AuAPrrE5mNPjlYjiXt5HhtJPK5ZzackGWOQ1sr3y96OStzMC4f+ScWH4maAyETtkj4dq4shsu6lDsEDntmaDXL0dcGffvuk6+jP1UmsXWIhAIzCa8BxleElQk58yjGLM+Rrg6TirBJl0F6mAWMzQHm4LRWI+vVXsk/QqvvWDS0n7pLhqiWggbHhKznuML9gkR/Pml/1pRDVm0Ycoj7tVAfMcSDu+m3OtL9WqdaQrgL2RziemjoJiiwEw85mlVOdC9TEfVLxkItYVfmy3NJSZA+of7024gqmJUFO48DjLUIw52O31IHOmKlLkDuU9VOzRQCerrkZpi8ROWBVMtxIIaz8mfA4SmoNYOjdZGQzV+9ItqMaoA2lTBldY9i50RameH1eS77j8gFkzh7DvEvWlJmbXj Pl7pCvmx zmud8MsgwhUjmsSLQfmO+zNJHwpEJqYmOwVRRjqcKCnTdNw7l2i96kX9Q7G4HKw7KtNe76W764Z/HWttSbXpGKFyuT5GUjPUX3/hPrvA8HTb7lPsWgaaDu9FRjZKsM214B6DZ+cYBwRVrt6+2ukIwXAlL3UM48s1C+V2pOx361bTNU8roN0G25mgvMrzPr8MhbNUSbtFPz53pis6F5Ak2UeoyriMLqTbGh5MW7LeJDEZVoOxMc7zpwRbtSf+p+G5BPoDD3B7I3yw3egkD5R20BYbctyXDDUENU7LkpfarnVRUNIg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/16/24 6:44 AM, Joanne Koong wrote: > In the current FUSE writeback design (see commit 3be5a52b30aa > ("fuse: support writable mmap")), a temp page is allocated for every > dirty page to be written back, the contents of the dirty page are copied over > to the temp page, and the temp page gets handed to the server to write back. > > This is done so that writeback may be immediately cleared on the dirty page, > and this in turn is done for two reasons: > a) in order to mitigate the following deadlock scenario that may arise > if reclaim waits on writeback on the dirty page to complete: > * single-threaded FUSE server is in the middle of handling a request > that needs a memory allocation > * memory allocation triggers direct reclaim > * direct reclaim waits on a folio under writeback > * the FUSE server can't write back the folio since it's stuck in > direct reclaim > b) in order to unblock internal (eg sync, page compaction) waits on > writeback without needing the server to complete writing back to disk, > which may take an indeterminate amount of time. > > With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates > the situations described above, FUSE writeback does not need to use > temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings. > > This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings > and removes the temporary pages + extra copying and the internal rb > tree. > > fio benchmarks -- > (using averages observed from 10 runs, throwing away outliers) > > Setup: > sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount > ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount > > fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G > --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount > > bs = 1k 4k 1M > Before 351 MiB/s 1818 MiB/s 1851 MiB/s > After 341 MiB/s 2246 MiB/s 2685 MiB/s > % diff -3% 23% 45% > > Signed-off-by: Joanne Koong > --- > fs/fuse/file.c | 339 +++---------------------------------------------- > 1 file changed, 20 insertions(+), 319 deletions(-) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index 88d0946b5bc9..56289ac58596 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id) > > struct fuse_writepage_args { > struct fuse_io_args ia; > - struct rb_node writepages_entry; > struct list_head queue_entry; > - struct fuse_writepage_args *next; > struct inode *inode; > struct fuse_sync_bucket *bucket; > }; > > -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi, > - pgoff_t idx_from, pgoff_t idx_to) > -{ > - struct rb_node *n; > - > - n = fi->writepages.rb_node; > - > - while (n) { > - struct fuse_writepage_args *wpa; > - pgoff_t curr_index; > - > - wpa = rb_entry(n, struct fuse_writepage_args, writepages_entry); > - WARN_ON(get_fuse_inode(wpa->inode) != fi); > - curr_index = wpa->ia.write.in.offset >> PAGE_SHIFT; > - if (idx_from >= curr_index + wpa->ia.ap.num_folios) > - n = n->rb_right; > - else if (idx_to < curr_index) > - n = n->rb_left; > - else > - return wpa; > - } > - return NULL; > -} > - > -/* > - * Check if any page in a range is under writeback > - */ > -static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from, > - pgoff_t idx_to) > -{ > - struct fuse_inode *fi = get_fuse_inode(inode); > - bool found; > - > - if (RB_EMPTY_ROOT(&fi->writepages)) > - return false; > - > - spin_lock(&fi->lock); > - found = fuse_find_writeback(fi, idx_from, idx_to); > - spin_unlock(&fi->lock); > - > - return found; > -} > - > -static inline bool fuse_page_is_writeback(struct inode *inode, pgoff_t index) > -{ > - return fuse_range_is_writeback(inode, index, index); > -} > - > -/* > - * Wait for page writeback to be completed. > - * > - * Since fuse doesn't rely on the VM writeback tracking, this has to > - * use some other means. > - */ > -static void fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index) > -{ > - struct fuse_inode *fi = get_fuse_inode(inode); > - > - wait_event(fi->page_waitq, !fuse_page_is_writeback(inode, index)); > -} > - > -static inline bool fuse_folio_is_writeback(struct inode *inode, > - struct folio *folio) > -{ > - pgoff_t last = folio_next_index(folio) - 1; > - return fuse_range_is_writeback(inode, folio_index(folio), last); > -} > - > -static void fuse_wait_on_folio_writeback(struct inode *inode, > - struct folio *folio) > -{ > - struct fuse_inode *fi = get_fuse_inode(inode); > - > - wait_event(fi->page_waitq, !fuse_folio_is_writeback(inode, folio)); > -} > - > /* > * Wait for all pending writepages on the inode to finish. > * > @@ -886,13 +808,6 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio) > ssize_t res; > u64 attr_ver; > > - /* > - * With the temporary pages that are used to complete writeback, we can > - * have writeback that extends beyond the lifetime of the folio. So > - * make sure we read a properly synced folio. > - */ > - fuse_wait_on_folio_writeback(inode, folio); > - > attr_ver = fuse_get_attr_version(fm->fc); > > /* Don't overflow end offset */ > @@ -1003,17 +918,12 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file) > static void fuse_readahead(struct readahead_control *rac) > { > struct inode *inode = rac->mapping->host; > - struct fuse_inode *fi = get_fuse_inode(inode); > struct fuse_conn *fc = get_fuse_conn(inode); > unsigned int max_pages, nr_pages; > - pgoff_t first = readahead_index(rac); > - pgoff_t last = first + readahead_count(rac) - 1; > > if (fuse_is_bad(inode)) > return; > > - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last)); > - > max_pages = min_t(unsigned int, fc->max_pages, > fc->max_read / PAGE_SIZE); > > @@ -1172,7 +1082,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia, > int err; > > for (i = 0; i < ap->num_folios; i++) > - fuse_wait_on_folio_writeback(inode, ap->folios[i]); > + folio_wait_writeback(ap->folios[i]); > > fuse_write_args_fill(ia, ff, pos, count); > ia->write.in.flags = fuse_write_flags(iocb); > @@ -1622,7 +1532,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter, > return res; > } > } > - if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) { > + if (!cuse && filemap_range_has_writeback(mapping, pos, (pos + count - 1))) { > if (!write) > inode_lock(inode); > fuse_sync_writes(inode); > @@ -1825,7 +1735,7 @@ static void fuse_writepage_free(struct fuse_writepage_args *wpa) > fuse_sync_bucket_dec(wpa->bucket); > > for (i = 0; i < ap->num_folios; i++) > - folio_put(ap->folios[i]); > + folio_end_writeback(ap->folios[i]); I noticed that if we folio_end_writeback() in fuse_writepage_finish() (rather than fuse_writepage_free()), there's ~50% buffer write bandwridth performance gain (5500MB -> 8500MB)[*] The fuse server is generally implemented in multi-thread style, and multi (fuse server) worker threads could fetch and process FUSE_WRITE requests of one fuse inode. Then there's serious lock contention for the xarray lock (of the address space) when these multi worker threads call fuse_writepage_end->folio_end_writeback when they are sending replies of FUSE_WRITE requests. The lock contention is greatly alleviated when folio_end_writeback() is serialized with fi->lock. IOWs in the current implementation (folio_end_writeback() in fuse_writepage_free()), each worker thread needs to compete for the xarray lock for 256 times (one fuse request can contain at most 256 pages if FUSE_MAX_MAX_PAGES is 256) when completing a FUSE_WRITE request. After moving folio_end_writeback() to fuse_writepage_finish(), each worker thread needs to compete for fi->lock only once. IOWs the locking granularity is larger now. > @@ -2367,54 +2111,23 @@ static int fuse_writepages_fill(struct folio *folio, > data->wpa = NULL; > } > > - err = -ENOMEM; > - tmp_folio = folio_alloc(GFP_NOFS | __GFP_HIGHMEM, 0); > - if (!tmp_folio) > - goto out_unlock; > - > - /* > - * The page must not be redirtied until the writeout is completed > - * (i.e. userspace has sent a reply to the write request). Otherwise > - * there could be more than one temporary page instance for each real > - * page. > - * > - * This is ensured by holding the page lock in page_mkwrite() while > - * checking fuse_page_is_writeback(). We already hold the page lock > - * since clear_page_dirty_for_io() and keep it held until we add the > - * request to the fi->writepages list and increment ap->num_folios. > - * After this fuse_page_is_writeback() will indicate that the page is > - * under writeback, so we can release the page lock. > - */ > if (data->wpa == NULL) { > err = -ENOMEM; > wpa = fuse_writepage_args_setup(folio, data->ff); > - if (!wpa) { > - folio_put(tmp_folio); > + if (!wpa) > goto out_unlock; > - } > fuse_file_get(wpa->ia.ff); > data->max_folios = 1; > ap = &wpa->ia.ap; > } > folio_start_writeback(folio); There's also a lock contention for the xarray lock when calling folio_start_writeback(). I also noticed a strange thing that, if we lock fi->lock and unlock immediately, the write bandwidth improves by 5% (8500MB -> 9000MB). The palce where to insert the "locking fi->lock and unlocking" actually doesn't matter. "perf lock contention" shows the lock contention for the xarray lock is greatly alleviated, though I can't understand how it is done quite well... As the performance gain is not significant (~5%), I think we can leave this stange phenomenon aside for now. [*] test case: ./passthrough_hp --bypass-rw 2 /tmp /mnt (testbench mode in https://github.com/libfuse/libfuse/pull/807/commits/e83789cc6e83ca42ccc9899c4f7f8c69f31cbff9 bypass the buffer copy along with the persistence procedure) fio -fallocate=0 -numjobs=32 -iodepth=1 -ioengine=sync -sync=0 --direct=0 -rw=write -bs=1M -size=100G --time_based --runtime=300 -directory=/mnt/ --group_reporting --name=Fio -- Thanks, Jingbo