From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 131A1C4829B for ; Mon, 12 Feb 2024 06:18:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7CAC86B0072; Mon, 12 Feb 2024 01:18:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 77B336B0074; Mon, 12 Feb 2024 01:18:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6424D6B0075; Mon, 12 Feb 2024 01:18:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 50E6A6B0072 for ; Mon, 12 Feb 2024 01:18:50 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 12E59C050D for ; Mon, 12 Feb 2024 06:18:50 +0000 (UTC) X-FDA: 81782148420.20.F5380AF Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf29.hostedemail.com (Postfix) with ESMTP id BE23E12000F for ; Mon, 12 Feb 2024 06:18:47 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=KQ3jHZ25; spf=pass (imf29.hostedemail.com: domain of djwong@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707718728; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KlOeBXVFGeK5ttXjnhrXeciRzguhLy630L3VzFrv7s4=; b=7j38n0Z0lsyJGHt+6LDNPxmV+PT8gZiJIA6iro53DHNnDHOmbpOigfezDF3gPRj6JdUMGS rzxaEocQSwO4QbFiLc2hwDvJ1bQ47lQRlK/Zkmp2wlr98JRM3eIDlFgO0lZB1SlOpBqsaN FsOucwl3LJzoLXp+HukQF+bUVRYMQz8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707718728; a=rsa-sha256; cv=none; b=qd3o7jTH2NTDqU8p+nIhxEvDD2jen/0xKbNYPHT6EZwKk0ULQYfdGd9wfE1v4Zpo6ofIWe c8PQ8W2ZS8jhI1yDl32YdxCmC/6OIbtrajHSObD9fSMpRBOGFie1ljlbx1SRNmXC68d/KO KVkgDF88BC2onuRJS4QLl9SGvd/Zn7s= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=KQ3jHZ25; spf=pass (imf29.hostedemail.com: domain of djwong@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 0EA70CE0ED4; Mon, 12 Feb 2024 06:18:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 467D5C433C7; Mon, 12 Feb 2024 06:18:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707718723; bh=Cgnj7UaWtNLg5yBDnc168wOt0+BMZugI7Zc2M/K+E20=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=KQ3jHZ25h2StFtqtdvnFE4TH0WdwQ2GhK/+krlaeydbgTF6FVNpBrpnv37IUxeMoc IfD6budWZjG9YQaRLrepIschPoBqLV+HdipTxydJLJohItYMAhS3zpy7Rgpx+he9j4 uVP0ojH/0/vPn94a7kDJcBZisFp7RHG982TLMEPtaVaHgOkaeTxESE2ytEJ0d8F26h C4D5y64QeH0iuSSFpmaXM0QIOJuI/Z1ohp4OblTQcbtE4i+GgwofkNkfGxWOQrIHl8 HmUlN2Q0vqC2twsDewfhVe0XHW0sNfkDOnvaaCGYAHluY1kS/jIr+ghJ98EVU55ZlI nBBJ9wSKtCgDg== Date: Sun, 11 Feb 2024 22:18:42 -0800 From: "Darrick J. Wong" To: Zhang Yi Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com, hch@infradead.org, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com Subject: Re: [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Message-ID: <20240212061842.GB6180@frogsfrogsfrogs> References: <20240127015825.1608160-1-yi.zhang@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240127015825.1608160-1-yi.zhang@huaweicloud.com> X-Stat-Signature: cam8rco7j9ogmwkhnykpz6z1uwfixb59 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: BE23E12000F X-Rspam-User: X-HE-Tag: 1707718727-922085 X-HE-Meta: U2FsdGVkX18NO2/bmPh5GLsONj3s7CO/H3PlbvTuaUaCIjT7yvUIKpv3xPCbbeGYeCp5rho1uFSl3LzbnCMxzRnzhPsuRHtm8pz/LZ7w39Qc6McNN2vhkIErpYREVNT3JmpvX5CCNF/GTwcMtel16QfdTgl/gZDmjqk+JAxcqxgRy0tbOOy8akRlwDYhYn1kZ1jrMdplmgj3HGF3SxaCNfR4Q6G0WGRVtgEPCU0SVFhSrSmTK5OZAfZR+XaktoRZs3Vi0d8416o312zBWPG6vLwynJERFnl0+voNiE3QXoFWbTWS2bBt/c8eb0hgPJNv0UtQdvpqMENFiKgGUZNwFZQQyid7iMq/SW+8J/Af7iVPXcNPpIssHg0DnSxxxq+YaxrEGgP3Gi7AMaxihPB5RNIchz4aemWeu8oQCDNSvN7uUmQkFqU1cijQYi7OBCqq2Xv8+ySehPgu3pRQ7+bAGs1zQQgqJDmBd0y42DQbHqlbbMUm437bhz4C/mh5jUpJyD6k393QA5oHd1Uvg6pzoG73kAAAvUKQpfWBNaTcvr49eGOascmoP6SuUlW+MUzxAX4X4OWi26tKBUIn66DBNPRN5ePqy9Rc69zzVi7pigMOadfSQUEp0FP0gweNjPISu0CU817wX63FBMN51Dw0wMiAIVK9tuySCirbysvXPKkHs2nJsesRr1dl9Gog4hxLBzmSniulmH/ETSVA7z0ufr31pGBz+qVqCPlaOe57qcDsPvgAaOuy6WNcUaUbnMX0TgYX8cp7FdwkhrO0kkbkPTAffdiL62aef6ipqTdJUOD6OjWrgB00pda3LgMAgIRAp9V/CdzXzB9jKwrSjBTUlqx+aw6ffIi6R9Eyg6yXiHNnPrRZuxvy7uEdlm9GcQNd7anOxDniquiKP0VGWU8ffi+F/v6YGmrJyhxpVkwM0FxwW76DRC7zxrunHAvsk5FSL41G0PjANQSJp4YLiLY C7RDxdyW iUPJ0tyJZdiz1g2IcJZ0peFEh1yIPq8jbrBYckuGmMNJO0R2UwO2qPmmhgS5kE3xuGbbWlJ8pn+R2EJV75Qwcj1+p+KKJDET0fzW6Eta9DZLdO4VxWBt05/93Q9VKlfP6iszcrVwFuZb+UuGsbiQ6Paa5NxR9dEg54nQpBn0JQCEzcOIPcOoXl2vPXj9axay3JB4y1cNwA3SQyUmpVYUiU5RL/cEppfmODQfXm1XJt+e+NIv10t916Fw6dVGLexRZlieXi7jXPTqcaYIi8otXEfnglH+Zk6kAJCWZNH0Vbct6Qxdy+EkUAbKFNkq/6iOf743DZOmuus5LA2pk/+eq53/ZT9rac2DkzzPdld6eusexYdcRTD0DI019umpC8ArbojOeFfc4WY5yAjFbzGK5CsVNrA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jan 27, 2024 at 09:57:59AM +0800, Zhang Yi wrote: > From: Zhang Yi > > Hello, > > This is the third version of RFC patch series that convert ext4 regular > file's buffered IO path to iomap and enable large folio. It's rebased on > 6.7 and Christoph's "map multiple blocks per ->map_blocks in iomap > writeback" series [1]. I've fixed all issues found in the last about 3 > weeks of stress tests and fault injection tests in v2. I hope I've > covered most of the corner cases, and any comments are welcome. :) > > Changes since v2: > - Update patch 1-6 to v3 [2]. > - iomap_zero and iomap_unshare don't need to update i_size and call > iomap_write_failed(), introduce a new helper iomap_write_end_simple() > to avoid doing that. > - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(), > introduce a new helper ext4_iomap_map_one_extent() to allocate > delalloc blocks in writeback, which is always under i_data_sem in > write mode. This is done to prevent the writing back delalloc > extents become stale if it raced by truncate. > - Add a lock detection in mapping_clear_large_folios(). > Changes since v1: > - Introduce seq count for iomap buffered write and writeback to protect > races from extents changes, e.g. truncate, mwrite. > - Always allocate unwritten extents for new blocks, drop dioread_lock > mode, and make no distinctions between dioread_lock and > dioread_nolock. > - Don't add ditry data range to jinode, drop data=ordered mode, and > make no distinctions between data=ordered and data=writeback mode. > - Postpone updating i_disksize to endio. > - Allow splitting extents and use reserved space in endio. > - Instead of reimplement a new delayed mapping helper > ext4_iomap_da_map_blocks() for buffer write, try to reuse > ext4_da_map_blocks(). > - Add support for disabling large folio on active inodes. > - Support online defragmentation, make file fall back to buffer_head > and disable large folio in ext4_move_extents(). > - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite. > - Add dirty_len and pos trace info to trace_iomap_writepage_map(). > - Update patch 1-6 to v2. > > This series only support ext4 with the default features and mount > options, doesn't support inline_data, bigalloc, dax, fs_verity, fs_crypt > and data=journal mode, ext4 would fall back to buffer_head path Do you plan to add bigalloc or !extents support as a part 2 patchset? An ext2 port to iomap has been (vaguely) in the works for a while, though iirc willy never got the performance to match because iomap didn't have a mechanism for the caller to tell it "run the IO now even though you don't have a complete page, because the indirect block is the next block after the 11th block". --D > automatically if you enabled these features/options. Although it has > many limitations now, it can satisfy the requirements of common cases > and bring a great performance benefit. > > Patch 1-6: this is a preparation series, it changes ext4_map_blocks() > and ext4_set_iomap() to recognize delayed only extents, I've send it out > separately [2]. > > Patch 7-8: these are two minor iomap changes, the first one is don't > update i_size and don't call iomap_write_failed() in zero_range, the > second one is for debug in iomap writeback path that I've discussed whit > Christoph [3]. > > Patch 9-15: this is another preparation series, including some changes > for delayed extents. Firstly, it factor out buffer_head from > ext4_da_map_blocks(), make it to support adding multi-blocks once a > time. Then make unwritten to written extents conversion in endio use to > reserved space, reduce the risk of potential data loss. Finally, > introduce a sequence counter for extent status tree, which is useful > for iomap buffer write and write back. > > Patch 16-22: Implement buffered IO iomap path for read, write, mmap, > zero range, truncate and writeback, replace current buffered_head path. > Please look at the following patch for details. > > Patch 23-26: Convert to iomap for regular file's buffered IO path > besides inline_data, bigalloc, dax, fs_verity, fs_crypt, and > data=journal mode, and enable large folio. It should be note that > buffered iomap path hasn't support Online defrag yet, so we need fall > back to buffer_head and disable large folio automatically if user call > EXT4_IOC_MOVE_EXT. > > About Tests: > - kvm-xfstests in auto mode, and about 3 weeks of stress tests and > fault injection tests. > - A performance tests below. > > Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU > with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk. > > == buffer read == > > buffer head iomap with large folio > type bs IOPS BW(MiB/s) IOPS BW(MiB/s) > ---------------------------------------------------- > hole 4K 565k 2206 811k 3167 > hole 64K 45.1k 2820 78.1k 4879 > hole 1M 2744 2744 4890 4891 > ramdisk 4K 436k 1703 554k 2163 > ramdisk 64K 29.6k 1848 44.0k 2747 > ramdisk 1M 1994 1995 2809 2809 > nvme 4K 306k 1196 324k 1267 > nvme 64K 19.3k 1208 24.3k 1517 > nvme 1M 1694 1694 2256 2256 > > == buffer write == > > buffer head ext4_iomap > type Overwrite Sync Writeback bs IOPS BW IOPS BW > ------------------------------------------------------------- > cache N N N 4K 395k 1544 415k 1621 > cache N N N 64K 30.8k 1928 80.1k 5005 > cache N N N 1M 1963 1963 5641 5642 > cache Y N N 4K 423k 1652 443k 1730 > cache Y N N 64K 33.0k 2063 80.8k 5051 > cache Y N N 1M 2103 2103 5588 5589 > ramdisk N N Y 4K 362k 1416 307k 1198 > ramdisk N N Y 64K 22.4k 1399 64.8k 4050 > ramdisk N N Y 1M 1670 1670 4559 4560 > ramdisk N Y N 4K 9830 38.4 13.5k 52.8 > ramdisk N Y N 64K 5834 365 10.1k 629 > ramdisk N Y N 1M 1011 1011 2064 2064 > ramdisk Y N Y 4K 397k 1550 409k 1598 > ramdisk Y N Y 64K 29.2k 1827 73.6k 4597 > ramdisk Y N Y 1M 1837 1837 4985 4985 > ramdisk Y Y N 4K 173k 675 182k 710 > ramdisk Y Y N 64K 17.7k 1109 33.7k 2105 > ramdisk Y Y N 1M 1128 1129 1790 1791 > nvme N N Y 4K 298k 1164 290k 1134 > nvme N N Y 64K 21.5k 1343 57.4k 3590 > nvme N N Y 1M 1308 1308 3664 3664 > nvme N Y N 4K 10.7k 41.8 12.0k 46.9 > nvme N Y N 64K 5962 373 8598 537 > nvme N Y N 1M 676 677 1417 1418 > nvme Y N Y 4K 366k 1430 373k 1456 > nvme Y N Y 64K 26.7k 1670 56.8k 3547 > nvme Y N Y 1M 1745 1746 3586 3586 > nvme Y Y N 4K 59.0k 230 61.2k 239 > nvme Y Y N 64K 13.0k 813 21.0k 1311 > nvme Y Y N 1M 683 683 1368 1369 > > TODO > - Keep on doing stress tests and fixing. > - I will rebase and resend my another patch set "ext4: more accurate > metadata reservaion for delalloc mount option[4]" later, it's useful > for iomap conversion. After this series, I suppose we could totally > drop ext4_nonda_switch() and prevent the risk of data loss caused by > extents splitting. > - Support for more features and mount options in the future. > > [1] https://lore.kernel.org/linux-fsdevel/20231207072710.176093-1-hch@lst.de/ > [2] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/ > [3] https://lore.kernel.org/linux-fsdevel/20231207150311.GA18830@lst.de/ > [4] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/ > > Thanks, > Yi. > > --- > v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/ > v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/ > > Zhang Yi (26): > ext4: refactor ext4_da_map_blocks() > ext4: convert to exclusive lock while inserting delalloc extents > ext4: correct the hole length returned by ext4_map_blocks() > ext4: add a hole extent entry in cache after punch > ext4: make ext4_map_blocks() distinguish delalloc only extent > ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type > iomap: don't increase i_size if it's not a write operation > iomap: add pos and dirty_len into trace_iomap_writepage_map > ext4: allow inserting delalloc extents with multi-blocks > ext4: correct delalloc extent length > ext4: also mark extent as delalloc if it's been unwritten > ext4: factor out bh handles to ext4_da_get_block_prep() > ext4: use reserved metadata blocks when splitting extent in endio > ext4: factor out ext4_map_{create|query}_blocks() > ext4: introduce seq counter for extent entry > ext4: add a new iomap aops for regular file's buffered IO path > ext4: implement buffered read iomap path > ext4: implement buffered write iomap path > ext4: implement writeback iomap path > ext4: implement mmap iomap path > ext4: implement zero_range iomap path > ext4: writeback partial blocks before zero range > ext4: fall back to buffer_head path for defrag > ext4: partially enable iomap for regular file's buffered IO path > filemap: support disable large folios on active inode > ext4: enable large folio for regular file with iomap buffered IO path > > fs/ext4/ext4.h | 14 +- > fs/ext4/ext4_jbd2.c | 6 + > fs/ext4/ext4_jbd2.h | 7 + > fs/ext4/extents.c | 149 +++--- > fs/ext4/extents_status.c | 39 +- > fs/ext4/extents_status.h | 4 +- > fs/ext4/file.c | 19 +- > fs/ext4/ialloc.c | 5 + > fs/ext4/inode.c | 891 +++++++++++++++++++++++++++--------- > fs/ext4/move_extent.c | 35 ++ > fs/ext4/page-io.c | 107 +++++ > fs/ext4/super.c | 3 + > fs/iomap/buffered-io.c | 30 +- > fs/iomap/trace.h | 43 +- > include/linux/pagemap.h | 14 + > include/trace/events/ext4.h | 31 +- > mm/readahead.c | 6 +- > 17 files changed, 1109 insertions(+), 294 deletions(-) > > -- > 2.39.2 > >