From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC004C4345F for ; Wed, 1 May 2024 08:11:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F4476B0088; Wed, 1 May 2024 04:11:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A4376B0098; Wed, 1 May 2024 04:11:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 345366B009A; Wed, 1 May 2024 04:11:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 153726B0088 for ; Wed, 1 May 2024 04:11:22 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 85DF18079B for ; Wed, 1 May 2024 08:11:21 +0000 (UTC) X-FDA: 82069107162.24.477125F Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182]) by imf26.hostedemail.com (Postfix) with ESMTP id B636B140009 for ; Wed, 1 May 2024 08:11:18 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=XINEI+pX; spf=pass (imf26.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714551078; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RCv0XoUxtwGu7CqXQBG2S9RvRVUsuvWpS02r0QITg4Y=; b=TgJd655erBqd1+V4El6Lki+olPrUXbmLkMvnnyf5jA3kis4ZGhcf08RdD11zP4tfKqx2pK eaVbTSAhEUuKaDm8SWwIeeysEV80UH7qOyIm/v73X1446dPaJSmEwOnDQNpyPO3REvoEAG 4yTSAxJx+RQK9h5lTv+g4Y/zQ6UH37E= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=XINEI+pX; spf=pass (imf26.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714551078; a=rsa-sha256; cv=none; b=VOTKT6chzXt3edx/z12PUq8vU5OeNKnrkyx41k1WUbCh9ED2RzE56iS3IqYQ1lg7B7VEn/ gz6OJwihHp9g7WJuXutOpnHl6TtYLNgzSZp6p6+QeI7m5hmD04uzDsDTkPrfSl/v2WcBne iJU+N4JRCWwzkFi3yk87VfwVQf2xU8Q= Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-5ce2aada130so4547347a12.1 for ; Wed, 01 May 2024 01:11:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1714551077; x=1715155877; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=RCv0XoUxtwGu7CqXQBG2S9RvRVUsuvWpS02r0QITg4Y=; b=XINEI+pXAu3Cs/NBqtp+DJrwD7299/mf07S2vCEdCZuM5hDMLlNZH/nMbZbuKivKJD Se7G87jGRkLUZrKzLv2muiU+RmfX7DBIhOXpz8lCTMWzXYquHfzWFkO+Ednf0bD6Sdsw BabwTa0IGVNTLMIPk2ZI1zIKoe7Nm6eNwmpgt5jqhdN6uz4Ctosfd9pQr3zQebX/gimz iQU1mQiSOrDVTY3btbdWveqwPeACeryUJIYVZuzSLuUcQyTsONa3gc8YveptKfRYAycH LXAi3ykgypP/uShCF+Jj71zG4166O8V6Vkif0UMZnzz08thWqIYI4pGLUtMMTbRufknu iZJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714551077; x=1715155877; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=RCv0XoUxtwGu7CqXQBG2S9RvRVUsuvWpS02r0QITg4Y=; b=cnDvoth1hLPhoU5nJAu7yWeSC/noMDPSvcb0zx9hfUeY44n0Z+SX9xcO/3vgP+DyN7 OKD64naxxOTs3+giXrD0cKkAc0oH0Yd9vkYZlTU544U8WrVuFsFb3RA1cZNhpi0e7+eL 3J/Q7kQckBYYJuH7VjgQubXP7P+qndJhkwOMroYko26qTiunV39RhI3gV7PXvas2EkUz KxF4zlZKvn6J98ljIVu0h6TrLg2Hu+hTLCj2D/kVfrBOK1Di1Is0h7pdTAhDxqo7MHPd DXETfjIsp4BppECn0dCAG3O9rsADIrjWRonLFQKd9qgE4KCAFeG0Bfw6kdsmpoS+LywZ 9Kng== X-Forwarded-Encrypted: i=1; AJvYcCVXMEoZSN53ARTO8T6sVMQ4UnoFQIpNvJzap8UbR8yTPb26xrjazsoiM5DcO063c0Fy60l9fOnieT8IfFspc+ctACg= X-Gm-Message-State: AOJu0YxzVAmGcZfVYKshBa13Y0gW9Ls/qCrOg/yRUusTrjp7558+VyB+ DKU5mkXpKK7Mtv66xYHBzr72kiro+gzPCvlHYkn+YX+I6+4CP7qncEb1maQXl8Q= X-Google-Smtp-Source: AGHT+IFTWLbPFA+nd6gmvYTqwqnSgDqVXNKVB+O1GG6+9S4CZi8Sdb0Tf1uHIFSH1D9IBisDluOtEw== X-Received: by 2002:a05:6a21:8196:b0:1aa:59ff:5902 with SMTP id pd22-20020a056a21819600b001aa59ff5902mr2127063pzb.9.1714551076783; Wed, 01 May 2024 01:11:16 -0700 (PDT) Received: from dread.disaster.area (pa49-179-32-121.pa.nsw.optusnet.com.au. [49.179.32.121]) by smtp.gmail.com with ESMTPSA id i16-20020a170902c95000b001eb542b85aasm7769134pla.117.2024.05.01.01.11.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 May 2024 01:11:16 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1s253Z-00H9G8-1s; Wed, 01 May 2024 18:11:13 +1000 Date: Wed, 1 May 2024 18:11:13 +1000 From: Dave Chinner To: Zhang Yi Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com, hch@infradead.org, djwong@kernel.org, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com Subject: Re: [RFC PATCH v4 24/34] ext4: implement buffered write iomap path Message-ID: References: <20240410142948.2817554-1-yi.zhang@huaweicloud.com> <20240410142948.2817554-25-yi.zhang@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240410142948.2817554-25-yi.zhang@huaweicloud.com> X-Rspamd-Queue-Id: B636B140009 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: bht63w19qo5e1qbtggft81s7e53drwhu X-HE-Tag: 1714551078-650876 X-HE-Meta: U2FsdGVkX18Jb3UD+dGKBQFl7JsZ2Vv7oBZ+t0cECwonchc+Jv8aw080c0UvBlR07BQDXqyUW/swlBfq7VZO/UCRPYZe5pzQSHNzuWZtIOv2KO5ldiklzIHy4n4lLsf+623tgt7Nrr31PR2P+BsreH5OyxIYZfKEHOI/wXOERfBJXtjZLqaHqThhGWUbg65ep0JLPS+BDrxgVsvRyNrvPj+ACqv/w8pUUkgdttUT0Z6hGS/2aq45zk6I2mogbvigf/5wwwVpwjxqvAGSxBkC/D53/BLBwFEIfybXUb6ZLfMZXpNgmTR1s6/hlNRpgbfQg+qT0YIJUF9EmB6a7qc5OcWRialAPukJK2nAoWcqT6OPGtNmChhiOl/ryd7j8BDlXt0RhiO/fTdU1Skp7hrg3q4YZN9tWdggomE6zMra6iTkXUmNCCITtazVo1DEdUUIis0UccUvUvmkqyf7+RAeeKsYMVZ1eTH7PQzEmqDCUXGJA7AyeW6808/Yet2OjhRwNNXfyNfu/B8L86o+Wgzzgpb0+GSgD5AtcXhy4BY6mfPzs/yh3iE/bUg6OZeSFVT9mBIwJXykJKmDHL/qp3WwXyzdi5QP/gOOb/dzEs9aYKL9C2CiuoAmliFytbye0uQ4GIIy4fzXUVlzUkwaI5mYcRb0Ckye5XZFKZ58XsDZQuwzVYd0EtEzHVM//WKT0SOB917vJYZCZt/MUVmsv5J4Gn62mv/4yJnpjDJu2kyDRcokouPkVzCUbZFLeAW276vfnHW+E9FvmotR841kMZwEmVebWceCMdmD8iS3qdCMMhhdV8beMw+qlQSrx7gHzsgGXi7tyvkrFtr1z6LiIUiVLqoRidRyWes/WVpfV0R1rpqJROT0YG9QJkvtEWfMJAEldO0+L54ogcWn8/u6RTPfdUCT2mvRYD0/qQ8oB0ie30ZkpxmwNN4dgSMTCBnb3xBKsg/jsZbDEYR33hFPALb xUkMuNmv o06t8tLXQ2xZ7bMTw7W2I6JWAo38SVWXGRinlb6phBzpT1Z9bcyhQOxOsqa0W4JugWt4UEjpTxN7bGfmoyOFjFOlqsLpj8eGd1LP1s9HFOeGxvkTIh1WdUO2ZMPj/sT5qE0akfnA8cdQKlxismE/4JshQ8VfCVJqtw3jVh5tx7lmmkbTI3gNhPRDxbtFpe03iK2bG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 10, 2024 at 10:29:38PM +0800, Zhang Yi wrote: > From: Zhang Yi > > Implement buffered write iomap path, use ext4_da_map_blocks() to map > delalloc extents and add ext4_iomap_get_blocks() to allocate blocks if > delalloc is disabled or free space is about to run out. > > Note that we always allocate unwritten extents for new blocks in the > iomap write path, this means that the allocation type is no longer > controlled by the dioread_nolock mount option. After that, we could > postpone the i_disksize updating to the writeback path, and drop journal > handle in the buffered dealloc write path completely. > > Signed-off-by: Zhang Yi > --- > fs/ext4/ext4.h | 3 + > fs/ext4/file.c | 19 +++++- > fs/ext4/inode.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++-- > 3 files changed, 183 insertions(+), 7 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 05949a8136ae..2bd543c43341 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2970,6 +2970,7 @@ int ext4_walk_page_buffers(handle_t *handle, > struct buffer_head *bh)); > int do_journal_get_write_access(handle_t *handle, struct inode *inode, > struct buffer_head *bh); > +int ext4_nonda_switch(struct super_block *sb); > #define FALL_BACK_TO_NONDELALLOC 1 > #define CONVERT_INLINE_DATA 2 > > @@ -3827,6 +3828,8 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end) > extern const struct iomap_ops ext4_iomap_ops; > extern const struct iomap_ops ext4_iomap_overwrite_ops; > extern const struct iomap_ops ext4_iomap_report_ops; > +extern const struct iomap_ops ext4_iomap_buffered_write_ops; > +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops; > > static inline int ext4_buffer_uptodate(struct buffer_head *bh) > { > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 54d6ff22585c..52f37c49572a 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -282,6 +282,20 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) > return count; > } > > +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb, > + struct iov_iter *from) > +{ > + struct inode *inode = file_inode(iocb->ki_filp); > + const struct iomap_ops *iomap_ops; > + > + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb)) > + iomap_ops = &ext4_iomap_buffered_da_write_ops; > + else > + iomap_ops = &ext4_iomap_buffered_write_ops; > + > + return iomap_file_buffered_write(iocb, from, iomap_ops); > +} > + > static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, > struct iov_iter *from) > { > @@ -296,7 +310,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, > if (ret <= 0) > goto out; > > - ret = generic_perform_write(iocb, from); > + if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) > + ret = ext4_iomap_buffered_write(iocb, from); > + else > + ret = generic_perform_write(iocb, from); > > out: > inode_unlock(inode); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 20eb772f4f62..e825ed16fd60 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2857,7 +2857,7 @@ static int ext4_dax_writepages(struct address_space *mapping, > return ret; > } > > -static int ext4_nonda_switch(struct super_block *sb) > +int ext4_nonda_switch(struct super_block *sb) > { > s64 free_clusters, dirty_clusters; > struct ext4_sb_info *sbi = EXT4_SB(sb); > @@ -3254,6 +3254,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode) > return inode->i_state & I_DIRTY_DATASYNC; > } > > +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap) > +{ > + return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq); > +} > + > +static const struct iomap_folio_ops ext4_iomap_folio_ops = { > + .iomap_valid = ext4_iomap_valid, > +}; > + > static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > struct ext4_map_blocks *map, loff_t offset, > loff_t length, unsigned int flags) > @@ -3284,6 +3293,9 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > iomap->flags |= IOMAP_F_MERGED; > > + iomap->validity_cookie = READ_ONCE(EXT4_I(inode)->i_es_seq); > + iomap->folio_ops = &ext4_iomap_folio_ops; > + > /* > * Flags passed to ext4_map_blocks() for direct I/O writes can result > * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits > @@ -3523,11 +3535,42 @@ const struct iomap_ops ext4_iomap_report_ops = { > .iomap_begin = ext4_iomap_begin_report, > }; > > -static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset, > +static int ext4_iomap_get_blocks(struct inode *inode, > + struct ext4_map_blocks *map) > +{ > + handle_t *handle; > + int ret, needed_blocks; > + > + /* > + * Reserve one block more for addition to orphan list in case > + * we allocate blocks but write fails for some reason. > + */ > + needed_blocks = ext4_writepage_trans_blocks(inode) + 1; > + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks); > + if (IS_ERR(handle)) > + return PTR_ERR(handle); > + > + ret = ext4_map_blocks(handle, inode, map, > + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); > + /* > + * Have to stop journal here since there is a potential deadlock > + * caused by later balance_dirty_pages(), it might wait on the > + * ditry pages to be written back, which might start another > + * handle and wait this handle stop. > + */ > + ext4_journal_stop(handle); > + > + return ret; > +} > + > +#define IOMAP_F_EXT4_DELALLOC IOMAP_F_PRIVATE > + > +static int __ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset, > loff_t length, unsigned int iomap_flags, > - struct iomap *iomap, struct iomap *srcmap) > + struct iomap *iomap, struct iomap *srcmap, > + bool delalloc) > { > - int ret; > + int ret, retries = 0; > struct ext4_map_blocks map; > u8 blkbits = inode->i_blkbits; > > @@ -3537,20 +3580,133 @@ static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset, > return -EINVAL; > if (WARN_ON_ONCE(ext4_has_inline_data(inode))) > return -ERANGE; > - > +retry: > /* Calculate the first and last logical blocks respectively. */ > map.m_lblk = offset >> blkbits; > map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > + if (iomap_flags & IOMAP_WRITE) { > + if (delalloc) > + ret = ext4_da_map_blocks(inode, &map); > + else > + ret = ext4_iomap_get_blocks(inode, &map); > > - ret = ext4_map_blocks(NULL, inode, &map, 0); > + if (ret == -ENOSPC && > + ext4_should_retry_alloc(inode->i_sb, &retries)) > + goto retry; > + } else { > + ret = ext4_map_blocks(NULL, inode, &map, 0); > + } > if (ret < 0) > return ret; > > ext4_set_iomap(inode, iomap, &map, offset, length, iomap_flags); > + if (delalloc) > + iomap->flags |= IOMAP_F_EXT4_DELALLOC; > + > + return 0; > +} Why are you implementing both read and write mapping paths in the one function? The whole point of having separate ops vectors for read and write is that it allows a clean separation of the read and write mapping operations. i.e. there is no need to use "if (write) else {do read}" code constructs at all. You can even have a different delalloc mapping function so you don't need "if (delalloc) else {do nonda}" branches everiywhere... > + > +static inline int ext4_iomap_buffered_io_begin(struct inode *inode, > + loff_t offset, loff_t length, unsigned int flags, > + struct iomap *iomap, struct iomap *srcmap) > +{ > + return __ext4_iomap_buffered_io_begin(inode, offset, length, flags, > + iomap, srcmap, false); > +} > + > +static inline int ext4_iomap_buffered_da_write_begin(struct inode *inode, > + loff_t offset, loff_t length, unsigned int flags, > + struct iomap *iomap, struct iomap *srcmap) > +{ > + return __ext4_iomap_buffered_io_begin(inode, offset, length, flags, > + iomap, srcmap, true); > +} > + > +/* > + * Drop the staled delayed allocation range from the write failure, > + * including both start and end blocks. If not, we could leave a range > + * of delayed extents covered by a clean folio, it could lead to > + * inaccurate space reservation. > + */ > +static int ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset, > + loff_t length) > +{ > + ext4_es_remove_extent(inode, offset >> inode->i_blkbits, > + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb))); > return 0; > } > > +static int ext4_iomap_buffered_write_end(struct inode *inode, loff_t offset, > + loff_t length, ssize_t written, > + unsigned int flags, > + struct iomap *iomap) > +{ > + handle_t *handle; > + loff_t end; > + int ret = 0, ret2; > + > + /* delalloc */ > + if (iomap->flags & IOMAP_F_EXT4_DELALLOC) { > + ret = iomap_file_buffered_write_punch_delalloc(inode, iomap, > + offset, length, written, ext4_iomap_punch_delalloc); > + if (ret) > + ext4_warning(inode->i_sb, > + "Failed to clean up delalloc for inode %lu, %d", > + inode->i_ino, ret); > + return ret; > + } Why are you creating a delalloc extent for the write operation and then immediately deleting it from the extent tree once the write operation is done? Delayed allocation is a state that persists for that file range until the data is written back, right, but this appears to return the range to a hole in the extent state tree? What happens when we do a second write to this same range? WIll it do another delayed allocation because there are no extents over this range? Also, why do you need IOMAP_F_EXT4_DELALLOC? Isn't a delalloc iomap set up with iomap->type = IOMAP_DELALLOC? Why can't that be used? -Dave. -- Dave Chinner david@fromorbit.com