From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08B4BC02180 for ; Mon, 13 Jan 2025 16:39:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8D6E56B0088; Mon, 13 Jan 2025 11:39:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 887076B008A; Mon, 13 Jan 2025 11:39:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74E3E6B008C; Mon, 13 Jan 2025 11:39:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 561106B0088 for ; Mon, 13 Jan 2025 11:39:24 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 120CCC0539 for ; Mon, 13 Jan 2025 16:39:24 +0000 (UTC) X-FDA: 83002989048.03.06D07AE Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf10.hostedemail.com (Postfix) with ESMTP id 47EE7C001F for ; Mon, 13 Jan 2025 16:39:21 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="KLlpfmd/"; spf=pass (imf10.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736786362; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZaxuJIjyuOHnu3Cih69xClu72+1xmHOugf0sZ/uXk+4=; b=6CMB1fHK6wnLzQGHismq893BpzxHex0XYs2IJTnkxf6QO29iwykS3c1kOXzUWYYh/6Se++ GR1rd6MPEc7L+RQtxFu9Vfqx0AQiOHCg4bEOHg7xyH5zHLJ6O7XwgtEqNT1G57Dgvrp7zq 1veKlm34qBOXAeURSPw1UOnKrLcoR/g= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="KLlpfmd/"; spf=pass (imf10.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736786362; a=rsa-sha256; cv=none; b=yZCsR9YzGBtEiB+uPtDXBQjzAB78n0EST/OeylGHomcRX5rOVcj/2i/uRb7FtVTI/+t+9f paHTEGR/PwPeweXBhs1B30C/VNXQ0lUb7Z0uw59cRZRjFkrN6JYJjsY8z8VUNHS89sTTW9 ALlbg5TewFw3HNibBsHXWnahkXDVedc= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 314775C5657; Mon, 13 Jan 2025 16:38:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A3939C4CED6; Mon, 13 Jan 2025 16:39:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736786360; bh=7yznwIi42BWHftKG3KqqtvRJ/fpAO2hGQPkmVEWTBvA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=KLlpfmd/IoMZQf0B0YYjLx4o9rP8DTZKejG1O8vfSWPjyCS2XILDAckd5gCmh6rP9 /T3Cvy+yTMVmmH/N/GjaylSnqiT89TAh3GxFJJHcBOU6LvkJ1h9xixjVinLXuHU4nt ECYSTKSgqiAwbGNNiCncyIrMBlpzr+1QMbZ5pAqEB1ClImKQff9LEJGRjIsbcjET4n +MFgQikxj6ac6i20rUdYKPRtDh8nUehKNFQt3wM6//DN8Ciu9YvhK4xLhJOX5xdZCx 2IqstzSR51IgAV3sdE5yQvpQD2u3T19UHIhd9gDBToQ2stuE58sZE1PDTJqiixqlpi hMUJ03H6PRjHQ== Date: Mon, 13 Jan 2025 08:39:20 -0800 From: "Darrick J. Wong" To: Alistair Popple Cc: akpm@linux-foundation.org, dan.j.williams@intel.com, linux-mm@kvack.org, alison.schofield@intel.com, lina@asahilina.net, zhang.lyra@gmail.com, gerald.schaefer@linux.ibm.com, vishal.l.verma@intel.com, dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com, jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com, will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com, dave.hansen@linux.intel.com, ira.weiny@intel.com, willy@infradead.org, tytso@mit.edu, linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com, chenhuacai@kernel.org, kernel@xen0n.name, loongarch@lists.linux.dev Subject: Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount Message-ID: <20250113163920.GE1306365@frogsfrogsfrogs> References: <704662ae360abeb777ed00efc6f8f232a79ae4ff.1736488799.git-series.apopple@nvidia.com> <20250110165019.GK6156@frogsfrogsfrogs> <20250113024940.GW1306365@frogsfrogsfrogs> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 47EE7C001F X-Stat-Signature: dypn9ffomut4neset1xbknrhqq1iq4bf X-Rspam-User: X-Rspamd-Server: rspam11 X-HE-Tag: 1736786361-491602 X-HE-Meta: U2FsdGVkX1/c0HQkhU/7MPhFib6TnLcnHkPQEfjMLcM1fXxpt3SRpUYcb0nMCbuEnkurKhfRvxJVsFQTQGGDH4oGb9Hym0LL0ZKTZye+e7L51A0dneF+CsijMpJ952peNeCcAA9wLYNtojen66iJZbabnYbGJju50yOoHyP3A5jSg4oIqWQWBtoa4KqUtHCUbsSuUV9UgG8G9V1smLL7EJdrVsNc2OgFx4lOZZlkqbbN8Na1WDO8Tph9bLKjN7ogzEuh9PUcpGxU7adkTFqW/9uizdC9Iqvj75SCOs9OjJqgqtFojgHvnsaNpT1tIxFOk3MaAwS7zLYUnhIwJ6xiv7bTz15DPhVUvkeo0l9txCxkeYnkzKk53qrEcRcC6ImUX2RzfDSdXy+7PgljLAkFt21XXpY0U/zDufMXvT3m48D3dxXcVXmQPL+Zs28exBTfWa+m4zFcTo/EZuYslipDiVSCwoJqVztmH7cF/UOtrxsN3ji/8VkmR9c58PpwaiYmQUky+f756YBVeIyzJbxjNv83IzzeYGzKG/q8TpR3ZVSjXIJSettHImv4m97F6sN4SkHA/NrgrDy2RILdODjWZyBAQwB7WaATj2xMQhZijdG4Ryo6cEE89xPgzcmsJE2p7xqTbyb3Vw9r2WsuI50cPiKgyzuZKq2loKjxfejwe/avxdc0qSoImZqGs6rvpJH3jwyg0JkyJYLFnh8hI98wvdNkLgfk0+m1u6ypHyt3LqJ4b7vawz/D412EvTy6uCb8Um//STiPv6iP4nRcNkICQjQriuGM5EGsBG7hRaQ9OT4biOEkt64R76H45rrdXMYLKnhRMVOydMZr3ZhBeyi2mIrZd81IE/z30QSoJjD5F51MZ20fPXGrQtNb+ss3DRMslcZvsTw7qRpJUyUKxi1dAsuAM+DGYECAS+8MRApbJMLFl4MUacKalLci8+tAqx3C7dp1VwFtIQn6Toe09pK Ae1+wmjj gSt4eTYo1HiBU7GfXuhAxqezt4PMulD5zEl+l8ScXaCc92BFpUYs0sNlOwKiFi9tGfPgyOztO5BH38/fy8Kyh7w7O0o9vWWvxBQjm5b6XjC7JAzhSYua/1oDhNRtjAHog6yrXFmariytw2SC49TpUxt++Wv/ZjbiqC2biblvJ8iyqVUJ9Tpe8LNmxSi7+V5cEz766NfXEiAuPkGbA0FuEgPvzqSvJPRwxcHoFrL+1m7QHrY+TMazsIgI8IbHkQLmNH6kWmjzVOmDkQuEoRZcC6gH+3YTV1n+HsOy6T6zaYNtce7ne5R2kqdSQjS7ZORhWnp4SefrHum8+F+loPpBrwdzFRo0CT2Qz5v4f X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 13, 2025 at 04:48:31PM +1100, Alistair Popple wrote: > On Sun, Jan 12, 2025 at 06:49:40PM -0800, Darrick J. Wong wrote: > > On Mon, Jan 13, 2025 at 11:57:18AM +1100, Alistair Popple wrote: > > > On Fri, Jan 10, 2025 at 08:50:19AM -0800, Darrick J. Wong wrote: > > > > On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote: > > > > > File systems call dax_break_mapping() prior to reallocating file > > > > > system blocks to ensure the page is not undergoing any DMA or other > > > > > accesses. Generally this is needed when a file is truncated to ensure > > > > > that if a block is reallocated nothing is writing to it. However > > > > > filesystems currently don't call this when an FS DAX inode is evicted. > > > > > > > > > > This can cause problems when the file system is unmounted as a page > > > > > can continue to be under going DMA or other remote access after > > > > > unmount. This means if the file system is remounted any truncate or > > > > > other operation which requires the underlying file system block to be > > > > > freed will not wait for the remote access to complete. Therefore a > > > > > busy block may be reallocated to a new file leading to corruption. > > > > > > > > > > Signed-off-by: Alistair Popple > > > > > > > > > > --- > > > > > > > > > > Changes for v5: > > > > > > > > > > - Don't wait for pages to be idle in non-DAX mappings > > > > > --- > > > > > fs/dax.c | 29 +++++++++++++++++++++++++++++ > > > > > fs/ext4/inode.c | 32 ++++++++++++++------------------ > > > > > fs/xfs/xfs_inode.c | 9 +++++++++ > > > > > fs/xfs/xfs_inode.h | 1 + > > > > > fs/xfs/xfs_super.c | 18 ++++++++++++++++++ > > > > > include/linux/dax.h | 2 ++ > > > > > 6 files changed, 73 insertions(+), 18 deletions(-) > > > > > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > > > index 7008a73..4e49cc4 100644 > > > > > --- a/fs/dax.c > > > > > +++ b/fs/dax.c > > > > > @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page, > > > > > TASK_INTERRUPTIBLE, 0, 0, cb(inode)); > > > > > } > > > > > > > > > > +static void wait_page_idle_uninterruptible(struct page *page, > > > > > + void (cb)(struct inode *), > > > > > + struct inode *inode) > > > > > +{ > > > > > + ___wait_var_event(page, page_ref_count(page) == 1, > > > > > + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode)); > > > > > +} > > > > > + > > > > > /* > > > > > * Unmaps the inode and waits for any DMA to complete prior to deleting the > > > > > * DAX mapping entries for the range. > > > > > @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end, > > > > > } > > > > > EXPORT_SYMBOL_GPL(dax_break_mapping); > > > > > > > > > > +void dax_break_mapping_uninterruptible(struct inode *inode, > > > > > + void (cb)(struct inode *)) > > > > > +{ > > > > > + struct page *page; > > > > > + > > > > > + if (!dax_mapping(inode->i_mapping)) > > > > > + return; > > > > > + > > > > > + do { > > > > > + page = dax_layout_busy_page_range(inode->i_mapping, 0, > > > > > + LLONG_MAX); > > > > > + if (!page) > > > > > + break; > > > > > + > > > > > + wait_page_idle_uninterruptible(page, cb, inode); > > > > > + } while (true); > > > > > + > > > > > + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX); > > > > > +} > > > > > +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible); > > > > > + > > > > > /* > > > > > * Invalidate DAX entry if it is clean. > > > > > */ > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > > > index ee8e83f..fa35161 100644 > > > > > --- a/fs/ext4/inode.c > > > > > +++ b/fs/ext4/inode.c > > > > > @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode) > > > > > (inode->i_size < EXT4_N_BLOCKS * 4); > > > > > } > > > > > > > > > > +static void ext4_wait_dax_page(struct inode *inode) > > > > > +{ > > > > > + filemap_invalidate_unlock(inode->i_mapping); > > > > > + schedule(); > > > > > + filemap_invalidate_lock(inode->i_mapping); > > > > > +} > > > > > + > > > > > +int ext4_break_layouts(struct inode *inode) > > > > > +{ > > > > > + return dax_break_mapping_inode(inode, ext4_wait_dax_page); > > > > > +} > > > > > + > > > > > /* > > > > > * Called at the last iput() if i_nlink is zero. > > > > > */ > > > > > @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode) > > > > > > > > > > trace_ext4_evict_inode(inode); > > > > > > > > > > + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page); > > > > > + > > > > > if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL) > > > > > ext4_evict_ea_inode(inode); > > > > > if (inode->i_nlink) { > > > > > @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, > > > > > return ret; > > > > > } > > > > > > > > > > -static void ext4_wait_dax_page(struct inode *inode) > > > > > -{ > > > > > - filemap_invalidate_unlock(inode->i_mapping); > > > > > - schedule(); > > > > > - filemap_invalidate_lock(inode->i_mapping); > > > > > -} > > > > > - > > > > > -int ext4_break_layouts(struct inode *inode) > > > > > -{ > > > > > - struct page *page; > > > > > - int error; > > > > > - > > > > > - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock))) > > > > > - return -EINVAL; > > > > > - > > > > > - return dax_break_mapping_inode(inode, ext4_wait_dax_page); > > > > > -} > > > > > - > > > > > /* > > > > > * ext4_punch_hole: punches a hole in a file by releasing the blocks > > > > > * associated with the given offset and length > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > > > > index 4410b42..c7ec5ab 100644 > > > > > --- a/fs/xfs/xfs_inode.c > > > > > +++ b/fs/xfs/xfs_inode.c > > > > > @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts( > > > > > return dax_break_mapping_inode(inode, xfs_wait_dax_page); > > > > > } > > > > > > > > > > +void > > > > > +xfs_break_dax_layouts_uninterruptible( > > > > > + struct inode *inode) > > > > > +{ > > > > > + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL); > > > > > + > > > > > + dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page); > > > > > +} > > > > > + > > > > > int > > > > > xfs_break_layouts( > > > > > struct inode *inode, > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > > > index c4f03f6..613797a 100644 > > > > > --- a/fs/xfs/xfs_inode.h > > > > > +++ b/fs/xfs/xfs_inode.h > > > > > @@ -594,6 +594,7 @@ xfs_itruncate_extents( > > > > > } > > > > > > > > > > int xfs_break_dax_layouts(struct inode *inode); > > > > > +void xfs_break_dax_layouts_uninterruptible(struct inode *inode); > > > > > int xfs_break_layouts(struct inode *inode, uint *iolock, > > > > > enum layout_break_reason reason); > > > > > > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > > > > index 8524b9d..73ec060 100644 > > > > > --- a/fs/xfs/xfs_super.c > > > > > +++ b/fs/xfs/xfs_super.c > > > > > @@ -751,6 +751,23 @@ xfs_fs_drop_inode( > > > > > return generic_drop_inode(inode); > > > > > } > > > > > > > > > > +STATIC void > > > > > +xfs_fs_evict_inode( > > > > > + struct inode *inode) > > > > > +{ > > > > > + struct xfs_inode *ip = XFS_I(inode); > > > > > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; > > > > > + > > > > > + if (IS_DAX(inode)) { > > > > > + xfs_ilock(ip, iolock); > > > > > + xfs_break_dax_layouts_uninterruptible(inode); > > > > > + xfs_iunlock(ip, iolock); > > > > > > > > If we're evicting the inode, why is it necessary to take i_rwsem and the > > > > mmap invalidation lock? Shouldn't the evicting thread be the only one > > > > with access to this inode? > > > > > > Hmm, good point. I think you're right. I can easily stop taking > > > XFS_IOLOCK_EXCL. Not taking XFS_MMAPLOCK_EXCL is slightly more difficult because > > > xfs_wait_dax_page() expects it to be taken. Do you think it is worth creating a > > > separate callback (xfs_wait_dax_page_unlocked()?) specifically for this path or > > > would you be happy with a comment explaining why we take the XFS_MMAPLOCK_EXCL > > > lock here? > > > > There shouldn't be any other threads removing "pages" from i_mapping > > during eviction, right? If so, I think you can just call schedule() > > directly from dax_break_mapping_uninterruptble. > > Oh right, and I guess you are saying the same would apply to ext4 so no need to > cycle the filemap lock there either, which I've just noticed is buggy anyway. So > I can just remove the callback entirely for dax_break_mapping_uninterruptible. Right. You might want to rename dax_break_layouts_uninterruptible to make it clearer that it's for evictions and doesn't go through the mmap invalidation lock. > > (dax mappings aren't allowed supposed to persist beyond unmount / > > eviction, just like regular pagecache, right??) > > Right they're not *supposed* to, but until at least this patch is applied they > can ;-) Yikes! --D > - Alistair > > > --D > > > > > - Alistair > > > > > > > --D > > > > > > > > > + } > > > > > + > > > > > + truncate_inode_pages_final(&inode->i_data); > > > > > + clear_inode(inode); > > > > > +} > > > > > + > > > > > static void > > > > > xfs_mount_free( > > > > > struct xfs_mount *mp) > > > > > @@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = { > > > > > .destroy_inode = xfs_fs_destroy_inode, > > > > > .dirty_inode = xfs_fs_dirty_inode, > > > > > .drop_inode = xfs_fs_drop_inode, > > > > > + .evict_inode = xfs_fs_evict_inode, > > > > > .put_super = xfs_fs_put_super, > > > > > .sync_fs = xfs_fs_sync_fs, > > > > > .freeze_fs = xfs_fs_freeze, > > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h > > > > > index ef9e02c..7c3773f 100644 > > > > > --- a/include/linux/dax.h > > > > > +++ b/include/linux/dax.h > > > > > @@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode, > > > > > { > > > > > return dax_break_mapping(inode, 0, LLONG_MAX, cb); > > > > > } > > > > > +void dax_break_mapping_uninterruptible(struct inode *inode, > > > > > + void (cb)(struct inode *)); > > > > > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff, > > > > > struct inode *dest, loff_t destoff, > > > > > loff_t len, bool *is_same, > > > > > -- > > > > > git-series 0.9.1 > > > > > > > > >