linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Theodore Ts'o <tytso@mit.edu>, Baokun Li <libaokun1@huawei.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-ext4@vger.kernel.org,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	linux-block@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Dave Chinner <dchinner@redhat.com>,
	Eric Sandeen <sandeen@redhat.com>, Zhang Yi <yi.zhang@redhat.com>,
	yangerkun <yangerkun@huawei.com>
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages
Date: Tue, 2 May 2023 16:27:09 -0700	[thread overview]
Message-ID: <20230502232709.GD15420@frogsfrogsfrogs> (raw)
In-Reply-To: <20230502223323.GI2155823@dread.disaster.area>

On Wed, May 03, 2023 at 08:33:23AM +1000, Dave Chinner wrote:
> On Tue, May 02, 2023 at 08:35:16AM -0700, Darrick J. Wong wrote:
> > On Tue, May 02, 2023 at 11:35:57AM +1000, Dave Chinner wrote:
> > > On Tue, May 02, 2023 at 08:57:32AM +0800, Ming Lei wrote:
> > > > On Mon, May 01, 2023 at 06:47:44AM +0200, Christoph Hellwig wrote:
> > > > > On Sat, Apr 29, 2023 at 01:10:49PM +0800, Ming Lei wrote:
> > > > > > Not sure if it is needed for non s_bdev
> > > > > 
> > > > > So you don't want to work this at all for btrfs?  Or the XFS log device,
> > > > > or ..
> > > > 
> > > > Basically FS can provide one generic API of shutdown_filesystem() which
> > > > shutdown FS generically, meantime calls each fs's ->shutdown() for
> > > > dealing with fs specific shutdown.
> > > > 
> > > > If there isn't superblock attached for one bdev, can you explain a bit what
> > > > filesystem code can do? Same with block layer bdev.
> > > > 
> > > > The current bio->bi_status together disk_live()(maybe bdev_live() is
> > > > needed) should be enough for FS code to handle non s_bdev.
> > > 
> > > maybe necessary for btrfs, but not for XFS....
> > > > 
> > > > > 
> > > > > > , because FS is over stackable device
> > > > > > directly. Stackable device has its own logic for handling underlying disks dead
> > > > > > or deleted, then decide if its own disk needs to be deleted, such as, it is
> > > > > > fine for raid1 to work from user viewpoint if one underlying disk is deleted.
> > > > > 
> > > > > We still need to propagate the even that device has been removed upwards.
> > > > > Right now some file systems (especially XFS) are good at just propagating
> > > > > it from an I/O error.  And explicity call would be much better.
> > > > 
> > > > It depends on the above question about how FS code handle non s_bdev
> > > > deletion/dead.
> > > 
> > > as XFS doesn't treat the individual devices differently. A
> > > failure on an external log device is just as fatal as a failure on
> > > a single device filesystem with an internal log. ext4 is 
> > > going to consider external journal device removal as fatal, too.
> > > 
> > > As for removal of realtime devices on XFS, all the user data has
> > > gone away, so the filesystem will largely be useless for users and
> > > applications.  At this point, we'll probably want to shut down the
> > > filesystem because we've had an unknown amount of user data loss and
> > > so silently continuing on as if nothing happened is not the right
> > > thing to do.
> > > 
> > > So as long as we can attach the superblock to each block device that
> > > the filesystem opens (regardless of where sb->s_bdev points), device
> > > removal calling sb_force_shutdown(sb, SB_SHUTDOWN_DEVICE_DEAD) will
> > > do what we need. If we need anything different in future, then we
> > > can worry about how to do that in the future.
> > 
> > Shiyang spent a lot of time hooking up pmem failure notifications so
> > that xfs can kill processes that have pmem in their mapping.  I wonder
> > if we could reuse some of that infrastructure here? 
> 
> ISTR that the generic mechanism for "device failure ranges" (I think
> I called the mechanism ->corrupt_range()) that we came up with in
> the first instance for this functionality got shouted down by some
> block layer devs because they saw it as unnecessary complexity to
> push device range failure notifications through block devices up
> to the filesystem.
> 
> The whole point of starting from that point was so that any type of
> block device could report a failure to the filesystem and have the
> filesystem deal with it appropriately:
> 
> This is where we started:
> 
> https://lore.kernel.org/linux-xfs/20201215121414.253660-1-ruansy.fnst@cn.fujitsu.com/
> 
> ".....
> The call trace is like this:
> memory_failure()
>  pgmap->ops->memory_failure()      => pmem_pgmap_memory_failure()
>   gendisk->fops->corrupted_range() => - pmem_corrupted_range()
>                                       - md_blk_corrupted_range()
>    sb->s_ops->currupted_range()    => xfs_fs_corrupted_range()
>     xfs_rmap_query_range()
>      xfs_currupt_helper()
>       * corrupted on metadata
>           try to recover data, call xfs_force_shutdown()
>       * corrupted on file data 
>           try to recover data, call mf_dax_mapping_kill_procs()
> ...."

<nod> I dug up
https://lore.kernel.org/linux-xfs/20210108095614.GB5647@lst.de/

which I interpreted as Christoph asking Shiyang not to make the dax
device code go swerving through the block layer to call
->corrupted_range, since he was trying to separate the two entirely.
I didn't think he was shutting down the idea of block devices being able
to call ->corrupted_range to tell the filesystem that the user's $2
NVME<->STL<->USB bridge caught on fire.

> > That MF_MEM_REMOVE
> > patchset he's been trying to get us to merge would be a good starting
> > point for building something similar for block devices.  AFAICT it does
> > the right thing if you hand it a subrange of the dax device or if you
> > pass it the customary (0, -1ULL) to mean "the entire device".
> 
> *nod*
> 
> That was exactly how I originally envisiaged that whole "bad device
> range" stack being used.
> 
> > The block device version of that could be a lot simpler-- imagine if
> > "echo 0 > /sys/block/fd0/device/delete" resulted in the block layer
> > first sending us a notification that the device is about to be removed.
> > We could then flush the fs and try to freeze it.  After the device
> > actually goes away, the blocy layer would send us a second notification
> > about DEVICE_DEAD and we could shut down the incore filesystem objects.
> 
> *nod*
> 
> But seeing this mechanism has already been shot down by the block
> layer devs, let's be a little less ambitious and just start with
> a simple, pre-existing "kill the filesystem" mechanism. Once we've
> got that in place and working, we can then expand on the error
> handling mechanism to perform notification of on more fine-grained
> storage  errors...

<shrug> Seeing as LSF is next week, I'll ask the room about this when
I'm there.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


  reply	other threads:[~2023-05-02 23:27 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-27  2:20 Ming Lei
2023-04-27  3:58 ` Matthew Wilcox
2023-04-27  4:50   ` Ming Lei
2023-04-27  6:36     ` Baokun Li
2023-04-27  7:33       ` Baokun Li
2023-04-27 10:01       ` Ming Lei
2023-04-27 11:19         ` Baokun Li
2023-04-27 11:27           ` Ming Lei
2023-04-28  1:41             ` Ming Lei
2023-04-28  3:47               ` Baokun Li
2023-04-28  5:47                 ` Theodore Ts'o
2023-04-29  3:16                   ` Ming Lei
2023-04-29  4:40                     ` Christoph Hellwig
2023-04-29  5:10                       ` Ming Lei
2023-05-01  4:47                         ` Christoph Hellwig
2023-05-02  0:57                           ` Ming Lei
2023-05-02  1:35                             ` Dave Chinner
2023-05-02 15:35                               ` Darrick J. Wong
2023-05-02 22:33                                 ` Dave Chinner
2023-05-02 23:27                                   ` Darrick J. Wong [this message]
2023-04-29  4:56                     ` Theodore Ts'o
2023-05-01  2:06                       ` Dave Chinner
2023-05-04  3:09                   ` Baokun Li
2023-04-27 23:33 ` Dave Chinner
2023-04-28  2:56   ` Matthew Wilcox
2023-04-28  5:24     ` Dave Chinner
2023-05-04 15:59 ` Keith Busch
2023-05-04 16:21   ` Matthew Wilcox
2023-05-05  2:06   ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230502232709.GD15420@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=hch@lst.de \
    --cc=libaokun1@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=sandeen@redhat.com \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox