From: Dan Williams <dan.j.williams@intel.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jane Chu <jane.chu@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Shiyang Ruan <ruansy.fnst@fujitsu.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-xfs <linux-xfs@vger.kernel.org>,
Linux NVDIMM <nvdimm@lists.linux.dev>,
Linux MM <linux-mm@kvack.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
david <david@fromorbit.com>, "Luck, Tony" <tony.luck@intel.com>,
Mauro Carvalho Chehab <mchehab@kernel.org>
Subject: Re: [PATCH v11 1/8] dax: Introduce holder for dax_device
Date: Thu, 7 Apr 2022 18:38:05 -0700 [thread overview]
Message-ID: <CAPcyv4g9m13VGq9mFHHhd301jZk-OQC47MGpB9nU=erA0i2ZCg@mail.gmail.com> (raw)
In-Reply-To: <20220406203900.GR27690@magnolia>
[ add Mauro and Tony for RAS discussion ]
On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote:
> > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@oracle.com> wrote:
> > >
> > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > > >>> As the code I pasted before, pmem driver will subtract its ->data_offset,
> > > >>> which is byte-based. And the filesystem who implements ->notify_failure()
> > > >>> will calculate the offset in unit of byte again.
> > > >>>
> > > >>> So, leave its function signature byte-based, to avoid repeated conversions.
> > > >>
> > > >> I'm actually fine either way, so I'll wait for Dan to comment.
> > > >
> > > > FWIW I'd convinced myself that the reason for using byte units is to
> > > > make it possible to reduce the pmem failure blast radius to subpage
> > > > units... but then I've also been distracted for months. :/
> > > >
> > >
> > > Yes, thanks Darrick! I recall that.
> > > Maybe just add a comment about why byte unit is used?
> >
> > I think we start with page failure notification and then figure out
> > how to get finer grained through the dax interface in follow-on
> > changes. Otherwise, for finer grained error handling support,
> > memory_failure() would also need to be converted to stop upcasting
> > cache-line granularity to page granularity failures. The native MCE
> > notification communicates a 'struct mce' that can be in terms of
> > sub-page bytes, but the memory management implications are all page
> > based. I assume the FS implications are all FS-block-size based?
>
> I wouldn't necessarily make that assumption -- for regular files, the
> user program is in a better position to figure out how to reset the file
> contents.
>
> For fs metadata, it really depends. In principle, if (say) we could get
> byte granularity poison info, we could look up the space usage within
> the block to decide if the poisoned part was actually free space, in
> which case we can correct the problem by (re)zeroing the affected bytes
> to clear the poison.
>
> Obviously, if the blast radius hits the internal space info or something
> that was storing useful data, then you'd have to rebuild the whole block
> (or the whole data structure), but that's not necessarily a given.
tl;dr: dax_holder_notify_failure() != fs->notify_failure()
So I think I see some confusion between what DAX->notify_failure()
needs, memory_failure() needs, the raw information provided by the
hardware, and the failure granularity the filesystem can make use of.
DAX and memory_failure() need to make immediate page granularity
decisions. They both need to map out whole pages (in the direct map
and userspace respectively) to prevent future poison consumption, at
least until the poison is repaired.
The event that leads to a page being failed can be triggered by a
hardware error as small as an individual cacheline. While that is
interesting to a filesystem it isn't information that memory_failure()
and DAX can utilize.
The reason DAX needs to have a callback into filesystem code is to map
the page failure back to all the processes that might have that page
mapped because reflink means that page->mapping is not sufficient to
find all the affected 'struct address_space' instances. So it's more
of an address-translation / "help me kill processes" service than a
general failure notification service.
Currently when raw hardware event happens there are mechanisms like
arch-specific notifier chains, like powerpc::mce_register_notifier()
and x86::mce_register_decode_chain(), or other platform firmware code
like ghes_edac_report_mem_error() that uplevel the error to a coarse
page granularity failure, while emitting the fine granularity error
event to userspace.
All of this to say that the interface to ask the fs to do the bottom
half of memory_failure() (walking affected 'struct address_space'
instances and killing processes (mf_dax_kill_procs())) is different
than the general interface to tell the filesystem that memory has gone
bad relative to a device. So if the only caller of
fs->notify_failure() handler is this code:
+ if (pgmap->ops->memory_failure) {
+ rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE,
+ flags);
...then you'll never get fine-grained reports. So, I still think the
DAX, pgmap and memory_failure() interface should be pfn based. The
interface to the *filesystem* ->notify_failure() can still be
byte-based, but the trigger for that byte based interface will likely
need to be something driven by another agent. Perhaps like rasdaemon
in userspace translating all the arch specific physical address events
back into device-relative offsets and then calling a new ABI that is
serviced by fs->notify_failure() on the backend.
next prev parent reply other threads:[~2022-04-08 1:38 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-27 12:07 [PATCH v11 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
2022-02-27 12:07 ` [PATCH v11 1/8] dax: Introduce holder for dax_device Shiyang Ruan
2022-03-11 23:35 ` Dan Williams
2022-03-16 13:46 ` Shiyang Ruan
2022-03-30 5:41 ` Christoph Hellwig
2022-03-30 10:03 ` Shiyang Ruan
2022-03-30 10:13 ` Christoph Hellwig
2022-03-30 10:58 ` Shiyang Ruan
2022-03-30 15:49 ` Christoph Hellwig
2022-03-30 16:18 ` Darrick J. Wong
2022-04-06 0:55 ` Jane Chu
2022-04-06 1:22 ` Dan Williams
2022-04-06 20:39 ` Darrick J. Wong
2022-04-08 1:38 ` Dan Williams [this message]
2022-04-08 5:59 ` Shiyang Ruan
2022-03-30 5:41 ` Christoph Hellwig
2022-02-27 12:07 ` [PATCH v11 2/8] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
2022-02-27 12:07 ` [PATCH v11 3/8] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
2022-02-27 12:07 ` [PATCH v11 4/8] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
2022-02-27 12:07 ` [PATCH v11 5/8] mm: move pgoff_address() to vma_pgoff_address() Shiyang Ruan
2022-03-30 5:46 ` Christoph Hellwig
2022-03-30 6:49 ` Shiyang Ruan
2022-02-27 12:07 ` [PATCH v11 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
2022-03-30 5:51 ` Christoph Hellwig
2022-02-27 12:07 ` [PATCH v11 7/8] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
2022-02-27 14:05 ` kernel test robot
2022-02-27 15:36 ` kernel test robot
2022-02-27 15:46 ` kernel test robot
2022-03-30 6:00 ` Christoph Hellwig
2022-03-30 15:16 ` Shiyang Ruan
2022-03-30 15:52 ` Christoph Hellwig
2022-04-08 6:04 ` Shiyang Ruan
2022-04-08 6:26 ` Dan Williams
2022-04-08 6:25 ` Dan Williams
2022-02-27 12:07 ` [PATCH v11 8/8] fsdax: set a CoW flag when associate reflink mappings Shiyang Ruan
2022-02-27 15:57 ` kernel test robot
2022-03-10 13:08 ` [PATCH v11 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPcyv4g9m13VGq9mFHHhd301jZk-OQC47MGpB9nU=erA0i2ZCg@mail.gmail.com' \
--to=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jane.chu@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=nvdimm@lists.linux.dev \
--cc=ruansy.fnst@fujitsu.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox