linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Andi Kleen <andi.kleen@intel.com>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tony Luck <tony.luck@intel.com>, Rik van Riel <riel@redhat.com>,
	Jun'ichi Nomura <j-nomura@ce.jp.nec.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
Date: Mon, 27 Aug 2012 18:05:06 -0400	[thread overview]
Message-ID: <1346105106-26033-1-git-send-email-n-horiguchi@ah.jp.nec.com> (raw)
In-Reply-To: <20120826222607.GD19235@dastard>

On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
> On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
> > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> > > > > > where we have possibilities of data lost. This is because in this fix
> > > > > > AS_HWPOISON is cleared when the inode cache is dropped.
> > > ....
> > > > > > --- v3.6-rc1.orig/mm/truncate.c
> > > > > > +++ v3.6-rc1/mm/truncate.c
> > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
> > > > > >  
> > > > > >  	oldsize = inode->i_size;
> > > > > >  	i_size_write(inode, newsize);
> > > > > > +	if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > > > > > +		mapping_clear_hwpoison(inode->i_mapping);
> > > > > 
> > > > > So only a truncate to zero size will clear the poison flag?
> > > > 
> > > > Yes, this is because we only know if the file is affected by hwpoison,
> > > > but not where the hwpoisoned page is in the file. We could remember it,
> > > > but I did not do it for simplicity.
> > > 
> > > Surely the page has flags on it to say it is poisoned? That is,
> > > after truncating the page cache, if the address space is poisoned,
> > > then you can do a pass across the mapping tree checking if there are
> > > any poisoned pages left? Or perhaps adding a new mapping tree tag so
> > > that the poisoned status is automatically determined by the presence
> > > of the poisoned page in the mapping tree?
> > 
> > The answer for the first question is yes. And for the second question,
> > I don't think it's easy because the mapping tree has no reference to
> > the error page (I explain more about this below, please see also it,)
> > and it can cost too much to search poisoned pages over page cache in
> > each request.
> 
> Which is my point about a radix tree tag - that's very efficient.
> 
> > And for the third question, I think we could do this, but to do it
> > we need an additional space (8 bytes) in struct radix_tree_node.
> > Considering that this new tag is not used so frequently, so I'm not
> > sure that it's worth the cost.
> 
> A radix tree node is currently 560 bytes on x86_64, packed 7 to a
> page. i.e. using 3920 bytes. We can add another 8 bytes to it
> without increasing memory usage at all. So, no cost at all.

OK.

> > > > > What happens if it is the last page in the mapping that is poisoned,
> > > > > and we truncate that away? Shouldn't that clear the poisoned bit?
> > > > 
> > > > When we handle the hwpoisoned inode, the error page should already
> > > > be removed from pagecache, so the remaining pages are not related
> > > > to the error and we need not care about them when we consider bit
> > > > clearing.
> > > 
> > > Sorry, I don't follow. What removed the page from the page cache?
> > > The truncate_page_cache() call that follows the above code hunk is
> > > what does that, so I don't see how it can already be removed from
> > > the page cache....
> > 
> > Memory error handler (memory_failure() in mm/memory-failure.c) has
> > removed the error page from the page cache.
> > And truncate_page_cache() that follows this hunk removes all pages
> > belonging to the page cache of the poisoned file (where the error
> > page itself is not included in them.)
> > 
> > Let me explain more to clarify my whole scenario. If a memory error
> > hits on a dirty pagecache, kernel works like below:
> > 
> >   1. handles a MCE interrupt (logging MCE events,)
> >   2. calls memory error handler (doing 3 to 6,)
> >   3. sets PageHWPoison flag on the error page,
> >   4. unmaps all mappings to processes' virtual addresses,
> 
> So nothing in userspace sees the bad page after this.
> 
> >   5. sets AS_HWPOISON on mappings to which the error page belongs
> >   6. invalidates the error page (unlinks it from LRU list and removes
> >      it from pagecache,)
> >   (memory error handler finished)
> 
> Ok, so the moment a memory error is handled, the page has been
> removed from the inode's mapping, and it will never be seen by
> aplications again. It's a transient error....
> 
> >   7. later accesses to the file returns -EIO,
> >   8. AS_HWPOISON is cleared when the file is removed or completely
> >      truncated.
> 
> .... so why do we have to keep an EIO on the inode forever?
> 
> If the page is not dirty, then just tossing it from the cache (as
> is already done) and rereading it from disk next time it is accessed
> removes the need for any error to be reported at all. It's
> effectively a transient error at this point, and as such no errors
> should be visible from userspace.
> 
> If the page is dirty, then it needs to be treated just like any
> other failed page write - the page is invalidated and the address
> space is marked with AS_EIO, and that is reported to the next
> operation that waits on IO on that file (i.e. fsync)
> 
> If you have a second application that reads the files that depends
> on a guarantee of good data, then the first step in that process is
> that application that writes it needs to use fsync to check the data
> was written correctly. That ensures that you only have clean pages
> in the cache before the writer closes the file, and any h/w error
> then devolves to the above transient clean page invalidation case.

Thank you for detailed explanations.
And yes, I understand it's ideal, but many applications choose not to
do that for performance reason.
So I think it's helpful if we can surely report to such applications.

> Hence I fail to see why this type of IO error needs to be sticky.
> The error on the mapping is transient - it is gone as soon as the
> page is removed from the mapping. Hence the error can be dropped as
> soon as it is reported to userspace because the mapping is now error
> free.

It's error free only for the applications which do fsync check in
each write, but not for the applications which don't do.
I think the penalty for the latters (ignore dirty data lost and get
wrong results) is too big to consider it as a reasonable trade-off.

> > You may think it strange that the condition of clearing AS_HWPOISON
> > is checked with file granularity.
> 
> I don't think it is strange, I think it is *wrong*.

OK, we can move to a new tag approach, and we can avoid this.

> > This is because currently userspace
> > applications know the memory errors only with file granularity for
> > simplicity, when they access via read(), write() and/or fsync().
> 
> Trying to report this error to every potential future access
> regardless of whether the error still exists or the access is to the
> poisoned range with magical sticky errors on inode address spaces
> and hacks to memory the reclaim subsystems smells to me like a really
> bad hack to work around applications that use bad data integrity
> practices.
> 
> As such, I think you probably need to rethink the approach you are
> taking to handling this error. The error is transient w.r.t. to the
> mapping and page cache, and needs to be addressed consistently
> compared to other transient IO errors that are reported through the
> mapping....

Agreed. We can handle this error without a controversial flag on the
mapping, in new pagecache tag approach. I'll try that one.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-08-27 22:05 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-22 15:17 [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Naoya Horiguchi
2012-08-22 15:17 ` [PATCH 1/3] HWPOISON: fix action_result() to print out dirty/clean Naoya Horiguchi
2012-08-23  9:33   ` Fengguang Wu
2012-08-23 20:31     ` Naoya Horiguchi
2012-08-22 15:17 ` [PATCH 2/3] HWPOISON: report sticky EIO for poisoned file Naoya Horiguchi
2012-08-23  9:22   ` Fengguang Wu
2012-08-23 20:31     ` Naoya Horiguchi
2012-08-22 15:17 ` [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky Naoya Horiguchi
2012-08-23  9:11   ` Fengguang Wu
2012-08-23 20:31     ` Naoya Horiguchi
2012-08-24 21:52       ` Naoya Horiguchi
2012-08-24  1:31   ` Dave Chinner
2012-08-24  2:39     ` Naoya Horiguchi
2012-08-24  4:39       ` Dave Chinner
2012-08-24 17:24         ` Naoya Horiguchi
2012-08-26 22:26           ` Dave Chinner
2012-08-27 22:05             ` Naoya Horiguchi [this message]
2012-08-29  2:59               ` Dave Chinner
2012-08-29  5:32                 ` Jun'ichi Nomura
2012-09-03  0:39                   ` Dave Chinner
2012-08-22 20:22 ` [PATCH 0/3 v2] HWPOISON: improve dirty pagecache error reporting Andi Kleen
2012-08-22 21:14   ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1346105106-26033-1-git-send-email-n-horiguchi@ah.jp.nec.com \
    --to=n-horiguchi@ah.jp.nec.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi.kleen@intel.com \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=j-nomura@ce.jp.nec.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox