Re: [LSF/MM TOPIC] do we really need PG_error at all?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Dilger <adilger@dilger.ca>
To: Jeff Layton <jlayton@redhat.com>
Cc: NeilBrown <neilb@suse.com>,
	James Bottomley <James.Bottomley@HansenPartnership.com>,
	linux-mm <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	lsf-pc <lsf-pc@lists.linuxfoundation.org>,
	Neil Brown <neilb@suse.de>,
	linux-scsi <linux-scsi@vger.kernel.org>,
	linux-block@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] do we really need PG_error at all?
Date: Mon, 27 Feb 2017 15:51:24 -0700	[thread overview]
Message-ID: <DC27F5BA-BCCA-41FF-8D41-7BB99AA4DB26@dilger.ca> (raw)
In-Reply-To: <1488208047.2876.6.camel@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7794 bytes --]

On Feb 27, 2017, at 8:07 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Mon, 2017-02-27 at 11:27 +1100, NeilBrown wrote:
>> On Sun, Feb 26 2017, James Bottomley wrote:
>> 
>>> On Mon, 2017-02-27 at 08:03 +1100, NeilBrown wrote:
>>>> On Sun, Feb 26 2017, James Bottomley wrote:
>>>> 
>>>>> [added linux-scsi and linux-block because this is part of our error
>>>>> handling as well]
>>>>> On Sun, 2017-02-26 at 09:42 -0500, Jeff Layton wrote:
>>>>>> Proposing this as a LSF/MM TOPIC, but it may turn out to be me
>>>>>> just not understanding the semantics here.
>>>>>> 
>>>>>> As I was looking into -ENOSPC handling in cephfs, I noticed that
>>>>>> PG_error is only ever tested in one place [1]
>>>>>> __filemap_fdatawait_range, which does this:
>>>>>> 
>>>>>> 	if (TestClearPageError(page))
>>>>>> 		ret = -EIO;
>>>>>> 
>>>>>> This error code will override any AS_* error that was set in the
>>>>>> mapping. Which makes me wonder...why don't we just set this error
>>>>>> in the mapping and not bother with a per-page flag? Could we
>>>>>> potentially free up a page flag by eliminating this?
>>>>> 
>>>>> Note that currently the AS_* codes are only set for write errors
>>>>> not for reads and we have no mapping error handling at all for swap
>>>>> pages, but I'm sure this is fixable.
>>>> 
>>>> How is a read error different from a failure to set PG_uptodate?
>>>> Does PG_error suppress retries?
>>> 
>>> We don't do any retries in the code above the block layer (or at least
>>> we shouldn't).
>> 
>> I was wondering about what would/should happen if a read request was
>> re-issued for some reason.  Should the error flag on the page cause an
>> immediate failure, or should it try again.
>> If read-ahead sees a read-error on some future page, is it necessary to
>> record the error so subsequent read-aheads don't notice the page is
>> missing and repeatedly try to re-load it?
>> When the application eventually gets to the faulty page, should a read
>> be tried then, or is the read-ahead failure permanent?
>> 
>> 
>> 
>>> 
>>>>> 
>>>>> From the I/O layer point of view we take great pains to try to
>>>>> pinpoint the error exactly to the sector.  We reflect this up by
>>>>> setting the PG_error flag on the page where the error occurred.  If
>>>>> we only set the error on the mapping, we lose that granularity,
>>>>> because the mapping is mostly at the file level (or VMA level for
>>>>> anon pages).
>>>> 
>>>> Are you saying that the IO layer finds the page in the bi_io_vec and
>>>> explicitly sets PG_error,
>>> 
>>> I didn't say anything about the mechanism.  I think the function you're
>>> looking for is fs/mpage.c:mpage_end_io().  layers below block indicate
>>> the position in the request.  Block maps the position to bio and the
>>> bio completion maps to page.  So the actual granularity seen in the
>>> upper layer depends on how the page to bio mapping is done.
>> 
>> If the block layer is just returning the status at a per-bio level (which
>> makes perfect sense), then this has nothing directly to do with the
>> PG_error flag.
>> 
>> The page cache needs to do something with bi_error, but it isn't
>> immediately clear that it needs to set PG_error.
>> 
>>> :q
>>>> rather than just passing an error indication to bi_end_io ??  That
>>>> would seem to be wrong as the page may not be in the page cache.
>>> 
>>> Usually pages in the mpage_end_io path are pinned, I think.
>>> 
>>>> So I guess I misunderstand you.
>>>> 
>>>>> 
>>>>> So I think the question for filesystem people from us would be do
>>>>> you care about this accuracy?  If it's OK just to know an error
>>>>> occurred somewhere in this file, then perhaps we don't need it.
>>>> 
>>>> I had always assumed that a bio would either succeed or fail, and
>>>> that no finer granularity could be available.
>>> 
>>> It does ... but a bio can be as small as a single page.
>>> 
>>>> I think the question here is: Do filesystems need the pagecache to
>>>> record which pages have seen an IO error?
>>> 
>>> It's not just filesystems.  The partition code uses PageError() ... the
>>> metadata code might as well (those are things with no mapping).  I'm
>>> not saying we can't remove PG_error; I am saying it's not going to be
>>> quite as simple as using the AS_ flags.
>> 
>> The partition code could use PageUptodate().
>> mpage_end_io() calls page_endio() on each page, and on read error that
>> calls:
>> 
>> 			ClearPageUptodate(page);
>> 			SetPageError(page);
>> 
>> are both of these necessary?
>> 
> 
>> fs/buffer.c can use several bios to read a single page.
>> If any one returns an error, PG_error is set.  When all of them have
>> completed, if PG_error is clear, PG_uptodate is then set.
>> This is an opportunistic use of PG_error, rather than an essential use.
>> It could be "fixed", and would need to be fixed if we were to deprecate
>> use of PG_error for read errors.
>> There are probably other usages like this.
>> 
> 
> Ok, I think I get it (somewhat):
> 
> The tricky part there is how to handle the PageError check in
> read_dev_sector if you don't use SetPageError in the result handler.
> 
> If we can count on read_pagecache_sector and read_dax_sector reliably
> returning an error when the page is considered to be in the cache
> (PageUpToDate) but had a read error, then that would work. I'm not sure
> how you'd indicate that without something like PG_error though if you
> don't want to retry on every attempt.
> 
> OTOH, if we want to always retry to read in pages that have had read
> errors when someone requests them, then we can simply not set
> PageUpToDate when readahead fails.
> 
> To chip away at the edges of this, what may make sense is to get this
> flag out of the writeback code as much as we can. When a write fails and
> SetPageError is called, we should also mark the mapping with an error.
> Then, we should be able to stop overriding the mapping error with -EIO
> in that codepath. Maybe call ClearPageError, or maybe leave it alone
> there?

My thought is that PG_error is definitely useful for applications to get
correct errors back when doing write()/sync_file_range() so that they know
there is an error in the data that _they_ wrote, rather than receiving an
error for data that may have been written by another thread, and in turn
clearing the error from another thread so it *doesn't* know it had a write
error.

As for stray sync() clearing PG_error from underneath an application, that
shouldn't happen since filemap_fdatawait_keep_errors() doesn't clear errors
and is used by device flushing code (fdatawait_one_bdev(), wait_sb_inodes()).

Cheers, Andreas

>>> 
>>> James
>>> 
>>>> I think that for write errors, there is no value in recording
>>>> block-oriented error status - only file-oriented status.
>>>> For read errors, it might if help to avoid indefinite read retries,
>>>> but I don't know the code well enough to be sure if this is an issue.
>>>> 
>>>> NeilBrown
>>>> 
>>>> 
>>>>> 
>>>>> James
>>>>> 
>>>>>> The main argument I could see for keeping it is that removing it
>>>>>> might subtly change the behavior of sync_file_range if you have
>>>>>> tasks syncing different ranges in a file concurrently. I'm not
>>>>>> sure if that would break any guarantees though.
>>>>>> 
>>>>>> Even if we do need it, I think we might need some cleanup here
>>>>>> anyway. A lot of readpage operations end up setting that flag
>>>>>> when they hit an error. Isn't it wrong to return an error on
>>>>>> fsync, just because we had a read error somewhere in the file in
>>>>>> a range that was never dirtied?
>>>>>> 
>>>>>> --
>>>>>> [1]: there is another place in f2fs, but it's more or less
>>>>>> equivalent to the call site in __filemap_fdatawait_range.
>>>>>> 
> 
> --
> Jeff Layton <jlayton@redhat.com>


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

next prev parent reply	other threads:[~2017-02-27 22:51 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-26 14:42 Jeff Layton
2017-02-26 17:10 ` James Bottomley
2017-02-26 21:03   ` NeilBrown
2017-02-26 22:43     ` Jeff Layton
2017-02-26 23:30     ` James Bottomley
2017-02-26 23:57       ` Jeff Layton
2017-02-27  0:27       ` NeilBrown
2017-02-27 15:07         ` Jeff Layton
2017-02-27 22:51           ` Andreas Dilger [this message]
2017-02-27 23:02             ` Jeff Layton
2017-02-27 23:32             ` NeilBrown
2017-02-28  1:11               ` [Lsf-pc] " Jeff Layton
2017-02-28 10:12                 ` Boaz Harrosh
2017-02-28 11:32                   ` Jeff Layton
2017-02-28 20:45                 ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DC27F5BA-BCCA-41FF-8D41-7BB99AA4DB26@dilger.ca \
    --to=adilger@dilger.ca \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=jlayton@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=lsf-pc@lists.linuxfoundation.org \
    --cc=neilb@suse.com \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox