linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Mason <chris.mason@oracle.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Theodore Ts'o <tytso@mit.edu>,
	stable@kernel.org
Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.
Date: Sun, 21 Oct 2007 01:09:36 -0600	[thread overview]
Message-ID: <m18x5xc5an.fsf@ebiederm.dsl.xmission.com> (raw)
In-Reply-To: <200710211536.24722.nickpiggin@yahoo.com.au> (Nick Piggin's message of "Sun, 21 Oct 2007 15:36:24 +1000")

Nick Piggin <nickpiggin@yahoo.com.au> writes:

> On Sunday 21 October 2007 14:53, Eric W. Biederman wrote:
>> Nick Piggin <nickpiggin@yahoo.com.au> writes:
>> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
>> >> Andrew Morton <akpm@linux-foundation.org> writes:
>> >> > I don't think we little angels want to tread here.  There are so many
>> >> > weirdo things out there which will break if we bust the coherence
>> >> > between the fs and /dev/hda1.
>> >>
>> >> We broke coherence between the fs and /dev/hda1 when we introduced
>> >> the page cache years ago,
>> >
>> > Not for metadata. And I wouldn't expect many filesystem analysis
>> > tools to care about data.
>>
>> Well tools like dump certainly weren't happy when we made the change.
>
> Doesn't that give you any suspicion that other tools mightn't
> be happy if we make this change, then?

I read a representative sample of the relevant tools before replying
to Andrew.

>> >> and weird hacky cases like
>> >> unmap_underlying_metadata don't change that.
>> >
>> > unmap_underlying_metadata isn't about raw block device access at
>> > all, though (if you write to the filesystem via the blockdevice
>> > when it isn't expecting it, it's going to blow up regardless).
>>
>> Well my goal with separating things is so that we could decouple two
>> pieces of code that have different usage scenarios, and where
>> supporting both scenarios simultaneously appears to me to needlessly
>> complicate the code.
>>
>> Added to that we could then tune the two pieces of code for their
>> different users.
>
> I don't see too much complication from it. If we can actually
> simplify things or make useful tuning, maybe it will be worth
> doing.

That was my feeling that we could simplify things.  The block layer
page cache operations certainly.

I know in the filesystems that use the buffer cache like reiser and
JBD they could stop worrying about the buffers becoming mysteriously
dirty.  Beyond that I think there is a lot of opportunity I just
haven't looked much yet.

>> >> Currently only
>> >> metadata is more or less in sync with the contents of /dev/hda1.
>> >
>> > It either is or it isn't, right? And it is, isn't it? (at least
>> > for the common filesystems).
>>
>> ext2 doesn't store directories in the buffer cache.
>  
> Oh that's what you mean. OK, agreed there. But for the filesystems
> and types of metadata that can now expect to have coherency, doing
> this will break that expectation.
>
> Again, I have no opinions either way on whether we should do that
> in the long run. But doing it as a kneejerk response to braindead
> rd.c code is wrong because of what *might* go wrong and we don't
> know about.

The rd.c code is perfectly valid if someone wasn't forcing buffer
heads on it's pages.  It is a conflict of expectations.

Regardless I didn't do it as a kneejerk and I don't think that
patch should be merged at this time.  I proposed it because as I
see it that starts untangling the mess that is the buffer cache.
rd.c was just my entry point into understanding how all of those
pieces work.   I was doing my best to completely explore my options
and what the code was doing before settling on the fix for rd.c

>> Journaling filesystems and filesystems that do ordered writes
>> game the buffer cache.  Putting in data that should not yet
>> be written to disk.  That gaming is where reiserfs goes BUG
>> and where JBD moves the dirty bit to a different dirty bit.
>
> Filesystems really want better control of writeback, I think.
> This isn't really a consequence of the unified blockdev pagecache
> / metadata buffer cache, it is just that most of the important
> things they do are with metadata.

Yes.

> If they have their own metadata inode, then they'll need to game
> the cache for it, or the writeback code for that inode somehow
> too.

Yes.  Although they will at least get the guarantee that no one
else is dirtying their pages at strange times. 


>> So as far as I can tell what is in the buffer cache is not really
>> in sync with what should be on disk at any given movement except
>> when everything is clean.
>
> Naturally. It is a writeback cache.

Not that so much as the order in which things go into the cache
does not match the order the blocks go to disk.

>> My suspicion is that actually reading from disk is likely to
>> give a more coherent view of things.  Because there at least
>> we have the writes as they are expected to be seen by fsck
>> to recover the data, and a snapshot there should at least
>> be recoverable.  Whereas a snapshot provides not such guarantees.
>
> ext3 fsck I don't think is supposed to be run under a read/write
> filesystem, so it's going to explode if you do that regardless.

Yes.  I was thinking of dump or something like that here.  Where
we simply read out the data and try to make some coherent sense
of it.  If we see a version of the metadata that points to things
that have not been finished yet or is in the process of being
written to that could be a problem.

When going through the buffer cache as far as I can tell people
don't use little things like page lock when writing data so
the page cache reads can potentially race with what should
be atomic writes.

Eric

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-10-21  7:09 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-15  8:28 [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Christian Borntraeger
2007-10-15 14:06 ` Nick Piggin
2007-10-15  9:05   ` Christian Borntraeger
2007-10-15 14:38     ` Nick Piggin
2007-10-15 18:38       ` Eric W. Biederman
2007-10-15 22:37         ` Eric W. Biederman
2007-10-15 22:40           ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Eric W. Biederman
2007-10-15 22:42             ` [PATCH] rd: Mark ramdisk buffers heads dirty Eric W. Biederman
2007-10-16  7:56               ` Christian Borntraeger
2007-10-16  9:22                 ` Eric W. Biederman
2007-10-17 16:14                 ` Christian Borntraeger
2007-10-17 17:57                   ` Eric W. Biederman
2007-10-17 19:14                     ` Chris Mason
2007-10-17 20:29                       ` Eric W. Biederman
2007-10-17 20:54                         ` Chris Mason
2007-10-17 21:30                           ` Eric W. Biederman
2007-10-17 22:58                             ` Chris Mason
2007-10-17 23:28                               ` Eric W. Biederman
2007-10-18  0:03                                 ` Chris Mason
2007-10-18  3:27                                   ` Eric W. Biederman
2007-10-18  3:59                                   ` [RFC][PATCH] block: Isolate the buffer cache in it's own mappings Eric W. Biederman
2007-10-18  4:32                                     ` Andrew Morton
2007-10-19 21:27                                       ` Eric W. Biederman
2007-10-21  4:24                                         ` Nick Piggin
2007-10-21  4:53                                           ` Eric W. Biederman
2007-10-21  5:36                                             ` Nick Piggin
2007-10-21  7:09                                               ` Eric W. Biederman [this message]
2007-10-22  0:15                                           ` David Chinner
2007-10-18  5:10                                     ` Nick Piggin
2007-10-19 21:35                                       ` Eric W. Biederman
2007-10-17 21:48                     ` [PATCH] rd: Mark ramdisk buffers heads dirty Christian Borntraeger
2007-10-17 22:22                       ` Eric W. Biederman
2007-10-18  9:26                         ` Christian Borntraeger
2007-10-19 22:46                           ` Eric W. Biederman
2007-10-19 22:51                           ` [PATCH] rd: Use a private inode for backing storage Eric W. Biederman
2007-10-21  4:28                             ` Nick Piggin
2007-10-21  5:10                               ` Eric W. Biederman
2007-10-21  5:24                                 ` Nick Piggin
2007-10-21  6:48                                   ` Eric W. Biederman
2007-10-21  7:28                                     ` Christian Borntraeger
2007-10-21  8:23                                       ` Eric W. Biederman
2007-10-21  9:56                                         ` Nick Piggin
2007-10-21 18:39                                           ` Eric W. Biederman
2007-10-22  1:56                                             ` Nick Piggin
2007-10-22 13:11                                             ` Chris Mason
2007-10-21  9:39                                     ` Nick Piggin
2007-10-21 17:56                                       ` Eric W. Biederman
2007-10-22  0:29                                         ` Nick Piggin
2007-10-16  8:19               ` [PATCH] rd: Mark ramdisk buffers heads dirty Nick Piggin
2007-10-16  8:48                 ` Christian Borntraeger
2007-10-16 19:06                 ` Eric W. Biederman
2007-10-16 22:06                   ` Nick Piggin
2007-10-16  8:12             ` [PATCH] rd: Preserve the dirty bit in init_page_buffers() Nick Piggin
2007-10-16  9:35               ` Eric W. Biederman
2007-10-15  9:16   ` [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure Andrew Morton
2007-10-15 15:23     ` Nick Piggin
2007-10-16  3:14       ` Eric W. Biederman
2007-10-16  6:45         ` Nick Piggin
2007-10-16  4:57           ` Eric W. Biederman
2007-10-16  8:08             ` Nick Piggin
2007-10-16  7:47               ` [patch][rfc] rewrite ramdisk Nick Piggin
2007-10-16  7:52                 ` Jan Engelhardt
2007-10-16  8:07                   ` Nick Piggin
2007-10-16  8:17                     ` Jan Engelhardt
2007-10-16  8:26                       ` Nick Piggin
2007-10-16  8:53                         ` Jan Engelhardt
2007-10-16  9:08                 ` Eric W. Biederman
2007-10-16 21:28                 ` Theodore Tso
2007-10-16 22:08                   ` Nick Piggin
2007-10-16 23:48                     ` Eric W. Biederman
2007-10-17  0:28                       ` Nick Piggin
2007-10-17  1:13                         ` Eric W. Biederman
2007-10-17  1:47                           ` Nick Piggin
2007-10-17 10:30                 ` Eric W. Biederman
2007-10-17 12:49                   ` Nick Piggin
2007-10-17 18:45                     ` Eric W. Biederman
2007-10-18  1:06                       ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m18x5xc5an.fsf@ebiederm.dsl.xmission.com \
    --to=ebiederm@xmission.com \
    --cc=akpm@linux-foundation.org \
    --cc=borntraeger@de.ibm.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=schwidefsky@de.ibm.com \
    --cc=stable@kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox