linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
       [not found]         ` <m11zse6ecw.fsf@flinx.npwt.net>
@ 1998-06-25 11:00           ` Stephen C. Tweedie
  1998-06-26 15:56             ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen C. Tweedie @ 1998-06-25 11:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Hans Reiser, Shawn Leas, Reiserfs,
	Ken Tetrick, linux-mm

Hi,

[CC:ed to linux-mm, who also have a great deal of interest in this
stuff.]

On 24 Jun 1998 09:53:03 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

ST> However, there's a lot of overlap, so I'd like to look at what we can do
ST> with this for 2.3.  In particular, I'd like 2.3's standard file writing
ST> mechanism to work essentially as write-through from the page cache,

> The current system is write-through.  I hope you mean write back.

The current system is write-through from the buffer cache.  The data
is copied into the page cache only if there is already a page mapping
that data.  That is really ugly, using the buffer cache both as an IO
buffer and as a data cache.  THAT is what we need to fix.

The ideal solution IMHO would be something which does write-through
from the page cache to the buffer cache and write-back from the buffer
cache to disk; in other words, when you write to a page, buffers are
generated to map that dirty data (without copying) there and then.
The IO is then left to the buffer cache, as currently happens, but the
buffer is deleted after IO (just like other temporary buffer_heads
behave right now).  That leaves the IO buffering to the buffer cache
and the caching to the page cache, which is the distinction that the
the current scheme approaches but does not quite achieve.

> This functionality is essentially what is implemented with brw_page,
> and I have written the generic_page_write that does essentially
> this.  There is no data copying however.  The fun angle is mapped
> pages need to be unmapped (or at least read only mapped) for a write
> to be successful.

Indeed; however, it might be a reasonable compromise to do a copy out
from the page cache to the buffer cache in this situation (we already
have a copy in there, so this would not hurt performance relative to
the current system).  

Doing COW at the page cache level is something we can implement later;
there are other reasons for it to be desirable anyway.  For example,
it lets you convert all read(2) and write(2) requests on whole pages
into mmap()s, transparently, giving automatic zero-copy IO to user
space.

> I should have a working patch this weekend (the code compiles now, I
> just need to make sure it works) and we can discuss it more when that
> has been released.

Excellent.  I look forward to seeing it.

--Stephen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-06-25 11:00           ` (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?)) Stephen C. Tweedie
@ 1998-06-26 15:56             ` Eric W. Biederman
  1998-06-29 10:35               ` Stephen C. Tweedie
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 1998-06-26 15:56 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Shawn Leas, Reiserfs, Ken Tetrick, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:

ST> Hi,
ST> [CC:ed to linux-mm, who also have a great deal of interest in this
ST> stuff.]

ST> On 24 Jun 1998 09:53:03 -0500, ebiederm+eric@npwt.net (Eric
ST> W. Biederman) said:

ST> However, there's a lot of overlap, so I'd like to look at what we can do
ST> with this for 2.3.  In particular, I'd like 2.3's standard file writing
ST> mechanism to work essentially as write-through from the page cache,

>> The current system is write-through.  I hope you mean write back.

ST> The current system is write-through from the buffer cache.  The data
ST> is copied into the page cache only if there is already a page mapping
ST> that data.  That is really ugly, using the buffer cache both as an IO
ST> buffer and as a data cache.  THAT is what we need to fix.

You're right.  But if you implement the appropriate routines so you
can use generic_file_write we do a proper write through the page
cache now.

ST> The ideal solution IMHO would be something which does write-through
ST> from the page cache to the buffer cache and write-back from the buffer
ST> cache to disk; in other words, when you write to a page, buffers are
ST> generated to map that dirty data (without copying) there and then.
ST> The IO is then left to the buffer cache, as currently happens, but the
ST> buffer is deleted after IO (just like other temporary buffer_heads
ST> behave right now).  That leaves the IO buffering to the buffer cache
ST> and the caching to the page cache, which is the distinction that the
ST> the current scheme approaches but does not quite achieve.

Unless I have missed something write-back from the page cache is
important, because then when you delete a file you haven't written yet
you can completely avoid I/O.   For short lived files this should be a
performance win.

Coping the few pages that are actively engaged in being written into
the buffer cache may not be a bad idea, as it removes the lock from
the page cache page much sooner, and frees if for use again.

>> This functionality is essentially what is implemented with brw_page,
>> and I have written the generic_page_write that does essentially
>> this.  There is no data copying however.  The fun angle is mapped
>> pages need to be unmapped (or at least read only mapped) for a write
>> to be successful.

ST> Indeed; however, it might be a reasonable compromise to do a copy out
ST> from the page cache to the buffer cache in this situation (we already
ST> have a copy in there, so this would not hurt performance relative to
ST> the current system).  

Agreed.  But it takes more work to write.

ST> Doing COW at the page cache level is something we can implement later;
ST> there are other reasons for it to be desirable anyway.  For example,
ST> it lets you convert all read(2) and write(2) requests on whole pages
ST> into mmap()s, transparently, giving automatic zero-copy IO to user
ST> space.

Sounds neat but I wasn't advocating it, in this context.

>> I should have a working patch this weekend (the code compiles now, I
>> just need to make sure it works) and we can discuss it more when that
>> has been released.

ST> Excellent.  I look forward to seeing it.

I need to clean the patch up a bit (I built it on top of a patched
kernel, but I have it working right now!).   I have successfully
performaned two simultaneous kernel compiles which is a pretty good
test for races ;).

Hopefully I'll have a little time this weekend, to make a good patch,
otherwise I'll just release my mess.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-06-26 15:56             ` Eric W. Biederman
@ 1998-06-29 10:35               ` Stephen C. Tweedie
  1998-06-29 19:59                 ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen C. Tweedie @ 1998-06-29 10:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Hans Reiser, Shawn Leas, Reiserfs,
	Ken Tetrick, linux-mm

Hi,

In article <m1emwcf97d.fsf@flinx.npwt.net>, ebiederm+eric@npwt.net (Eric
W. Biederman) writes:

>>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:
ST> The ideal solution IMHO would be something which does write-through
ST> from the page cache to the buffer cache and write-back from the buffer
ST> cache to disk; in other words, when you write to a page, buffers are
ST> generated to map that dirty data (without copying) there and then.
ST> The IO is then left to the buffer cache, as currently happens, but the
ST> buffer is deleted after IO (just like other temporary buffer_heads
ST> behave right now).  That leaves the IO buffering to the buffer cache
ST> and the caching to the page cache, which is the distinction that the
ST> the current scheme approaches but does not quite achieve.

> Unless I have missed something write-back from the page cache is
> important, because then when you delete a file you haven't written yet
> you can completely avoid I/O.   For short lived files this should be a
> performance win.

We already do bforget() to deal with this in the buffer cache.  Having
the outstanding IO labelled in the buffer cache will not result in
redundant writes in this case.

>>> This functionality is essentially what is implemented with brw_page,
>>> and I have written the generic_page_write that does essentially
>>> this.  There is no data copying however.  The fun angle is mapped
>>> pages need to be unmapped (or at least read only mapped) for a write
>>> to be successful.

ST> Indeed; however, it might be a reasonable compromise to do a copy out
ST> from the page cache to the buffer cache in this situation (we already
ST> have a copy in there, so this would not hurt performance relative to
ST> the current system).  

> Agreed.  But it takes more work to write.

On reflection, it's not an issue.  Mapped pages do not have to be
unmapped at all.  We can continue to share between cache and buffers as
long as we want.  Later modifications to the data in the cache page will
update the buffer contents, true, but that's irrelevant as we will still
be writing valid file contents to disk when the IO arrives.  Those
semantics are just fine.

--Stephen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-06-29 10:35               ` Stephen C. Tweedie
@ 1998-06-29 19:59                 ` Eric W. Biederman
  1998-06-30 16:10                   ` Stephen C. Tweedie
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 1998-06-29 19:59 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Shawn Leas, Reiserfs, Ken Tetrick, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:

ST> Hi,
ST> In article <m1emwcf97d.fsf@flinx.npwt.net>, ebiederm+eric@npwt.net (Eric
ST> W. Biederman) writes:

>> Unless I have missed something write-back from the page cache is
>> important, because then when you delete a file you haven't written yet
>> you can completely avoid I/O.   For short lived files this should be a
>> performance win.

ST> We already do bforget() to deal with this in the buffer cache.  Having
ST> the outstanding IO labelled in the buffer cache will not result in
ST> redundant writes in this case.

That's good to know. It doesn't suprise me but I hadn't been through the
code enough to see that one.  I knew about bforget I just hadn't seen
it used.

>>>> This functionality is essentially what is implemented with brw_page,
>>>> and I have written the generic_page_write that does essentially
>>>> this.  There is no data copying however.  The fun angle is mapped
>>>> pages need to be unmapped (or at least read only mapped) for a write
>>>> to be successful.

ST> Indeed; however, it might be a reasonable compromise to do a copy out
ST> from the page cache to the buffer cache in this situation (we already
ST> have a copy in there, so this would not hurt performance relative to
ST> the current system).  

>> Agreed.  But it takes more work to write.

ST> On reflection, it's not an issue.  Mapped pages do not have to be
ST> unmapped at all.  We can continue to share between cache and buffers as
ST> long as we want.  Later modifications to the data in the cache page will
ST> update the buffer contents, true, but that's irrelevant as we will still
ST> be writing valid file contents to disk when the IO arrives.  Those
ST> semantics are just fine.

There are two problems I see.  

1) A DMA controller actively access the same memory the CPU is
accessing could be a problem.  Recall video flicker on old video
cards.

2) More importantly the cpu writes to the _cache_, and the DMA
controller reads from the RAM.  I don't see any consistency garnatees
there.  We may be able solve these problems on a per architecture or
device basis however. 

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-06-29 19:59                 ` Eric W. Biederman
@ 1998-06-30 16:10                   ` Stephen C. Tweedie
  1998-07-01  0:17                     ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen C. Tweedie @ 1998-06-30 16:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Hans Reiser, Shawn Leas, Reiserfs,
	Ken Tetrick, linux-mm

Hi,

On 29 Jun 1998 14:59:37 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

> There are two problems I see.  

> 1) A DMA controller actively access the same memory the CPU is
> accessing could be a problem.  Recall video flicker on old video
> cards.

Shouldn't be a problem.

> 2) More importantly the cpu writes to the _cache_, and the DMA
> controller reads from the RAM.  I don't see any consistency garnatees
> there.  We may be able solve these problems on a per architecture or
> device basis however.

Again, not important.  If we ever modify a page which is already being
written out to a device, then we mark that page dirty.  On write, we
mark it clean (but locked) _before_ starting the IO, not after.  So, if
there is ever an overlap of a filesystem/mmap write with an IO to disk,
we will always schedule another IO later to clean the re-dirtied
buffers.

--Stephen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-06-30 16:10                   ` Stephen C. Tweedie
@ 1998-07-01  0:17                     ` Eric W. Biederman
  1998-07-01  9:12                       ` Stephen C. Tweedie
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 1998-07-01  0:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Shawn Leas, Reiserfs, Ken Tetrick, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@dcs.ed.ac.uk> writes:

ST> Hi,
ST> On 29 Jun 1998 14:59:37 -0500, ebiederm+eric@npwt.net (Eric
ST> W. Biederman) said:

>> There are two problems I see.  

>> 1) A DMA controller actively access the same memory the CPU is
>> accessing could be a problem.  Recall video flicker on old video
>> cards.

ST> Shouldn't be a problem.

When either I trace through the code, or a hardware guy convinces me,
that it is safe to both write to a page, and do DMA from a page
simultaneously I'll believe it.

>> 2) More importantly the cpu writes to the _cache_, and the DMA
>> controller reads from the RAM.  I don't see any consistency garnatees
>> there.  We may be able solve these problems on a per architecture or
>> device basis however.

ST> Again, not important.  If we ever modify a page which is already being
ST> written out to a device, then we mark that page dirty.  On write, we
ST> mark it clean (but locked) _before_ starting the IO, not after.  So, if
ST> there is ever an overlap of a filesystem/mmap write with an IO to disk,
ST> we will always schedule another IO later to clean the re-dirtied
ST> buffers.

Duh.  I wonder what I was thinking...

Anyhow I've implemented the conservative version.  The only
change needed is to change from unmapping pages to removing the dirty
bit, and the basic code stands. 

The most important change needed would be to tell unuse_page it can't
remove a a locked page from the page cache.  Either that or I need to
worry about incrementing the count for page writes, which wouldn't be
a bad idea either.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-07-01  0:17                     ` Eric W. Biederman
@ 1998-07-01  9:12                       ` Stephen C. Tweedie
  1998-07-01 12:45                         ` Eric W. Biederman
  1998-07-01 13:11                         ` Eric W. Biederman
  0 siblings, 2 replies; 11+ messages in thread
From: Stephen C. Tweedie @ 1998-07-01  9:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Hans Reiser, Shawn Leas, Reiserfs,
	Ken Tetrick, linux-mm

Hi,

On 30 Jun 1998 19:17:15 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

> When either I trace through the code, or a hardware guy convinces me,
> that it is safe to both write to a page, and do DMA from a page
> simultaneously I'll believe it.

Read the source code!  We already do this.  If one process or thread
msync()s a mapped file, its dirty pages get written to disk,
independently of any other processes on the same or other CPUs which
may still have the pages mapped and may still be writing to them.  We
don't unmap pages for write; we just mark them non-dirty around all
ptes.

--Stephen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-07-01  9:12                       ` Stephen C. Tweedie
@ 1998-07-01 12:45                         ` Eric W. Biederman
  1998-07-01 13:11                         ` Eric W. Biederman
  1 sibling, 0 replies; 11+ messages in thread
From: Eric W. Biederman @ 1998-07-01 12:45 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Shawn Leas, Reiserfs, Ken Tetrick, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> Hi,
ST> On 30 Jun 1998 19:17:15 -0500, ebiederm+eric@npwt.net (Eric
ST> W. Biederman) said:

>> When either I trace through the code, or a hardware guy convinces me,
>> that it is safe to both write to a page, and do DMA from a page
>> simultaneously I'll believe it.

ST> Read the source code!  We already do this.  If one process or thread
ST> msync()s a mapped file, its dirty pages get written to disk,
ST> independently of any other processes on the same or other CPUs which
ST> may still have the pages mapped and may still be writing to them.  We
ST> don't unmap pages for write; we just mark them non-dirty around all
ST> ptes.

Which is fine but, it still (currently) gets copied to the buffer cache.
As the buffer cache leaves the picture...

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-07-01  9:12                       ` Stephen C. Tweedie
  1998-07-01 12:45                         ` Eric W. Biederman
@ 1998-07-01 13:11                         ` Eric W. Biederman
  1998-07-01 20:07                           ` Stephen C. Tweedie
  1 sibling, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 1998-07-01 13:11 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Shawn Leas, Reiserfs, Ken Tetrick, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:

ST> Hi,
ST> On 30 Jun 1998 19:17:15 -0500, ebiederm+eric@npwt.net (Eric
ST> W. Biederman) said:

>> When either I trace through the code, or a hardware guy convinces me,
>> that it is safe to both write to a page, and do DMA from a page
>> simultaneously I'll believe it.

ST> Read the source code!  We already do this.  If one process or thread
ST> msync()s a mapped file, its dirty pages get written to disk,
ST> independently of any other processes on the same or other CPUs which
ST> may still have the pages mapped and may still be writing to them.  We
ST> don't unmap pages for write; we just mark them non-dirty around all
ST> ptes.

I just took the time and looked.  

And in buffer.c in get_hash_table if we are returning a locked buffer,
we always wait on that buffer until it is unlocked.  So to date we I
don't see us tempting fate, with writing to locked buffers.

It may be harmless but I have't seen that yet.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-07-01 13:11                         ` Eric W. Biederman
@ 1998-07-01 20:07                           ` Stephen C. Tweedie
  1998-07-02 15:17                             ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen C. Tweedie @ 1998-07-01 20:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Stephen C. Tweedie, Hans Reiser, Shawn Leas, Reiserfs,
	Ken Tetrick, linux-mm

Hi,

On 01 Jul 1998 08:11:46 -0500, ebiederm+eric@npwt.net (Eric
W. Biederman) said:

ST> Read the source code!  We already do this.  If one process or thread
ST> msync()s a mapped file, its dirty pages get written to disk,
ST> independently of any other processes on the same or other CPUs which
ST> may still have the pages mapped and may still be writing to them.  We
ST> don't unmap pages for write; we just mark them non-dirty around all
ST> ptes.

> I just took the time and looked.  

> And in buffer.c in get_hash_table if we are returning a locked buffer,
> we always wait on that buffer until it is unlocked.  So to date we I
> don't see us tempting fate, with writing to locked buffers.

Whoops, yes, we do currently do copies for msync().  It's been too long
since I was digging in that code...

--Stephen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?))
  1998-07-01 20:07                           ` Stephen C. Tweedie
@ 1998-07-02 15:17                             ` Eric W. Biederman
  0 siblings, 0 replies; 11+ messages in thread
From: Eric W. Biederman @ 1998-07-02 15:17 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Shawn Leas, Reiserfs, Ken Tetrick, linux-mm

>>>>> "ST" == Stephen C Tweedie <sct@redhat.com> writes:
>> I just took the time and looked.  

>> And in buffer.c in get_hash_table if we are returning a locked buffer,
>> we always wait on that buffer until it is unlocked.  So to date we I
>> don't see us tempting fate, with writing to locked buffers.

ST> Whoops, yes, we do currently do copies for msync().  It's been too long
ST> since I was digging in that code...

Well I asked on Linux kernel and talked a little bit about this with Alan Cox.
He figures if we try and stop something like DMA half way through we
are in trouble but otherwise we should be o.k.

So for the next round I'll implement the cheap clear the dirty bit, on
the page tables trick.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~1998-07-02 15:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.HPP.3.96.980617035608.29950A-100000@ixion.honeywell.com>
     [not found] ` <199806221138.MAA00852@dax.dcs.ed.ac.uk>
     [not found]   ` <358F4FBE.821B333C@ricochet.net>
     [not found]     ` <m11zsgrvnf.fsf@flinx.npwt.net>
     [not found]       ` <199806241154.MAA03544@dax.dcs.ed.ac.uk>
     [not found]         ` <m11zse6ecw.fsf@flinx.npwt.net>
1998-06-25 11:00           ` (reiserfs) Re: More on Re: (reiserfs) Reiserfs and ext2fs (was Re: (reiserfs) Sum Benchmarks (these look typical?)) Stephen C. Tweedie
1998-06-26 15:56             ` Eric W. Biederman
1998-06-29 10:35               ` Stephen C. Tweedie
1998-06-29 19:59                 ` Eric W. Biederman
1998-06-30 16:10                   ` Stephen C. Tweedie
1998-07-01  0:17                     ` Eric W. Biederman
1998-07-01  9:12                       ` Stephen C. Tweedie
1998-07-01 12:45                         ` Eric W. Biederman
1998-07-01 13:11                         ` Eric W. Biederman
1998-07-01 20:07                           ` Stephen C. Tweedie
1998-07-02 15:17                             ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox