RFC: Re: journal ports for 2.3?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* RFC: Re: journal ports for 2.3?
       [not found] <000c01bf472c$8ad8cb60$8edb1581@isc.rit.edu>
@ 1999-12-21  0:24 ` Stephen C. Tweedie
  1999-12-21 10:18   ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-12-21  0:24 UTC (permalink / raw)
  To: Chris Mason
  Cc: sct, reiserfs, linux-fsdevel, linux-mm, Andrea Arcangeli,
	Ingo Molnar, Linus Torvalds

Hi,

All comments welcome: this is a first draft outline of what I _think_
Linus is asking for from journaling for mainline kernels.

On Wed, 15 Dec 1999 13:45:22 -0500, Chris Mason
<clmsys@osfmail.isc.rit.edu> said:

> What is your current plan for porting ext3 into 2.3/2.4?  Are you still
> going to be buffer cache based, or do you plan on moving every thing into
> the page cache?

For 2.4 the first release will probably still be in the buffer cache,
but I'm resigned to the fact that Linus won't accept it for a final
merge until it uses an alternative method.

I'd like to talk to you about that if possible.  Right now, it looks
as if the following is the absolute minimum required to make ext3,
reiserfs and any unknown future journaled fs'es work properly in 2.3:

  * Add an extra "async" parameter to super_operations->write_super()
    to distinguish between bdflush and sync()

  * Clean up the rules for allowing the raid5 code to snoop the buffer
    cache: raid5 should consider a buffer locked and transient if it
    has b_count raised

  * The raid resync code needs to be atomic wrt. ll_rw_block()

  * Whatever caching mechanism we use --- page cache or something else
    --- we *must* allow the VM to make callbacks into the filesystem
    to indicate memory pressure.  There are two cases: first, when
    memory gets short, we need to be able to request flush-from-memory
    (including clean pages) secondly, if we detect too many dirty
    buffers, we need to be able to request flush-to-disk (without
    necessarily reclaiming memory, but causing a stall on the calling
    process to act as a throttle on heavy write traffic).

    For the out-of-memory pressure, ideally all we need is a callback on
    the page->mapping address_space.  We have one address space per
    inode, so adding a struct as_operations to the address_space would
    only grow our tables by one pointer per inode, not one pointer per
    pages.

    Shrink_mmap() can easily use such a pointer to perform any
    filesystem-specific tearing-down of the page.

    
    The second case is a little more tricky: currently the only
    mechanism we have for write throttling under heavy write load is the
    refile_buffer() checks in buffer.c.  Ideally there should be a
    system-wide upper bound on dirty data: if each different filesystem
    starts to throttle writes at 50% of physical memory then you only
    need two different filesystems to overcommit your memory badly.

    A PG_Dirty flag, a global counter of dirty pages and a system-wide
    dirty memory threshold would be enough to allow ext3 and reiserfs to
    perform their own write throttling in a way which wouldn't fall
    apart if both ext3 and reiserfs were rpesent in the system at the
    same time.  Making the refile_buffer() checks honour that global
    threshold would be trivial.  

    The PG_Dirty flag would also allow for VM callbacks to be made to
    the filesystems if it was determined that we needed the dirty memory
    pages for some other use (as already happens in the buffer cache if
    try_to_free_buffers fails and wakes up bdflush).  Such a callback
    should also be triggered off the address_space.

There are lots of other things which would be useful to journaling, such
as the ll_rw_block-level write ordering enforcement and write barrier,
but the above is really the minimum necessary to actually get the things
to _work_ without intruding into the buffer cache and without destroying
the system's performance if journaled transactions are allowed to grow
without VM back-pressure.


Cheers,
 Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: Re: journal ports for 2.3?
  1999-12-21  0:24 ` RFC: Re: journal ports for 2.3? Stephen C. Tweedie
@ 1999-12-21 10:18   ` Andrea Arcangeli
  1999-12-21 13:21     ` (reiserfs) " Stephen C. Tweedie
  1999-12-22  1:21     ` Benjamin C.R. LaHaise
  0 siblings, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 1999-12-21 10:18 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Chris Mason, reiserfs, linux-fsdevel, linux-mm, Ingo Molnar,
	Linus Torvalds

On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:

>    refile_buffer() checks in buffer.c.  Ideally there should be a
>    system-wide upper bound on dirty data: if each different filesystem
>    starts to throttle writes at 50% of physical memory then you only
>    need two different filesystems to overcommit your memory badly.

If all FSes shares the dirty list of buffer.c that's not true. All normal
filesystems are using the mark_buffer_dirty() in buffer.c so currently the
40% setting of bdflush is a system-wide number and not a per-fs number.

>    same time.  Making the refile_buffer() checks honour that global
>    threshold would be trivial.  

If both ext3 and reiserfs are using refile_buffer and both are using
balance_dirty in the right places as Linus wants, all seems just fine to
me.

I disagree since 2.3.10 (or similar) about mark_buffer_dirty not including
the balance_dirty() check (and I just provided patches to fix that some
month ago IIRC). Last time I checked ext2 was harmed by this, and we'll
have to add the proper balance_dirty() in the ext2 mknod path and check
the rest.

I completly agree to change mark_buffer_dirty() to call balance_dirty()
before returning. But if you add the balance_dirty() calls all over the
right places all should be _just_ fine as far I can tell.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-21 10:18   ` Andrea Arcangeli
@ 1999-12-21 13:21     ` Stephen C. Tweedie
  1999-12-21 13:57       ` Andrea Arcangeli
  1999-12-22 23:37       ` Hans Reiser
  1999-12-22  1:21     ` Benjamin C.R. LaHaise
  1 sibling, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-12-21 13:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

Hi,

On Tue, 21 Dec 1999 11:18:03 +0100 (CET), Andrea Arcangeli
<andrea@suse.de> said:

> On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:
>> refile_buffer() checks in buffer.c.  Ideally there should be a
>> system-wide upper bound on dirty data: if each different filesystem
>> starts to throttle writes at 50% of physical memory then you only
>> need two different filesystems to overcommit your memory badly.

> If all FSes shares the dirty list of buffer.c that's not true. 

The entire point of this is that Linus has refused, point blank, to
add the complexity of journaling to the buffer cache.  The journaling
_has_ to be done independently, so we _have_ to have the dirty data
for journal transactions kept outside of the buffer cache.

We cannot use the buffer.c dirty list anyway because bdflush can write
those buffers to disk at any time.  Transactions have to control the
write ordering so we can only feed those writes into the buffer queues
under strict control when we go to commit a transaction.  

> All normal filesystems are using the mark_buffer_dirty() in buffer.c

We're not talking about normal filesystems. :)

> so currently the 40% setting of bdflush is a system-wide number and
> not a per-fs number.

For filesystems that can use that mechanism, sure.  We need to be able
to extend that mechanism so that filesystems with other writeback
mechanisms can use it too.

> If both ext3 and reiserfs are using refile_buffer and both are using
> balance_dirty in the right places as Linus wants, all seems just fine to
> me.

They aren't and they can't.

> I completly agree to change mark_buffer_dirty() to call balance_dirty()
> before returning. 

Agreed.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-21 13:21     ` (reiserfs) " Stephen C. Tweedie
@ 1999-12-21 13:57       ` Andrea Arcangeli
  1999-12-22  0:28         ` Stephen C. Tweedie
  1999-12-22 23:37       ` Hans Reiser
  1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-12-21 13:57 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Chris Mason, reiserfs, linux-fsdevel, linux-mm, Ingo Molnar,
	Linus Torvalds

On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:

>We cannot use the buffer.c dirty list anyway because bdflush can write
>those buffers to disk at any time.  Transactions have to control the

So you are talking about replacing this line:

	dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;

with:

	dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED]) >> PAGE_SHIFT;

If you don't do that you don't need _two_ filesystems to generate too many
dirty buffers but you can potentially go OOM with only one journaling
filesystem running. As you talked about a _two_ filesystem case generating
dirty buffers on 100% of memory I thought you was talking about something
very different than the above one liner. If you was talking about it
that's fine and I agree of course.

>We're not talking about normal filesystems. :)

With "normal" filesystems I meant filesystems that are _using_
linux/fs/buffer.c.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-21 13:57       ` Andrea Arcangeli
@ 1999-12-22  0:28         ` Stephen C. Tweedie
  1999-12-23 11:51           ` Hans Reiser
  0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-12-22  0:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

Hi,

On Tue, 21 Dec 1999 14:57:29 +0100 (CET), Andrea Arcangeli
<andrea@suse.de> said:

> So you are talking about replacing this line:
> 	dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
> with:
> 	dirty = (size_buffers_type[BUF_DIRTY]+size_buffers_type[BUF_PINNED]) >> PAGE_SHIFT;

Basically yes, but I was envisaging something slightly different from
the above.

There may well be data which is simply not in the buffer cache at all
but which needs to be accounted for as pinned memory.  A good example
would be if some filesystem wants to implement deferred allocation of
disk blocks: the corresponding pages in the page cache obviously cannot
be flushed to disk without generating extra filesystem activity for the
allocation of disk blocks to pages.  The pages must therefore be pinned,
but as they don't yet have disk mappings we can't assume that they are
in the buffer cache.

So we really need a pinned page threshold which can apply to general
pages, not necessarily to the buffer cache.

There's another issue, though.  BUF_DIRTY buffers do not necessarily
count as pinned in this context: they can always be flushed to disk
without generating any significant new memory allocation pressure.  We
still need to do write-throttling, so we need a threshold on dirty data
for that reason.  However, deferred allocation and transactions actually
have a more subtle and nastier property: you cannot necessarily flush
the pages from memory without first allocating more memory.

In the transaction case this is because you have to allow transactions
which are already in progress to complete before you can commit the
transaction (you cannot commit incomplete transactions because that
would defeat the entire point of a transactional system!).  In the case
of deferred disk block allocation, the problem is that flushing the
dirty data requires extra filesystem operations as we allocate disk
blocks to pages.

In these cases we need to be able to make sure that not only does pinned
memory never exceed a threshold, we also have to ensure that the
*future* allocations required to flush the existing allocated memory can
also be satisfied.  We need to allow filesystems to "reserve" such extra
memory, and we need a system-wide threshold on all such reservations.

The ext3 journaling code already has support for reservations, but
that's currently a per-filesystem parameter.  We still have need for a
global VM reservation to prevent memory starvation if multiple different
filesystems have this behaviour.

Note that what we need here isn't complex: it's no more than exporting
atomic_t counts of the number of dirty and reserved pages in the system
and supporting a maximum threshold on these values via /proc.  The
mechanism for observing these limits can be local to each filesystem: as
long as there is an agreed counter in the VM where they can register
their use of memory.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: Re: journal ports for 2.3?
  1999-12-21 10:18   ` Andrea Arcangeli
  1999-12-21 13:21     ` (reiserfs) " Stephen C. Tweedie
@ 1999-12-22  1:21     ` Benjamin C.R. LaHaise
  1999-12-22 22:19       ` Stephen C. Tweedie
                         ` (2 more replies)
  1 sibling, 3 replies; 34+ messages in thread
From: Benjamin C.R. LaHaise @ 1999-12-22  1:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

On Tue, 21 Dec 1999, Andrea Arcangeli wrote:

> On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:
> 
> >    refile_buffer() checks in buffer.c.  Ideally there should be a
> >    system-wide upper bound on dirty data: if each different filesystem
> >    starts to throttle writes at 50% of physical memory then you only
> >    need two different filesystems to overcommit your memory badly.
> 
> If all FSes shares the dirty list of buffer.c that's not true. All normal
> filesystems are using the mark_buffer_dirty() in buffer.c so currently the
> 40% setting of bdflush is a system-wide number and not a per-fs number.

The buffer dirty lists are the wrong place to be dealing with this.  We
need a lightweight, fast way of monitoring the system's dirty buffer/page
thresholds -- one that can be called for every write to a page or on the
write faults for cow pages.

> >    same time.  Making the refile_buffer() checks honour that global
> >    threshold would be trivial.  
> 
> If both ext3 and reiserfs are using refile_buffer and both are using
> balance_dirty in the right places as Linus wants, all seems just fine to
> me.
> 
> I disagree since 2.3.10 (or similar) about mark_buffer_dirty not including
> the balance_dirty() check (and I just provided patches to fix that some
> month ago IIRC). Last time I checked ext2 was harmed by this, and we'll
> have to add the proper balance_dirty() in the ext2 mknod path and check
> the rest.

> I completly agree to change mark_buffer_dirty() to call balance_dirty()
> before returning. But if you add the balance_dirty() calls all over the
> right places all should be _just_ fine as far I can tell.

I don't agree, both for the reasons above and because doing a
balance_dirty in mark_buffer_dirty tends to result in stalls in the
*wrong* place, because it tends to stall in the middle of an operation,
not before it has begun.  You end up stalling on metadata operations that
shouldn't stall.  The stall thresholds for data vs metadata have to be
different in order to make the system 'feel' right.  This is easily
accomplished by trying to "allocate" the dirty buffers before you actually
dirty them (by checking if there's enough slack in the dirty buffer
margins before doing the operation).

		-ben

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: Re: journal ports for 2.3?
  1999-12-22  1:21     ` Benjamin C.R. LaHaise
@ 1999-12-22 22:19       ` Stephen C. Tweedie
  1999-12-22 22:41         ` (reiserfs) " Tan Pong Heng
  1999-12-23 12:02       ` Hans Reiser
  1999-12-27 16:31       ` Andrea Arcangeli
  2 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 1999-12-22 22:19 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

Hi,

On Tue, 21 Dec 1999 20:21:05 -0500 (EST), "Benjamin C.R. LaHaise"
<blah@kvack.org> said:

> The buffer dirty lists are the wrong place to be dealing with this.  We
> need a lightweight, fast way of monitoring the system's dirty buffer/page
> thresholds -- one that can be called for every write to a page or on the
> write faults for cow pages.

Precisely.  The only thing that the core VM needs to export is an atomic
counter for such pages, a wait queue so that processes can wait for
pages to be cleaned, and a function to be called to try to reclaim such
pages.  

Remember, though, that we have three different types of page we need to
deal with.  There are simple used pages, which we need to reclaim in a
component-independent manner when we are using too much memory; then
there are dirty pages which can be flushed to disk at any time; then
there are reserved pages which cannot be flushed to disk without some
extra work.

The first case is simple: we already have the wait queues and reclaim
functions in place, and all we need is an address_space callback to
allow filesystem-specific caches to return pages when shrink_mmap()
wants them.

In the second case (dirty pages), bdflush already does some of the work,
but we need a more generic solution of we want to support dirty data
which is not stored in buffer_heads in a portable manner.

The third case (reserved pages) is the case which doesn't affect any
current code but which will become really important for journaled or
deferred-allocation filesystems.  

--Stephen

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-22 22:19       ` Stephen C. Tweedie
@ 1999-12-22 22:41         ` Tan Pong Heng
  1999-12-23  3:27           ` William J. Earl
  2000-01-06 17:54           ` (reiserfs) Re: RFC: Re: journal ports for 2.3? Stephen C. Tweedie
  0 siblings, 2 replies; 34+ messages in thread
From: Tan Pong Heng @ 1999-12-22 22:41 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Benjamin C.R. LaHaise, Andrea Arcangeli, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Tue, 21 Dec 1999 20:21:05 -0500 (EST), "Benjamin C.R. LaHaise"
> <blah@kvack.org> said:
>
> > The buffer dirty lists are the wrong place to be dealing with this.  We
> > need a lightweight, fast way of monitoring the system's dirty buffer/page
> > thresholds -- one that can be called for every write to a page or on the
> > write faults for cow pages.
>
> Precisely.  The only thing that the core VM needs to export is an atomic
> counter for such pages, a wait queue so that processes can wait for
> pages to be cleaned, and a function to be called to try to reclaim such
> pages.
>
> Remember, though, that we have three different types of page we need to
> deal with.  There are simple used pages, which we need to reclaim in a
> component-independent manner when we are using too much memory; then
> there are dirty pages which can be flushed to disk at any time; then
> there are reserved pages which cannot be flushed to disk without some
> extra work.
>
> The first case is simple: we already have the wait queues and reclaim
> functions in place, and all we need is an address_space callback to
> allow filesystem-specific caches to return pages when shrink_mmap()
> wants them.
>
> In the second case (dirty pages), bdflush already does some of the work,
> but we need a more generic solution of we want to support dirty data
> which is not stored in buffer_heads in a portable manner.
>
> The third case (reserved pages) is the case which doesn't affect any
> current code but which will become really important for journaled or
> deferred-allocation filesystems.
>
> --Stephen

Sorry for intruding, I have been monitoring this thread with interest.

I was thinking that, unless you want to have FS specific buffer/page cache,
there is alway a gain for a unified cache for all fs. I think the one piece
of functionality missing from the 2.3 implementation is the dependency
between the various pages. If you could specify a tree relations between
the various subset of the buffer/page and the reclaim machanism honor
that everything should be fine. For FS that does not care about ordering,
they could simply ignore this capability and the machanism could assume
that everything is in one big set and could be reclaimed in any order.

I have note been giving the complexity of implementing such functionality
a thought yet. But it seem to be feasible - since you would need to do that
any way for your FS....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-21 13:21     ` (reiserfs) " Stephen C. Tweedie
  1999-12-21 13:57       ` Andrea Arcangeli
@ 1999-12-22 23:37       ` Hans Reiser
  2000-01-06 17:48         ` Stephen C. Tweedie
  1 sibling, 1 reply; 34+ messages in thread
From: Hans Reiser @ 1999-12-22 23:37 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Chris Mason, reiserfs, linux-fsdevel, linux-mm,
	Ingo Molnar, Linus Torvalds

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Tue, 21 Dec 1999 11:18:03 +0100 (CET), Andrea Arcangeli
> <andrea@suse.de> said:
>
> > On Tue, 21 Dec 1999, Stephen C. Tweedie wrote:
> >> refile_buffer() checks in buffer.c.  Ideally there should be a
> >> system-wide upper bound on dirty data: if each different filesystem
> >> starts to throttle writes at 50% of physical memory then you only
> >> need two different filesystems to overcommit your memory badly.
>
> > If all FSes shares the dirty list of buffer.c that's not true.

Stephen's global counter really would make things simpler to code.  I would also
like to see each filesystem able to specify a minimum amount it wants reserved
as clean pages, and have a global minimum that is the sum of all of these
amounts for all mounted filesystems.

>
>
> The entire point of this is that Linus has refused, point blank, to
> add the complexity of journaling to the buffer cache.  The journaling
> _has_ to be done independently, so we _have_ to have the dirty data
> for journal transactions kept outside of the buffer cache.
>
> We cannot use the buffer.c dirty list anyway because bdflush can write
> those buffers to disk at any time.  Transactions have to control the
> write ordering so we can only feed those writes into the buffer queues
> under strict control when we go to commit a transaction.
>
> > All normal filesystems are using the mark_buffer_dirty() in buffer.c
>
> We're not talking about normal filesystems. :)
>
> > so currently the 40% setting of bdflush is a system-wide number and
> > not a per-fs number.
>
> For filesystems that can use that mechanism, sure.  We need to be able
> to extend that mechanism so that filesystems with other writeback
> mechanisms can use it too.
>
> > If both ext3 and reiserfs are using refile_buffer and both are using
> > balance_dirty in the right places as Linus wants, all seems just fine to
> > me.
>
> They aren't and they can't.
>
> > I completly agree to change mark_buffer_dirty() to call balance_dirty()
> > before returning.
>
> Agreed.

How can we use a mark_buffer_dirty that calls balance_dirty in a place where we
cannot call balance_dirty?

>
>
> --Stephen

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-22 22:41         ` (reiserfs) " Tan Pong Heng
@ 1999-12-23  3:27           ` William J. Earl
  1999-12-23 15:36             ` Andrea Arcangeli
  2000-01-06 17:54           ` (reiserfs) Re: RFC: Re: journal ports for 2.3? Stephen C. Tweedie
  1 sibling, 1 reply; 34+ messages in thread
From: William J. Earl @ 1999-12-23  3:27 UTC (permalink / raw)
  To: Tan Pong Heng
  Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Andrea Arcangeli,
	Chris Mason, reiserfs, linux-fsdevel, linux-mm, Ingo Molnar,
	Linus Torvalds

Tan Pong Heng writes:
...
 > I was thinking that, unless you want to have FS specific buffer/page cache,
 > there is alway a gain for a unified cache for all fs. I think the one piece
 > of functionality missing from the 2.3 implementation is the dependency
 > between the various pages. If you could specify a tree relations between
 > the various subset of the buffer/page and the reclaim machanism honor
 > that everything should be fine. For FS that does not care about ordering,
 > they could simply ignore this capability and the machanism could assume
 > that everything is in one big set and could be reclaimed in any order.
...

      For the XFS port, we have been working on this, since XFS very much
wants to cluster logically adjacent delayed-allocation (and delayed-write) pages
together to optimize writes.  That is, if the someone who wants to write
back a dirty page to disk asks the file system to do so, then the file
system wants to find all nearby pages (nearby in the file, not necessarily
in memory).   The file system looks up the extent in which the page resides,
or allocates an extent if the page is part of a delayed allocation, and
then writes all of the pages in the extent at once.  Given the present
data structures, this is done by probing the page cache for each page
in the extent.  If the page cache were indexed by a per-inode AVL tree
(or other ordered index), then collecting adjacent pages would be cheaper.
Compared to a disk I/O, hash table probes are still relatively low in cost,
but it would be possible to do a bit better with some ordered index.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-22  0:28         ` Stephen C. Tweedie
@ 1999-12-23 11:51           ` Hans Reiser
  0 siblings, 0 replies; 34+ messages in thread
From: Hans Reiser @ 1999-12-23 11:51 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Chris Mason, reiserfs, linux-fsdevel, linux-mm,
	Ingo Molnar, Linus Torvalds

Stephen's remarks seem right to me.

Hans


--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-22  1:21     ` Benjamin C.R. LaHaise
  1999-12-22 22:19       ` Stephen C. Tweedie
@ 1999-12-23 12:02       ` Hans Reiser
  1999-12-23 15:49         ` Andrea Arcangeli
  1999-12-27 16:31       ` Andrea Arcangeli
  2 siblings, 1 reply; 34+ messages in thread
From: Hans Reiser @ 1999-12-23 12:02 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

"Benjamin C.R. LaHaise" wrote:

> I completly agree to change mark_buffer_dirty() to call balance_dirty()

> > before returning. But if you add the balance_dirty() calls all over the
> > right places all should be _just_ fine as far I can tell.
>
> I don't agree, both for the reasons above and because doing a
> balance_dirty in mark_buffer_dirty tends to result in stalls in the
> *wrong* place, because it tends to stall in the middle of an operation,
> not before it has begun.  You end up stalling on metadata operations that
> shouldn't stall.  The stall thresholds for data vs metadata have to be
> different in order to make the system 'feel' right.  This is easily
> accomplished by trying to "allocate" the dirty buffers before you actually
> dirty them (by checking if there's enough slack in the dirty buffer
> margins before doing the operation).
>
>                 -ben

If reiserfs had good SMP, you could stall it anywhere, and the code could handle
that.  But we don't, and I bet others also don't, and we won't have it for some
time even though we are working on it.

Hans

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-23  3:27           ` William J. Earl
@ 1999-12-23 15:36             ` Andrea Arcangeli
  1999-12-24  5:53               ` afei
  1999-12-26  8:26               ` feiliu
  0 siblings, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 1999-12-23 15:36 UTC (permalink / raw)
  To: William J. Earl
  Cc: Tan Pong Heng, Stephen C. Tweedie, Benjamin C.R. LaHaise,
	Chris Mason, reiserfs, linux-fsdevel, linux-mm, Ingo Molnar,
	Linus Torvalds

On Wed, 22 Dec 1999, William J. Earl wrote:

>in the extent.  If the page cache were indexed by a per-inode AVL tree

Some month ago I did some research in putting the pagecache into a
per-inode RB-tree. AVL would be overkill because insert/removal can be the
only operation done on the tree (with cache pollution going on).

Unfortunately if the inode size gets very large the RB-tree won't scale
:(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of
the complexity paying with memory", while with an rbtree you have to
always pay O(N*log(N)) for each query/insert/removal... Chuck's  bench
generated nice numbers with the pagecache in the per-inode RB though
(without considering your "ordering" needs of course).

The interesting code should be here (or nearby, just search for the
filename in the ftp area if it's not exactly there):

	ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-23 12:02       ` Hans Reiser
@ 1999-12-23 15:49         ` Andrea Arcangeli
  1999-12-23 16:41           ` Hans Reiser
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 1999-12-23 15:49 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Benjamin C.R. LaHaise, Stephen C. Tweedie, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

On Thu, 23 Dec 1999, Hans Reiser wrote:

>If reiserfs had good SMP, you could stall it anywhere, and the code
>could handle that.  But we don't, and I bet others also don't, and we
>won't have it for some time even though we are working on it.

I completly understand that we need also an atomic mark_buffer_dirty and
to call buffer_dirty from some other place.

But IMHO there's no one good reason to break all the old rock solid
filesystems like ext2 just because there's the need of a new feature.

I am not proposing to not provide a way to atomically marking a buffer
dirty. I propose only to not change the semantic of the function called
`mark_buffer_dirty()' as it happened now.

If you want the atomic version just recall __mark_buffer_dirty() and use
balance_dirty() by hand as soon as you can (after releasing your SMP
locks).

We can trivially replace mark_buffer_dirty() with __mark_buffer_dirty()
with an automated script inside smart/SMP filesystems that wants to
continue to use the current 2.3.x semantic of mark_buffer_dirty().

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-23 15:49         ` Andrea Arcangeli
@ 1999-12-23 16:41           ` Hans Reiser
  0 siblings, 0 replies; 34+ messages in thread
From: Hans Reiser @ 1999-12-23 16:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Benjamin C.R. LaHaise, Stephen C. Tweedie, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

All I'm going to ask is that if mark_buffer_dirty gets changed again, whoever
changes it please let us know this time.....  The last two times it was changed
we weren't informed, and the first time it happened it took a long time to
figure it out.

I think that whether we make __mark_buffer_dirty or mark_buffer_dirty schedule
free is an argument over whether to name a function half-full or half-empty.  I
yield to both sides.

Hans

Andrea Arcangeli wrote:

> On Thu, 23 Dec 1999, Hans Reiser wrote:
>
> >If reiserfs had good SMP, you could stall it anywhere, and the code
> >could handle that.  But we don't, and I bet others also don't, and we
> >won't have it for some time even though we are working on it.
>
> I completly understand that we need also an atomic mark_buffer_dirty and
> to call buffer_dirty from some other place.
>
> But IMHO there's no one good reason to break all the old rock solid
> filesystems like ext2 just because there's the need of a new feature.
>
> I am not proposing to not provide a way to atomically marking a buffer
> dirty. I propose only to not change the semantic of the function called
> `mark_buffer_dirty()' as it happened now.
>
> If you want the atomic version just recall __mark_buffer_dirty() and use
> balance_dirty() by hand as soon as you can (after releasing your SMP
> locks).
>
> We can trivially replace mark_buffer_dirty() with __mark_buffer_dirty()
> with an automated script inside smart/SMP filesystems that wants to
> continue to use the current 2.3.x semantic of mark_buffer_dirty().
>
> Andrea

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-23 15:36             ` Andrea Arcangeli
@ 1999-12-24  5:53               ` afei
  1999-12-26  8:26               ` feiliu
  1 sibling, 0 replies; 34+ messages in thread
From: afei @ 1999-12-24  5:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William J. Earl, Tan Pong Heng, Stephen C. Tweedie,
	Benjamin C.R. LaHaise, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

May I ask why the time is O(N*Log(N)) instead of O(Log(N)). We have this
interesting OS class implementing a AVL tree structured directory entry in
ext2 directory file on disk. I always think it is not going to work out.
But the TA and the professor keep telling me the new file system will be
better than ext2 bcause now we have O(Log(N)) time search(ok),
insert/removal(???). I really doubt it but I do not know where they can be
wrong.

Fei

 On Thu, 23 Dec 1999, Andrea Arcangeli wrote:

> On Wed, 22 Dec 1999, William J. Earl wrote:
> 
> >in the extent.  If the page cache were indexed by a per-inode AVL tree
> 
> Some month ago I did some research in putting the pagecache into a
> per-inode RB-tree. AVL would be overkill because insert/removal can be the
> only operation done on the tree (with cache pollution going on).
> 
> Unfortunately if the inode size gets very large the RB-tree won't scale
> :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of
> the complexity paying with memory", while with an rbtree you have to
> always pay O(N*log(N)) for each query/insert/removal... Chuck's  bench
> generated nice numbers with the pagecache in the per-inode RB though
> (without considering your "ordering" needs of course).
> 
> The interesting code should be here (or nearby, just search for the
> filename in the ftp area if it's not exactly there):
> 
> 	ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.nl.linux.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-23 15:36             ` Andrea Arcangeli
  1999-12-24  5:53               ` afei
@ 1999-12-26  8:26               ` feiliu
  2000-01-02 22:24                 ` Peter J. Braam
  1 sibling, 1 reply; 34+ messages in thread
From: feiliu @ 1999-12-26  8:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William J. Earl, Tan Pong Heng, Stephen C. Tweedie,
	Benjamin C.R. LaHaise, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

May I ask why the time is O(N*Log(N)) instead of O(Log(N)). We have this
interesting OS class implementing a AVL tree structured directory entry in
ext2 directory file on disk. I always think it is not going to work out.
But the TA and the professor keep telling me the new file system will be
better than ext2 bcause now we have O(Log(N)) time search(ok),
insert/removal(???). I really doubt it but I do not know where they can be
wrong.

besides, how can one join this linux-fsdevel@vger.rutgers.edu email list?
I did not find a place having instruction on doing it.
Fei


 *~~~~~~~~~~~~~~~~~~~~~+_____________________+~~~~~~~~~~~~~~~~~~~*
 *  Email:afei@jhu.edu | WWW:   http://aa.eps.jhu.edu/~feiliu    *
  *  (410)889-9876(H)  | Johns Hopkins Univ. | (410)516-7047(O) *
   *-------------------+_____________________+-----------------*

On Thu, 23 Dec 1999, Andrea Arcangeli wrote:

> On Wed, 22 Dec 1999, William J. Earl wrote:
> 
> >in the extent.  If the page cache were indexed by a per-inode AVL tree
> 
> Some month ago I did some research in putting the pagecache into a
> per-inode RB-tree. AVL would be overkill because insert/removal can be the
> only operation done on the tree (with cache pollution going on).
> 
> Unfortunately if the inode size gets very large the RB-tree won't scale
> :(. With an hash you can say "ok, enlarge the hash 200mbyte and get rid of
> the complexity paying with memory", while with an rbtree you have to
> always pay O(N*log(N)) for each query/insert/removal... Chuck's  bench
> generated nice numbers with the pagecache in the per-inode RB though
> (without considering your "ordering" needs of course).
> 
> The interesting code should be here (or nearby, just search for the
> filename in the ftp area if it's not exactly there):
> 
> 	ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.6_andrea5.bz2
> 
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.nl.linux.org/Linux-MM/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: Re: journal ports for 2.3?
  1999-12-22  1:21     ` Benjamin C.R. LaHaise
  1999-12-22 22:19       ` Stephen C. Tweedie
  1999-12-23 12:02       ` Hans Reiser
@ 1999-12-27 16:31       ` Andrea Arcangeli
  2 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 1999-12-27 16:31 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise
  Cc: Stephen C. Tweedie, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

On Tue, 21 Dec 1999, Benjamin C.R. LaHaise wrote:

>The buffer dirty lists are the wrong place to be dealing with this.  We

The only reason for not using buffer.c is to make sure to not insert bugs
in such file.

>need a lightweight, fast way of monitoring the system's dirty buffer/page

The lightweight/fastway is a counter and we just have it. You only want to
split it in two parts.

Actually I am not completly against splitting into two parts. Note that my
reply was just to make clear that there is _just_ this "monitoring"
counter. The "need of two filesystem" to exploit the problem showed you at
least partially misunderstood the current code and so I explained how
thing works right now.

>thresholds -- one that can be called for every write to a page or on the
>write faults for cow pages.

The cow write faults have really nothing to do with this. In a cow both
the old page is clean and can be unmapped and the copy is anonymous and
can be swapped out so nothing is unfreeable there.

>I don't agree, both for the reasons above and because doing a
>balance_dirty in mark_buffer_dirty tends to result in stalls in the
>*wrong* place, because it tends to stall in the middle of an operation,

It's always the *wrong* place because every balance_dirty tends to stall
in the middle of an operation.

Try to copy data in 2.3.x and you'll stall in the middle of the
block_*write* stuff. Do you suggest to remove the balance_dirty() from
there as well so the code won't stall?

>not before it has begun.  You end up stalling on metadata operations that
>shouldn't stall.  The stall thresholds for data vs metadata have to be

If you don't want to stall there then buy a faster harddisk so all the
metadata writes will be done async.

If you generate 1 gigabyte of dirty data in 1 sec and you only have
10mbyte of RAM, then you _must_ stall or you'll go OOM. You choose to go
OOM and that's definitely a very bad design bug.

>different in order to make the system 'feel' right.  This is easily
>accomplished by trying to "allocate" the dirty buffers before you actually
>dirty them (by checking if there's enough slack in the dirty buffer

How can you make a buffer dirty without first allocate it? I'd like to
know.

>margins before doing the operation).

This make no sense at all to me. Sorry.

The only two reasons for not calling balance_dirty() inside
mark_buffer_dirty() are:

o	it was not possible to call balance_dirty() at mark_buffer_dirty()
	time, because it happened inside an atomical critical section.

o	you are going to mark a couple of buffer dirty at the same time
	(see block_write_full_page for example) and so you want to
	coalesce four balance_dirty() in one balance_dirty() to improve
	performances.

Both cases make perfect sense and I sure agree we need to be able to do
that. But (silenty) breaking all old fs to do the above two things is very
silly IMHO.

I am not doing a new patch because the first one is just been rejected
(with explicit commentary) some month ago.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RFC: Re: journal ports for 2.3?
  1999-12-26  8:26               ` feiliu
@ 2000-01-02 22:24                 ` Peter J. Braam
  2000-01-05 13:02                   ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it) Hans Reiser
  0 siblings, 1 reply; 34+ messages in thread
From: Peter J. Braam @ 2000-01-02 22:24 UTC (permalink / raw)
  To: Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

Hi, 

I have one request for the journal API for use by network file systems -
it is a request of a slightly different nature than the ones discussed so
far.

InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
a cache and wraps around it. (Any disk file system can be used, but so far
only Ext2 has been exploited.)  High availability file systems need update
logs of changes that were made to the cache so that these may be
propagated to peers when they come back online (to support "disconnected
operation").

Requested feature: 
------------------------------------------------------------------------

Stephen's journal API has a tremendously useful feature: it allows nesting
of transactions.   I don't know if Reiser has this (can you tell me
Chris?) but it is _incredibly_ useful.  So: 

- InterMezzo can start a journal transaction
 - execute the underlying Ext3 routine within that transaction 
   (i.e. the Ext3 transaction becomes part of the one started 
    by InterMezzo)
- InterMezzo finishes its routine (e.g. by noting that an update
took place in its update log) and commits or aborts the transaction

-------------------------------------------------------------------------

[So, in particular InterMezzo and Ext3 share the journal transaction log.]

Why is this useful? There are at least two reasons:

 - the update InterMezzo update log can be kept in sync with the Ext3 file
system as a cache

 - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
remmeber a global file identifier, similar to a Coda FID or NFS file
handle) and it can make updates to its metadata atomically with updates
made to Ext3 metadata.

Both of these reasons touch the core architectural decisions of systems
like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
be so delighted with what one can do with Stephen's API.

Presently, systems like Coda and AFS have a hell of a time keeping caches
in sync with the metadata and to a large extent Coda's really bad
performance is caused by this (an external transaction system is used in
conjunction with synchronous operations on the disk file system, ouch...).
InterMezzo will start using the kernel journal facility that should be
much lighter weight.

Is this a reasonable thing to ask for? 

- Peter -

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it)
  2000-01-02 22:24                 ` Peter J. Braam
@ 2000-01-05 13:02                   ` Hans Reiser
  2000-01-05 15:22                     ` Peter J. Braam
  0 siblings, 1 reply; 34+ messages in thread
From: Hans Reiser @ 2000-01-05 13:02 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

Is nesting really the term you mean to use here, or is joining the term you
mean?

Do you really mean transactions within other transactions?

Exactly what functionality do you need?

Hans

"Peter J. Braam" wrote:

> Hi,
>
> I have one request for the journal API for use by network file systems -
> it is a request of a slightly different nature than the ones discussed so
> far.
>
> InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> a cache and wraps around it. (Any disk file system can be used, but so far
> only Ext2 has been exploited.)  High availability file systems need update
> logs of changes that were made to the cache so that these may be
> propagated to peers when they come back online (to support "disconnected
> operation").
>
> Requested feature:
> ------------------------------------------------------------------------
>
> Stephen's journal API has a tremendously useful feature: it allows nesting
> of transactions.   I don't know if Reiser has this (can you tell me
> Chris?) but it is _incredibly_ useful.  So:
>
> - InterMezzo can start a journal transaction
>  - execute the underlying Ext3 routine within that transaction
>    (i.e. the Ext3 transaction becomes part of the one started
>     by InterMezzo)
> - InterMezzo finishes its routine (e.g. by noting that an update
> took place in its update log) and commits or aborts the transaction
>
> -------------------------------------------------------------------------
>
> [So, in particular InterMezzo and Ext3 share the journal transaction log.]
>
> Why is this useful? There are at least two reasons:
>
>  - the update InterMezzo update log can be kept in sync with the Ext3 file
> system as a cache
>
>  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> remmeber a global file identifier, similar to a Coda FID or NFS file
> handle) and it can make updates to its metadata atomically with updates
> made to Ext3 metadata.
>
> Both of these reasons touch the core architectural decisions of systems
> like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> be so delighted with what one can do with Stephen's API.
>
> Presently, systems like Coda and AFS have a hell of a time keeping caches
> in sync with the metadata and to a large extent Coda's really bad
> performance is caused by this (an external transaction system is used in
> conjunction with synchronous operations on the disk file system, ouch...).
> InterMezzo will start using the kernel journal facility that should be
> much lighter weight.
>
> Is this a reasonable thing to ask for?
>
> - Peter -

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my  ISP probably lost it)
  2000-01-05 13:02                   ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it) Hans Reiser
@ 2000-01-05 15:22                     ` Peter J. Braam
  2000-01-05 15:37                       ` Tigran Aivazian
                                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Peter J. Braam @ 2000-01-05 15:22 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

I think I mean joining.  What I need is:
  
 braam starts trans
   does A
   calls reiser: hans starts
   does B
   hans commits; nothing goes to disk yet
   braam does C
braam commits/aborts ABC now go or don't


- Peter -

On Wed, 5 Jan 2000, Hans Reiser wrote:

> Is nesting really the term you mean to use here, or is joining the term you
> mean?
> 
> Do you really mean transactions within other transactions?
> 
> Exactly what functionality do you need?
> 
> Hans
> 
> "Peter J. Braam" wrote:
> 
> > Hi,
> >
> > I have one request for the journal API for use by network file systems -
> > it is a request of a slightly different nature than the ones discussed so
> > far.
> >
> > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> > a cache and wraps around it. (Any disk file system can be used, but so far
> > only Ext2 has been exploited.)  High availability file systems need update
> > logs of changes that were made to the cache so that these may be
> > propagated to peers when they come back online (to support "disconnected
> > operation").
> >
> > Requested feature:
> > ------------------------------------------------------------------------
> >
> > Stephen's journal API has a tremendously useful feature: it allows nesting
> > of transactions.   I don't know if Reiser has this (can you tell me
> > Chris?) but it is _incredibly_ useful.  So:
> >
> > - InterMezzo can start a journal transaction
> >  - execute the underlying Ext3 routine within that transaction
> >    (i.e. the Ext3 transaction becomes part of the one started
> >     by InterMezzo)
> > - InterMezzo finishes its routine (e.g. by noting that an update
> > took place in its update log) and commits or aborts the transaction
> >
> > -------------------------------------------------------------------------
> >
> > [So, in particular InterMezzo and Ext3 share the journal transaction log.]
> >
> > Why is this useful? There are at least two reasons:
> >
> >  - the update InterMezzo update log can be kept in sync with the Ext3 file
> > system as a cache
> >
> >  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> > remmeber a global file identifier, similar to a Coda FID or NFS file
> > handle) and it can make updates to its metadata atomically with updates
> > made to Ext3 metadata.
> >
> > Both of these reasons touch the core architectural decisions of systems
> > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> > be so delighted with what one can do with Stephen's API.
> >
> > Presently, systems like Coda and AFS have a hell of a time keeping caches
> > in sync with the metadata and to a large extent Coda's really bad
> > performance is caused by this (an external transaction system is used in
> > conjunction with synchronous operations on the disk file system, ouch...).
> > InterMezzo will start using the kernel journal facility that should be
> > much lighter weight.
> >
> > Is this a reasonable thing to ask for?
> >
> > - Peter -
> 
> --
> Get Linux (http://www.kernel.org) plus ReiserFS
>  (http://devlinux.org/namesys).  If you sell an OS or
> internet appliance, buy a port of ReiserFS!  If you
> need customizations and industrial grade support, we sell them.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my  ISP probably lost it)
  2000-01-05 15:22                     ` Peter J. Braam
@ 2000-01-05 15:37                       ` Tigran Aivazian
  2000-01-06  8:40                         ` Hans Reiser
  2000-01-05 15:50                       ` Chris Mason
  2000-01-06  8:34                       ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause " Hans Reiser
  2 siblings, 1 reply; 34+ messages in thread
From: Tigran Aivazian @ 2000-01-05 15:37 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Hans Reiser, Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

On Wed, 5 Jan 2000, Peter J. Braam wrote:
> I think I mean joining.  What I need is:
>   
>  braam starts trans
>    does A
>    calls reiser: hans starts
>    does B
>    hans commits; nothing goes to disk yet
>    braam does C
> braam commits/aborts ABC now go or don't

no, that definitely looks like nesting to me.

Tigran.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my  ISP probably lost it)
  2000-01-05 15:22                     ` Peter J. Braam
  2000-01-05 15:37                       ` Tigran Aivazian
@ 2000-01-05 15:50                       ` Chris Mason
  2000-01-06  8:34                       ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause " Hans Reiser
  2 siblings, 0 replies; 34+ messages in thread
From: Chris Mason @ 2000-01-05 15:50 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Hans Reiser, Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, reiserfs, mason,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds


On Wed, 5 Jan 2000, Peter J. Braam wrote:

> I think I mean joining.  What I need is:
>   
>  braam starts trans
>    does A
>    calls reiser: hans starts
>    does B
>    hans commits; nothing goes to disk yet
>    braam does C
> braam commits/aborts ABC now go or don't
> 
> 
Reiserfs won't do this kind of nesting right now, we also don't have a
transaction abort (aside from crashing the machine).  These can be added
to a future version, but would you mind explaining your transaction needs
in more detail (offline) so I can get a better idea of what you are
looking for?

-chris

> - Peter -
> 
> On Wed, 5 Jan 2000, Hans Reiser wrote:
> 
> > Is nesting really the term you mean to use here, or is joining the term you
> > mean?
> > 
> > Do you really mean transactions within other transactions?
> > 
> > Exactly what functionality do you need?
> > 
> > Hans
> > 
> > "Peter J. Braam" wrote:
> > 
> > > Hi,
> > >
> > > I have one request for the journal API for use by network file systems -
> > > it is a request of a slightly different nature than the ones discussed so
> > > far.
> > >
> > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> > > a cache and wraps around it. (Any disk file system can be used, but so far
> > > only Ext2 has been exploited.)  High availability file systems need update
> > > logs of changes that were made to the cache so that these may be
> > > propagated to peers when they come back online (to support "disconnected
> > > operation").
> > >
> > > Requested feature:
> > > ------------------------------------------------------------------------
> > >
> > > Stephen's journal API has a tremendously useful feature: it allows nesting
> > > of transactions.   I don't know if Reiser has this (can you tell me
> > > Chris?) but it is _incredibly_ useful.  So:
> > >
> > > - InterMezzo can start a journal transaction
> > >  - execute the underlying Ext3 routine within that transaction
> > >    (i.e. the Ext3 transaction becomes part of the one started
> > >     by InterMezzo)
> > > - InterMezzo finishes its routine (e.g. by noting that an update
> > > took place in its update log) and commits or aborts the transaction
> > >
> > > -------------------------------------------------------------------------
> > >
> > > [So, in particular InterMezzo and Ext3 share the journal transaction log.]
> > >
> > > Why is this useful? There are at least two reasons:
> > >
> > >  - the update InterMezzo update log can be kept in sync with the Ext3 file
> > > system as a cache
> > >
> > >  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> > > remmeber a global file identifier, similar to a Coda FID or NFS file
> > > handle) and it can make updates to its metadata atomically with updates
> > > made to Ext3 metadata.
> > >
> > > Both of these reasons touch the core architectural decisions of systems
> > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> > > be so delighted with what one can do with Stephen's API.
> > >
> > > Presently, systems like Coda and AFS have a hell of a time keeping caches
> > > in sync with the metadata and to a large extent Coda's really bad
> > > performance is caused by this (an external transaction system is used in
> > > conjunction with synchronous operations on the disk file system, ouch...).
> > > InterMezzo will start using the kernel journal facility that should be
> > > much lighter weight.
> > >
> > > Is this a reasonable thing to ask for?
> > >
> > > - Peter -
> > 
> > --
> > Get Linux (http://www.kernel.org) plus ReiserFS
> >  (http://devlinux.org/namesys).  If you sell an OS or
> > internet appliance, buy a port of ReiserFS!  If you
> > need customizations and industrial grade support, we sell them.
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my   ISP probably lost it)
  2000-01-05 15:22                     ` Peter J. Braam
  2000-01-05 15:37                       ` Tigran Aivazian
  2000-01-05 15:50                       ` Chris Mason
@ 2000-01-06  8:34                       ` Hans Reiser
  2000-01-07  1:25                         ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my Albert D. Cahalan
  2 siblings, 1 reply; 34+ messages in thread
From: Hans Reiser @ 2000-01-06  8:34 UTC (permalink / raw)
  To: Peter J. Braam
  Cc: Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

Yes, but not before 2.5.  Chris and I have already discussed that it would be
nice to make the transaction API available to user space, but we haven't done any
work on it, or even specified the user API.  We probably won't even start work on
it for 6 months (unless a sponsor asks for it).  We do think it is a good idea.

Hans

"Peter J. Braam" wrote:

> I think I mean joining.  What I need is:
>
>  braam starts trans
>    does A
>    calls reiser: hans starts
>    does B
>    hans commits; nothing goes to disk yet
>    braam does C
> braam commits/aborts ABC now go or don't
>
> - Peter -
>
> On Wed, 5 Jan 2000, Hans Reiser wrote:
>
> > Is nesting really the term you mean to use here, or is joining the term you
> > mean?
> >
> > Do you really mean transactions within other transactions?
> >
> > Exactly what functionality do you need?
> >
> > Hans
> >
> > "Peter J. Braam" wrote:
> >
> > > Hi,
> > >
> > > I have one request for the journal API for use by network file systems -
> > > it is a request of a slightly different nature than the ones discussed so
> > > far.
> > >
> > > InterMezzo (www.inter-mezzo.org) exploits an existing disk file system as
> > > a cache and wraps around it. (Any disk file system can be used, but so far
> > > only Ext2 has been exploited.)  High availability file systems need update
> > > logs of changes that were made to the cache so that these may be
> > > propagated to peers when they come back online (to support "disconnected
> > > operation").
> > >
> > > Requested feature:
> > > ------------------------------------------------------------------------
> > >
> > > Stephen's journal API has a tremendously useful feature: it allows nesting
> > > of transactions.   I don't know if Reiser has this (can you tell me
> > > Chris?) but it is _incredibly_ useful.  So:
> > >
> > > - InterMezzo can start a journal transaction
> > >  - execute the underlying Ext3 routine within that transaction
> > >    (i.e. the Ext3 transaction becomes part of the one started
> > >     by InterMezzo)
> > > - InterMezzo finishes its routine (e.g. by noting that an update
> > > took place in its update log) and commits or aborts the transaction
> > >
> > > -------------------------------------------------------------------------
> > >
> > > [So, in particular InterMezzo and Ext3 share the journal transaction log.]
> > >
> > > Why is this useful? There are at least two reasons:
> > >
> > >  - the update InterMezzo update log can be kept in sync with the Ext3 file
> > > system as a cache
> > >
> > >  - InterMezzo will soon manage somewhat more metadata (e.g. it may want to
> > > remmeber a global file identifier, similar to a Coda FID or NFS file
> > > handle) and it can make updates to its metadata atomically with updates
> > > made to Ext3 metadata.
> > >
> > > Both of these reasons touch the core architectural decisions of systems
> > > like Coda/AFS/InterMezzo/DCE-DFS -- so there is some historical reason to
> > > be so delighted with what one can do with Stephen's API.
> > >
> > > Presently, systems like Coda and AFS have a hell of a time keeping caches
> > > in sync with the metadata and to a large extent Coda's really bad
> > > performance is caused by this (an external transaction system is used in
> > > conjunction with synchronous operations on the disk file system, ouch...).
> > > InterMezzo will start using the kernel journal facility that should be
> > > much lighter weight.
> > >
> > > Is this a reasonable thing to ask for?
> > >
> > > - Peter -
> >
> > --
> > Get Linux (http://www.kernel.org) plus ReiserFS
> >  (http://devlinux.org/namesys).  If you sell an OS or
> > internet appliance, buy a port of ReiserFS!  If you
> > need customizations and industrial grade support, we sell them.
> >

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my  ISP probably lost it)
  2000-01-05 15:37                       ` Tigran Aivazian
@ 2000-01-06  8:40                         ` Hans Reiser
  0 siblings, 0 replies; 34+ messages in thread
From: Hans Reiser @ 2000-01-06  8:40 UTC (permalink / raw)
  To: Tigran Aivazian
  Cc: Peter J. Braam, Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

Tigran Aivazian wrote:

> On Wed, 5 Jan 2000, Peter J. Braam wrote:
> > I think I mean joining.  What I need is:
> >
> >  braam starts trans
> >    does A
> >    calls reiser: hans starts
> >    does B
> >    hans commits; nothing goes to disk yet
> >    braam does C
> > braam commits/aborts ABC now go or don't
>
> no, that definitely looks like nesting to me.
>
> Tigran.

It looks like joining to me.  If it was nesting, you would be able to commit A
without comitting B.

Of course, if there is database literature defining nesting, and there probably
is, then I should be ignored here.
Perhaps the literature defines nesting as equivalent to what I call joining.

Hans

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-22 23:37       ` Hans Reiser
@ 2000-01-06 17:48         ` Stephen C. Tweedie
  2000-01-06 18:20           ` Andrea Arcangeli
  0 siblings, 1 reply; 34+ messages in thread
From: Stephen C. Tweedie @ 2000-01-06 17:48 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

Hi,

On Thu, 23 Dec 1999 02:37:48 +0300, Hans Reiser <reiser@idiom.com>
said:

>> > I completly agree to change mark_buffer_dirty() to call balance_dirty()
>> > before returning.

> How can we use a mark_buffer_dirty that calls balance_dirty in a
> place where we cannot call balance_dirty?

It shouldn't be impossible: as long as we are protected against
recursive invocations of balance_dirty (which should be easy to
arrange) we should be safe enough, at least if the memory reservation
bits of the VM/fs interaction are working so that the balance_dirty
can guarantee to run to completion.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  1999-12-22 22:41         ` (reiserfs) " Tan Pong Heng
  1999-12-23  3:27           ` William J. Earl
@ 2000-01-06 17:54           ` Stephen C. Tweedie
  1 sibling, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2000-01-06 17:54 UTC (permalink / raw)
  To: Tan Pong Heng
  Cc: Stephen C. Tweedie, Benjamin C.R. LaHaise, Andrea Arcangeli,
	Chris Mason, reiserfs, linux-fsdevel, linux-mm, Ingo Molnar,
	Linus Torvalds

Hi,

On Thu, 23 Dec 1999 06:41:44 +0800, Tan Pong Heng
<pongheng@starnet.gov.sg> said:

> I was thinking that, unless you want to have FS specific buffer/page
> cache, there is alway a gain for a unified cache for all fs. I think
> the one piece of functionality missing from the 2.3 implementation
> is the dependency between the various pages. If you could specify a
> tree relations between the various subset of the buffer/page and the
> reclaim machanism honor that everything should be fine. For FS that
> does not care about ordering, they could simply ignore this
> capability and the machanism could assume that everything is in one
> big set and could be reclaimed in any order.

That just doesn't give you enough power.  The trouble is that there
are IO dependencies which you don't know about until after the first
IO has completed.  For example, in journaling you may be allocating
journal blocks on demand, and you don't know where the journal commit
block will be until you have written most of the rest of the
transaction out.  If you are doing deferred allocation of disk blocks,
then you can't even _start_ the dependent IO trail until you
explicitly tell the filesystem that the flush-to-disk is beginning.

You need a way to let the filesystem know that you want something in
the cache to be written to disk.  You don't want to presume that one
general-purpose ordering mechanism will work.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  2000-01-06 17:48         ` Stephen C. Tweedie
@ 2000-01-06 18:20           ` Andrea Arcangeli
  2000-01-06 21:32             ` Hans Reiser
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2000-01-06 18:20 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Chris Mason, reiserfs, linux-fsdevel, linux-mm,
	Ingo Molnar, Linus Torvalds

BTW, I thought Hans was talking about places that can't sleep (because of
some not schedule-aware lock) when he said "place that cannot call
balance_dirty()".

On Thu, 6 Jan 2000, Stephen C. Tweedie wrote:

>It shouldn't be impossible: as long as we are protected against
>recursive invocations of balance_dirty (which should be easy to

I am not sure to understand correctly. In case the ll_rw_block layer
produces dirty buffers we are protected by wakeup_bdflush that become a
noop when recalled from kflushd (wakeup_bdflush is not blocking to avoid
bdflush waiting bdflush :). And in genral balance_dirty should never
recurse on the same stack.

>arrange) we should be safe enough, at least if the memory reservation
>bits of the VM/fs interaction are working so that the balance_dirty
>can guarantee to run to completion.

Hmm maybe you are talking about something else...

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  2000-01-06 18:20           ` Andrea Arcangeli
@ 2000-01-06 21:32             ` Hans Reiser
  2000-01-07 11:51               ` Stephen C. Tweedie
  0 siblings, 1 reply; 34+ messages in thread
From: Hans Reiser @ 2000-01-06 21:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Chris Mason, reiserfs, linux-fsdevel,
	linux-mm, Ingo Molnar, Linus Torvalds

Andrea Arcangeli wrote:

> BTW, I thought Hans was talking about places that can't sleep (because of
> some not schedule-aware lock) when he said "place that cannot call
> balance_dirty()".

You were correct.  I think Stephen and I are missing in communicating here.


--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my
  2000-01-06  8:34                       ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause " Hans Reiser
@ 2000-01-07  1:25                         ` Albert D. Cahalan
  2000-01-07 11:37                           ` Stephen C. Tweedie
  0 siblings, 1 reply; 34+ messages in thread
From: Albert D. Cahalan @ 2000-01-07  1:25 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Peter J. Braam, Andrea Arcangeli, William J. Earl, Tan Pong Heng,
	Stephen C. Tweedie, Benjamin C.R. LaHaise, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds,
	intermezzo-devel, simmonds

Hans Reiser writes:

> Yes, but not before 2.5.  Chris and I have already discussed that
> it would be nice to make the transaction API available to user space,
> but we haven't done any work on it, or even specified the user API.

AIX has such an API already. It is good to clone if you can.

This ought to contain the API, but might require some digging:
http://www.rs6000.ibm.com/support/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my
  2000-01-07  1:25                         ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my Albert D. Cahalan
@ 2000-01-07 11:37                           ` Stephen C. Tweedie
  0 siblings, 0 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2000-01-07 11:37 UTC (permalink / raw)
  To: Albert D. Cahalan
  Cc: Hans Reiser, Peter J. Braam, Andrea Arcangeli, William J. Earl,
	Tan Pong Heng, Stephen C. Tweedie, Benjamin C.R. LaHaise,
	Chris Mason, reiserfs, linux-fsdevel, linux-mm, Ingo Molnar,
	Linus Torvalds, intermezzo-devel, simmonds

Hi,

On Thu, 6 Jan 2000 20:25:38 -0500 (EST), "Albert D. Cahalan"
<acahalan@cs.uml.edu> said:

> AIX has such an API already. It is good to clone if you can.

The AIX API is much more than a simple small-operation atomic
transaction API, isn't it?  The filesystem transactions have many
properties --- no abort, predictable size, short duration --- which make
a journaling engine inappropriate for use in a general purpose
user-visible transaction API.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  2000-01-06 21:32             ` Hans Reiser
@ 2000-01-07 11:51               ` Stephen C. Tweedie
  2000-01-07 12:46                 ` Andrea Arcangeli
  2000-01-07 19:59                 ` Hans Reiser
  0 siblings, 2 replies; 34+ messages in thread
From: Stephen C. Tweedie @ 2000-01-07 11:51 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Chris Mason, reiserfs,
	linux-fsdevel, linux-mm, Ingo Molnar, Linus Torvalds

Hi,

On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser <reiser@idiom.com> said:

> Andrea Arcangeli wrote:
>> BTW, I thought Hans was talking about places that can't sleep (because of
>> some not schedule-aware lock) when he said "place that cannot call
>> balance_dirty()".

> You were correct.  I think Stephen and I are missing in communicating here.

Fine, I was just looking at it from the VFS point of view, not the
specific filesystem.  In the worst case, a filesystem can always simply
defer marking the buffer as dirty until after the locking window has
passed, so there's obviously no fundamental problem with having a
blocking mark_buffer_dirty.  If we want a non-blocking version too, with
the requirement that the filesystem then to a manual rebalance once it
is safe to do so, that will work fine too.

--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  2000-01-07 11:51               ` Stephen C. Tweedie
@ 2000-01-07 12:46                 ` Andrea Arcangeli
  2000-01-07 19:59                 ` Hans Reiser
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2000-01-07 12:46 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Hans Reiser, Chris Mason, reiserfs, linux-fsdevel, linux-mm,
	Ingo Molnar, Linus Torvalds

On Fri, 7 Jan 2000, Stephen C. Tweedie wrote:

>Fine, I was just looking at it from the VFS point of view, not the
>specific filesystem.  In the worst case, a filesystem can always simply
>defer marking the buffer as dirty until after the locking window has
>passed, so there's obviously no fundamental problem with having a
>blocking mark_buffer_dirty.  If we want a non-blocking version too, with
>the requirement that the filesystem then to a manual rebalance once it
>is safe to do so, that will work fine too.

I did the new mark_buffer_dirty blocking and __mark_buffer_dirty
nonblocking while fixing the 2.3.x buffer code.

	ftp://ftp.*.kernel.org/pub/linux/kernel/people/andrea/patches/v2.3/2.3.36pre5/buffer-2.gz

I am running with above applyed since some day on a based 2.3.36 on Alpha
and all is worked fine so far under all kind of loads.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: (reiserfs) Re: RFC: Re: journal ports for 2.3?
  2000-01-07 11:51               ` Stephen C. Tweedie
  2000-01-07 12:46                 ` Andrea Arcangeli
@ 2000-01-07 19:59                 ` Hans Reiser
  1 sibling, 0 replies; 34+ messages in thread
From: Hans Reiser @ 2000-01-07 19:59 UTC (permalink / raw)
  To: Stephen C. Tweedie
  Cc: Andrea Arcangeli, Chris Mason, reiserfs, linux-fsdevel, linux-mm,
	Ingo Molnar, Linus Torvalds

"Stephen C. Tweedie" wrote:

> Hi,
>
> On Fri, 07 Jan 2000 00:32:48 +0300, Hans Reiser <reiser@idiom.com> said:
>
> > Andrea Arcangeli wrote:
> >> BTW, I thought Hans was talking about places that can't sleep (because of
> >> some not schedule-aware lock) when he said "place that cannot call
> >> balance_dirty()".
>
> > You were correct.  I think Stephen and I are missing in communicating here.
>
> Fine, I was just looking at it from the VFS point of view, not the
> specific filesystem.  In the worst case, a filesystem can always simply
> defer marking the buffer as dirty until after the locking window has
> passed, so there's obviously no fundamental problem with having a
> blocking mark_buffer_dirty.  If we want a non-blocking version too, with
> the requirement that the filesystem then to a manual rebalance once it
> is safe to do so, that will work fine too.
>
> --Stephen

Yes, but then you have to track what you defer.  Code complication.

I just want to leave things as they are until we have time to do SMP right.

When we do SMP right, then a mark_buffer_dirty() which causes schedule is not a
problem.  Let's deal with this in 2.5....

Hans

--
Get Linux (http://www.kernel.org) plus ReiserFS
 (http://devlinux.org/namesys).  If you sell an OS or
internet appliance, buy a port of ReiserFS!  If you
need customizations and industrial grade support, we sell them.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.nl.linux.org/Linux-MM/

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2000-01-07 19:59 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <000c01bf472c$8ad8cb60$8edb1581@isc.rit.edu>
1999-12-21  0:24 ` RFC: Re: journal ports for 2.3? Stephen C. Tweedie
1999-12-21 10:18   ` Andrea Arcangeli
1999-12-21 13:21     ` (reiserfs) " Stephen C. Tweedie
1999-12-21 13:57       ` Andrea Arcangeli
1999-12-22  0:28         ` Stephen C. Tweedie
1999-12-23 11:51           ` Hans Reiser
1999-12-22 23:37       ` Hans Reiser
2000-01-06 17:48         ` Stephen C. Tweedie
2000-01-06 18:20           ` Andrea Arcangeli
2000-01-06 21:32             ` Hans Reiser
2000-01-07 11:51               ` Stephen C. Tweedie
2000-01-07 12:46                 ` Andrea Arcangeli
2000-01-07 19:59                 ` Hans Reiser
1999-12-22  1:21     ` Benjamin C.R. LaHaise
1999-12-22 22:19       ` Stephen C. Tweedie
1999-12-22 22:41         ` (reiserfs) " Tan Pong Heng
1999-12-23  3:27           ` William J. Earl
1999-12-23 15:36             ` Andrea Arcangeli
1999-12-24  5:53               ` afei
1999-12-26  8:26               ` feiliu
2000-01-02 22:24                 ` Peter J. Braam
2000-01-05 13:02                   ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resending because my ISP probably lost it) Hans Reiser
2000-01-05 15:22                     ` Peter J. Braam
2000-01-05 15:37                       ` Tigran Aivazian
2000-01-06  8:40                         ` Hans Reiser
2000-01-05 15:50                       ` Chris Mason
2000-01-06  8:34                       ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause " Hans Reiser
2000-01-07  1:25                         ` (reiserfs) Re: RFC: Re: journal ports for 2.3? (resendingbecause my Albert D. Cahalan
2000-01-07 11:37                           ` Stephen C. Tweedie
2000-01-06 17:54           ` (reiserfs) Re: RFC: Re: journal ports for 2.3? Stephen C. Tweedie
1999-12-23 12:02       ` Hans Reiser
1999-12-23 15:49         ` Andrea Arcangeli
1999-12-23 16:41           ` Hans Reiser
1999-12-27 16:31       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox