page locking and error handling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* page locking and error handling
@ 2001-02-15 15:02 Roman Zippel
  2001-02-15 15:33 ` Eric W. Biederman
  2001-02-16 18:20 ` Marcelo Tosatti
  0 siblings, 2 replies; 5+ messages in thread
From: Roman Zippel @ 2001-02-15 15:02 UTC (permalink / raw)
  To: linux-mm

Hi,

I'm currently trying to exactly understand the current page state handling
and I have a few problems around the generic_file_write function. The
problems I see are:
- the mentioned deadlock in generic_file_write is not really fixed, it's
  just a bit harder to exploit?
- if copy_from_user() fails the page is set as not uptodate. AFAIK this
  assumes that the page->buffers are still uptodate, so previous writes
  are not lost.
- a concurrent read might outrun a write and so possibly get some new data
  of the write and some old data.

Please correct me, if I'm wrong. Anyway, here are some ideas to address
this.
1. We can add a nonblocking version from copy_(from|to)_user, which
   returns EAGAIN if it finds a locked page. Most of the needed code is in
   the slow path, so it doesn't affect performance. Also see
   arch/m68k/mm/fault.c how to pass additional info from the fault
   handler.
2. We should make the state of a page completely independent of the
   underlying mapping mechanism.
   - A page shouldn't get suddenly not uptodate because a read from user
     space fails, so we need to _clearly_ define who sets/clears which
     bit. Especially in error situations ClearPageUptodate() is called,
     but gets the page data really not uptodate?
   - This also includes that the buffer pointer becomes private to the
     mapping mechanism, so it can be used for other caching mechanism
     (e.g. nfs doesn't have to store it separately).
3. During a write we always lock at least one page and we don't release
   the previous page until we got the next. This means:
   - the i_sem is not needed anymore, so multiple writes can access the
     file at the same time.
   - a read can't outrun a write anymore.
   - page locking has to happen completely at the higher layer and keeping
     multiple pages locked would require something like 1).
   - this would allow to pass multiple pages at once to the mapping
     mechanism, as we can easily link several pages together. This
     actually is all what is needed/wanted for streaming and no need for a
     heavyweight kiobuf.

This is probably is a bit sketchy, but the main idea is to further improve
the page state handling and remove dependencies/assumptions to the buffer
handling. This would also allow better error handling, e.g. data for a
removed media could be saved in a temporary file instead of throwing away
the data or one could even keep two medias mounted in the same drive.
Another possibility is to use/test several i/o mechanism at the same time,
or even to make them modular.
Anyway, just some wild ideas :). I'm interested in any comments, whether
above is needed/desired. As it would mean some quite heavy changes, I'd
like to make sure I'm not missing anything before starting hacking on it
(what won't be to soon, as more important things are pending...).

bye, Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: page locking and error handling
  2001-02-15 15:02 page locking and error handling Roman Zippel
@ 2001-02-15 15:33 ` Eric W. Biederman
  2001-02-15 18:50   ` Roman Zippel
  2001-02-16 18:20 ` Marcelo Tosatti
  1 sibling, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 2001-02-15 15:33 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-mm

Roman Zippel <zippel@fh-brandenburg.de> writes:

> Hi,
> 
> I'm currently trying to exactly understand the current page state handling
> and I have a few problems around the generic_file_write function. The
> problems I see are:
> - the mentioned deadlock in generic_file_write is not really fixed, it's
>   just a bit harder to exploit?

I don't know I haven't traced that path.

> - if copy_from_user() fails the page is set as not uptodate. AFAIK this
>   assumes that the page->buffers are still uptodate, so previous writes
>   are not lost.
If copy_from_user fails that invokes undefined behavior, and you just lost
your previous writes because you ``overwrote'' them.

> - a concurrent read might outrun a write and so possibly get some new data
>   of the write and some old data.

Read/write are not atomic so no problem.
> 

> Please correct me, if I'm wrong. Anyway, here are some ideas to address
> this.
> 1. We can add a nonblocking version from copy_(from|to)_user, which
>    returns EAGAIN if it finds a locked page. Most of the needed code is in
>    the slow path, so it doesn't affect performance. Also see
>    arch/m68k/mm/fault.c how to pass additional info from the fault
>    handler.
> 2. We should make the state of a page completely independent of the
>    underlying mapping mechanism.
>    - A page shouldn't get suddenly not uptodate because a read from user
>      space fails, so we need to _clearly_ define who sets/clears which
>      bit. Especially in error situations ClearPageUptodate() is called,
>      but gets the page data really not uptodate?

Some kind of ``correct'' handling for this should happen so that if
the page is mapped we don't break invariants, (like a mapped page
should always be uptodate.  But other than that we are fine.

>    - This also includes that the buffer pointer becomes private to the
>      mapping mechanism, so it can be used for other caching mechanism
>      (e.g. nfs doesn't have to store it separately).

Possibly.  This is a bit of a corner case.  It looks good on paper
certainly.

> 3. During a write we always lock at least one page and we don't release
>    the previous page until we got the next. This means:
>    - the i_sem is not needed anymore, so multiple writes can access the
>      file at the same time.

i_sem I believe is to protect the file length, and avoid weird
truncation races.  As well allowing things like O_APPEND work.
I don't see how page level locking helps with file size changes.

>    - a read can't outrun a write anymore.

That isn't a problem.  You have to do locking in user space to avoid
that if you want it.

>    - page locking has to happen completely at the higher layer and keeping
>      multiple pages locked would require something like 1).
?
>    - this would allow to pass multiple pages at once to the mapping
>      mechanism, as we can easily link several pages together. This
>      actually is all what is needed/wanted for streaming and no need for a
>      heavyweight kiobuf.

Hmm.  For what you are suggesting kiobufs aren't that bad.  Not that
I'm supporting them, but since you are aiming at the cases they handle
just fine I won't criticize them either.

> 
> This is probably is a bit sketchy, but the main idea is to further improve
> the page state handling and remove dependencies/assumptions to the buffer
> handling. This would also allow better error handling, e.g. data for a
> removed media could be saved in a temporary file instead of throwing away
> the data or one could even keep two medias mounted in the same
> drive.

Create a pseudo block device if you want these kinds of semantics they
should not be handled directly by the filesystem layer.  At least not
unless so one comes up with a design where it just happens to fall out
naturally. (Unlikely).

> Another possibility is to use/test several i/o mechanism at the same time,
> or even to make them modular.
> Anyway, just some wild ideas :). I'm interested in any comments, whether
> above is needed/desired. As it would mean some quite heavy changes, I'd
> like to make sure I'm not missing anything before starting hacking on it
> (what won't be to soon, as more important things are pending...).

Have fun reading the code and kibitzing.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: page locking and error handling
  2001-02-15 15:33 ` Eric W. Biederman
@ 2001-02-15 18:50   ` Roman Zippel
  2001-02-15 23:37     ` Eric W. Biederman
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Zippel @ 2001-02-15 18:50 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-mm

Hi,

On 15 Feb 2001, Eric W. Biederman wrote:

> > - if copy_from_user() fails the page is set as not uptodate. AFAIK this
> >   assumes that the page->buffers are still uptodate, so previous writes
> >   are not lost.
> If copy_from_user fails that invokes undefined behavior, and you just lost
> your previous writes because you ``overwrote'' them.

What about partial writes?

> > 3. During a write we always lock at least one page and we don't release
> >    the previous page until we got the next. This means:
> >    - the i_sem is not needed anymore, so multiple writes can access the
> >      file at the same time.
> 
> i_sem I believe is to protect the file length, and avoid weird
> truncation races.  As well allowing things like O_APPEND work.
> I don't see how page level locking helps with file size changes.

Of course locking of i_size is still needed, but i_sem is not needed for
the writing itself.

> >    - this would allow to pass multiple pages at once to the mapping
> >      mechanism, as we can easily link several pages together. This
> >      actually is all what is needed/wanted for streaming and no need for a
> >      heavyweight kiobuf.
> 
> Hmm.  For what you are suggesting kiobufs aren't that bad.  Not that
> I'm supporting them, but since you are aiming at the cases they handle
> just fine I won't criticize them either.

A list of pages is more flexible and can be used in more situations than
a kiobuf, a lower layer can of course still use whatever it wants.

> > This is probably is a bit sketchy, but the main idea is to further improve
> > the page state handling and remove dependencies/assumptions to the buffer
> > handling. This would also allow better error handling, e.g. data for a
> > removed media could be saved in a temporary file instead of throwing away
> > the data or one could even keep two medias mounted in the same
> > drive.
> 
> Create a pseudo block device if you want these kinds of semantics they
> should not be handled directly by the filesystem layer.  At least not
> unless so one comes up with a design where it just happens to fall out
> naturally. (Unlikely).

A pseudo block device is probably needed, but it needs a few hooks to
switch to it, but most of them are in the slow path. It also needs some
userspace support. Anyway, it doesn't need that much design changes,
mostly you only need to change the ClearPageUptodate() calls.

bye, Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: page locking and error handling
  2001-02-15 18:50   ` Roman Zippel
@ 2001-02-15 23:37     ` Eric W. Biederman
  0 siblings, 0 replies; 5+ messages in thread
From: Eric W. Biederman @ 2001-02-15 23:37 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-mm

Roman Zippel <zippel@fh-brandenburg.de> writes:

> Hi,
> 
> On 15 Feb 2001, Eric W. Biederman wrote:
> 
> > > - if copy_from_user() fails the page is set as not uptodate. AFAIK this
> > >   assumes that the page->buffers are still uptodate, so previous writes
> > >   are not lost.
> > If copy_from_user fails that invokes undefined behavior, and you just lost
> > your previous writes because you ``overwrote'' them.
> 
> What about partial writes?

The important thing is if copy_from_user fails it is because of a buggy
user space app.  Because the buggy app passed a bad memory area.  So you
have undefined behavior, so you can do whatever is convenient.

The only case to worry about how do we keep from breaking kernel invariants.
I think it make break an invariant to set a mmaped page as not
uptodate, but I can't see any other problems with the interface.

Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: page locking and error handling
  2001-02-15 15:02 page locking and error handling Roman Zippel
  2001-02-15 15:33 ` Eric W. Biederman
@ 2001-02-16 18:20 ` Marcelo Tosatti
  1 sibling, 0 replies; 5+ messages in thread
From: Marcelo Tosatti @ 2001-02-16 18:20 UTC (permalink / raw)
  To: Roman Zippel; +Cc: linux-mm


On Thu, 15 Feb 2001, Roman Zippel wrote:

> Hi,

<snip>

>    - page locking has to happen completely at the higher layer and keeping
>      multiple pages locked would require something like 1).
>    - this would allow to pass multiple pages at once to the mapping
>      mechanism, as we can easily link several pages together. This
>      actually is all what is needed/wanted for streaming and no need for a
>      heavyweight kiobuf.

At commit_write(), the buffers of the pages which are being writen are
only marked dirty and not necessarily queued to IO. commit_write() will
start writting older dirty buffers with flush_dirty_buffers() if the
system is over a watermark of dirty data, which _may_ write dirty buffers
from the current write() syscall. (O_SYNC is another story..)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2001-02-16 18:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-15 15:02 page locking and error handling Roman Zippel
2001-02-15 15:33 ` Eric W. Biederman
2001-02-15 18:50   ` Roman Zippel
2001-02-15 23:37     ` Eric W. Biederman
2001-02-16 18:20 ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox