Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
       [not found] <1212685513-32237-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
@ 2008-06-05 19:30 ` Andrew Morton
  2008-06-11 15:08   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2008-06-05 19:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: cmm, jack, linux-ext4, linux-mm, linux-kernel

On Thu,  5 Jun 2008 22:35:12 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> We would like to get notified when we are doing a write on mmap
> section.  The changes are needed to handle ENOSPC when writing to an
> mmap section of files with holes.
> 

Whoa.  You didn't copy anything like enough mailing lists for a change
of this magnitude.  I added some.

This is a large change in behaviour!

a) applications will now get a synchronous SIGBUS when modifying a
   page over an ENOSPC filesystem.  Whereas previously they could have
   proceeded to completion and then detected the error via an fsync().

   It's going to take more than one skimpy little paragraph to
   justify this, and to demonstrate that it is preferable, and to
   convince us that nothing will break from this user-visible behaviour
   change.

b) we're now doing fs operations (and some I/O) in the pagefault
   code.  This has several implications:

   - performance changes

   - potential for deadlocks when a process takes the fault from
     within a copy_to_user() in, say, mm/filemap.c

   - performing additional memory allocations within that
     copy_to_user().  Possibility that these will reenter the
     filesystem.

And that's just ext2.

For ext3 things are even more complex, because we have the
journal_start/journal_end pair which is effectively another "lock" for
ranking/deadlock purposes.  And now we're taking i_alloc_sem and
lock_page and we're doing ->writepage() and its potential
journal_start(), all potentially within the context of a
copy_to_user().

Now, things become easier because copy_to_user() only happens on the
read() side of things, where we don't hold lock_page() and things are
generally simpler.

But still, this is a high-risk change.  I think we should require a lot
of convincing that issues such as the above have been suitably
considered and addressed, and that the change has had *intense*
testing.

> index 47d88da..cc2e106 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -136,6 +136,7 @@ extern void ext2_get_inode_flags(struct ext2_inode_info *);
>  int __ext2_write_begin(struct file *file, struct address_space *mapping,
>  		loff_t pos, unsigned len, unsigned flags,
>  		struct page **pagep, void **fsdata);
> +extern int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page);
>  
>  /* ioctl.c */
>  extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 5f2fa9c..d539dcf 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -18,6 +18,7 @@
>   * 	(jj@sunsite.ms.mff.cuni.cz)
>   */
>  
> +#include <linux/mm.h>
>  #include <linux/time.h>
>  #include "ext2.h"
>  #include "xattr.h"
> @@ -38,6 +39,24 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
>  	return 0;
>  }
>  
> +static struct vm_operations_struct ext2_file_vm_ops = {
> +	.fault		= filemap_fault,
> +	.page_mkwrite   = ext2_page_mkwrite,
> +};
> +
> +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct address_space *mapping = file->f_mapping;
> +
> +	if (!mapping->a_ops->readpage)
> +		return -ENOEXEC;

this copied-and-pasted test can now be removed.

> +	file_accessed(file);
> +	vma->vm_ops = &ext2_file_vm_ops;
> +	vma->vm_flags |= VM_CAN_NONLINEAR;
> +	return 0;
> +}
> +
> +
>  /*
>   * We have mostly NULL's here: the current defaults are ok for
>   * the ext2 filesystem.
> @@ -52,7 +71,7 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
>  #ifdef CONFIG_COMPAT
>  	.compat_ioctl	= ext2_compat_ioctl,
>  #endif
> -	.mmap		= generic_file_mmap,
> +	.mmap		= ext2_file_mmap,
>  	.open		= generic_file_open,
>  	.release	= ext2_release_file,
>  	.fsync		= ext2_sync_file,
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 384fc0d..d4c5c23 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1443,3 +1443,8 @@ int ext2_setattr(struct dentry *dentry, struct iattr *iattr)
>  		error = ext2_acl_chmod(inode);
>  	return error;
>  }
> +
> +int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> +{
> +	return block_page_mkwrite(vma, page, ext2_get_block);
> +}
> -- 
> 1.5.5.1.357.g1af8b.dirty

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
  2008-06-05 19:30 ` [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification Andrew Morton
@ 2008-06-11 15:08   ` Aneesh Kumar K.V
  2008-06-11 19:07     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Aneesh Kumar K.V @ 2008-06-11 15:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, jack, linux-ext4, linux-mm, linux-kernel

On Thu, Jun 05, 2008 at 12:30:45PM -0700, Andrew Morton wrote:
> On Thu,  5 Jun 2008 22:35:12 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > We would like to get notified when we are doing a write on mmap
> > section.  The changes are needed to handle ENOSPC when writing to an
> > mmap section of files with holes.
> > 
> 
> Whoa.  You didn't copy anything like enough mailing lists for a change
> of this magnitude.  I added some.
> 
> This is a large change in behaviour!
> 
> a) applications will now get a synchronous SIGBUS when modifying a
>    page over an ENOSPC filesystem.  Whereas previously they could have
>    proceeded to completion and then detected the error via an fsync().

Or not detect the error at all if we don't call fsync() right ? Isn't a
synchronous SIGBUS the right behaviour ?


> 
>    It's going to take more than one skimpy little paragraph to
>    justify this, and to demonstrate that it is preferable, and to
>    convince us that nothing will break from this user-visible behaviour
>    change.
> 
> b) we're now doing fs operations (and some I/O) in the pagefault
>    code.  This has several implications:
> 
>    - performance changes
> 
>    - potential for deadlocks when a process takes the fault from
>      within a copy_to_user() in, say, mm/filemap.c
> 
>    - performing additional memory allocations within that
>      copy_to_user().  Possibility that these will reenter the
>      filesystem.
> 
> And that's just ext2.
> 
> For ext3 things are even more complex, because we have the
> journal_start/journal_end pair which is effectively another "lock" for
> ranking/deadlock purposes.  And now we're taking i_alloc_sem and
> lock_page and we're doing ->writepage() and its potential
> journal_start(), all potentially within the context of a
> copy_to_user().

One of the reason why we would need this in ext3/ext4 is that we cannot
do block allocation in the writepage with the recent locking changes.
The locking changes involve changing the locking order of journal_start
and page_lock. With writepage we are already called with page_lock and
we can't start new transaction needed for block allocation.

But if we agree that we should not do block allocation in page_mkwrite
we need to add writepages and allocate blocks in writepages.

> 
> Now, things become easier because copy_to_user() only happens on the
> read() side of things, where we don't hold lock_page() and things are
> generally simpler.
> 
> But still, this is a high-risk change.  I think we should require a lot
> of convincing that issues such as the above have been suitably
> considered and addressed, and that the change has had *intense*
> testing.
> 
> > index 47d88da..cc2e106 100644
> > --- a/fs/ext2/ext2.h
> > +++ b/fs/ext2/ext2.h
> > @@ -136,6 +136,7 @@ extern void ext2_get_inode_flags(struct ext2_inode_info *);
> >  int __ext2_write_begin(struct file *file, struct address_space *mapping,
> >  		loff_t pos, unsigned len, unsigned flags,
> >  		struct page **pagep, void **fsdata);
> > +extern int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> >  
> >  /* ioctl.c */
> >  extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
> > diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> > index 5f2fa9c..d539dcf 100644
> > --- a/fs/ext2/file.c
> > +++ b/fs/ext2/file.c
> > @@ -18,6 +18,7 @@
> >   * 	(jj@sunsite.ms.mff.cuni.cz)
> >   */
> >  
> > +#include <linux/mm.h>
> >  #include <linux/time.h>
> >  #include "ext2.h"
> >  #include "xattr.h"
> > @@ -38,6 +39,24 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
> >  	return 0;
> >  }
> >  
> > +static struct vm_operations_struct ext2_file_vm_ops = {
> > +	.fault		= filemap_fault,
> > +	.page_mkwrite   = ext2_page_mkwrite,
> > +};
> > +
> > +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +	struct address_space *mapping = file->f_mapping;
> > +
> > +	if (!mapping->a_ops->readpage)
> > +		return -ENOEXEC;
> 
> this copied-and-pasted test can now be removed.
> 
> > +	file_accessed(file);
> > +	vma->vm_ops = &ext2_file_vm_ops;
> > +	vma->vm_flags |= VM_CAN_NONLINEAR;
> > +	return 0;
> > +}
> > +
> > +
> >  /*
> >   * We have mostly NULL's here: the current defaults are ok for
> >   * the ext2 filesystem.
> > @@ -52,7 +71,7 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
> >  #ifdef CONFIG_COMPAT
> >  	.compat_ioctl	= ext2_compat_ioctl,
> >  #endif
> > -	.mmap		= generic_file_mmap,
> > +	.mmap		= ext2_file_mmap,
> >  	.open		= generic_file_open,
> >  	.release	= ext2_release_file,
> >  	.fsync		= ext2_sync_file,
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index 384fc0d..d4c5c23 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -1443,3 +1443,8 @@ int ext2_setattr(struct dentry *dentry, struct iattr *iattr)
> >  		error = ext2_acl_chmod(inode);
> >  	return error;
> >  }
> > +
> > +int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> > +{
> > +	return block_page_mkwrite(vma, page, ext2_get_block);
> > +}
> > -- 

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
  2008-06-11 15:08   ` Aneesh Kumar K.V
@ 2008-06-11 19:07     ` Andrew Morton
  2008-06-12  4:06       ` Aneesh Kumar K.V
  2008-06-12 16:17       ` Jan Kara
  0 siblings, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2008-06-11 19:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: cmm, jack, linux-ext4, linux-mm, linux-kernel

On Wed, 11 Jun 2008 20:38:45 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> On Thu, Jun 05, 2008 at 12:30:45PM -0700, Andrew Morton wrote:
> > On Thu,  5 Jun 2008 22:35:12 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > 
> > > We would like to get notified when we are doing a write on mmap
> > > section.  The changes are needed to handle ENOSPC when writing to an
> > > mmap section of files with holes.
> > > 
> > 
> > Whoa.  You didn't copy anything like enough mailing lists for a change
> > of this magnitude.  I added some.
> > 
> > This is a large change in behaviour!
> > 
> > a) applications will now get a synchronous SIGBUS when modifying a
> >    page over an ENOSPC filesystem.  Whereas previously they could have
> >    proceeded to completion and then detected the error via an fsync().
> 
> Or not detect the error at all if we don't call fsync() right ? Isn't a
> synchronous SIGBUS the right behaviour ?
>

Not according to POSIX.  Or at least posix-several-years-ago, when this
last was discussed.  The spec doesn't have much useful to say about any
of this.

It's a significant change in the userspace interface.

> 
> > 
> >    It's going to take more than one skimpy little paragraph to
> >    justify this, and to demonstrate that it is preferable, and to
> >    convince us that nothing will break from this user-visible behaviour
> >    change.
> > 
> > b) we're now doing fs operations (and some I/O) in the pagefault
> >    code.  This has several implications:
> > 
> >    - performance changes
> > 
> >    - potential for deadlocks when a process takes the fault from
> >      within a copy_to_user() in, say, mm/filemap.c
> > 
> >    - performing additional memory allocations within that
> >      copy_to_user().  Possibility that these will reenter the
> >      filesystem.
> > 
> > And that's just ext2.
> > 
> > For ext3 things are even more complex, because we have the
> > journal_start/journal_end pair which is effectively another "lock" for
> > ranking/deadlock purposes.  And now we're taking i_alloc_sem and
> > lock_page and we're doing ->writepage() and its potential
> > journal_start(), all potentially within the context of a
> > copy_to_user().
> 
> One of the reason why we would need this in ext3/ext4 is that we cannot
> do block allocation in the writepage with the recent locking changes.

Perhaps those recent locking changes were wrong.

> The locking changes involve changing the locking order of journal_start
> and page_lock. With writepage we are already called with page_lock and
> we can't start new transaction needed for block allocation.

ext3_write_begin() has journal_start() nesting inside the lock_page().

> But if we agree that we should not do block allocation in page_mkwrite
> we need to add writepages and allocate blocks in writepages.

I'm not sure what writepages has to do with pagefaults?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
  2008-06-11 19:07     ` Andrew Morton
@ 2008-06-12  4:06       ` Aneesh Kumar K.V
  2008-06-12 12:22         ` Chris Mason
  2008-06-12 16:17       ` Jan Kara
  1 sibling, 1 reply; 7+ messages in thread
From: Aneesh Kumar K.V @ 2008-06-12  4:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, jack, linux-ext4, linux-mm, linux-kernel

On Wed, Jun 11, 2008 at 12:07:49PM -0700, Andrew Morton wrote:
> On Wed, 11 Jun 2008 20:38:45 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Jun 05, 2008 at 12:30:45PM -0700, Andrew Morton wrote:
> > > On Thu,  5 Jun 2008 22:35:12 +0530
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > > 
> > > > We would like to get notified when we are doing a write on mmap
> > > > section.  The changes are needed to handle ENOSPC when writing to an
> > > > mmap section of files with holes.
> > > > 
> > > 
> > > Whoa.  You didn't copy anything like enough mailing lists for a change
> > > of this magnitude.  I added some.
> > > 
> > > This is a large change in behaviour!
> > > 
> > > a) applications will now get a synchronous SIGBUS when modifying a
> > >    page over an ENOSPC filesystem.  Whereas previously they could have
> > >    proceeded to completion and then detected the error via an fsync().
> > 
> > Or not detect the error at all if we don't call fsync() right ? Isn't a
> > synchronous SIGBUS the right behaviour ?
> >
> 
> Not according to POSIX.  Or at least posix-several-years-ago, when this
> last was discussed.  The spec doesn't have much useful to say about any
> of this.
> 
> It's a significant change in the userspace interface.
> 
> > 
> > > 
> > >    It's going to take more than one skimpy little paragraph to
> > >    justify this, and to demonstrate that it is preferable, and to
> > >    convince us that nothing will break from this user-visible behaviour
> > >    change.
> > > 
> > > b) we're now doing fs operations (and some I/O) in the pagefault
> > >    code.  This has several implications:
> > > 
> > >    - performance changes
> > > 
> > >    - potential for deadlocks when a process takes the fault from
> > >      within a copy_to_user() in, say, mm/filemap.c
> > > 
> > >    - performing additional memory allocations within that
> > >      copy_to_user().  Possibility that these will reenter the
> > >      filesystem.
> > > 
> > > And that's just ext2.
> > > 
> > > For ext3 things are even more complex, because we have the
> > > journal_start/journal_end pair which is effectively another "lock" for
> > > ranking/deadlock purposes.  And now we're taking i_alloc_sem and
> > > lock_page and we're doing ->writepage() and its potential
> > > journal_start(), all potentially within the context of a
> > > copy_to_user().
> > 
> > One of the reason why we would need this in ext3/ext4 is that we cannot
> > do block allocation in the writepage with the recent locking changes.
> 
> Perhaps those recent locking changes were wrong.
> 
> > The locking changes involve changing the locking order of journal_start
> > and page_lock. With writepage we are already called with page_lock and
> > we can't start new transaction needed for block allocation.
> 
> ext3_write_begin() has journal_start() nesting inside the lock_page().
> 

All those are changed as a part of lock inversion changes.



> > But if we agree that we should not do block allocation in page_mkwrite
> > we need to add writepages and allocate blocks in writepages.
> 
> I'm not sure what writepages has to do with pagefaults?
> 

The idea is to have ext3/4_writepages. In writepages start a transaction
and iterate over the pages take the lock and do block allocation. With
that change we should be able to not do block allocation in the
page_mkwrite path. We may still want to do block reservation there.

Something like.

ext4_writepages()
{
	journal_start()
	for_each_page()
	lock_page
	if (bh_unmapped()...)
		block_alloc()
	unlock_page
	journal_stop()

}

ext4_writepage()
{
	for_each_buffer_head()
		if (bh_unmapped()) {
			redirty_page
			unlock_page
			return;
		}
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
  2008-06-12  4:06       ` Aneesh Kumar K.V
@ 2008-06-12 12:22         ` Chris Mason
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Mason @ 2008-06-12 12:22 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, cmm, jack, linux-ext4, linux-mm, linux-kernel

On Thu, 2008-06-12 at 09:36 +0530, Aneesh Kumar K.V wrote:
> On Wed, Jun 11, 2008 at 12:07:49PM -0700, Andrew Morton wrote:
> > On Wed, 11 Jun 2008 20:38:45 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> The idea is to have ext3/4_writepages. In writepages start a transaction
> and iterate over the pages take the lock and do block allocation. With
> that change we should be able to not do block allocation in the
> page_mkwrite path. We may still want to do block reservation there.
> 
> Something like.
> 
> ext4_writepages()
> {
> 	journal_start()
> 	for_each_page()

Even with delayed allocation, the vast majority of the pages won't need
any allocations.  You'll hit delalloc, do a big chunk with the journal
lock held and then do simple writepages that don't need anything
special.

I know the jbd journal_start is cheaper than the reiserfs one is, but it
might not perform well to hold it across the long writepages loop.  At
least reiser saw a good boost when I stopped calling journal_begin in
writepage unless the page really needed allocations.

With the loop you have in mind, it is easy enough to back out and start
the transaction only when required.

-chris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
  2008-06-11 19:07     ` Andrew Morton
  2008-06-12  4:06       ` Aneesh Kumar K.V
@ 2008-06-12 16:17       ` Jan Kara
  2008-06-22 22:50         ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Kara @ 2008-06-12 16:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aneesh Kumar K.V, cmm, linux-ext4, linux-mm, linux-kernel

On Wed 11-06-08 12:07:49, Andrew Morton wrote:
> On Wed, 11 Jun 2008 20:38:45 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Jun 05, 2008 at 12:30:45PM -0700, Andrew Morton wrote:
> > > On Thu,  5 Jun 2008 22:35:12 +0530
> > > "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > > 
> > > > We would like to get notified when we are doing a write on mmap
> > > > section.  The changes are needed to handle ENOSPC when writing to an
> > > > mmap section of files with holes.
> > > > 
> > > 
> > > Whoa.  You didn't copy anything like enough mailing lists for a change
> > > of this magnitude.  I added some.
> > > 
> > > This is a large change in behaviour!
> > > 
> > > a) applications will now get a synchronous SIGBUS when modifying a
> > >    page over an ENOSPC filesystem.  Whereas previously they could have
> > >    proceeded to completion and then detected the error via an fsync().
> > 
> > Or not detect the error at all if we don't call fsync() right ? Isn't a
> > synchronous SIGBUS the right behaviour ?
> >
> 
> Not according to POSIX.  Or at least posix-several-years-ago, when this
> last was discussed.  The spec doesn't have much useful to say about any
> of this.
> 
> It's a significant change in the userspace interface.
> 
> > 
> > > 
> > >    It's going to take more than one skimpy little paragraph to
> > >    justify this, and to demonstrate that it is preferable, and to
> > >    convince us that nothing will break from this user-visible behaviour
> > >    change.
> > > 
> > > b) we're now doing fs operations (and some I/O) in the pagefault
> > >    code.  This has several implications:
> > > 
> > >    - performance changes
> > > 
> > >    - potential for deadlocks when a process takes the fault from
> > >      within a copy_to_user() in, say, mm/filemap.c
> > > 
> > >    - performing additional memory allocations within that
> > >      copy_to_user().  Possibility that these will reenter the
> > >      filesystem.
> > > 
> > > And that's just ext2.
> > > 
> > > For ext3 things are even more complex, because we have the
> > > journal_start/journal_end pair which is effectively another "lock" for
> > > ranking/deadlock purposes.  And now we're taking i_alloc_sem and
> > > lock_page and we're doing ->writepage() and its potential
> > > journal_start(), all potentially within the context of a
> > > copy_to_user().
> > 
> > One of the reason why we would need this in ext3/ext4 is that we cannot
> > do block allocation in the writepage with the recent locking changes.
> 
> Perhaps those recent locking changes were wrong.
  Well, the locking changes are those reverting locking ordering of
transaction start and page lock - we have them in ext4 and Aneesh seems 
to be looking into porting them to ext3 (at least ordered mode rewrite
needs them). I wouldn't say they are wrong in principle.
  It's easier to use page_mkwrite() to allocate blocks so that
later in writepage() we don't have to do block allocation which needs to
start a transaction (because that means unlocking the page which gets
quickly nasty to handle properly...).
  BTW: XFS, OCFS2 or GFS2 define page_mkwrite() in this manner so they do
return SIGBUS when you run out of space when writing to mmapped hole. So
it's not like this change is introducing completely new behavior... I can
understand that we might not want to change the behavior for ext2 or ext3
but ext4 is IMO definitely free to choose.

> > The locking changes involve changing the locking order of journal_start
> > and page_lock. With writepage we are already called with page_lock and
> > we can't start new transaction needed for block allocation.
> 
> ext3_write_begin() has journal_start() nesting inside the lock_page().
> 
> > But if we agree that we should not do block allocation in page_mkwrite
> > we need to add writepages and allocate blocks in writepages.
> 
> I'm not sure what writepages has to do with pagefaults?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.
  2008-06-12 16:17       ` Jan Kara
@ 2008-06-22 22:50         ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2008-06-22 22:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Aneesh Kumar K.V, cmm, linux-ext4, linux-mm, linux-kernel

On Thu, Jun 12, 2008 at 06:17:06PM +0200, Jan Kara wrote:
>   BTW: XFS, OCFS2 or GFS2 define page_mkwrite() in this manner so they do
> return SIGBUS when you run out of space when writing to mmapped hole. So
> it's not like this change is introducing completely new behavior... I can
> understand that we might not want to change the behavior for ext2 or ext3
> but ext4 is IMO definitely free to choose.

Yup, and it's the only sane behaviour, IMO. Letting the application
continue to oversubscribe filesystem space and then throwing away
the data that can't be written well after the fact (potentially
after the application has gone away) is a horrendously bad failure
mode.

This was one of the main publicised features of ->page_mkwrite() -
that it would allow up front detection of ENOSPC conditions during
mmap writes. I'm extremely surprised to see that this is being
considered undesirable after all this time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-06-22 22:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1212685513-32237-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
2008-06-05 19:30 ` [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification Andrew Morton
2008-06-11 15:08   ` Aneesh Kumar K.V
2008-06-11 19:07     ` Andrew Morton
2008-06-12  4:06       ` Aneesh Kumar K.V
2008-06-12 12:22         ` Chris Mason
2008-06-12 16:17       ` Jan Kara
2008-06-22 22:50         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox