* [patch 01/41] mm: revert KERNEL_DS buffered write optimisation
[not found] <20070524052844.860329000@suse.de>
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 npiggin, Andrew Morton
` (11 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management, Neil Brown
[-- Attachment #1: mm-revert-nfsd-writev-opt.patch --]
[-- Type: text/plain, Size: 2247 bytes --]
Revert the patch from Neil Brown to optimise NFSD writev handling.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 32 +++++++++++++-------------------
1 file changed, 13 insertions(+), 19 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1948,27 +1948,21 @@ generic_file_buffered_write(struct kiocb
/* Limit the size of the copy to the caller's write size */
bytes = min(bytes, count);
- /* We only need to worry about prefaulting when writes are from
- * user-space. NFSd uses vfs_writev with several non-aligned
- * segments in the vector, and limiting to one segment a time is
- * a noticeable performance for re-write
+ /*
+ * Limit the size of the copy to that of the current segment,
+ * because fault_in_pages_readable() doesn't know how to walk
+ * segments.
*/
- if (!segment_eq(get_fs(), KERNEL_DS)) {
- /*
- * Limit the size of the copy to that of the current
- * segment, because fault_in_pages_readable() doesn't
- * know how to walk segments.
- */
- bytes = min(bytes, cur_iov->iov_len - iov_base);
+ bytes = min(bytes, cur_iov->iov_len - iov_base);
+
+ /*
+ * Bring in the user page that we will copy from _first_.
+ * Otherwise there's a nasty deadlock on copying from the
+ * same page as we're writing to, without it being marked
+ * up-to-date.
+ */
+ fault_in_pages_readable(buf, bytes);
- /*
- * Bring in the user page that we will copy from
- * _first_. Otherwise there's a nasty deadlock on
- * copying from the same page as we're writing to,
- * without it being marked up-to-date.
- */
- fault_in_pages_readable(buf, bytes);
- }
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {
status = -ENOMEM;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
[not found] <20070524052844.860329000@suse.de>
2007-05-25 12:21 ` [patch 01/41] mm: revert KERNEL_DS buffered write optimisation npiggin
@ 2007-05-25 12:21 ` npiggin, Andrew Morton
2007-05-25 12:21 ` [patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 npiggin, Andrew Morton
` (10 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin, Andrew Morton @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management, Andrew Morton
[-- Attachment #1: mm-revert-buffered-write-zero-length-iov.patch --]
[-- Type: text/plain, Size: 1848 bytes --]
This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 9 +--------
mm/filemap.h | 4 ++--
2 files changed, 3 insertions(+), 10 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1969,12 +1969,6 @@ generic_file_buffered_write(struct kiocb
break;
}
- if (unlikely(bytes == 0)) {
- status = 0;
- copied = 0;
- goto zero_length_segment;
- }
-
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -2004,8 +1998,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
- if (likely(copied >= 0)) {
+ if (likely(copied > 0)) {
if (!status)
status = copied;
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;
- do {
+ while (bytes) {
int copy = min(bytes, iov->iov_len - base);
bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
- } while (bytes);
+ }
*iovp = iov;
*basep = base;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83
[not found] <20070524052844.860329000@suse.de>
2007-05-25 12:21 ` [patch 01/41] mm: revert KERNEL_DS buffered write optimisation npiggin
2007-05-25 12:21 ` [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 npiggin, Andrew Morton
@ 2007-05-25 12:21 ` npiggin, Andrew Morton
2007-05-25 12:21 ` [patch 04/41] mm: clean up buffered write code npiggin, Andrew Morton
` (9 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin, Andrew Morton @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management, Andrew Morton
[-- Attachment #1: mm-revert-buffered-write-deadlock-fix.patch --]
[-- Type: text/plain, Size: 3110 bytes --]
This patch fixed the following bug:
When prefaulting in the pages in generic_file_buffered_write(), we only
faulted in the pages for the firts segment of the iovec. If the second of
successive segment described a mmapping of the page into which we're
write()ing, and that page is not up-to-date, the fault handler tries to lock
the already-locked page (to bring it up to date) and deadlocks.
An exploit for this bug is in writev-deadlock-demo.c, in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.
(These demos assume blocksize < PAGE_CACHE_SIZE).
The problem with this fix is that it takes the kernel back to doing a single
prepare_write()/commit_write() per iovec segment. So in the worst case we'll
run prepare_write+commit_write 1024 times where we previously would have run
it once. The other problem with the fix is that it fix all the locking problems.
<insert numbers obtained via ext3-tools's writev-speed.c here>
And apparently this change killed NFS overwrite performance, because, I
suppose, it talks to the server for each prepare_write+commit_write.
So just back that patch out - we'll be fixing the deadlock by other means.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Nick says: also it only ever actually papered over the bug, because after
faulting in the pages, they might be unmapped or reclaimed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1939,21 +1939,14 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+ unsigned long maxlen;
size_t copied;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
-
- /* Limit the size of the copy to the caller's write size */
- bytes = min(bytes, count);
-
- /*
- * Limit the size of the copy to that of the current segment,
- * because fault_in_pages_readable() doesn't know how to walk
- * segments.
- */
- bytes = min(bytes, cur_iov->iov_len - iov_base);
+ if (bytes > count)
+ bytes = count;
/*
* Bring in the user page that we will copy from _first_.
@@ -1961,7 +1954,10 @@ generic_file_buffered_write(struct kiocb
* same page as we're writing to, without it being marked
* up-to-date.
*/
- fault_in_pages_readable(buf, bytes);
+ maxlen = cur_iov->iov_len - iov_base;
+ if (maxlen > bytes)
+ maxlen = bytes;
+ fault_in_pages_readable(buf, maxlen);
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 04/41] mm: clean up buffered write code
[not found] <20070524052844.860329000@suse.de>
` (2 preceding siblings ...)
2007-05-25 12:21 ` [patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 npiggin, Andrew Morton
@ 2007-05-25 12:21 ` npiggin, Andrew Morton
2007-05-25 12:21 ` [patch 05/41] mm: debug write deadlocks npiggin
` (8 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin, Andrew Morton @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management, Andrew Morton
[-- Attachment #1: mm-generic_file_buffered_write-cleanup.patch --]
[-- Type: text/plain, Size: 3605 bytes --]
Rename some variables and fix some types.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 35 ++++++++++++++++++-----------------
1 file changed, 18 insertions(+), 17 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1912,16 +1912,15 @@ generic_file_buffered_write(struct kiocb
size_t count, ssize_t written)
{
struct file *file = iocb->ki_filp;
- struct address_space * mapping = file->f_mapping;
+ struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
struct page *page;
struct page *cached_page = NULL;
- size_t bytes;
struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
- size_t iov_base = 0; /* offset in the current iovec */
+ size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;
pagevec_init(&lru_pvec, 0);
@@ -1932,31 +1931,33 @@ generic_file_buffered_write(struct kiocb
if (likely(nr_segs == 1))
buf = iov->iov_base + written;
else {
- filemap_set_next_iovec(&cur_iov, &iov_base, written);
- buf = cur_iov->iov_base + iov_base;
+ filemap_set_next_iovec(&cur_iov, &iov_offset, written);
+ buf = cur_iov->iov_base + iov_offset;
}
do {
- unsigned long index;
- unsigned long offset;
- unsigned long maxlen;
- size_t copied;
+ pgoff_t index; /* Pagecache index for current page */
+ unsigned long offset; /* Offset into pagecache page */
+ unsigned long maxlen; /* Bytes remaining in current iovec */
+ size_t bytes; /* Bytes to write to page */
+ size_t copied; /* Bytes copied from user */
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
+ offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
if (bytes > count)
bytes = count;
+ maxlen = cur_iov->iov_len - iov_offset;
+ if (maxlen > bytes)
+ maxlen = bytes;
+
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*/
- maxlen = cur_iov->iov_len - iov_base;
- if (maxlen > bytes)
- maxlen = bytes;
fault_in_pages_readable(buf, maxlen);
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
@@ -1987,7 +1988,7 @@ generic_file_buffered_write(struct kiocb
buf, bytes);
else
copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_base, bytes);
+ cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (status == AOP_TRUNCATED_PAGE) {
@@ -2005,12 +2006,12 @@ generic_file_buffered_write(struct kiocb
buf += status;
if (unlikely(nr_segs > 1)) {
filemap_set_next_iovec(&cur_iov,
- &iov_base, status);
+ &iov_offset, status);
if (count)
buf = cur_iov->iov_base +
- iov_base;
+ iov_offset;
} else {
- iov_base += status;
+ iov_offset += status;
}
}
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 05/41] mm: debug write deadlocks
[not found] <20070524052844.860329000@suse.de>
` (3 preceding siblings ...)
2007-05-25 12:21 ` [patch 04/41] mm: clean up buffered write code npiggin, Andrew Morton
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 06/41] mm: trim more holes npiggin
` (7 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: mm-debug-write-deadlocks.patch --]
[-- Type: text/plain, Size: 1360 bytes --]
Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the
difficult race where the page may be unmapped before calling copy_from_user.
Makes the race much easier to hit.
This is useful for demonstration and testing purposes, but is removed in a
subsequent patch.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1952,6 +1952,7 @@ generic_file_buffered_write(struct kiocb
if (maxlen > bytes)
maxlen = bytes;
+#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -1959,6 +1960,7 @@ generic_file_buffered_write(struct kiocb
* up-to-date.
*/
fault_in_pages_readable(buf, maxlen);
+#endif
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 06/41] mm: trim more holes
[not found] <20070524052844.860329000@suse.de>
` (4 preceding siblings ...)
2007-05-25 12:21 ` [patch 05/41] mm: debug write deadlocks npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 07/41] mm: buffered write cleanup npiggin
` (6 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: mm-trim-more-holes.patch --]
[-- Type: text/plain, Size: 3606 bytes --]
If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then
we may have failed the write operation despite prepare_write having
instantiated blocks past i_size. Fix this, and consolidate the trimming into
one place.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 80 +++++++++++++++++++++++++++++------------------------------
1 file changed, 40 insertions(+), 40 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1969,22 +1969,9 @@ generic_file_buffered_write(struct kiocb
}
status = a_ops->prepare_write(file, page, offset, offset+bytes);
- if (unlikely(status)) {
- loff_t isize = i_size_read(inode);
+ if (unlikely(status))
+ goto fs_write_aop_error;
- if (status != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (status == AOP_TRUNCATED_PAGE)
- continue;
- /*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
- */
- if (pos + bytes > isize)
- vmtruncate(inode, isize);
- break;
- }
if (likely(nr_segs == 1))
copied = filemap_copy_from_user(page, offset,
buf, bytes);
@@ -1993,40 +1980,53 @@ generic_file_buffered_write(struct kiocb
cur_iov, iov_offset, bytes);
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
- if (status == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- continue;
+ if (unlikely(status < 0 || status == AOP_TRUNCATED_PAGE))
+ goto fs_write_aop_error;
+ if (unlikely(copied != bytes)) {
+ status = -EFAULT;
+ goto fs_write_aop_error;
}
- if (likely(copied > 0)) {
- if (!status)
- status = copied;
+ if (unlikely(status > 0)) /* filesystem did partial write */
+ copied = status;
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, status);
- if (count)
- buf = cur_iov->iov_base +
- iov_offset;
- } else {
- iov_offset += status;
- }
+ if (likely(copied > 0)) {
+ written += copied;
+ count -= copied;
+ pos += copied;
+ buf += copied;
+ if (unlikely(nr_segs > 1)) {
+ filemap_set_next_iovec(&cur_iov,
+ &iov_offset, copied);
+ if (count)
+ buf = cur_iov->iov_base + iov_offset;
+ } else {
+ iov_offset += copied;
}
}
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);
- if (status < 0)
- break;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
+ continue;
+
+fs_write_aop_error:
+ if (status != AOP_TRUNCATED_PAGE)
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + bytes > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ if (status == AOP_TRUNCATED_PAGE)
+ continue;
+ else
+ break;
+
} while (count);
*ppos = pos;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 07/41] mm: buffered write cleanup
[not found] <20070524052844.860329000@suse.de>
` (5 preceding siblings ...)
2007-05-25 12:21 ` [patch 06/41] mm: trim more holes npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 08/41] mm: write iovec cleanup npiggin
` (5 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: mm-buffered-write-cleanup.patch --]
[-- Type: text/plain, Size: 11383 bytes --]
Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.
Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).
[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
to clashes in -mm. What put them there, and why? ]
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
fs/mpage.c | 10 ---
mm/filemap.c | 145 ++++++++++++++++++++++-----------------------------------
mm/readahead.c | 24 +++------
3 files changed, 66 insertions(+), 113 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -666,27 +666,22 @@ EXPORT_SYMBOL(find_lock_page);
struct page *find_or_create_page(struct address_space *mapping,
unsigned long index, gfp_t gfp_mask)
{
- struct page *page, *cached_page = NULL;
+ struct page *page;
int err;
repeat:
page = find_lock_page(mapping, index);
if (!page) {
- if (!cached_page) {
- cached_page =
- __page_cache_alloc(gfp_mask);
- if (!cached_page)
- return NULL;
- }
- err = add_to_page_cache_lru(cached_page, mapping,
- index, gfp_mask);
- if (!err) {
- page = cached_page;
- cached_page = NULL;
- } else if (err == -EEXIST)
- goto repeat;
+ page = __page_cache_alloc(gfp_mask);
+ if (!page)
+ return NULL;
+ err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+ if (unlikely(err)) {
+ page_cache_release(page);
+ page = NULL;
+ if (err == -EEXIST)
+ goto repeat;
+ }
}
- if (cached_page)
- page_cache_release(cached_page);
return page;
}
EXPORT_SYMBOL(find_or_create_page);
@@ -883,11 +878,9 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
unsigned int prev_offset;
loff_t isize;
- struct page *cached_page;
int error;
struct file_ra_state ra = *_ra;
- cached_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_index;
@@ -1059,23 +1052,20 @@ no_cached_page:
* Ok, it wasn't cached, so we need to create a new
* page..
*/
- if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
- if (!cached_page) {
- desc->error = -ENOMEM;
- goto out;
- }
+ page = page_cache_alloc_cold(mapping);
+ if (!page) {
+ desc->error = -ENOMEM;
+ goto out;
}
- error = add_to_page_cache_lru(cached_page, mapping,
+ error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
+ page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc->error = error;
goto out;
}
- page = cached_page;
- cached_page = NULL;
goto readpage;
}
@@ -1084,8 +1074,6 @@ out:
_ra->prev_index = prev_index;
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
- if (cached_page)
- page_cache_release(cached_page);
if (filp)
file_accessed(filp);
}
@@ -1573,35 +1561,28 @@ static struct page *__read_cache_page(st
int (*filler)(void *,struct page*),
void *data)
{
- struct page *page, *cached_page = NULL;
+ struct page *page;
int err;
repeat:
page = find_get_page(mapping, index);
if (!page) {
- if (!cached_page) {
- cached_page = page_cache_alloc_cold(mapping);
- if (!cached_page)
- return ERR_PTR(-ENOMEM);
- }
- err = add_to_page_cache_lru(cached_page, mapping,
- index, GFP_KERNEL);
- if (err == -EEXIST)
- goto repeat;
- if (err < 0) {
+ page = page_cache_alloc_cold(mapping);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+ err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+ if (unlikely(err)) {
+ page_cache_release(page);
+ if (err == -EEXIST)
+ goto repeat;
/* Presumably ENOMEM for radix tree node */
- page_cache_release(cached_page);
return ERR_PTR(err);
}
- page = cached_page;
- cached_page = NULL;
err = filler(data, page);
if (err < 0) {
page_cache_release(page);
page = ERR_PTR(err);
}
}
- if (cached_page)
- page_cache_release(cached_page);
return page;
}
@@ -1679,40 +1660,6 @@ struct page *read_cache_page(struct addr
EXPORT_SYMBOL(read_cache_page);
/*
- * If the page was newly created, increment its refcount and add it to the
- * caller's lru-buffering pagevec. This function is specifically for
- * generic_file_write().
- */
-static inline struct page *
-__grab_cache_page(struct address_space *mapping, unsigned long index,
- struct page **cached_page, struct pagevec *lru_pvec)
-{
- int err;
- struct page *page;
-repeat:
- page = find_lock_page(mapping, index);
- if (!page) {
- if (!*cached_page) {
- *cached_page = page_cache_alloc(mapping);
- if (!*cached_page)
- return NULL;
- }
- err = add_to_page_cache(*cached_page, mapping,
- index, GFP_KERNEL);
- if (err == -EEXIST)
- goto repeat;
- if (err == 0) {
- page = *cached_page;
- page_cache_get(page);
- if (!pagevec_add(lru_pvec, page))
- __pagevec_lru_add(lru_pvec);
- *cached_page = NULL;
- }
- }
- return page;
-}
-
-/*
* The logic we want is
*
* if suid or (sgid and xgrp)
@@ -1906,6 +1853,33 @@ generic_file_direct_write(struct kiocb *
}
EXPORT_SYMBOL(generic_file_direct_write);
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+static struct page *__grab_cache_page(struct address_space *mapping,
+ pgoff_t index)
+{
+ int status;
+ struct page *page;
+repeat:
+ page = find_lock_page(mapping, index);
+ if (likely(page))
+ return page;
+
+ page = page_cache_alloc(mapping);
+ if (!page)
+ return NULL;
+ status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+ if (unlikely(status)) {
+ page_cache_release(page);
+ if (status == -EEXIST)
+ goto repeat;
+ return NULL;
+ }
+ return page;
+}
+
ssize_t
generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos, loff_t *ppos,
@@ -1916,15 +1890,10 @@ generic_file_buffered_write(struct kiocb
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- struct page *page;
- struct page *cached_page = NULL;
- struct pagevec lru_pvec;
const struct iovec *cur_iov = iov; /* current iovec */
size_t iov_offset = 0; /* offset in the current iovec */
char __user *buf;
- pagevec_init(&lru_pvec, 0);
-
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
@@ -1936,6 +1905,7 @@ generic_file_buffered_write(struct kiocb
}
do {
+ struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
unsigned long maxlen; /* Bytes remaining in current iovec */
@@ -1962,7 +1932,8 @@ generic_file_buffered_write(struct kiocb
fault_in_pages_readable(buf, maxlen);
#endif
- page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
+
+ page = __grab_cache_page(mapping, index);
if (!page) {
status = -ENOMEM;
break;
@@ -2030,9 +2001,6 @@ fs_write_aop_error:
} while (count);
*ppos = pos;
- if (cached_page)
- page_cache_release(cached_page);
-
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
@@ -2052,7 +2020,6 @@ fs_write_aop_error:
if (unlikely(file->f_flags & O_DIRECT) && written)
status = filemap_write_and_wait(mapping);
- pagevec_lru_add(&lru_pvec);
return written ? written : status;
}
EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c
+++ linux-2.6/fs/mpage.c
@@ -387,31 +387,25 @@ mpage_readpages(struct address_space *ma
struct bio *bio = NULL;
unsigned page_idx;
sector_t last_block_in_bio = 0;
- struct pagevec lru_pvec;
struct buffer_head map_bh;
unsigned long first_logical_block = 0;
clear_buffer_mapped(&map_bh);
- pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_entry(pages->prev, struct page, lru);
prefetchw(&page->flags);
list_del(&page->lru);
- if (!add_to_page_cache(page, mapping,
+ if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
bio = do_mpage_readpage(bio, page,
nr_pages - page_idx,
&last_block_in_bio, &map_bh,
&first_logical_block,
get_block);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- } else {
- page_cache_release(page);
}
+ page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
BUG_ON(!list_empty(pages));
if (bio)
mpage_bio_submit(READ, bio);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -65,28 +65,25 @@ int read_cache_pages(struct address_spac
int (*filler)(void *, struct page *), void *data)
{
struct page *page;
- struct pagevec lru_pvec;
int ret = 0;
- pagevec_init(&lru_pvec, 0);
-
while (!list_empty(pages)) {
page = list_to_page(pages);
list_del(&page->lru);
- if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) {
+ if (add_to_page_cache_lru(page, mapping,
+ page->index, GFP_KERNEL)) {
page_cache_release(page);
continue;
}
+ page_cache_release(page);
+
ret = filler(data, page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- if (ret) {
+ if (unlikely(ret)) {
put_pages_list(pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
}
- pagevec_lru_add(&lru_pvec);
return ret;
}
@@ -96,7 +93,6 @@ static int read_pages(struct address_spa
struct list_head *pages, unsigned nr_pages)
{
unsigned page_idx;
- struct pagevec lru_pvec;
int ret;
if (mapping->a_ops->readpages) {
@@ -106,19 +102,15 @@ static int read_pages(struct address_spa
goto out;
}
- pagevec_init(&lru_pvec, 0);
for (page_idx = 0; page_idx < nr_pages; page_idx++) {
struct page *page = list_to_page(pages);
list_del(&page->lru);
- if (!add_to_page_cache(page, mapping,
+ if (!add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
mapping->a_ops->readpage(filp, page);
- if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
- } else
- page_cache_release(page);
+ }
+ page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
ret = 0;
out:
return ret;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 08/41] mm: write iovec cleanup
[not found] <20070524052844.860329000@suse.de>
` (6 preceding siblings ...)
2007-05-25 12:21 ` [patch 07/41] mm: buffered write cleanup npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 09/41] mm: fix pagecache write deadlocks npiggin
` (4 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: mm-write-iov-cleanup.patch --]
[-- Type: text/plain, Size: 8832 bytes --]
Hide some of the open-coded nr_segs tests into the iovec helpers. This is
all to simplify generic_file_buffered_write, because that gets more complex
in the next patch.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 36 +++++--------------
mm/filemap.h | 104 +++++++++++++++++++++++++++----------------------------
mm/filemap_xip.c | 17 +++-----
3 files changed, 69 insertions(+), 88 deletions(-)
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -22,82 +22,82 @@ __filemap_copy_from_user_iovec_inatomic(
/*
* Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied. If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
- *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy. If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero). A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
*/
static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
- const char __user *buf, unsigned bytes)
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t base, size_t bytes)
{
char *kaddr;
- int left;
+ size_t copied;
kaddr = kmap_atomic(page, KM_USER0);
- left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
+ if (likely(nr_segs == 1)) {
+ int left;
+ char __user *buf = iov->iov_base + base;
+ left = __copy_from_user_inatomic_nocache(kaddr + offset,
+ buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+ iov, base, bytes);
+ }
kunmap_atomic(kaddr, KM_USER0);
- if (left != 0) {
- /* Do it the slow way */
- kaddr = kmap(page);
- left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
- kunmap(page);
- }
- return bytes - left;
+ return copied;
}
/*
- * This has the same sideeffects and return value as filemap_copy_from_user().
- * The difference is that on a fault we need to memset the remainder of the
- * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
- * single-segment behaviour.
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
*/
static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
- const struct iovec *iov, size_t base, size_t bytes)
+filemap_copy_from_user(struct page *page, unsigned long offset,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t base, size_t bytes)
{
char *kaddr;
size_t copied;
- kaddr = kmap_atomic(page, KM_USER0);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- kunmap_atomic(kaddr, KM_USER0);
- if (copied != bytes) {
- kaddr = kmap(page);
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
- base, bytes);
- if (bytes - copied)
- memset(kaddr + offset + copied, 0, bytes - copied);
- kunmap(page);
+ kaddr = kmap(page);
+ if (likely(nr_segs == 1)) {
+ int left;
+ char __user *buf = iov->iov_base + base;
+ left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
+ iov, base, bytes);
}
+ kunmap(page);
return copied;
}
static inline void
-filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, size_t bytes)
+filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
+ size_t *basep, size_t bytes)
{
- const struct iovec *iov = *iovp;
- size_t base = *basep;
-
- while (bytes) {
- int copy = min(bytes, iov->iov_len - base);
-
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- base = 0;
+ if (likely(nr_segs == 1)) {
+ *basep += bytes;
+ } else {
+ const struct iovec *iov = *iovp;
+ size_t base = *basep;
+
+ while (bytes) {
+ int copy = min(bytes, iov->iov_len - base);
+
+ bytes -= copy;
+ base += copy;
+ if (iov->iov_len == base) {
+ iov++;
+ base = 0;
+ }
}
+ *iovp = iov;
+ *basep = base;
}
- *iovp = iov;
- *basep = base;
}
#endif
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1897,12 +1897,7 @@ generic_file_buffered_write(struct kiocb
/*
* handle partial DIO write. Adjust cur_iov if needed.
*/
- if (likely(nr_segs == 1))
- buf = iov->iov_base + written;
- else {
- filemap_set_next_iovec(&cur_iov, &iov_offset, written);
- buf = cur_iov->iov_base + iov_offset;
- }
+ filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
do {
struct page *page;
@@ -1912,6 +1907,7 @@ generic_file_buffered_write(struct kiocb
size_t bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
+ buf = cur_iov->iov_base + iov_offset;
offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = PAGE_CACHE_SIZE - offset;
@@ -1943,13 +1939,10 @@ generic_file_buffered_write(struct kiocb
if (unlikely(status))
goto fs_write_aop_error;
- if (likely(nr_segs == 1))
- copied = filemap_copy_from_user(page, offset,
- buf, bytes);
- else
- copied = filemap_copy_from_user_iovec(page, offset,
- cur_iov, iov_offset, bytes);
+ copied = filemap_copy_from_user(page, offset,
+ cur_iov, nr_segs, iov_offset, bytes);
flush_dcache_page(page);
+
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (unlikely(status < 0 || status == AOP_TRUNCATED_PAGE))
goto fs_write_aop_error;
@@ -1960,20 +1953,11 @@ generic_file_buffered_write(struct kiocb
if (unlikely(status > 0)) /* filesystem did partial write */
copied = status;
- if (likely(copied > 0)) {
- written += copied;
- count -= copied;
- pos += copied;
- buf += copied;
- if (unlikely(nr_segs > 1)) {
- filemap_set_next_iovec(&cur_iov,
- &iov_offset, copied);
- if (count)
- buf = cur_iov->iov_base + iov_offset;
- } else {
- iov_offset += copied;
- }
- }
+ written += copied;
+ count -= copied;
+ pos += copied;
+ filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
+
unlock_page(page);
mark_page_accessed(page);
page_cache_release(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -15,7 +15,6 @@
#include <linux/rmap.h>
#include <linux/sched.h>
#include <asm/tlbflush.h>
-#include "filemap.h"
/*
* We do use our own empty page to avoid interference with other users
@@ -319,6 +318,7 @@ __xip_file_write(struct file *filp, cons
unsigned long index;
unsigned long offset;
size_t copied;
+ char *kaddr;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
index = pos >> PAGE_CACHE_SHIFT;
@@ -326,14 +326,6 @@ __xip_file_write(struct file *filp, cons
if (bytes > count)
bytes = count;
- /*
- * Bring in the user page that we will copy from _first_.
- * Otherwise there's a nasty deadlock on copying from the
- * same page as we're writing to, without it being marked
- * up-to-date.
- */
- fault_in_pages_readable(buf, bytes);
-
page = a_ops->get_xip_page(mapping,
index*(PAGE_SIZE/512), 0);
if (IS_ERR(page) && (PTR_ERR(page) == -ENODATA)) {
@@ -350,8 +342,13 @@ __xip_file_write(struct file *filp, cons
break;
}
- copied = filemap_copy_from_user(page, offset, buf, bytes);
+ fault_in_pages_readable(buf, bytes);
+ kaddr = kmap_atomic(page, KM_USER0);
+ copied = bytes -
+ __copy_from_user_inatomic_nocache(kaddr, buf, bytes);
+ kunmap_atomic(kaddr, KM_USER0);
flush_dcache_page(page);
+
if (likely(copied > 0)) {
status = copied;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 09/41] mm: fix pagecache write deadlocks
[not found] <20070524052844.860329000@suse.de>
` (7 preceding siblings ...)
2007-05-25 12:21 ` [patch 08/41] mm: write iovec cleanup npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 10/41] mm: buffered write iterator npiggin
` (3 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: mm-pagecache-write-deadlocks.patch --]
[-- Type: text/plain, Size: 9238 bytes --]
Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:
1. generic_buffered_write
2. lock_page
3. prepare_write
4. unlock_page+vmtruncate
5. copy_from_user
6. mmap_sem(r)
7. handle_mm_fault
8. lock_page (filemap_nopage)
9. commit_write
10. unlock_page
a. sys_munmap / sys_mlock / others
b. mmap_sem(w)
c. make_pages_present
d. get_user_pages
e. handle_mm_fault
f. lock_page (filemap_nopage)
2,8 - recursive deadlock if page is same
2,8;2,8 - ABBA deadlock is page is different
2,6;b,f - ABBA deadlock if page is same
The solution is as follows:
1. If we find the destination page is uptodate, continue as normal, but use
atomic usercopies which do not take pagefaults and do not zero the uncopied
tail of the destination. The destination is already uptodate, so we can
commit_write the full length even if there was a partial copy: it does not
matter that the tail was not modified, because if it is dirtied and written
back to disk it will not cause any problems (uptodate *means* that the
destination page is as new or newer than the copy on disk).
1a. The above requires that fault_in_pages_readable correctly returns access
information, because atomic usercopies cannot distinguish between
non-present pages in a readable mapping, from lack of a readable mapping.
2. If we find the destination page is non uptodate, unlock it (this could be
made slightly more optimal), then allocate a temporary page to copy the
source data into. Relock the destination page and continue with the copy.
However, instead of a usercopy (which might take a fault), copy the data
from the pinned temporary page via the kernel address space.
(also, rename maxlen to seglen, because it was confusing)
This increases the CPU/memory copy cost by almost 50% on the affected
workloads. That will be solved by introducing a new set of pagecache write
aops in a subsequent patch.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
include/linux/pagemap.h | 11 +++-
mm/filemap.c | 122 ++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 112 insertions(+), 21 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1900,11 +1900,12 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
do {
+ struct page *src_page;
struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
- unsigned long maxlen; /* Bytes remaining in current iovec */
- size_t bytes; /* Bytes to write to page */
+ unsigned long seglen; /* Bytes remaining in current iovec */
+ unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
buf = cur_iov->iov_base + iov_offset;
@@ -1914,20 +1915,30 @@ generic_file_buffered_write(struct kiocb
if (bytes > count)
bytes = count;
- maxlen = cur_iov->iov_len - iov_offset;
- if (maxlen > bytes)
- maxlen = bytes;
+ /*
+ * a non-NULL src_page indicates that we're doing the
+ * copy via get_user_pages and kmap.
+ */
+ src_page = NULL;
+
+ seglen = cur_iov->iov_len - iov_offset;
+ if (seglen > bytes)
+ seglen = bytes;
-#ifndef CONFIG_DEBUG_VM
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
+ *
+ * Not only is this an optimisation, but it is also required
+ * to check that the address is actually valid, when atomic
+ * usercopies are used, below.
*/
- fault_in_pages_readable(buf, maxlen);
-#endif
-
+ if (unlikely(fault_in_pages_readable(buf, seglen))) {
+ status = -EFAULT;
+ break;
+ }
page = __grab_cache_page(mapping, index);
if (!page) {
@@ -1935,32 +1946,104 @@ generic_file_buffered_write(struct kiocb
break;
}
+ /*
+ * non-uptodate pages cannot cope with short copies, and we
+ * cannot take a pagefault with the destination page locked.
+ * So pin the source page to copy it.
+ */
+ if (!PageUptodate(page)) {
+ unlock_page(page);
+
+ src_page = alloc_page(GFP_KERNEL);
+ if (!src_page) {
+ page_cache_release(page);
+ status = -ENOMEM;
+ break;
+ }
+
+ /*
+ * Cannot get_user_pages with a page locked for the
+ * same reason as we can't take a page fault with a
+ * page locked (as explained below).
+ */
+ copied = filemap_copy_from_user(src_page, offset,
+ cur_iov, nr_segs, iov_offset, bytes);
+ if (unlikely(copied == 0)) {
+ status = -EFAULT;
+ page_cache_release(page);
+ page_cache_release(src_page);
+ break;
+ }
+ bytes = copied;
+
+ lock_page(page);
+ /*
+ * Can't handle the page going uptodate here, because
+ * that means we would use non-atomic usercopies, which
+ * zero out the tail of the page, which can cause
+ * zeroes to become transiently visible. We could just
+ * use a non-zeroing copy, but the APIs aren't too
+ * consistent.
+ */
+ if (unlikely(!page->mapping || PageUptodate(page))) {
+ unlock_page(page);
+ page_cache_release(page);
+ page_cache_release(src_page);
+ continue;
+ }
+
+ }
+
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status))
goto fs_write_aop_error;
- copied = filemap_copy_from_user(page, offset,
+ if (!src_page) {
+ /*
+ * Must not enter the pagefault handler here, because
+ * we hold the page lock, so we might recursively
+ * deadlock on the same lock, or get an ABBA deadlock
+ * against a different lock, or against the mmap_sem
+ * (which nests outside the page lock). So increment
+ * preempt count, and use _atomic usercopies.
+ *
+ * The page is uptodate so we are OK to encounter a
+ * short copy: if unmodified parts of the page are
+ * marked dirty and written out to disk, it doesn't
+ * really matter.
+ */
+ pagefault_disable();
+ copied = filemap_copy_from_user_atomic(page, offset,
cur_iov, nr_segs, iov_offset, bytes);
+ pagefault_enable();
+ } else {
+ void *src, *dst;
+ src = kmap_atomic(src_page, KM_USER0);
+ dst = kmap_atomic(page, KM_USER1);
+ memcpy(dst + offset, src + offset, bytes);
+ kunmap_atomic(dst, KM_USER1);
+ kunmap_atomic(src, KM_USER0);
+ copied = bytes;
+ }
flush_dcache_page(page);
status = a_ops->commit_write(file, page, offset, offset+bytes);
if (unlikely(status < 0 || status == AOP_TRUNCATED_PAGE))
goto fs_write_aop_error;
- if (unlikely(copied != bytes)) {
- status = -EFAULT;
- goto fs_write_aop_error;
- }
if (unlikely(status > 0)) /* filesystem did partial write */
- copied = status;
+ copied = min_t(size_t, copied, status);
+
+ unlock_page(page);
+ mark_page_accessed(page);
+ page_cache_release(page);
+ if (src_page)
+ page_cache_release(src_page);
written += copied;
count -= copied;
pos += copied;
filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
- unlock_page(page);
- mark_page_accessed(page);
- page_cache_release(page);
balance_dirty_pages_ratelimited(mapping);
cond_resched();
continue;
@@ -1969,6 +2052,8 @@ fs_write_aop_error:
if (status != AOP_TRUNCATED_PAGE)
unlock_page(page);
page_cache_release(page);
+ if (src_page)
+ page_cache_release(src_page);
/*
* prepare_write() may have instantiated a few blocks
@@ -1981,7 +2066,6 @@ fs_write_aop_error:
continue;
else
break;
-
} while (count);
*ppos = pos;
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -218,6 +218,9 @@ static inline int fault_in_pages_writeab
{
int ret;
+ if (unlikely(size == 0))
+ return 0;
+
/*
* Writing zeroes into userspace here is OK, because we know that if
* the zero gets there, we'll be overwriting it.
@@ -237,19 +240,23 @@ static inline int fault_in_pages_writeab
return ret;
}
-static inline void fault_in_pages_readable(const char __user *uaddr, int size)
+static inline int fault_in_pages_readable(const char __user *uaddr, int size)
{
volatile char c;
int ret;
+ if (unlikely(size == 0))
+ return 0;
+
ret = __get_user(c, uaddr);
if (ret == 0) {
const char __user *end = uaddr + size - 1;
if (((unsigned long)uaddr & PAGE_MASK) !=
((unsigned long)end & PAGE_MASK))
- __get_user(c, end);
+ ret = __get_user(c, end);
}
+ return ret;
}
#endif /* _LINUX_PAGEMAP_H */
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 10/41] mm: buffered write iterator
[not found] <20070524052844.860329000@suse.de>
` (8 preceding siblings ...)
2007-05-25 12:21 ` [patch 09/41] mm: fix pagecache write deadlocks npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 11/41] fs: fix data-loss on error npiggin
` (2 subsequent siblings)
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: fs-buffered-write-iterator.patch --]
[-- Type: text/plain, Size: 11230 bytes --]
Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
include/linux/fs.h | 33 ++++++++++++
mm/filemap.c | 144 +++++++++++++++++++++++++++++++++++++++++++----------
mm/filemap.h | 103 -------------------------------------
3 files changed, 150 insertions(+), 130 deletions(-)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -416,6 +416,39 @@ struct page;
struct address_space;
struct writeback_control;
+struct iov_iter {
+ const struct iovec *iov;
+ unsigned long nr_segs;
+ size_t iov_offset;
+ size_t count;
+};
+
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i);
+size_t iov_iter_single_seg_count(struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t count, size_t written)
+{
+ i->iov = iov;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count + written;
+
+ iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+ return i->count;
+}
+
+
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/cpuset.h>
-#include "filemap.h"
+#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include "internal.h"
/*
@@ -1707,8 +1707,7 @@ int remove_suid(struct dentry *dentry)
}
EXPORT_SYMBOL(remove_suid);
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
{
size_t copied = 0, left = 0;
@@ -1731,6 +1730,110 @@ __filemap_copy_from_user_iovec_inatomic(
}
/*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied. If a fault is encountered then return the number of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ BUG_ON(!in_atomic());
+ kaddr = kmap_atomic(page, KM_USER0);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_from_user_inatomic_nocache(kaddr + offset,
+ buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap_atomic(kaddr, KM_USER0);
+
+ return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+ struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+ char *kaddr;
+ size_t copied;
+
+ kaddr = kmap(page);
+ if (likely(i->nr_segs == 1)) {
+ int left;
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+ copied = bytes - left;
+ } else {
+ copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+ i->iov, i->iov_offset, bytes);
+ }
+ kunmap(page);
+ return copied;
+}
+
+static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
+{
+ if (likely(i->nr_segs == 1)) {
+ i->iov_offset += bytes;
+ } else {
+ const struct iovec *iov = i->iov;
+ size_t base = i->iov_offset;
+
+ while (bytes) {
+ int copy = min(bytes, iov->iov_len - base);
+
+ bytes -= copy;
+ base += copy;
+ if (iov->iov_len == base) {
+ iov++;
+ base = 0;
+ }
+ }
+ i->iov = iov;
+ i->iov_offset = base;
+ }
+}
+
+void iov_iter_advance(struct iov_iter *i, size_t bytes)
+{
+ BUG_ON(i->count < bytes);
+
+ __iov_iter_advance_iov(i, bytes);
+ i->count -= bytes;
+}
+
+int iov_iter_fault_in_readable(struct iov_iter *i)
+{
+ size_t seglen = min(i->iov->iov_len - i->iov_offset, i->count);
+ char __user *buf = i->iov->iov_base + i->iov_offset;
+ return fault_in_pages_readable(buf, seglen);
+}
+
+/*
+ * Return the count of just the current iov_iter segment.
+ */
+size_t iov_iter_single_seg_count(struct iov_iter *i)
+{
+ const struct iovec *iov = i->iov;
+ if (i->nr_segs == 1)
+ return i->count;
+ else
+ return min(i->count, iov->iov_len - i->iov_offset);
+}
+
+/*
* Performs necessary checks before doing a write
*
* Can adjust writing position or amount of bytes to write.
@@ -1890,30 +1993,22 @@ generic_file_buffered_write(struct kiocb
const struct address_space_operations *a_ops = mapping->a_ops;
struct inode *inode = mapping->host;
long status = 0;
- const struct iovec *cur_iov = iov; /* current iovec */
- size_t iov_offset = 0; /* offset in the current iovec */
- char __user *buf;
+ struct iov_iter i;
- /*
- * handle partial DIO write. Adjust cur_iov if needed.
- */
- filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, written);
+ iov_iter_init(&i, iov, nr_segs, count, written);
do {
struct page *src_page;
struct page *page;
pgoff_t index; /* Pagecache index for current page */
unsigned long offset; /* Offset into pagecache page */
- unsigned long seglen; /* Bytes remaining in current iovec */
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
- buf = cur_iov->iov_base + iov_offset;
offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count)
- bytes = count;
+ bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+ iov_iter_count(&i));
/*
* a non-NULL src_page indicates that we're doing the
@@ -1921,10 +2016,6 @@ generic_file_buffered_write(struct kiocb
*/
src_page = NULL;
- seglen = cur_iov->iov_len - iov_offset;
- if (seglen > bytes)
- seglen = bytes;
-
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -1935,7 +2026,7 @@ generic_file_buffered_write(struct kiocb
* to check that the address is actually valid, when atomic
* usercopies are used, below.
*/
- if (unlikely(fault_in_pages_readable(buf, seglen))) {
+ if (unlikely(iov_iter_fault_in_readable(&i))) {
status = -EFAULT;
break;
}
@@ -1966,8 +2057,8 @@ generic_file_buffered_write(struct kiocb
* same reason as we can't take a page fault with a
* page locked (as explained below).
*/
- copied = filemap_copy_from_user(src_page, offset,
- cur_iov, nr_segs, iov_offset, bytes);
+ copied = iov_iter_copy_from_user(src_page, &i,
+ offset, bytes);
if (unlikely(copied == 0)) {
status = -EFAULT;
page_cache_release(page);
@@ -2013,8 +2104,8 @@ generic_file_buffered_write(struct kiocb
* really matter.
*/
pagefault_disable();
- copied = filemap_copy_from_user_atomic(page, offset,
- cur_iov, nr_segs, iov_offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(page, &i,
+ offset, bytes);
pagefault_enable();
} else {
void *src, *dst;
@@ -2039,10 +2130,9 @@ generic_file_buffered_write(struct kiocb
if (src_page)
page_cache_release(src_page);
+ iov_iter_advance(&i, copied);
written += copied;
- count -= copied;
pos += copied;
- filemap_set_next_iovec(&cur_iov, nr_segs, &iov_offset, copied);
balance_dirty_pages_ratelimited(mapping);
cond_resched();
@@ -2066,7 +2156,7 @@ fs_write_aop_error:
continue;
else
break;
- } while (count);
+ } while (iov_iter_count(&i));
*ppos = pos;
/*
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ /dev/null
@@ -1,103 +0,0 @@
-/*
- * linux/mm/filemap.h
- *
- * Copyright (C) 1994-1999 Linus Torvalds
- */
-
-#ifndef __FILEMAP_H
-#define __FILEMAP_H
-
-#include <linux/types.h>
-#include <linux/fs.h>
-#include <linux/mm.h>
-#include <linux/highmem.h>
-#include <linux/uio.h>
-#include <linux/uaccess.h>
-
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
- const struct iovec *iov,
- size_t base,
- size_t bytes);
-
-/*
- * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied. If a fault is encountered then return the number of
- * bytes which were copied.
- */
-static inline size_t
-filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
- const struct iovec *iov, unsigned long nr_segs,
- size_t base, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- kaddr = kmap_atomic(page, KM_USER0);
- if (likely(nr_segs == 1)) {
- int left;
- char __user *buf = iov->iov_base + base;
- left = __copy_from_user_inatomic_nocache(kaddr + offset,
- buf, bytes);
- copied = bytes - left;
- } else {
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
- iov, base, bytes);
- }
- kunmap_atomic(kaddr, KM_USER0);
-
- return copied;
-}
-
-/*
- * This has the same sideeffects and return value as
- * filemap_copy_from_user_atomic().
- * The difference is that it attempts to resolve faults.
- */
-static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
- const struct iovec *iov, unsigned long nr_segs,
- size_t base, size_t bytes)
-{
- char *kaddr;
- size_t copied;
-
- kaddr = kmap(page);
- if (likely(nr_segs == 1)) {
- int left;
- char __user *buf = iov->iov_base + base;
- left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
- copied = bytes - left;
- } else {
- copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset,
- iov, base, bytes);
- }
- kunmap(page);
- return copied;
-}
-
-static inline void
-filemap_set_next_iovec(const struct iovec **iovp, unsigned long nr_segs,
- size_t *basep, size_t bytes)
-{
- if (likely(nr_segs == 1)) {
- *basep += bytes;
- } else {
- const struct iovec *iov = *iovp;
- size_t base = *basep;
-
- while (bytes) {
- int copy = min(bytes, iov->iov_len - base);
-
- bytes -= copy;
- base += copy;
- if (iov->iov_len == base) {
- iov++;
- base = 0;
- }
- }
- *iovp = iov;
- *basep = base;
- }
-}
-#endif
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 11/41] fs: fix data-loss on error
[not found] <20070524052844.860329000@suse.de>
` (9 preceding siblings ...)
2007-05-25 12:21 ` [patch 10/41] mm: buffered write iterator npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-25 12:21 ` [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops npiggin
2007-05-25 12:21 ` [patch 13/41] mm: restore KERNEL_DS optimisations npiggin
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: fs-dataloss-stop.patch --]
[-- Type: text/plain, Size: 1340 bytes --]
New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts
of those buffers because it thinks they contain stale data: wrong, they
are actually uptodate so this is a data loss situation.
Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
fs/buffer.c | 2 ++
1 file changed, 2 insertions(+)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1816,7 +1816,9 @@ static int __block_prepare_write(struct
unmap_underlying_metadata(bh->b_bdev,
bh->b_blocknr);
if (PageUptodate(page)) {
+ clear_buffer_new(bh);
set_buffer_uptodate(bh);
+ mark_buffer_dirty(bh);
continue;
}
if (block_end > to || block_start < from) {
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
[not found] <20070524052844.860329000@suse.de>
` (10 preceding siblings ...)
2007-05-25 12:21 ` [patch 11/41] fs: fix data-loss on error npiggin
@ 2007-05-25 12:21 ` npiggin
2007-05-31 4:30 ` Andrew Morton
2007-05-25 12:21 ` [patch 13/41] mm: restore KERNEL_DS optimisations npiggin
12 siblings, 1 reply; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: fs-new-write-aops.patch --]
[-- Type: text/plain, Size: 37031 bytes --]
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
API design contributions, code review and fixes.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Documentation/filesystems/Locking | 9 -
Documentation/filesystems/vfs.txt | 45 +++++++
drivers/block/loop.c | 75 ++++-------
fs/buffer.c | 198 ++++++++++++++++++++++++++-----
fs/libfs.c | 44 +++++++
fs/namei.c | 47 +------
fs/revoked_inode.c | 14 ++
fs/splice.c | 70 +----------
include/linux/buffer_head.h | 10 +
include/linux/fs.h | 30 ++++
include/linux/pagemap.h | 2
mm/filemap.c | 238 +++++++++++++++++++++++++++++++++-----
12 files changed, 576 insertions(+), 206 deletions(-)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -409,6 +409,8 @@ enum positive_aop_returns {
AOP_TRUNCATED_PAGE = 0x80001,
};
+#define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */
+
/*
* oh the beauties of C type declarations.
*/
@@ -428,7 +430,7 @@ size_t iov_iter_copy_from_user_atomic(st
size_t iov_iter_copy_from_user(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(struct iov_iter *i);
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
size_t iov_iter_single_seg_count(struct iov_iter *i);
static inline void iov_iter_init(struct iov_iter *i,
@@ -469,6 +471,14 @@ struct address_space_operations {
*/
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+ int (*write_begin)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
+
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
@@ -483,6 +493,18 @@ struct address_space_operations {
int (*launder_page) (struct page *);
};
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
+
struct backing_dev_info;
struct address_space {
struct inode *host; /* owner: inode, block_device */
@@ -1894,6 +1916,12 @@ extern int simple_prepare_write(struct f
unsigned offset, unsigned to);
extern int simple_commit_write(struct file *file, struct page *page,
unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct nameidata *);
extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1814,11 +1814,10 @@ void iov_iter_advance(struct iov_iter *i
i->count -= bytes;
}
-int iov_iter_fault_in_readable(struct iov_iter *i)
+int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- size_t seglen = min(i->iov->iov_len - i->iov_offset, i->count);
char __user *buf = i->iov->iov_base + i->iov_offset;
- return fault_in_pages_readable(buf, seglen);
+ return fault_in_pages_readable(buf, bytes);
}
/*
@@ -1917,6 +1916,93 @@ inline int generic_write_checks(struct f
}
EXPORT_SYMBOL(generic_write_checks);
+int pagecache_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ const struct address_space_operations *aops = mapping->a_ops;
+
+ if (aops->write_begin) {
+ return aops->write_begin(file, mapping, pos, len, flags,
+ pagep, fsdata);
+ } else {
+ int ret;
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+ unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+ struct inode *inode = mapping->host;
+ struct page *page;
+again:
+ page = __grab_cache_page(mapping, index);
+ *pagep = page;
+ if (!page)
+ return -ENOMEM;
+
+ if (flags & AOP_FLAG_UNINTERRUPTIBLE && !PageUptodate(page)) {
+ /*
+ * There is no way to resolve a short write situation
+ * for a !Uptodate page (except by double copying in
+ * the caller done by generic_perform_write_2copy).
+ *
+ * Instead, we have to bring it uptodate here.
+ */
+ ret = aops->readpage(file, page);
+ page_cache_release(page);
+ if (ret) {
+ if (ret == AOP_TRUNCATED_PAGE)
+ goto again;
+ return ret;
+ }
+ goto again;
+ }
+
+ ret = aops->prepare_write(file, page, offset, offset+len);
+ if (ret) {
+ if (ret != AOP_TRUNCATED_PAGE)
+ unlock_page(page);
+ page_cache_release(page);
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ if (ret == AOP_TRUNCATED_PAGE)
+ goto again;
+ }
+ return ret;
+ }
+}
+EXPORT_SYMBOL(pagecache_write_begin);
+
+int pagecache_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ const struct address_space_operations *aops = mapping->a_ops;
+ int ret;
+
+ if (aops->write_begin) {
+ ret = aops->write_end(file, mapping, pos, len, copied,
+ page, fsdata);
+ } else {
+ unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+ struct inode *inode = mapping->host;
+
+ flush_dcache_page(page);
+ ret = aops->commit_write(file, page, offset, offset+len);
+ unlock_page(page);
+ page_cache_release(page);
+ BUG_ON(ret == AOP_TRUNCATED_PAGE); /* can't deal with */
+
+ if (ret < 0) {
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ } else if (ret > 0)
+ ret = min_t(size_t, copied, ret);
+ else
+ ret = copied;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(pagecache_write_end);
+
ssize_t
generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long *nr_segs, loff_t pos, loff_t *ppos,
@@ -1960,8 +2046,7 @@ EXPORT_SYMBOL(generic_file_direct_write)
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-static struct page *__grab_cache_page(struct address_space *mapping,
- pgoff_t index)
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index)
{
int status;
struct page *page;
@@ -1982,20 +2067,16 @@ repeat:
}
return page;
}
+EXPORT_SYMBOL(__grab_cache_page);
-ssize_t
-generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
- unsigned long nr_segs, loff_t pos, loff_t *ppos,
- size_t count, ssize_t written)
+static ssize_t generic_perform_write_2copy(struct file *file,
+ struct iov_iter *i, loff_t pos)
{
- struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
- struct inode *inode = mapping->host;
- long status = 0;
- struct iov_iter i;
-
- iov_iter_init(&i, iov, nr_segs, count, written);
+ struct inode *inode = mapping->host;
+ long status = 0;
+ ssize_t written = 0;
do {
struct page *src_page;
@@ -2008,7 +2089,7 @@ generic_file_buffered_write(struct kiocb
offset = (pos & (PAGE_CACHE_SIZE - 1));
index = pos >> PAGE_CACHE_SHIFT;
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
- iov_iter_count(&i));
+ iov_iter_count(i));
/*
* a non-NULL src_page indicates that we're doing the
@@ -2026,7 +2107,7 @@ generic_file_buffered_write(struct kiocb
* to check that the address is actually valid, when atomic
* usercopies are used, below.
*/
- if (unlikely(iov_iter_fault_in_readable(&i))) {
+ if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
@@ -2057,7 +2138,7 @@ generic_file_buffered_write(struct kiocb
* same reason as we can't take a page fault with a
* page locked (as explained below).
*/
- copied = iov_iter_copy_from_user(src_page, &i,
+ copied = iov_iter_copy_from_user(src_page, i,
offset, bytes);
if (unlikely(copied == 0)) {
status = -EFAULT;
@@ -2082,7 +2163,6 @@ generic_file_buffered_write(struct kiocb
page_cache_release(src_page);
continue;
}
-
}
status = a_ops->prepare_write(file, page, offset, offset+bytes);
@@ -2104,7 +2184,7 @@ generic_file_buffered_write(struct kiocb
* really matter.
*/
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, &i,
+ copied = iov_iter_copy_from_user_atomic(page, i,
offset, bytes);
pagefault_enable();
} else {
@@ -2130,9 +2210,9 @@ generic_file_buffered_write(struct kiocb
if (src_page)
page_cache_release(src_page);
- iov_iter_advance(&i, copied);
- written += copied;
+ iov_iter_advance(i, copied);
pos += copied;
+ written += copied;
balance_dirty_pages_ratelimited(mapping);
cond_resched();
@@ -2156,13 +2236,117 @@ fs_write_aop_error:
continue;
else
break;
- } while (iov_iter_count(&i));
- *ppos = pos;
+ } while (iov_iter_count(i));
+
+ return written ? written : status;
+}
+
+static ssize_t generic_perform_write(struct file *file,
+ struct iov_iter *i, loff_t pos)
+{
+ struct address_space *mapping = file->f_mapping;
+ const struct address_space_operations *a_ops = mapping->a_ops;
+ long status = 0;
+ ssize_t written = 0;
+
+ do {
+ struct page *page;
+ pgoff_t index; /* Pagecache index for current page */
+ unsigned long offset; /* Offset into pagecache page */
+ unsigned long bytes; /* Bytes to write to page */
+ size_t copied; /* Bytes copied from user */
+ void *fsdata;
+
+ offset = (pos & (PAGE_CACHE_SIZE - 1));
+ index = pos >> PAGE_CACHE_SHIFT;
+ bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+ iov_iter_count(i));
+
+again:
+
+ /*
+ * Bring in the user page that we will copy from _first_.
+ * Otherwise there's a nasty deadlock on copying from the
+ * same page as we're writing to, without it being marked
+ * up-to-date.
+ *
+ * Not only is this an optimisation, but it is also required
+ * to check that the address is actually valid, when atomic
+ * usercopies are used, below.
+ */
+ if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+ status = -EFAULT;
+ break;
+ }
+
+ status = a_ops->write_begin(file, mapping, pos, bytes, 0,
+ &page, &fsdata);
+ if (unlikely(status))
+ break;
+
+ pagefault_disable();
+ copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ pagefault_enable();
+ flush_dcache_page(page);
+
+ status = a_ops->write_end(file, mapping, pos, bytes, copied,
+ page, fsdata);
+ if (unlikely(status < 0))
+ break;
+ copied = status;
+
+ cond_resched();
+
+ if (unlikely(copied == 0)) {
+ /*
+ * If we were unable to copy any data at all, we must
+ * fall back to a single segment length write.
+ *
+ * If we didn't fallback here, we could livelock
+ * because not all segments in the iov can be copied at
+ * once without a pagefault.
+ */
+ bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
+ iov_iter_single_seg_count(i));
+ goto again;
+ }
+ iov_iter_advance(i, copied);
+ pos += copied;
+ written += copied;
+
+ balance_dirty_pages_ratelimited(mapping);
+
+ } while (iov_iter_count(i));
+
+ return written ? written : status;
+}
+
+ssize_t
+generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long nr_segs, loff_t pos, loff_t *ppos,
+ size_t count, ssize_t written)
+{
+ struct file *file = iocb->ki_filp;
+ struct address_space *mapping = file->f_mapping;
+ const struct address_space_operations *a_ops = mapping->a_ops;
+ struct inode *inode = mapping->host;
+ ssize_t status;
+ struct iov_iter i;
+
+ iov_iter_init(&i, iov, nr_segs, count, written);
+ if (a_ops->write_begin)
+ status = generic_perform_write(file, &i, pos);
+ else
+ status = generic_perform_write_2copy(file, &i, pos);
- /*
- * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
- */
if (likely(status >= 0)) {
+ written += status;
+ *ppos = pos + status;
+
+ /*
+ * For now, when the user asks for O_SYNC, we'll actually give
+ * O_DSYNC
+ */
if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
if (!a_ops->writepage || !is_sync_kiocb(iocb))
status = generic_osync_inode(inode, mapping,
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1773,6 +1773,48 @@ recover:
goto done;
}
+/*
+ * If a page has any new buffers, zero them out here, and mark them uptodate
+ * and dirty so they'll be written out (in order to prevent uninitialised
+ * block data from leaking). And clear the new bit.
+ */
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+ unsigned int block_start, block_end;
+ struct buffer_head *head, *bh;
+
+ BUG_ON(!PageLocked(page));
+ if (!page_has_buffers(page))
+ return;
+
+ bh = head = page_buffers(page);
+ block_start = 0;
+ do {
+ block_end = block_start + bh->b_size;
+
+ if (buffer_new(bh)) {
+ if (block_end > from && block_start < to) {
+ if (!PageUptodate(page)) {
+ unsigned start, size;
+
+ start = max(from, block_start);
+ size = min(to, block_end) - start;
+
+ zero_user_page(page, start, size, KM_USER0);
+ set_buffer_uptodate(bh);
+ }
+
+ clear_buffer_new(bh);
+ mark_buffer_dirty(bh);
+ }
+ }
+
+ block_start = block_end;
+ bh = bh->b_this_page;
+ } while (bh != head);
+}
+EXPORT_SYMBOL(page_zero_new_buffers);
+
static int __block_prepare_write(struct inode *inode, struct page *page,
unsigned from, unsigned to, get_block_t *get_block)
{
@@ -1857,38 +1899,8 @@ static int __block_prepare_write(struct
if (!buffer_uptodate(*wait_bh))
err = -EIO;
}
- if (!err) {
- bh = head;
- do {
- if (buffer_new(bh))
- clear_buffer_new(bh);
- } while ((bh = bh->b_this_page) != head);
- return 0;
- }
- /* Error case: */
- /*
- * Zero out any newly allocated blocks to avoid exposing stale
- * data. If BH_New is set, we know that the block was newly
- * allocated in the above loop.
- */
- bh = head;
- block_start = 0;
- do {
- block_end = block_start+blocksize;
- if (block_end <= from)
- goto next_bh;
- if (block_start >= to)
- break;
- if (buffer_new(bh)) {
- clear_buffer_new(bh);
- zero_user_page(page, block_start, bh->b_size, KM_USER0);
- set_buffer_uptodate(bh);
- mark_buffer_dirty(bh);
- }
-next_bh:
- block_start = block_end;
- bh = bh->b_this_page;
- } while (bh != head);
+ if (unlikely(err))
+ page_zero_new_buffers(page, from, to);
return err;
}
@@ -1913,6 +1925,7 @@ static int __block_commit_write(struct i
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
}
+ clear_buffer_new(bh);
}
/*
@@ -1927,6 +1940,127 @@ static int __block_commit_write(struct i
}
/*
+ * block_write_begin takes care of the basic task of block allocation and
+ * bringing partial write blocks uptodate first.
+ *
+ * If *pagep is not NULL, then block_write_begin uses the locked page
+ * at *pagep rather than allocating its own. In this case, the page will
+ * not be unlocked or deallocated on failure.
+ */
+int block_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata,
+ get_block_t *get_block)
+{
+ struct inode *inode = mapping->host;
+ int status = 0;
+ struct page *page;
+ pgoff_t index;
+ unsigned start, end;
+ int ownpage = 0;
+
+ index = pos >> PAGE_CACHE_SHIFT;
+ start = pos & (PAGE_CACHE_SIZE - 1);
+ end = start + len;
+
+ page = *pagep;
+ if (page == NULL) {
+ ownpage = 1;
+ page = __grab_cache_page(mapping, index);
+ if (!page) {
+ status = -ENOMEM;
+ goto out;
+ }
+ *pagep = page;
+ } else
+ BUG_ON(!PageLocked(page));
+
+ status = __block_prepare_write(inode, page, start, end, get_block);
+ if (unlikely(status)) {
+ ClearPageUptodate(page);
+
+ if (ownpage) {
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * prepare_write() may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ }
+ goto out;
+ }
+
+out:
+ return status;
+}
+EXPORT_SYMBOL(block_write_begin);
+
+int block_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ struct inode *inode = mapping->host;
+ unsigned start;
+
+ start = pos & (PAGE_CACHE_SIZE - 1);
+
+ if (unlikely(copied < len)) {
+ /*
+ * The buffers that were written will now be uptodate, so we
+ * don't have to worry about a readpage reading them and
+ * overwriting a partial write. However if we have encountered
+ * a short write and only partially written into a buffer, it
+ * will not be marked uptodate, so a readpage might come in and
+ * destroy our partial write.
+ *
+ * Do the simplest thing, and just treat any short write to a
+ * non uptodate page as a zero-length write, and force the
+ * caller to redo the whole thing.
+ */
+ if (!PageUptodate(page))
+ copied = 0;
+
+ page_zero_new_buffers(page, start+copied, start+len);
+ }
+ flush_dcache_page(page);
+
+ /* This could be a short (even 0-length) commit */
+ __block_commit_write(inode, page, start, start+copied);
+
+ return copied;
+}
+EXPORT_SYMBOL(block_write_end);
+
+int generic_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ struct inode *inode = mapping->host;
+
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+ unlock_page(page);
+ mark_page_accessed(page); /* XXX: put this in caller? */
+ page_cache_release(page);
+
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold i_mutex.
+ */
+ if (pos+copied > inode->i_size) {
+ i_size_write(inode, pos+copied);
+ mark_inode_dirty(inode);
+ }
+
+ return copied;
+}
+EXPORT_SYMBOL(generic_write_end);
+
+/*
* Generic "read page" function for block devices that have the normal
* get_block functionality. This is most of the block device filesystems.
* Reads the page asynchronously --- the unlock_buffer() and
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h
+++ linux-2.6/include/linux/buffer_head.h
@@ -203,6 +203,16 @@ void block_invalidatepage(struct page *p
int block_write_full_page(struct page *page, get_block_t *get_block,
struct writeback_control *wbc);
int block_read_full_page(struct page*, get_block_t*);
+int block_write_begin(struct file *, struct address_space *,
+ loff_t, unsigned, unsigned,
+ struct page **, void **, get_block_t*);
+int block_write_end(struct file *, struct address_space *,
+ loff_t, unsigned, unsigned,
+ struct page *, void *);
+int generic_write_end(struct file *, struct address_space *,
+ loff_t, unsigned, unsigned,
+ struct page *, void *);
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to);
int block_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
int cont_prepare_write(struct page*, unsigned, unsigned, get_block_t*,
loff_t *);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -96,6 +96,8 @@ unsigned find_get_pages_contig(struct ad
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
int tag, unsigned int nr_pages, struct page **pages);
+struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index);
+
/*
* Returns locked page at given index in given cache, creating it if needed.
*/
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -351,6 +351,26 @@ int simple_prepare_write(struct file *fi
return 0;
}
+int simple_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ struct page *page;
+ pgoff_t index;
+ unsigned from;
+
+ index = pos >> PAGE_CACHE_SHIFT;
+ from = pos & (PAGE_CACHE_SIZE - 1);
+
+ page = __grab_cache_page(mapping, index);
+ if (!page)
+ return -ENOMEM;
+
+ *pagep = page;
+
+ return simple_prepare_write(file, page, from, from+len);
+}
+
int simple_commit_write(struct file *file, struct page *page,
unsigned from, unsigned to)
{
@@ -369,6 +389,28 @@ int simple_commit_write(struct file *fil
return 0;
}
+int simple_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+ /* zero the stale part of the page if we did a short copy */
+ if (copied < len) {
+ void *kaddr = kmap_atomic(page, KM_USER0);
+ memset(kaddr + from + copied, 0, len - copied);
+ flush_dcache_page(page);
+ kunmap_atomic(kaddr, KM_USER0);
+ }
+
+ simple_commit_write(file, page, from, from+copied);
+
+ unlock_page(page);
+ page_cache_release(page);
+
+ return copied;
+}
+
/*
* the inodes created here are not hashed. If you use iunique to generate
* unique inode values later for this filesystem, then you must take care
@@ -642,6 +684,8 @@ EXPORT_SYMBOL(dcache_dir_open);
EXPORT_SYMBOL(dcache_readdir);
EXPORT_SYMBOL(generic_read_dir);
EXPORT_SYMBOL(get_sb_pseudo);
+EXPORT_SYMBOL(simple_write_begin);
+EXPORT_SYMBOL(simple_write_end);
EXPORT_SYMBOL(simple_commit_write);
EXPORT_SYMBOL(simple_dir_inode_operations);
EXPORT_SYMBOL(simple_dir_operations);
Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c
+++ linux-2.6/drivers/block/loop.c
@@ -202,14 +202,13 @@ lo_do_transfer(struct loop_device *lo, i
* do_lo_send_aops - helper for writing data to a loop device
*
* This is the fast version for backing filesystems which implement the address
- * space operations prepare_write and commit_write.
+ * space operations write_begin and write_end.
*/
static int do_lo_send_aops(struct loop_device *lo, struct bio_vec *bvec,
- int bsize, loff_t pos, struct page *page)
+ int bsize, loff_t pos, struct page *unused)
{
struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */
struct address_space *mapping = file->f_mapping;
- const struct address_space_operations *aops = mapping->a_ops;
pgoff_t index;
unsigned offset, bv_offs;
int len, ret;
@@ -221,63 +220,47 @@ static int do_lo_send_aops(struct loop_d
len = bvec->bv_len;
while (len > 0) {
sector_t IV;
- unsigned size;
+ unsigned size, copied;
int transfer_result;
+ struct page *page;
+ void *fsdata;
IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
size = PAGE_CACHE_SIZE - offset;
if (size > len)
size = len;
- page = grab_cache_page(mapping, index);
- if (unlikely(!page))
+
+ ret = pagecache_write_begin(file, mapping, pos, size, 0,
+ &page, &fsdata);
+ if (ret)
goto fail;
- ret = aops->prepare_write(file, page, offset,
- offset + size);
- if (unlikely(ret)) {
- if (ret == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- continue;
- }
- goto unlock;
- }
+
transfer_result = lo_do_transfer(lo, WRITE, page, offset,
bvec->bv_page, bv_offs, size, IV);
- if (unlikely(transfer_result)) {
- /*
- * The transfer failed, but we still write the data to
- * keep prepare/commit calls balanced.
- */
- printk(KERN_ERR "loop: transfer error block %llu\n",
- (unsigned long long)index);
- zero_user_page(page, offset, size, KM_USER0);
- }
- flush_dcache_page(page);
- ret = aops->commit_write(file, page, offset,
- offset + size);
- if (unlikely(ret)) {
- if (ret == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- continue;
- }
- goto unlock;
- }
+ copied = size;
if (unlikely(transfer_result))
- goto unlock;
- bv_offs += size;
- len -= size;
+ copied = 0;
+
+ ret = pagecache_write_end(file, mapping, pos, size, copied,
+ page, fsdata);
+ if (ret < 0)
+ goto fail;
+ if (ret < copied)
+ copied = ret;
+
+ if (unlikely(transfer_result))
+ goto fail;
+
+ bv_offs += copied;
+ len -= copied;
offset = 0;
index++;
- pos += size;
- unlock_page(page);
- page_cache_release(page);
+ pos += copied;
}
ret = 0;
out:
mutex_unlock(&mapping->host->i_mutex);
return ret;
-unlock:
- unlock_page(page);
- page_cache_release(page);
fail:
ret = -1;
goto out;
@@ -311,7 +294,7 @@ static int __do_lo_send_write(struct fil
* do_lo_send_direct_write - helper for writing data to a loop device
*
* This is the fast, non-transforming version for backing filesystems which do
- * not implement the address space operations prepare_write and commit_write.
+ * not implement the address space operations write_begin and write_end.
* It uses the write file operation which should be present on all writeable
* filesystems.
*/
@@ -330,7 +313,7 @@ static int do_lo_send_direct_write(struc
* do_lo_send_write - helper for writing data to a loop device
*
* This is the slow, transforming version for filesystems which do not
- * implement the address space operations prepare_write and commit_write. It
+ * implement the address space operations write_begin and write_end. It
* uses the write file operation which should be present on all writeable
* filesystems.
*
@@ -762,7 +745,7 @@ static int loop_set_fd(struct loop_devic
*/
if (!file->f_op->sendfile)
goto out_putf;
- if (aops->prepare_write && aops->commit_write)
+ if (aops->prepare_write || aops->write_begin)
lo_flags |= LO_FLAGS_USE_AOPS;
if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
lo_flags |= LO_FLAGS_READ_ONLY;
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2697,53 +2697,30 @@ int __page_symlink(struct inode *inode,
{
struct address_space *mapping = inode->i_mapping;
struct page *page;
+ void *fsdata;
int err;
char *kaddr;
retry:
- err = -ENOMEM;
- page = find_or_create_page(mapping, 0, gfp_mask);
- if (!page)
- goto fail;
- err = mapping->a_ops->prepare_write(NULL, page, 0, len-1);
- if (err == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto retry;
- }
+ err = pagecache_write_begin(NULL, mapping, 0, PAGE_CACHE_SIZE,
+ AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
if (err)
- goto fail_map;
+ goto fail;
+
kaddr = kmap_atomic(page, KM_USER0);
memcpy(kaddr, symname, len-1);
+ memset(kaddr+len-1, 0, PAGE_CACHE_SIZE-(len-1));
kunmap_atomic(kaddr, KM_USER0);
- err = mapping->a_ops->commit_write(NULL, page, 0, len-1);
- if (err == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto retry;
- }
- if (err)
- goto fail_map;
- /*
- * Notice that we are _not_ going to block here - end of page is
- * unmapped, so this will only try to map the rest of page, see
- * that it is unmapped (typically even will not look into inode -
- * ->i_size will be enough for everything) and zero it out.
- * OTOH it's obviously correct and should make the page up-to-date.
- */
- if (!PageUptodate(page)) {
- err = mapping->a_ops->readpage(NULL, page);
- if (err != AOP_TRUNCATED_PAGE)
- wait_on_page_locked(page);
- } else {
- unlock_page(page);
- }
- page_cache_release(page);
+
+ err = pagecache_write_end(NULL, mapping, 0, PAGE_CACHE_SIZE, PAGE_CACHE_SIZE,
+ page, fsdata);
if (err < 0)
goto fail;
+ if (err < PAGE_CACHE_SIZE)
+ goto retry;
+
mark_inode_dirty(inode);
return 0;
-fail_map:
- unlock_page(page);
- page_cache_release(page);
fail:
return err;
}
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -554,7 +554,7 @@ static int pipe_to_file(struct pipe_inod
struct address_space *mapping = file->f_mapping;
unsigned int offset, this_len;
struct page *page;
- pgoff_t index;
+ void *fsdata;
int ret;
/*
@@ -564,49 +564,16 @@ static int pipe_to_file(struct pipe_inod
if (unlikely(ret))
return ret;
- index = sd->pos >> PAGE_CACHE_SHIFT;
offset = sd->pos & ~PAGE_CACHE_MASK;
this_len = sd->len;
if (this_len + offset > PAGE_CACHE_SIZE)
this_len = PAGE_CACHE_SIZE - offset;
-find_page:
- page = find_lock_page(mapping, index);
- if (!page) {
- ret = -ENOMEM;
- page = page_cache_alloc_cold(mapping);
- if (unlikely(!page))
- goto out_ret;
-
- /*
- * This will also lock the page
- */
- ret = add_to_page_cache_lru(page, mapping, index,
- GFP_KERNEL);
- if (unlikely(ret))
- goto out;
- }
-
- ret = mapping->a_ops->prepare_write(file, page, offset, offset+this_len);
- if (unlikely(ret)) {
- loff_t isize = i_size_read(mapping->host);
-
- if (ret != AOP_TRUNCATED_PAGE)
- unlock_page(page);
- page_cache_release(page);
- if (ret == AOP_TRUNCATED_PAGE)
- goto find_page;
-
- /*
- * prepare_write() may have instantiated a few blocks
- * outside i_size. Trim these off again.
- */
- if (sd->pos + this_len > isize)
- vmtruncate(mapping->host, isize);
-
- goto out_ret;
- }
+ ret = pagecache_write_begin(file, mapping, sd->pos, sd->len,
+ AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+ if (unlikely(ret))
+ goto out;
if (buf->page != page) {
/*
@@ -616,35 +583,14 @@ find_page:
char *dst = kmap_atomic(page, KM_USER1);
memcpy(dst + offset, src + buf->offset, this_len);
- flush_dcache_page(page);
kunmap_atomic(dst, KM_USER1);
buf->ops->unmap(pipe, buf, src);
}
- ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
- if (ret) {
- if (ret == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto find_page;
- }
- if (ret < 0)
- goto out;
- /*
- * Partial write has happened, so 'ret' already initialized by
- * number of bytes written, Where is nothing we have to do here.
- */
- } else
- ret = this_len;
- /*
- * Return the number of bytes written and mark page as
- * accessed, we are now done!
- */
- mark_page_accessed(page);
- balance_dirty_pages_ratelimited(mapping);
+ ret = pagecache_write_end(file, mapping, sd->pos, sd->len, sd->len, page, fsdata);
+
out:
- page_cache_release(page);
- unlock_page(page);
-out_ret:
+
return ret;
}
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -178,15 +178,18 @@ prototypes:
locking rules:
All except set_page_dirty may block
- BKL PageLocked(page)
+ BKL PageLocked(page) i_sem
writepage: no yes, unlocks (see below)
readpage: no yes, unlocks
sync_page: no maybe
writepages: no
set_page_dirty no no
readpages: no
-prepare_write: no yes
-commit_write: no yes
+prepare_write: no yes yes
+commit_write: no yes yes
+write_begin: no locks the page yes
+write_end: no yes, unlocks yes
+perform_write: no n/a yes
bmap: yes
invalidatepage: no yes
releasepage: no yes
Index: linux-2.6/Documentation/filesystems/vfs.txt
===================================================================
--- linux-2.6.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6/Documentation/filesystems/vfs.txt
@@ -534,6 +534,12 @@ struct address_space_operations {
struct list_head *pages, unsigned nr_pages);
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+ int (*write_begin)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(struct file *, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
@@ -629,6 +635,45 @@ struct address_space_operations {
operations. It should avoid returning an error if possible -
errors should have been handled by prepare_write.
+ write_begin: This is intended as a replacement for prepare_write. The
+ key differences being that:
+ - it returns a locked page (in *pagep) rather than being
+ given a pre locked page;
+ - it must be able to cope with short writes (where the
+ length passed to write_begin is greater than the number
+ of bytes copied into the page).
+
+ Called by the generic buffered write code to ask the filesystem to
+ prepare to write len bytes at the given offset in the file. The
+ address_space should check that the write will be able to complete,
+ by allocating space if necessary and doing any other internal
+ housekeeping. If the write will update parts of any basic-blocks on
+ storage, then those blocks should be pre-read (if they haven't been
+ read already) so that the updated blocks can be written out properly.
+
+ The filesystem must return the locked pagecache page for the specified
+ offset, in *pagep, for the caller to write into.
+
+ flags is a field for AOP_FLAG_xxx flags, described in
+ include/linux/fs.h.
+
+ A void * may be returned in fsdata, which then gets passed into
+ write_end.
+
+ Returns 0 on success; < 0 on failure (which is the error code), in
+ which case write_end is not called.
+
+ write_end: After a successful write_begin, and data copy, write_end must
+ be called. len is the original len passed to write_begin, and copied
+ is the amount that was able to be copied (copied == len is always true
+ if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag).
+
+ The filesystem must take care of unlocking the page and releasing it
+ refcount, and updating i_size.
+
+ Returns < 0 on failure, otherwise the number of bytes (<= 'copied')
+ that were able to be copied into pagecache.
+
bmap: called by the VFS to map a logical block offset within object to
physical block number. This method is used by the FIBMAP
ioctl and for working with swap-files. To be able to swap to
Index: linux-2.6/fs/revoked_inode.c
===================================================================
--- linux-2.6.orig/fs/revoked_inode.c
+++ linux-2.6/fs/revoked_inode.c
@@ -384,6 +384,20 @@ static int revoked_commit_write(struct f
return -EIO;
}
+static int revoked_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ return -EIO;
+}
+
+static int revoked_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ return -EIO;
+}
+
static ssize_t revoked_direct_IO(int rw, struct kiocb *iocb,
const struct iovec *iov, loff_t offset,
unsigned long nr_segs)
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 13/41] mm: restore KERNEL_DS optimisations
[not found] <20070524052844.860329000@suse.de>
` (11 preceding siblings ...)
2007-05-25 12:21 ` [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops npiggin
@ 2007-05-25 12:21 ` npiggin
12 siblings, 0 replies; 24+ messages in thread
From: npiggin @ 2007-05-25 12:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
[-- Attachment #1: fs-kernel_ds-opt.patch --]
[-- Type: text/plain, Size: 1797 bytes --]
Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
path.
This may be a pretty questionable gain in most cases, especially after the
legacy 2copy write path is removed, but it doesn't cost much.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2123,7 +2123,7 @@ static ssize_t generic_perform_write_2co
* cannot take a pagefault with the destination page locked.
* So pin the source page to copy it.
*/
- if (!PageUptodate(page)) {
+ if (!PageUptodate(page) && !segment_eq(get_fs(), KERNEL_DS)) {
unlock_page(page);
src_page = alloc_page(GFP_KERNEL);
@@ -2248,6 +2248,13 @@ static ssize_t generic_perform_write(str
const struct address_space_operations *a_ops = mapping->a_ops;
long status = 0;
ssize_t written = 0;
+ unsigned int flags = 0;
+
+ /*
+ * Copies from kernel address space cannot fail (NFSD is a big user).
+ */
+ if (segment_eq(get_fs(), KERNEL_DS))
+ flags |= AOP_FLAG_UNINTERRUPTIBLE;
do {
struct page *page;
@@ -2279,7 +2286,7 @@ again:
break;
}
- status = a_ops->write_begin(file, mapping, pos, bytes, 0,
+ status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
if (unlikely(status))
break;
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-25 12:21 ` [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops npiggin
@ 2007-05-31 4:30 ` Andrew Morton
2007-05-31 4:43 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2007-05-31 4:30 UTC (permalink / raw)
To: npiggin; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Fri, 25 May 2007 22:21:56 +1000 npiggin@suse.de wrote:
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
It doesn't like LTP's vmsplice01:
------------[ cut here ]------------
kernel BUG at fs/buffer.c:1829!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU: 0
EIP: 0060:[<c01a2938>] Not tainted VLI
EFLAGS: 00010206 (2.6.22-rc3-mm1 #1)
EIP is at __block_prepare_write+0x348/0x360
eax: 4000081d ebx: c176f10c ecx: 000001f0 edx: c176f10c
esi: df877670 edi: deeae3f0 ebp: de89fdc0 esp: de89fd60
ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
Process vmsplice01 (pid: 15759, ti=de89e000 task=c3ac1100 task.ti=de89e000)
Stack: c0140083 de89fd7c 00000046 00000000 c2fffe20 000001f0 c176f10c deeae3f0
c017c982 c01ee9b7 df877668 c01ee9b7 df877670 c345f300 df877684 de89fdcc
c01ee9eb c176f10c de89fdc4 c015c7ee deeae504 c176f10c df877670 deeae3f0
Call Trace:
[<c0103e1a>] show_trace_log_lvl+0x1a/0x30
[<c0103ed8>] show_stack_log_lvl+0xa8/0xe0
[<c01040f9>] show_registers+0x1e9/0x2f0
[<c010430f>] die+0x10f/0x240
[<c01044d1>] do_trap+0x91/0xc0
[<c0104889>] do_invalid_op+0x89/0xa0
[<c03f44ca>] error_code+0x72/0x78
[<c01a29d9>] block_write_begin+0x49/0xd0
[<c01c9a27>] ext3_write_begin+0xb7/0x190
[<c015e77f>] pagecache_write_begin+0x4f/0x150
[<c019f02b>] pipe_to_file+0x8b/0x140
[<c019ea00>] __splice_from_pipe+0x70/0x250
[<c019ec28>] splice_from_pipe+0x48/0x70
[<c019eef4>] generic_file_splice_write+0x54/0x100
[<c019e91f>] do_splice_from+0x5f/0x80
[<c019fd84>] sys_splice+0x164/0x200
[<c0102d8e>] sysenter_past_esp+0x5f/0x99
=======================
INFO: lockdep is turned off.
Code: 49 c0 89 44 24 0c 89 7c 24 08 89 5c 24 04 c7 04 24 ac 2a 49 c0 e8 e9 ff f7 ff e8 b4 21 f6 ff 8b 4d f0 e9 a6 fe ff ff 0f 0b eb fe <0f> 0b eb fe 8d 74 26 00 0f 0b eb fe 0f 0b eb fe 90 8d b4 26 00
EIP: [<c01a2938>] __block_prepare_write+0x348/0x360 SS:ESP 0068:de89fd60
That's
BUG_ON(to > PAGE_CACHE_SIZE);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 4:30 ` Andrew Morton
@ 2007-05-31 4:43 ` Nick Piggin
2007-05-31 4:52 ` Andrew Morton
0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2007-05-31 4:43 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Wed, May 30, 2007 at 09:30:35PM -0700, Andrew Morton wrote:
> On Fri, 25 May 2007 22:21:56 +1000 npiggin@suse.de wrote:
>
> > These are intended to replace prepare_write and commit_write with more
> > flexible alternatives that are also able to avoid the buffered write
> > deadlock problems efficiently (which prepare_write is unable to do).
>
> It doesn't like LTP's vmsplice01:
>
> ------------[ cut here ]------------
> kernel BUG at fs/buffer.c:1829!
> invalid opcode: 0000 [#1]
> SMP
> Modules linked in:
> CPU: 0
> EIP: 0060:[<c01a2938>] Not tainted VLI
> EFLAGS: 00010206 (2.6.22-rc3-mm1 #1)
> EIP is at __block_prepare_write+0x348/0x360
> eax: 4000081d ebx: c176f10c ecx: 000001f0 edx: c176f10c
> esi: df877670 edi: deeae3f0 ebp: de89fdc0 esp: de89fd60
> ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
> Process vmsplice01 (pid: 15759, ti=de89e000 task=c3ac1100 task.ti=de89e000)
> Stack: c0140083 de89fd7c 00000046 00000000 c2fffe20 000001f0 c176f10c deeae3f0
> c017c982 c01ee9b7 df877668 c01ee9b7 df877670 c345f300 df877684 de89fdcc
> c01ee9eb c176f10c de89fdc4 c015c7ee deeae504 c176f10c df877670 deeae3f0
> Call Trace:
> [<c0103e1a>] show_trace_log_lvl+0x1a/0x30
> [<c0103ed8>] show_stack_log_lvl+0xa8/0xe0
> [<c01040f9>] show_registers+0x1e9/0x2f0
> [<c010430f>] die+0x10f/0x240
> [<c01044d1>] do_trap+0x91/0xc0
> [<c0104889>] do_invalid_op+0x89/0xa0
> [<c03f44ca>] error_code+0x72/0x78
> [<c01a29d9>] block_write_begin+0x49/0xd0
> [<c01c9a27>] ext3_write_begin+0xb7/0x190
> [<c015e77f>] pagecache_write_begin+0x4f/0x150
> [<c019f02b>] pipe_to_file+0x8b/0x140
> [<c019ea00>] __splice_from_pipe+0x70/0x250
> [<c019ec28>] splice_from_pipe+0x48/0x70
> [<c019eef4>] generic_file_splice_write+0x54/0x100
> [<c019e91f>] do_splice_from+0x5f/0x80
> [<c019fd84>] sys_splice+0x164/0x200
> [<c0102d8e>] sysenter_past_esp+0x5f/0x99
> =======================
> INFO: lockdep is turned off.
> Code: 49 c0 89 44 24 0c 89 7c 24 08 89 5c 24 04 c7 04 24 ac 2a 49 c0 e8 e9 ff f7 ff e8 b4 21 f6 ff 8b 4d f0 e9 a6 fe ff ff 0f 0b eb fe <0f> 0b eb fe 8d 74 26 00 0f 0b eb fe 0f 0b eb fe 90 8d b4 26 00
> EIP: [<c01a2938>] __block_prepare_write+0x348/0x360 SS:ESP 0068:de89fd60
>
>
> That's
>
> BUG_ON(to > PAGE_CACHE_SIZE);
Thanks. Hmm, sorry I didn't test splice much. Does this fix it?
---
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -570,7 +570,7 @@ static int pipe_to_file(struct pipe_inod
if (this_len + offset > PAGE_CACHE_SIZE)
this_len = PAGE_CACHE_SIZE - offset;
- ret = pagecache_write_begin(file, mapping, sd->pos, sd->len,
+ ret = pagecache_write_begin(file, mapping, sd->pos, this_len,
AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
if (unlikely(ret))
goto out;
@@ -587,7 +587,7 @@ static int pipe_to_file(struct pipe_inod
buf->ops->unmap(pipe, buf, src);
}
- ret = pagecache_write_end(file, mapping, sd->pos, sd->len, sd->len, page, fsdata);
+ ret = pagecache_write_end(file, mapping, sd->pos, this_len, this_len, page, fsdata);
out:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 4:43 ` Nick Piggin
@ 2007-05-31 4:52 ` Andrew Morton
2007-05-31 4:57 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2007-05-31 4:52 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Thu, 31 May 2007 06:43:27 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > INFO: lockdep is turned off.
> > Code: 49 c0 89 44 24 0c 89 7c 24 08 89 5c 24 04 c7 04 24 ac 2a 49 c0 e8 e9 ff f7 ff e8 b4 21 f6 ff 8b 4d f0 e9 a6 fe ff ff 0f 0b eb fe <0f> 0b eb fe 8d 74 26 00 0f 0b eb fe 0f 0b eb fe 90 8d b4 26 00
> > EIP: [<c01a2938>] __block_prepare_write+0x348/0x360 SS:ESP 0068:de89fd60
> >
> >
> > That's
> >
> > BUG_ON(to > PAGE_CACHE_SIZE);
>
>
> Thanks. Hmm, sorry I didn't test splice much. Does this fix it?
Don't know - I shelved the patches.
Given the great pile of build errors, I think we need the next rev.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 4:52 ` Andrew Morton
@ 2007-05-31 4:57 ` Nick Piggin
2007-05-31 5:11 ` Andrew Morton
0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2007-05-31 4:57 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Wed, May 30, 2007 at 09:52:31PM -0700, Andrew Morton wrote:
> On Thu, 31 May 2007 06:43:27 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
> > > INFO: lockdep is turned off.
> > > Code: 49 c0 89 44 24 0c 89 7c 24 08 89 5c 24 04 c7 04 24 ac 2a 49 c0 e8 e9 ff f7 ff e8 b4 21 f6 ff 8b 4d f0 e9 a6 fe ff ff 0f 0b eb fe <0f> 0b eb fe 8d 74 26 00 0f 0b eb fe 0f 0b eb fe 90 8d b4 26 00
> > > EIP: [<c01a2938>] __block_prepare_write+0x348/0x360 SS:ESP 0068:de89fd60
> > >
> > >
> > > That's
> > >
> > > BUG_ON(to > PAGE_CACHE_SIZE);
> >
> >
> > Thanks. Hmm, sorry I didn't test splice much. Does this fix it?
>
> Don't know - I shelved the patches.
Oh, that didn't last long :P
> Given the great pile of build errors, I think we need the next rev.
I was working on bringing some of the others uptodate (hopefully
before you did another release). There is not much point in doing
that if they don't get merged because the patches just break again.
Were there build errors in any core code or converted filesystems?
AFAIKS it was just in reiser4 and a couple of the "cont" filesystems
that didn't get converted yet.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 4:57 ` Nick Piggin
@ 2007-05-31 5:11 ` Andrew Morton
2007-05-31 5:15 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2007-05-31 5:11 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Thu, 31 May 2007 06:57:54 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > Don't know - I shelved the patches.
>
> Oh, that didn't last long :P
I have a heap of other stuff to get out the door. If I have to
do just two bisects then it's 4AM and I give up then I have to repull
everything and we're back to square one.
Fortunately, I didn't need to do a bisect this time. That's unusual.
>
> > Given the great pile of build errors, I think we need the next rev.
>
> I was working on bringing some of the others uptodate (hopefully
> before you did another release). There is not much point in doing
> that if they don't get merged because the patches just break again.
There's not much point in sending build-busting patches either. Lots
of people run allmodconfig.
My sympathy for broken patches is limited - you should see what happens
over here ;)
I can do you a rollup with those patches reinstated after I've done rc3-mm1
if you like.
> Were there build errors in any core code or converted filesystems?
> AFAIKS it was just in reiser4 and a couple of the "cont" filesystems
> that didn't get converted yet.
The _cont filesystems, reiser4 and that revoke warning.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 5:11 ` Andrew Morton
@ 2007-05-31 5:15 ` Nick Piggin
2007-05-31 7:05 ` Andrew Morton
0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2007-05-31 5:15 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Wed, May 30, 2007 at 10:11:21PM -0700, Andrew Morton wrote:
> On Thu, 31 May 2007 06:57:54 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
> > > Don't know - I shelved the patches.
> >
> > Oh, that didn't last long :P
>
> I have a heap of other stuff to get out the door. If I have to
> do just two bisects then it's 4AM and I give up then I have to repull
> everything and we're back to square one.
>
> Fortunately, I didn't need to do a bisect this time. That's unusual.
>
> >
> > > Given the great pile of build errors, I think we need the next rev.
> >
> > I was working on bringing some of the others uptodate (hopefully
> > before you did another release). There is not much point in doing
> > that if they don't get merged because the patches just break again.
>
> There's not much point in sending build-busting patches either. Lots
> of people run allmodconfig.
>
> My sympathy for broken patches is limited - you should see what happens
> over here ;)
>
> I can do you a rollup with those patches reinstated after I've done rc3-mm1
> if you like.
OK, if you are planning to get a release out now, that's fair
enough.
If you can send that rollup, it would be good. I could try getting
everything to compile and do some more testing on it too.
> > Were there build errors in any core code or converted filesystems?
> > AFAIKS it was just in reiser4 and a couple of the "cont" filesystems
> > that didn't get converted yet.
>
> The _cont filesystems, reiser4 and that revoke warning.
OK, thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 5:15 ` Nick Piggin
@ 2007-05-31 7:05 ` Andrew Morton
2007-06-01 1:18 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2007-05-31 7:05 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Thu, 31 May 2007 07:15:39 +0200 Nick Piggin <npiggin@suse.de> wrote:
> If you can send that rollup, it would be good. I could try getting
> everything to compile and do some more testing on it too.
Single patch against 2.6.22-rc3: http://userweb.kernel.org/~akpm/np.gz
broken-out: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-05-30-09-30.tar.gz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops
2007-05-31 7:05 ` Andrew Morton
@ 2007-06-01 1:18 ` Nick Piggin
0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2007-06-01 1:18 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Mark Fasheh, Linux Memory Management
On Thu, May 31, 2007 at 12:05:39AM -0700, Andrew Morton wrote:
> On Thu, 31 May 2007 07:15:39 +0200 Nick Piggin <npiggin@suse.de> wrote:
>
> > If you can send that rollup, it would be good. I could try getting
> > everything to compile and do some more testing on it too.
>
>
> Single patch against 2.6.22-rc3: http://userweb.kernel.org/~akpm/np.gz
>
> broken-out: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-05-30-09-30.tar.gz
Thanks, I'll get onto it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
2007-05-14 19:06 ` Dave Jones
@ 2007-05-14 22:45 ` Nick Piggin
0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2007-05-14 22:45 UTC (permalink / raw)
To: Dave Jones
Cc: Andrew Morton, Andrew Morton, linux-fsdevel, Linux Memory Management
On Mon, May 14, 2007 at 03:06:35PM -0400, Dave Jones wrote:
> On Mon, May 14, 2007 at 04:06:21PM +1000, npiggin@suse.de wrote:
> > This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
> > also revert.
>
> changes like this play havoc with git-bisect. If you must revert stuff
> before patching new code in, revert it all in a single diff.
It's all going to still boot and run...
But I guess there is no harm in merging them, however I'll send them
to Andrew singularly and he can merge them if he likes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
2007-05-14 6:06 ` [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 npiggin, Andrew Morton
@ 2007-05-14 19:06 ` Dave Jones
2007-05-14 22:45 ` Nick Piggin
0 siblings, 1 reply; 24+ messages in thread
From: Dave Jones @ 2007-05-14 19:06 UTC (permalink / raw)
To: npiggin, Andrew Morton
Cc: Andrew Morton, linux-fsdevel, Linux Memory Management
On Mon, May 14, 2007 at 04:06:21PM +1000, npiggin@suse.de wrote:
> This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
> also revert.
changes like this play havoc with git-bisect. If you must revert stuff
before patching new code in, revert it all in a single diff.
Dave
--
http://www.codemonkey.org.uk
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6
[not found] <20070514060619.689648000@wotan.suse.de>
@ 2007-05-14 6:06 ` npiggin, Andrew Morton
2007-05-14 19:06 ` Dave Jones
0 siblings, 1 reply; 24+ messages in thread
From: npiggin, Andrew Morton @ 2007-05-14 6:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-fsdevel, Linux Memory Management, Andrew Morton
[-- Attachment #1: mm-revert-buffered-write-zero-length-iov.patch --]
[-- Type: text/plain, Size: 1848 bytes --]
This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we
also revert.
Cc: Linux Memory Management <linux-mm@kvack.org>
Cc: Linux Filesystems <linux-fsdevel@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
mm/filemap.c | 9 +--------
mm/filemap.h | 4 ++--
2 files changed, 3 insertions(+), 10 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1957,12 +1957,6 @@ generic_file_buffered_write(struct kiocb
break;
}
- if (unlikely(bytes == 0)) {
- status = 0;
- copied = 0;
- goto zero_length_segment;
- }
-
status = a_ops->prepare_write(file, page, offset, offset+bytes);
if (unlikely(status)) {
loff_t isize = i_size_read(inode);
@@ -1992,8 +1986,7 @@ generic_file_buffered_write(struct kiocb
page_cache_release(page);
continue;
}
-zero_length_segment:
- if (likely(copied >= 0)) {
+ if (likely(copied > 0)) {
if (!status)
status = copied;
Index: linux-2.6/mm/filemap.h
===================================================================
--- linux-2.6.orig/mm/filemap.h
+++ linux-2.6/mm/filemap.h
@@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove
const struct iovec *iov = *iovp;
size_t base = *basep;
- do {
+ while (bytes) {
int copy = min(bytes, iov->iov_len - base);
bytes -= copy;
@@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove
iov++;
base = 0;
}
- } while (bytes);
+ }
*iovp = iov;
*basep = base;
}
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2007-06-01 1:18 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20070524052844.860329000@suse.de>
2007-05-25 12:21 ` [patch 01/41] mm: revert KERNEL_DS buffered write optimisation npiggin
2007-05-25 12:21 ` [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 npiggin, Andrew Morton
2007-05-25 12:21 ` [patch 03/41] Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 npiggin, Andrew Morton
2007-05-25 12:21 ` [patch 04/41] mm: clean up buffered write code npiggin, Andrew Morton
2007-05-25 12:21 ` [patch 05/41] mm: debug write deadlocks npiggin
2007-05-25 12:21 ` [patch 06/41] mm: trim more holes npiggin
2007-05-25 12:21 ` [patch 07/41] mm: buffered write cleanup npiggin
2007-05-25 12:21 ` [patch 08/41] mm: write iovec cleanup npiggin
2007-05-25 12:21 ` [patch 09/41] mm: fix pagecache write deadlocks npiggin
2007-05-25 12:21 ` [patch 10/41] mm: buffered write iterator npiggin
2007-05-25 12:21 ` [patch 11/41] fs: fix data-loss on error npiggin
2007-05-25 12:21 ` [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops npiggin
2007-05-31 4:30 ` Andrew Morton
2007-05-31 4:43 ` Nick Piggin
2007-05-31 4:52 ` Andrew Morton
2007-05-31 4:57 ` Nick Piggin
2007-05-31 5:11 ` Andrew Morton
2007-05-31 5:15 ` Nick Piggin
2007-05-31 7:05 ` Andrew Morton
2007-06-01 1:18 ` Nick Piggin
2007-05-25 12:21 ` [patch 13/41] mm: restore KERNEL_DS optimisations npiggin
[not found] <20070514060619.689648000@wotan.suse.de>
2007-05-14 6:06 ` [patch 02/41] Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6 npiggin, Andrew Morton
2007-05-14 19:06 ` Dave Jones
2007-05-14 22:45 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox