linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@osdl.org>
From: Andrew Morton <akpm@osdl.org>
To: Nick Piggin <nickpiggin@yahoo.com.au>,
	Nick Piggin <npiggin@suse.de>,
	Linux Memory Management <linux-mm@kvack.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: [patch 6/6] fix pagecache write deadlocks
Date: Tue, 10 Oct 2006 23:18:08 -0700	[thread overview]
Message-ID: <20061010231808.2d90bc5f.akpm@osdl.org> (raw)
In-Reply-To: <20061010231514.c1da7355.akpm@osdl.org>

This is half-written and won't work.

The idea is to modify the core write() code so that it won't take a pagefault
while holding a lock on the pagecache page.

- Instead of copy_from_user(), use inc_preempt_count() and
  copy_from_user_inatomic().

- If the copy_from_user_inatomic() hits a pagefault, it'll return a short
  copy.

  - So zero out the remainder of the pagecache page (the uncopied bit).

    - but only if the page is not uptodate.

  - commit_write()

  - unlock_page()

  - adjust various pointers and counters

  - go back and try to fault the page in again, redo the lock_page,
    prepare_write, copy_from_user_inatomic(), etc.

  - After a certain number of retries, someone is being silly: give up.


Now, the design objective here isn't just to fix the deadlock.  It's to be
able to copy multiple iovec segments into the pagecache page within a single
prepare-write/commit_write pair.  But to do that, we'll need to prefault them.

That could get complex.  Walk across the segments, touching each user page
until we reach the point where we see that this iovec segment doesn't fall
into the target page.

Alternatively, only prefault the *present* iovec segment.  The code as
designed will handle pagefaults against the user's pages quite happily.  But
is it efficient?  Needs thought.

(I think we will end up with quite a bit of dead code as a result of this
exercise - some of the fancy user-copying inlines.  Needs checking when the
dust has settled).


Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 mm/filemap.c |    7 ++--
 mm/filemap.h |   69 ++++++++++++++++++++++++++++++++++---------------
 2 files changed, 53 insertions(+), 23 deletions(-)

diff -puN mm/filemap.c~fix-pagecache-write-deadlocks mm/filemap.c
--- a/mm/filemap.c~fix-pagecache-write-deadlocks
+++ a/mm/filemap.c
@@ -2133,11 +2133,12 @@ generic_file_buffered_write(struct kiocb
 			break;
 		}
 		if (likely(nr_segs == 1))
-			copied = filemap_copy_from_user(page, offset,
+			copied = filemap_copy_from_user_atomic(page, offset,
 							buf, bytes);
 		else
-			copied = filemap_copy_from_user_iovec(page, offset,
-						cur_iov, iov_offset, bytes);
+			copied = filemap_copy_from_user_iovec_atomic(page,
+						offset, cur_iov, iov_offset,
+						bytes);
 		flush_dcache_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
diff -puN mm/filemap.h~fix-pagecache-write-deadlocks mm/filemap.h
--- a/mm/filemap.h~fix-pagecache-write-deadlocks
+++ a/mm/filemap.h
@@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic(
 
 /*
  * Copy as much as we can into the page and return the number of bytes which
- * were sucessfully copied.  If a fault is encountered then clear the page
- * out to (offset+bytes) and return the number of bytes which were copied.
+ * were sucessfully copied.  If a fault is encountered then return the number of
+ * bytes which were copied.
  *
- * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache
- * to *NOT* zero any tail of the buffer that it failed to copy.  If it does,
- * and if the following non-atomic copy succeeds, then there is a small window
- * where the target page contains neither the data before the write, nor the
- * data after the write (it contains zero).  A read at this time will see
- * data that is inconsistent with any ordering of the read and the write.
- * (This has been detected in practice).
+ * NOTE: For this to work reliably we really want
+ * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it
+ * failed to copy.  If it does, and if the following non-atomic copy succeeds,
+ * then there is a small window where the target page contains neither the data
+ * before the write, nor the data after the write (it contains zero).  A read at
+ * this time will see data that is inconsistent with any ordering of the read
+ * and the write.  (This has been detected in practice).
  */
 static inline size_t
-filemap_copy_from_user(struct page *page, unsigned long offset,
+filemap_copy_from_user_atomic(struct page *page, unsigned long offset,
 			const char __user *buf, unsigned bytes)
 {
 	char *kaddr;
@@ -53,14 +53,28 @@ filemap_copy_from_user(struct page *page
 	return bytes - left;
 }
 
+static inline size_t
+filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset,
+			const char __user *buf, unsigned bytes)
+{
+	int left;
+	char *kaddr;
+
+	kaddr = kmap(page);
+	left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+	kunmap(page);
+	return bytes - left;
+}
+
 /*
- * This has the same sideeffects and return value as filemap_copy_from_user().
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_atomic().
  * The difference is that on a fault we need to memset the remainder of the
  * page (out to offset+bytes), to emulate filemap_copy_from_user()'s
  * single-segment behaviour.
  */
 static inline size_t
-filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
+filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset,
 			const struct iovec *iov, size_t base, size_t bytes)
 {
 	char *kaddr;
@@ -70,14 +84,29 @@ filemap_copy_from_user_iovec(struct page
 	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
 							 base, bytes);
 	kunmap_atomic(kaddr, KM_USER0);
-	if (copied != bytes) {
-		kaddr = kmap(page);
-		copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
-								 base, bytes);
-		if (bytes - copied)
-			memset(kaddr + offset + copied, 0, bytes - copied);
-		kunmap(page);
-	}
+	return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * filemap_copy_from_user_nonatomic().
+ * The difference is that on a fault we need to memset the remainder of the
+ * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s
+ * single-segment behaviour.
+ */
+static inline size_t
+filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset,
+			const struct iovec *iov, size_t base, size_t bytes)
+{
+	char *kaddr;
+	size_t copied;
+
+	kaddr = kmap(page);
+	copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov,
+							 base, bytes);
+	if (bytes - copied)
+		memset(kaddr + offset + copied, 0, bytes - copied);
+	kunmap(page);
 	return copied;
 }
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2006-10-11  6:18 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-10 14:21 [rfc] 2.6.19-rc1-git5: consolidation of file backed fault handlers Nick Piggin
2006-10-10 14:21 ` [patch 1/5] mm: fault vs invalidate/truncate check Nick Piggin
2006-10-10 14:21 ` [patch 2/5] mm: fault vs invalidate/truncate race fix Nick Piggin
2006-10-11  4:38   ` Andrew Morton
2006-10-11  5:39     ` Nick Piggin
2006-10-11  6:00       ` Andrew Morton
2006-10-11  9:21         ` Nick Piggin
2006-10-11 16:21         ` Linus Torvalds
2006-10-11 16:57           ` SPAM: " Nick Piggin
2006-10-11 17:11             ` Linus Torvalds
2006-10-11 17:21               ` SPAM: " Nick Piggin
2006-10-11 17:38                 ` Linus Torvalds
2006-10-12  3:33                   ` Nick Piggin
2006-10-12 15:37                     ` Linus Torvalds
2006-10-12 15:40                       ` Nick Piggin
2006-10-11  5:13   ` Andrew Morton
2006-10-11  5:50     ` Nick Piggin
2006-10-11  6:10       ` Andrew Morton
2006-10-11  6:17       ` [patch 1/6] revert "generic_file_buffered_write(): handle zero length iovec segments" Andrew Morton, Andrew Morton
     [not found]       ` <20061010231150.fb9e30f5.akpm@osdl.org>
2006-10-11  6:17         ` [patch 2/6] revert "generic_file_buffered_write(): deadlock on vectored write" Andrew Morton, Andrew Morton
     [not found]         ` <20061010231243.bc8b834c.akpm@osdl.org>
2006-10-11  6:17           ` [patch 3/6] generic_file_buffered_write() cleanup Andrew Morton, Andrew Morton
     [not found]           ` <20061010231339.a79c1fae.akpm@osdl.org>
2006-10-11  6:18             ` [patch 4/6] generic_file_buffered_write(): fix page prefaulting Andrew Morton, Andrew Morton
     [not found]             ` <20061010231424.db88931f.akpm@osdl.org>
2006-10-11  6:18               ` [patch 5/6] generic_file_buffered_write(): max_len cleanup Andrew Morton, Andrew Morton
     [not found]               ` <20061010231514.c1da7355.akpm@osdl.org>
2006-10-11  6:18                 ` Andrew Morton, Andrew Morton [this message]
2006-10-21  1:53       ` [patch 2/5] mm: fault vs invalidate/truncate race fix Benjamin Herrenschmidt
2006-10-10 14:22 ` [patch 3/5] mm: fault handler to replace nopage and populate Nick Piggin
2006-10-10 14:22 ` [patch 4/5] mm: add vm_insert_pfn helpler Nick Piggin
2006-10-10 14:22 ` [patch 5/5] mm: merge nopfn with fault handler Nick Piggin
2006-10-10 14:26 ` [rfc] 2.6.19-rc1-git5: consolidation of file backed fault handlers Nick Piggin
2006-10-10 14:33 ` Christoph Hellwig
2006-10-10 15:01   ` Nick Piggin
2006-10-10 16:09     ` Arjan van de Ven
2006-10-11  0:46       ` SPAM: " Nick Piggin
2006-10-10 15:07   ` Arjan van de Ven

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20061010231808.2d90bc5f.akpm@osdl.org \
    --to=akpm@osdl.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nickpiggin@yahoo.com.au \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox