* [patch] splice mmap_sem deadlock
@ 2007-09-28 16:00 Nick Piggin
2007-09-28 17:31 ` Jens Axboe
0 siblings, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2007-09-28 16:00 UTC (permalink / raw)
To: Andrew Morton, Jens Axboe, Linux Memory Management List
I'm fairly sure this is the right patch... but the explicit comment has me
thinking I missed something? (there is also a down_write->fault deadlock
in the splice code in -mm, however when talking with Jens about that code,
we might have an idea for a different approach using preexisting vmas
rather than setting them up with splice -- so this patch is just for mainline)
mmap_sem cannot be taken recursively for read, due to the FIFO nature of the
rwsem, and the presence of possible write lockers.
process A process B
down_read(mmap_sem); [1]
get_user(); down_write(mmap_sem); [2]
-> page fault
down_read(mmap_sem); [3]
[1] will never be released until [3] can be taken and released, however:
[2] blocks on [1]; [3] blocks on [2].
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/splice.c
===================================================================
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -1534,12 +1534,6 @@ static int get_iovec_page_array(const st
{
int buffers = 0, error = 0;
- /*
- * It's ok to take the mmap_sem for reading, even
- * across a "get_user()".
- */
- down_read(¤t->mm->mmap_sem);
-
while (nr_vecs) {
unsigned long off, npages;
void __user *base;
@@ -1583,9 +1577,11 @@ static int get_iovec_page_array(const st
if (npages > PIPE_BUFFERS - buffers)
npages = PIPE_BUFFERS - buffers;
+ down_read(¤t->mm->mmap_sem);
error = get_user_pages(current, current->mm,
(unsigned long) base, npages, 0, 0,
&pages[buffers], NULL);
+ up_read(¤t->mm->mmap_sem);
if (unlikely(error <= 0))
break;
@@ -1624,8 +1620,6 @@ static int get_iovec_page_array(const st
iov++;
}
- up_read(¤t->mm->mmap_sem);
-
if (buffers)
return buffers;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [patch] splice mmap_sem deadlock 2007-09-28 16:00 [patch] splice mmap_sem deadlock Nick Piggin @ 2007-09-28 17:31 ` Jens Axboe 2007-09-28 18:10 ` Linus Torvalds 0 siblings, 1 reply; 21+ messages in thread From: Jens Axboe @ 2007-09-28 17:31 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, Linux Memory Management List, Linus Torvalds On Fri, Sep 28 2007, Nick Piggin wrote: > I'm fairly sure this is the right patch... but the explicit comment has me > thinking I missed something? (there is also a down_write->fault deadlock > in the splice code in -mm, however when talking with Jens about that code, > we might have an idea for a different approach using preexisting vmas > rather than setting them up with splice -- so this patch is just for mainline) > > > mmap_sem cannot be taken recursively for read, due to the FIFO nature of the > rwsem, and the presence of possible write lockers. > > process A process B > down_read(mmap_sem); [1] > get_user(); down_write(mmap_sem); [2] > -> page fault > down_read(mmap_sem); [3] > > [1] will never be released until [3] can be taken and released, however: > [2] blocks on [1]; [3] blocks on [2]. It does looks suspicious. It was actually Linus who originally suggested this approach and wrote that comment - Linus? > Signed-off-by: Nick Piggin <npiggin@suse.de> > > --- > Index: linux-2.6/fs/splice.c > =================================================================== > --- linux-2.6.orig/fs/splice.c > +++ linux-2.6/fs/splice.c > @@ -1534,12 +1534,6 @@ static int get_iovec_page_array(const st > { > int buffers = 0, error = 0; > > - /* > - * It's ok to take the mmap_sem for reading, even > - * across a "get_user()". > - */ > - down_read(¤t->mm->mmap_sem); > - > while (nr_vecs) { > unsigned long off, npages; > void __user *base; > @@ -1583,9 +1577,11 @@ static int get_iovec_page_array(const st > if (npages > PIPE_BUFFERS - buffers) > npages = PIPE_BUFFERS - buffers; > > + down_read(¤t->mm->mmap_sem); > error = get_user_pages(current, current->mm, > (unsigned long) base, npages, 0, 0, > &pages[buffers], NULL); > + up_read(¤t->mm->mmap_sem); > > if (unlikely(error <= 0)) > break; > @@ -1624,8 +1620,6 @@ static int get_iovec_page_array(const st > iov++; > } > > - up_read(¤t->mm->mmap_sem); > - > if (buffers) > return buffers; > -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 17:31 ` Jens Axboe @ 2007-09-28 18:10 ` Linus Torvalds 2007-09-28 18:15 ` Jens Axboe 2007-09-29 13:08 ` Nick Piggin 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2007-09-28 18:10 UTC (permalink / raw) To: Jens Axboe; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, 28 Sep 2007, Jens Axboe wrote: > > It does looks suspicious. It was actually Linus who originally suggested > this approach and wrote that comment - Linus? Well, it used to be true, long time ago. It probably was still true in the original splice patch back way back when. But yeah, rwsemaphores broke that (very useful) trick in the name of fairness ;( Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 18:10 ` Linus Torvalds @ 2007-09-28 18:15 ` Jens Axboe 2007-09-28 18:23 ` Linus Torvalds 2007-09-29 13:08 ` Nick Piggin 1 sibling, 1 reply; 21+ messages in thread From: Jens Axboe @ 2007-09-28 18:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, Sep 28 2007, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > It does looks suspicious. It was actually Linus who originally suggested > > this approach and wrote that comment - Linus? > > Well, it used to be true, long time ago. It probably was still true in the > original splice patch back way back when. But yeah, rwsemaphores broke > that (very useful) trick in the name of fairness ;( It actually looks like it was buggy from day 1 there, unfortunately. The code is from april 2006 and used down_read() even then. So can you apply Nicks patch, add my Acked-by: Jens Axboe <jens.axboe@oracle.com> if you want. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 18:15 ` Jens Axboe @ 2007-09-28 18:23 ` Linus Torvalds 2007-09-28 19:30 ` Jens Axboe 0 siblings, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2007-09-28 18:23 UTC (permalink / raw) To: Jens Axboe; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, 28 Sep 2007, Jens Axboe wrote: > > It actually looks like it was buggy from day 1 there, unfortunately. The > code is from april 2006 and used down_read() even then. I was thinking of my *original* patch from way back when. But that one didn't actually do any of that stuff so no, it wasn't from there. > So can you apply Nicks patch I don't even have it, I only have a quoted-corrupted version of it. I wasn't originally cc'd. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 18:23 ` Linus Torvalds @ 2007-09-28 19:30 ` Jens Axboe 2007-09-28 20:02 ` Linus Torvalds 0 siblings, 1 reply; 21+ messages in thread From: Jens Axboe @ 2007-09-28 19:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, Sep 28 2007, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > It actually looks like it was buggy from day 1 there, unfortunately. The > > code is from april 2006 and used down_read() even then. > > I was thinking of my *original* patch from way back when. But that one > didn't actually do any of that stuff so no, it wasn't from there. Right, vmsplice() wasn't part of the original stuff. > > So can you apply Nicks patch > > I don't even have it, I only have a quoted-corrupted version of it. I > wasn't originally cc'd. Hmm, part of me doesn't like this patch, since we now end up beating on mmap_sem for each part of the vec. It's fine for a stable patch, but how about - prefaulting the iovec - using __get_user() - only dropping/regrabbing the lock if we have to fault ? -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 19:30 ` Jens Axboe @ 2007-09-28 20:02 ` Linus Torvalds 2007-09-28 20:08 ` Linus Torvalds 2007-09-29 13:10 ` Nick Piggin 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2007-09-28 20:02 UTC (permalink / raw) To: Jens Axboe; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, 28 Sep 2007, Jens Axboe wrote: > > Hmm, part of me doesn't like this patch, since we now end up beating on > mmap_sem for each part of the vec. It's fine for a stable patch, but how > about > > - prefaulting the iovec > - using __get_user() > - only dropping/regrabbing the lock if we have to fault "__get_user()" doesn't help any. But we should do the same thing we do for generic_file_write(), or whatever - probe it while in an atomic region. So something like the appended might work. Untested. Linus --- fs/splice.c | 32 +++++++++++++++++++++----------- 1 files changed, 21 insertions(+), 11 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index c010a72..07e880e 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1236,31 +1236,41 @@ static int get_iovec_page_array(const struct iovec __user *iov, { int buffers = 0, error = 0; - /* - * It's ok to take the mmap_sem for reading, even - * across a "get_user()". - */ down_read(¤t->mm->mmap_sem); while (nr_vecs) { unsigned long off, npages; + struct iovec entry; void __user *base; size_t len; int i; /* - * Get user address base and length for this iovec. + * We do not want to recursively take the mmap_sem semaphore + * on a page fault, since that could deadlock with a writer + * that comes in in the middle. So disable pagefaults, and + * do it the slow way if the copy fails.. */ - error = get_user(base, &iov->iov_base); - if (unlikely(error)) - break; - error = get_user(len, &iov->iov_len); - if (unlikely(error)) - break; + pagefault_disable(); + i = __copy_from_user_inatomic(&entry, iov, sizeof(entry)); + pagefault_enable(); + + if (unlikely(i)) { + up_read(¤t->mm->mmap_sem); + i = copy_from_user(&entry, iov, sizeof(entry)); + down_read(¤t->mm->mmap_sem); + error = -EFAULT; + if (i) + break; + } + + len = entry.iov_len; + base = entry.iov_base; /* * Sanity check this iovec. 0 read succeeds. */ + error = 0; if (unlikely(!len)) break; error = -EFAULT; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 20:02 ` Linus Torvalds @ 2007-09-28 20:08 ` Linus Torvalds 2007-09-29 6:37 ` Jens Axboe 2007-10-01 12:03 ` Jens Axboe 2007-09-29 13:10 ` Nick Piggin 1 sibling, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2007-09-28 20:08 UTC (permalink / raw) To: Jens Axboe; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, 28 Sep 2007, Linus Torvalds wrote: > > So something like the appended might work. Untested. Btw, it migth be cleaner to separate out this thing as a function of it's own, ie something like /* * Do a copy-from-user while holding the mmap_semaphore for reading */ int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n) { int partial; pagefault_disable(); partial = __copy_from_user_inatomic(dst, src, n); pagefault_enable(); if (!partial) return 0; up_read(¤t->mm->mmap_sem); partial = copy_from_user(dst, src, n); down_read(¤t->mm->mmap_sem); return partial ? -EFAULT : 0; } in case anybody else needs it. And even if nobody else does, making it a static inline function in fs/splice.c would at least separate out this thing from the core functionality, and just help keep things clear. Wanna test that thing? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 20:08 ` Linus Torvalds @ 2007-09-29 6:37 ` Jens Axboe 2007-10-01 12:03 ` Jens Axboe 1 sibling, 0 replies; 21+ messages in thread From: Jens Axboe @ 2007-09-29 6:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, Sep 28 2007, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Linus Torvalds wrote: > > > > So something like the appended might work. Untested. > > Btw, it migth be cleaner to separate out this thing as a function of it's > own, ie something like > > /* > * Do a copy-from-user while holding the mmap_semaphore for reading > */ > int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n) > { > int partial; > > pagefault_disable(); > partial = __copy_from_user_inatomic(dst, src, n); > pagefault_enable(); > > if (!partial) > return 0; > up_read(¤t->mm->mmap_sem); > partial = copy_from_user(dst, src, n); > down_read(¤t->mm->mmap_sem); > > return partial ? -EFAULT : 0; > } > > in case anybody else needs it. And even if nobody else does, making it a > static inline function in fs/splice.c would at least separate out this > thing from the core functionality, and just help keep things clear. > > Wanna test that thing? Sure thing, I like this approach a lot more! I'll whip up a combo of this and the previous and give it some testing. Don't expect a tested patch before monday though, weekend fully booked... -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 20:08 ` Linus Torvalds 2007-09-29 6:37 ` Jens Axboe @ 2007-10-01 12:03 ` Jens Axboe 2007-10-01 15:11 ` Linus Torvalds 1 sibling, 1 reply; 21+ messages in thread From: Jens Axboe @ 2007-10-01 12:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Fri, Sep 28 2007, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Linus Torvalds wrote: > > > > So something like the appended might work. Untested. > > Btw, it migth be cleaner to separate out this thing as a function of it's > own, ie something like > > /* > * Do a copy-from-user while holding the mmap_semaphore for reading > */ > int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n) > { > int partial; > > pagefault_disable(); > partial = __copy_from_user_inatomic(dst, src, n); > pagefault_enable(); > > if (!partial) > return 0; > up_read(¤t->mm->mmap_sem); > partial = copy_from_user(dst, src, n); > down_read(¤t->mm->mmap_sem); > > return partial ? -EFAULT : 0; > } > > in case anybody else needs it. And even if nobody else does, making it a > static inline function in fs/splice.c would at least separate out this > thing from the core functionality, and just help keep things clear. > > Wanna test that thing? OK, this is what I tested. It works for me. diff --git a/fs/splice.c b/fs/splice.c index c010a72..49b8107 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1224,6 +1224,33 @@ static long do_splice(struct file *in, loff_t __user *off_in, } /* + * Do a copy-from-user while holding the mmap_semaphore for reading. If we + * have to fault the user page in, we must drop the mmap_sem to avoid a + * deadlock in the page fault handling (it wants to grab mmap_sem too, but for + * writing). This assumes that we will very rarely hit the partial != 0 path, + * or this will not be a win. + */ +static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n) +{ + int partial; + + pagefault_disable(); + partial = __copy_from_user_inatomic(dst, src, n); + pagefault_enable(); + + /* + * Didn't copy everything, drop the mmap_sem and do a faulting copy + */ + if (unlikely(partial)) { + up_read(¤t->mm->mmap_sem); + partial = copy_from_user(dst, src, n); + down_read(¤t->mm->mmap_sem); + } + + return partial; +} + +/* * Map an iov into an array of pages and offset/length tupples. With the * partial_page structure, we can map several non-contiguous ranges into * our ones pages[] map instead of splitting that operation into pieces. @@ -1236,31 +1263,26 @@ static int get_iovec_page_array(const struct iovec __user *iov, { int buffers = 0, error = 0; - /* - * It's ok to take the mmap_sem for reading, even - * across a "get_user()". - */ down_read(¤t->mm->mmap_sem); while (nr_vecs) { unsigned long off, npages; + struct iovec entry; void __user *base; size_t len; int i; - /* - * Get user address base and length for this iovec. - */ - error = get_user(base, &iov->iov_base); - if (unlikely(error)) - break; - error = get_user(len, &iov->iov_len); - if (unlikely(error)) + error = -EFAULT; + if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry))) break; + base = entry.iov_base; + len = entry.iov_len; + /* * Sanity check this iovec. 0 read succeeds. */ + error = 0; if (unlikely(!len)) break; error = -EFAULT; -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-10-01 12:03 ` Jens Axboe @ 2007-10-01 15:11 ` Linus Torvalds 2007-10-01 15:45 ` Balbir Singh 2007-10-01 17:33 ` Jens Axboe 0 siblings, 2 replies; 21+ messages in thread From: Linus Torvalds @ 2007-10-01 15:11 UTC (permalink / raw) To: Jens Axboe; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List The comment is wrong. On Mon, 1 Oct 2007, Jens Axboe wrote: > > /* > + * Do a copy-from-user while holding the mmap_semaphore for reading. If we > + * have to fault the user page in, we must drop the mmap_sem to avoid a > + * deadlock in the page fault handling (it wants to grab mmap_sem too, but for > + * writing). This assumes that we will very rarely hit the partial != 0 path, > + * or this will not be a win. > + */ Page faulting only grabs it for reading, and having a page fault happen is not problematic in itself. Readers *do* nest. What is problematic is: thread#1 thread#2 get_iovec_page_array down_read() .. everything ok so far .. mmap() down_write() .. correctly blocks on the reader .. .. everything ok so far .. .. pagefault .. down_read() .. fairness code now blocks on the waiting writer! .. .. oops. We're deadlocked .. So the problem is that while readers do nest nicely, they only do so if no potential writers can possibly exist (which of course never happens: an rwlock with no writers is a no-op ;). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-10-01 15:11 ` Linus Torvalds @ 2007-10-01 15:45 ` Balbir Singh 2007-10-01 16:11 ` Linus Torvalds 2007-10-01 17:33 ` Jens Axboe 1 sibling, 1 reply; 21+ messages in thread From: Balbir Singh @ 2007-10-01 15:45 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Nick Piggin, Andrew Morton, Linux Memory Management List Linus Torvalds wrote: > The comment is wrong. > > On Mon, 1 Oct 2007, Jens Axboe wrote: >> >> /* >> + * Do a copy-from-user while holding the mmap_semaphore for reading. If we >> + * have to fault the user page in, we must drop the mmap_sem to avoid a >> + * deadlock in the page fault handling (it wants to grab mmap_sem too, but for >> + * writing). This assumes that we will very rarely hit the partial != 0 path, >> + * or this will not be a win. >> + */ > > Page faulting only grabs it for reading, and having a page fault happen is > not problematic in itself. Readers *do* nest. > > What is problematic is: > > thread#1 thread#2 > > get_iovec_page_array > down_read() > .. everything ok so far .. > mmap() > down_write() > .. correctly blocks on the reader .. > .. everything ok so far .. > > .. pagefault .. > down_read() > .. fairness code now blocks on the waiting writer! .. > .. oops. We're deadlocked .. > > So the problem is that while readers do nest nicely, they only do so if no > potential writers can possibly exist (which of course never happens: an > rwlock with no writers is a no-op ;). Sounds very similar to the problems we had with CPU hotplug earlier. It's a rwlock locking anti-pattern. I know that recursive locks have been frowned upon earlier, but I wonder if there is a case here. Of-course recursive locks would not be *fair*. The other solution of passing down lock ownership information is a pain. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-10-01 15:45 ` Balbir Singh @ 2007-10-01 16:11 ` Linus Torvalds 2007-10-01 18:19 ` Balbir Singh 0 siblings, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2007-10-01 16:11 UTC (permalink / raw) To: Balbir Singh Cc: Jens Axboe, Nick Piggin, Andrew Morton, Linux Memory Management List On Mon, 1 Oct 2007, Balbir Singh wrote: > > Sounds very similar to the problems we had with CPU hotplug earlier. > It's a rwlock locking anti-pattern. I know that recursive locks > have been frowned upon earlier, but I wonder if there is a case here. > Of-course recursive locks would not be *fair*. The problem with recursive locks is that they are inevitably done wrong. For example, the "natural" way to do them is to just save the process ID or something like that. Which is utter crap. Yet, people do it *every* single time (yes, I've done it too, I admit). The thing is, "recursive" doesn't mean "same CPU" or "same process" or "same thread" or anything like that. It means "same *dependency-chain*". With the very real implication that you literally have to pass the "lock instance" (whether that is a cookie or anything else) around, and thus really generate the proper chain. For example, in CPU hotplug, the dependency chain really did end up moving between different execution contexts, iirc (eg from process context into kernel workqueues). So we could add some kind of recursive interface that maintained a list of ownership or whatever, but the fact remains that after 16 years, we still haven't really needed it, except for code that is so ugly and broken that pretty much everybody really feels it should be rewritten (and generally for *other* reasons) anyway. So I'm not categorically against nesting, but I'm certainly down on it, and I think it's almost always done wrong. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-10-01 16:11 ` Linus Torvalds @ 2007-10-01 18:19 ` Balbir Singh 0 siblings, 0 replies; 21+ messages in thread From: Balbir Singh @ 2007-10-01 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Nick Piggin, Andrew Morton, Linux Memory Management List Linus Torvalds wrote: > > On Mon, 1 Oct 2007, Balbir Singh wrote: >> Sounds very similar to the problems we had with CPU hotplug earlier. >> It's a rwlock locking anti-pattern. I know that recursive locks >> have been frowned upon earlier, but I wonder if there is a case here. >> Of-course recursive locks would not be *fair*. > > The problem with recursive locks is that they are inevitably done wrong. > > For example, the "natural" way to do them is to just save the process ID > or something like that. Which is utter crap. Yet, people do it *every* > single time (yes, I've done it too, I admit). > Yes, I've done that too, I guess we learn what's bad by doing it. > The thing is, "recursive" doesn't mean "same CPU" or "same process" or > "same thread" or anything like that. It means "same *dependency-chain*". > With the very real implication that you literally have to pass the "lock > instance" (whether that is a cookie or anything else) around, and thus > really generate the proper chain. > > For example, in CPU hotplug, the dependency chain really did end up moving > between different execution contexts, iirc (eg from process context into > kernel workqueues). > I agree with whatever you've said so far. The original intention of every lock is to protect data not code. In the example mentioned before and in the case of CPU hotplug, what we intend to do, is to prevent the writer from causing a deadlock. We create a deadlock, as a side-effect of the locking is fair. I guess what we might need is a variant of unfair locks. Tracking owners is one way of eliminating deadlocks that occur, specially since the reader-writer lock does not allow readers to provide fair locking. I think this is where RCU excels, it allows readers/writers to proceed almost independently. > So we could add some kind of recursive interface that maintained a list of > ownership or whatever, but the fact remains that after 16 years, we still > haven't really needed it, except for code that is so ugly and broken that > pretty much everybody really feels it should be rewritten (and generally > for *other* reasons) anyway. > Recursive mutexes are meant for a special purpose. Uncontrolled use of recursive locks would be really bad. > So I'm not categorically against nesting, but I'm certainly down on it, > and I think it's almost always done wrong. > Yes, recursive locking needs to be designed with care and usage reviewed with a lot of attention. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-10-01 15:11 ` Linus Torvalds 2007-10-01 15:45 ` Balbir Singh @ 2007-10-01 17:33 ` Jens Axboe 1 sibling, 0 replies; 21+ messages in thread From: Jens Axboe @ 2007-10-01 17:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nick Piggin, Andrew Morton, Linux Memory Management List On Mon, Oct 01 2007, Linus Torvalds wrote: > > The comment is wrong. > > On Mon, 1 Oct 2007, Jens Axboe wrote: > > > > /* > > + * Do a copy-from-user while holding the mmap_semaphore for reading. If we > > + * have to fault the user page in, we must drop the mmap_sem to avoid a > > + * deadlock in the page fault handling (it wants to grab mmap_sem too, but for > > + * writing). This assumes that we will very rarely hit the partial != 0 path, > > + * or this will not be a win. > > + */ > > Page faulting only grabs it for reading, and having a page fault happen is > not problematic in itself. Readers *do* nest. > > What is problematic is: > > thread#1 thread#2 > > get_iovec_page_array > down_read() > .. everything ok so far .. > mmap() > down_write() > .. correctly blocks on the reader .. > .. everything ok so far .. > > .. pagefault .. > down_read() > .. fairness code now blocks on the waiting writer! .. > .. oops. We're deadlocked .. > > So the problem is that while readers do nest nicely, they only do so if no > potential writers can possibly exist (which of course never happens: an > rwlock with no writers is a no-op ;). Ah, I didn't read the explanation well enough it seems. Better? diff --git a/fs/splice.c b/fs/splice.c index c010a72..e95a362 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1224,6 +1224,33 @@ static long do_splice(struct file *in, loff_t __user *off_in, } /* + * Do a copy-from-user while holding the mmap_semaphore for reading, in a + * manner safe from deadlocking with simultaneous mmap() (grabbing mmap_sem + * for writing) and page faulting on the user memory pointed to by src. + * This assumes that we will very rarely hit the partial != 0 path, or this + * will not be a win. + */ +static int copy_from_user_mmap_sem(void *dst, const void __user *src, size_t n) +{ + int partial; + + pagefault_disable(); + partial = __copy_from_user_inatomic(dst, src, n); + pagefault_enable(); + + /* + * Didn't copy everything, drop the mmap_sem and do a faulting copy + */ + if (unlikely(partial)) { + up_read(¤t->mm->mmap_sem); + partial = copy_from_user(dst, src, n); + down_read(¤t->mm->mmap_sem); + } + + return partial; +} + +/* * Map an iov into an array of pages and offset/length tupples. With the * partial_page structure, we can map several non-contiguous ranges into * our ones pages[] map instead of splitting that operation into pieces. @@ -1236,31 +1263,26 @@ static int get_iovec_page_array(const struct iovec __user *iov, { int buffers = 0, error = 0; - /* - * It's ok to take the mmap_sem for reading, even - * across a "get_user()". - */ down_read(¤t->mm->mmap_sem); while (nr_vecs) { unsigned long off, npages; + struct iovec entry; void __user *base; size_t len; int i; - /* - * Get user address base and length for this iovec. - */ - error = get_user(base, &iov->iov_base); - if (unlikely(error)) - break; - error = get_user(len, &iov->iov_len); - if (unlikely(error)) + error = -EFAULT; + if (copy_from_user_mmap_sem(&entry, iov, sizeof(entry))) break; + base = entry.iov_base; + len = entry.iov_len; + /* * Sanity check this iovec. 0 read succeeds. */ + error = 0; if (unlikely(!len)) break; error = -EFAULT; -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 20:02 ` Linus Torvalds 2007-09-28 20:08 ` Linus Torvalds @ 2007-09-29 13:10 ` Nick Piggin 2007-09-30 6:46 ` Jens Axboe 1 sibling, 1 reply; 21+ messages in thread From: Nick Piggin @ 2007-09-29 13:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, Andrew Morton, Linux Memory Management List On Fri, Sep 28, 2007 at 01:02:50PM -0700, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > Hmm, part of me doesn't like this patch, since we now end up beating on > > mmap_sem for each part of the vec. It's fine for a stable patch, but how > > about > > > > - prefaulting the iovec > > - using __get_user() > > - only dropping/regrabbing the lock if we have to fault > > "__get_user()" doesn't help any. But we should do the same thing we do for > generic_file_write(), or whatever - probe it while in an atomic region. > > So something like the appended might work. Untested. I got an idea for getting rid of mmap_sem from here completely. Which is why I was looking at these callers in the first place. It would be really convenient and help me play with the idea if mmap_sem is wrapped closely around get_user_pages where possible... If you're really worried about mmap_sem batching here, can you just avoid this complexity and do all the get_user()s up-front, before taking mmap_sem at all? You only have to save PIPE_BUFFERS number of them. > > Linus > --- > fs/splice.c | 32 +++++++++++++++++++++----------- > 1 files changed, 21 insertions(+), 11 deletions(-) > > diff --git a/fs/splice.c b/fs/splice.c > index c010a72..07e880e 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -1236,31 +1236,41 @@ static int get_iovec_page_array(const struct iovec __user *iov, > { > int buffers = 0, error = 0; > > - /* > - * It's ok to take the mmap_sem for reading, even > - * across a "get_user()". > - */ > down_read(¤t->mm->mmap_sem); > > while (nr_vecs) { > unsigned long off, npages; > + struct iovec entry; > void __user *base; > size_t len; > int i; > > /* > - * Get user address base and length for this iovec. > + * We do not want to recursively take the mmap_sem semaphore > + * on a page fault, since that could deadlock with a writer > + * that comes in in the middle. So disable pagefaults, and > + * do it the slow way if the copy fails.. > */ > - error = get_user(base, &iov->iov_base); > - if (unlikely(error)) > - break; > - error = get_user(len, &iov->iov_len); > - if (unlikely(error)) > - break; > + pagefault_disable(); > + i = __copy_from_user_inatomic(&entry, iov, sizeof(entry)); > + pagefault_enable(); > + > + if (unlikely(i)) { > + up_read(¤t->mm->mmap_sem); > + i = copy_from_user(&entry, iov, sizeof(entry)); > + down_read(¤t->mm->mmap_sem); > + error = -EFAULT; > + if (i) > + break; > + } > + > + len = entry.iov_len; > + base = entry.iov_base; > > /* > * Sanity check this iovec. 0 read succeeds. > */ > + error = 0; > if (unlikely(!len)) > break; > error = -EFAULT; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-29 13:10 ` Nick Piggin @ 2007-09-30 6:46 ` Jens Axboe 2007-09-30 12:07 ` Nick Piggin 0 siblings, 1 reply; 21+ messages in thread From: Jens Axboe @ 2007-09-30 6:46 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Andrew Morton, Linux Memory Management List On Sat, Sep 29 2007, Nick Piggin wrote: > On Fri, Sep 28, 2007 at 01:02:50PM -0700, Linus Torvalds wrote: > > > > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > > > Hmm, part of me doesn't like this patch, since we now end up beating on > > > mmap_sem for each part of the vec. It's fine for a stable patch, but how > > > about > > > > > > - prefaulting the iovec > > > - using __get_user() > > > - only dropping/regrabbing the lock if we have to fault > > > > "__get_user()" doesn't help any. But we should do the same thing we do for > > generic_file_write(), or whatever - probe it while in an atomic region. > > > > So something like the appended might work. Untested. > > I got an idea for getting rid of mmap_sem from here completely. Which > is why I was looking at these callers in the first place. > > It would be really convenient and help me play with the idea if mmap_sem > is wrapped closely around get_user_pages where possible... Well, move it back there in your first patch? Not a big deal, surely :-) > If you're really worried about mmap_sem batching here, can you just > avoid this complexity and do all the get_user()s up-front, before taking > mmap_sem at all? You only have to save PIPE_BUFFERS number of > them. Sure, that is easily doable at the cost of some stack. I have other patches that grow PIPE_BUFFERS dynamically in the pipeline, so I'd prefer not to since that'll then turn into a dynamic allocation. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-30 6:46 ` Jens Axboe @ 2007-09-30 12:07 ` Nick Piggin 2007-09-30 20:05 ` Jens Axboe 0 siblings, 1 reply; 21+ messages in thread From: Nick Piggin @ 2007-09-30 12:07 UTC (permalink / raw) To: Jens Axboe; +Cc: Linus Torvalds, Andrew Morton, Linux Memory Management List On Sun, Sep 30, 2007 at 08:46:46AM +0200, Jens Axboe wrote: > On Sat, Sep 29 2007, Nick Piggin wrote: > > On Fri, Sep 28, 2007 at 01:02:50PM -0700, Linus Torvalds wrote: > > > > > > > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > > > > > Hmm, part of me doesn't like this patch, since we now end up beating on > > > > mmap_sem for each part of the vec. It's fine for a stable patch, but how > > > > about > > > > > > > > - prefaulting the iovec > > > > - using __get_user() > > > > - only dropping/regrabbing the lock if we have to fault > > > > > > "__get_user()" doesn't help any. But we should do the same thing we do for > > > generic_file_write(), or whatever - probe it while in an atomic region. > > > > > > So something like the appended might work. Untested. > > > > I got an idea for getting rid of mmap_sem from here completely. Which > > is why I was looking at these callers in the first place. > > > > It would be really convenient and help me play with the idea if mmap_sem > > is wrapped closely around get_user_pages where possible... > > Well, move it back there in your first patch? Not a big deal, surely :-) > > > If you're really worried about mmap_sem batching here, can you just > > avoid this complexity and do all the get_user()s up-front, before taking > > mmap_sem at all? You only have to save PIPE_BUFFERS number of > > them. > > Sure, that is easily doable at the cost of some stack. I have other > patches that grow PIPE_BUFFERS dynamically in the pipeline, so I'd > prefer not to since that'll then turn into a dynamic allocation. You already have much more PIPE_BUFFERS stuff on the stack. If it gets much bigger, you should dynamically allocate all this anyway, no? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-30 12:07 ` Nick Piggin @ 2007-09-30 20:05 ` Jens Axboe 2007-09-30 20:12 ` Nick Piggin 0 siblings, 1 reply; 21+ messages in thread From: Jens Axboe @ 2007-09-30 20:05 UTC (permalink / raw) To: Nick Piggin; +Cc: Linus Torvalds, Andrew Morton, Linux Memory Management List On Sun, Sep 30 2007, Nick Piggin wrote: > On Sun, Sep 30, 2007 at 08:46:46AM +0200, Jens Axboe wrote: > > On Sat, Sep 29 2007, Nick Piggin wrote: > > > On Fri, Sep 28, 2007 at 01:02:50PM -0700, Linus Torvalds wrote: > > > > > > > > > > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > > > > > > > Hmm, part of me doesn't like this patch, since we now end up beating on > > > > > mmap_sem for each part of the vec. It's fine for a stable patch, but how > > > > > about > > > > > > > > > > - prefaulting the iovec > > > > > - using __get_user() > > > > > - only dropping/regrabbing the lock if we have to fault > > > > > > > > "__get_user()" doesn't help any. But we should do the same thing we do for > > > > generic_file_write(), or whatever - probe it while in an atomic region. > > > > > > > > So something like the appended might work. Untested. > > > > > > I got an idea for getting rid of mmap_sem from here completely. Which > > > is why I was looking at these callers in the first place. > > > > > > It would be really convenient and help me play with the idea if mmap_sem > > > is wrapped closely around get_user_pages where possible... > > > > Well, move it back there in your first patch? Not a big deal, surely :-) > > > > > If you're really worried about mmap_sem batching here, can you just > > > avoid this complexity and do all the get_user()s up-front, before taking > > > mmap_sem at all? You only have to save PIPE_BUFFERS number of > > > them. > > > > Sure, that is easily doable at the cost of some stack. I have other > > patches that grow PIPE_BUFFERS dynamically in the pipeline, so I'd > > prefer not to since that'll then turn into a dynamic allocation. > > You already have much more PIPE_BUFFERS stuff on the stack. If it > gets much bigger, you should dynamically allocate all this anyway, no? Yep, but then it's one more item that has to be dynamically allocated. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-30 20:05 ` Jens Axboe @ 2007-09-30 20:12 ` Nick Piggin 0 siblings, 0 replies; 21+ messages in thread From: Nick Piggin @ 2007-09-30 20:12 UTC (permalink / raw) To: Jens Axboe; +Cc: Linus Torvalds, Andrew Morton, Linux Memory Management List On Sun, Sep 30, 2007 at 10:05:52PM +0200, Jens Axboe wrote: > On Sun, Sep 30 2007, Nick Piggin wrote: > > > > You already have much more PIPE_BUFFERS stuff on the stack. If it > > gets much bigger, you should dynamically allocate all this anyway, no? > > Yep, but then it's one more item that has to be dynamically allocated. Just have a struct for a tuple of itmes for each pipe buf to allocate and you'll never notice ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [patch] splice mmap_sem deadlock 2007-09-28 18:10 ` Linus Torvalds 2007-09-28 18:15 ` Jens Axboe @ 2007-09-29 13:08 ` Nick Piggin 1 sibling, 0 replies; 21+ messages in thread From: Nick Piggin @ 2007-09-29 13:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, Andrew Morton, Linux Memory Management List On Fri, Sep 28, 2007 at 11:10:34AM -0700, Linus Torvalds wrote: > > > On Fri, 28 Sep 2007, Jens Axboe wrote: > > > > It does looks suspicious. It was actually Linus who originally suggested > > this approach and wrote that comment - Linus? > > Well, it used to be true, long time ago. It probably was still true in the > original splice patch back way back when. But yeah, rwsemaphores broke > that (very useful) trick in the name of fairness ;( You might be thinking about rwlocks as well, where it still is true AFAIK. The rwsem fairness I think is pretty critical for it's most prominent consumer (ie. mmap_sem), because read sides can be very long and frequent with a lot of concurrency (it would be easy to starve the write side). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-10-01 18:20 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-09-28 16:00 [patch] splice mmap_sem deadlock Nick Piggin 2007-09-28 17:31 ` Jens Axboe 2007-09-28 18:10 ` Linus Torvalds 2007-09-28 18:15 ` Jens Axboe 2007-09-28 18:23 ` Linus Torvalds 2007-09-28 19:30 ` Jens Axboe 2007-09-28 20:02 ` Linus Torvalds 2007-09-28 20:08 ` Linus Torvalds 2007-09-29 6:37 ` Jens Axboe 2007-10-01 12:03 ` Jens Axboe 2007-10-01 15:11 ` Linus Torvalds 2007-10-01 15:45 ` Balbir Singh 2007-10-01 16:11 ` Linus Torvalds 2007-10-01 18:19 ` Balbir Singh 2007-10-01 17:33 ` Jens Axboe 2007-09-29 13:10 ` Nick Piggin 2007-09-30 6:46 ` Jens Axboe 2007-09-30 12:07 ` Nick Piggin 2007-09-30 20:05 ` Jens Axboe 2007-09-30 20:12 ` Nick Piggin 2007-09-29 13:08 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox