* [rfc] buffered write deadlock fix
@ 2006-10-13 16:43 Nick Piggin
2006-10-13 16:44 ` [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin, Andrew Morton
` (5 more replies)
0 siblings, 6 replies; 23+ messages in thread
From: Nick Piggin @ 2006-10-13 16:43 UTC (permalink / raw)
To: Linux Memory Management
Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason,
Linux Kernel, Nick Piggin
The following set of patches attempt to fix the buffered write
locking problems.
While looking at this deadlock, it became apparent that there are
several others which are equally bad or worse. It will be very
good to fix these.
I ceased to become an admirer of this problem when it stopped my
pagefault vs invalidate race fix from being merged!
Review and comments would be very nice. Testing only if you don't
value your data. I realise all filesystem developers are busy
solving the 10TB fsck problem now, but if you could please take a
minute to look at the fs/ changes, and also ensure your
filesystem's prepare and commit_write handlers aren't broken.
Sorry for the shotgun mail. It is your fault for ever being
mentioned in the same email as the buffered write deadlock ;)
Thanks,
Nick
--
SuSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread* [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin @ 2006-10-13 16:44 ` Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 2/6] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin, Andrew Morton ` (4 subsequent siblings) 5 siblings, 0 replies; 23+ messages in thread From: Nick Piggin, Andrew Morton @ 2006-10-13 16:44 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel, Nick Piggin Revert 81b0c8713385ce1b1b9058e916edcf9561ad76d6. This was a bugfix against 6527c2bdf1f833cc18e8f42bd97973d583e4aa83, which we also revert. Signed-off-by: Andrew Morton <akpm@osdl.org> Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1912,12 +1912,6 @@ generic_file_buffered_write(struct kiocb break; } - if (unlikely(bytes == 0)) { - status = 0; - copied = 0; - goto zero_length_segment; - } - status = a_ops->prepare_write(file, page, offset, offset+bytes); if (unlikely(status)) { loff_t isize = i_size_read(inode); @@ -1947,8 +1941,7 @@ generic_file_buffered_write(struct kiocb page_cache_release(page); continue; } -zero_length_segment: - if (likely(copied >= 0)) { + if (likely(copied > 0)) { if (!status) status = copied; Index: linux-2.6/mm/filemap.h =================================================================== --- linux-2.6.orig/mm/filemap.h +++ linux-2.6/mm/filemap.h @@ -87,7 +87,7 @@ filemap_set_next_iovec(const struct iove const struct iovec *iov = *iovp; size_t base = *basep; - do { + while (bytes) { int copy = min(bytes, iov->iov_len - base); bytes -= copy; @@ -96,7 +96,7 @@ filemap_set_next_iovec(const struct iove iov++; base = 0; } - } while (bytes); + } *iovp = iov; *basep = base; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 2/6] mm: revert "generic_file_buffered_write(): deadlock on vectored write" 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin 2006-10-13 16:44 ` [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin, Andrew Morton @ 2006-10-13 16:44 ` Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 3/6] mm: generic_file_buffered_write cleanup Nick Piggin, Andrew Morton ` (3 subsequent siblings) 5 siblings, 0 replies; 23+ messages in thread From: Nick Piggin, Andrew Morton @ 2006-10-13 16:44 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel, Nick Piggin Revert 6527c2bdf1f833cc18e8f42bd97973d583e4aa83 This patch fixed the following bug: When prefaulting in the pages in generic_file_buffered_write(), we only faulted in the pages for the firts segment of the iovec. If the second of successive segment described a mmapping of the page into which we're write()ing, and that page is not up-to-date, the fault handler tries to lock the already-locked page (to bring it up to date) and deadlocks. An exploit for this bug is in writev-deadlock-demo.c, in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz. (These demos assume blocksize < PAGE_CACHE_SIZE). The problem with this fix is that it takes the kernel back to doing a single prepare_write()/commit_write() per iovec segment. So in the worst case we'll run prepare_write+commit_write 1024 times where we previously would have run it once. The other problem with the fix is that it fix all the locking problems. <insert numbers obtained via ext3-tools's writev-speed.c here> And apparently this change killed NFS overwrite performance, because, I suppose, it talks to the server for each prepare_write+commit_write. So just back that patch out - we'll be fixing the deadlock by other means. Signed-off-by: Andrew Morton <akpm@osdl.org> Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1882,21 +1882,14 @@ generic_file_buffered_write(struct kiocb do { unsigned long index; unsigned long offset; + unsigned long maxlen; size_t copied; offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ index = pos >> PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; - - /* Limit the size of the copy to the caller's write size */ - bytes = min(bytes, count); - - /* - * Limit the size of the copy to that of the current segment, - * because fault_in_pages_readable() doesn't know how to walk - * segments. - */ - bytes = min(bytes, cur_iov->iov_len - iov_base); + if (bytes > count) + bytes = count; /* * Bring in the user page that we will copy from _first_. @@ -1904,7 +1897,10 @@ generic_file_buffered_write(struct kiocb * same page as we're writing to, without it being marked * up-to-date. */ - fault_in_pages_readable(buf, bytes); + maxlen = cur_iov->iov_len - iov_base; + if (maxlen > bytes) + maxlen = bytes; + fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); if (!page) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 3/6] mm: generic_file_buffered_write cleanup 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin 2006-10-13 16:44 ` [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 2/6] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin, Andrew Morton @ 2006-10-13 16:44 ` Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 4/6] mm: comment mmap_sem / lock_page lockorder Nick Piggin ` (2 subsequent siblings) 5 siblings, 0 replies; 23+ messages in thread From: Nick Piggin, Andrew Morton @ 2006-10-13 16:44 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel, Nick Piggin Signed-off-by: Andrew Morton <akpm@osdl.org> Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1855,16 +1855,15 @@ generic_file_buffered_write(struct kiocb size_t count, ssize_t written) { struct file *file = iocb->ki_filp; - struct address_space * mapping = file->f_mapping; + struct address_space *mapping = file->f_mapping; const struct address_space_operations *a_ops = mapping->a_ops; struct inode *inode = mapping->host; long status = 0; struct page *page; struct page *cached_page = NULL; - size_t bytes; struct pagevec lru_pvec; const struct iovec *cur_iov = iov; /* current iovec */ - size_t iov_base = 0; /* offset in the current iovec */ + size_t iov_offset = 0; /* offset in the current iovec */ char __user *buf; pagevec_init(&lru_pvec, 0); @@ -1875,31 +1874,33 @@ generic_file_buffered_write(struct kiocb if (likely(nr_segs == 1)) buf = iov->iov_base + written; else { - filemap_set_next_iovec(&cur_iov, &iov_base, written); - buf = cur_iov->iov_base + iov_base; + filemap_set_next_iovec(&cur_iov, &iov_offset, written); + buf = cur_iov->iov_base + iov_offset; } do { - unsigned long index; - unsigned long offset; - unsigned long maxlen; - size_t copied; + pgoff_t index; /* Pagecache index for current page */ + unsigned long offset; /* Offset into pagecache page */ + unsigned long maxlen; /* Bytes remaining in current iovec */ + size_t bytes; /* Bytes to write to page */ + size_t copied; /* Bytes copied from user */ - offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ + offset = (pos & (PAGE_CACHE_SIZE - 1)); index = pos >> PAGE_CACHE_SHIFT; bytes = PAGE_CACHE_SIZE - offset; if (bytes > count) bytes = count; + maxlen = cur_iov->iov_len - iov_offset; + if (maxlen > bytes) + maxlen = bytes; + /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the * same page as we're writing to, without it being marked * up-to-date. */ - maxlen = cur_iov->iov_len - iov_base; - if (maxlen > bytes) - maxlen = bytes; fault_in_pages_readable(buf, maxlen); page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); @@ -1930,7 +1931,7 @@ generic_file_buffered_write(struct kiocb buf, bytes); else copied = filemap_copy_from_user_iovec(page, offset, - cur_iov, iov_base, bytes); + cur_iov, iov_offset, bytes); flush_dcache_page(page); status = a_ops->commit_write(file, page, offset, offset+bytes); if (status == AOP_TRUNCATED_PAGE) { @@ -1948,12 +1949,12 @@ generic_file_buffered_write(struct kiocb buf += status; if (unlikely(nr_segs > 1)) { filemap_set_next_iovec(&cur_iov, - &iov_base, status); + &iov_offset, status); if (count) buf = cur_iov->iov_base + - iov_base; + iov_offset; } else { - iov_base += status; + iov_offset += status; } } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 4/6] mm: comment mmap_sem / lock_page lockorder 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin ` (2 preceding siblings ...) 2006-10-13 16:44 ` [patch 3/6] mm: generic_file_buffered_write cleanup Nick Piggin, Andrew Morton @ 2006-10-13 16:44 ` Nick Piggin 2006-10-13 16:44 ` [patch 5/6] mm: debug write deadlocks Nick Piggin 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton 5 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2006-10-13 16:44 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel, Nick Piggin Add a few more examples to the mmap_sem / lock_page ordering. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -73,7 +73,7 @@ generic_file_direct_IO(int rw, struct ki * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * * ->mmap_sem - * ->lock_page (access_process_vm) + * ->lock_page (page fault, sys_mmap, access_process_vm) * * ->mmap_sem * ->i_mutex (msync) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -29,7 +29,7 @@ * taken together; in truncation, i_mutex is taken outermost. * * mm->mmap_sem - * page->flags PG_locked (lock_page) + * page->flags PG_locked (lock_page, eg from pagefault) * mapping->i_mmap_lock * anon_vma->lock * mm->page_table_lock or pte_lock -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 5/6] mm: debug write deadlocks 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin ` (3 preceding siblings ...) 2006-10-13 16:44 ` [patch 4/6] mm: comment mmap_sem / lock_page lockorder Nick Piggin @ 2006-10-13 16:44 ` Nick Piggin 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton 5 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2006-10-13 16:44 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel, Nick Piggin Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the difficult race where the page may be unmapped before calling copy_from_user. Makes the race much easier to hit. This probably needn't go upstream. Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1895,6 +1895,7 @@ generic_file_buffered_write(struct kiocb if (maxlen > bytes) maxlen = bytes; +#ifndef CONFIG_DEBUG_VM /* * Bring in the user page that we will copy from _first_. * Otherwise there's a nasty deadlock on copying from the @@ -1902,6 +1903,7 @@ generic_file_buffered_write(struct kiocb * up-to-date. */ fault_in_pages_readable(buf, maxlen); +#endif page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); if (!page) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch 6/6] mm: fix pagecache write deadlocks 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin ` (4 preceding siblings ...) 2006-10-13 16:44 ` [patch 5/6] mm: debug write deadlocks Nick Piggin @ 2006-10-13 16:44 ` Nick Piggin, Andrew Morton 2006-10-13 22:14 ` Andrew Morton ` (3 more replies) 5 siblings, 4 replies; 23+ messages in thread From: Nick Piggin, Andrew Morton @ 2006-10-13 16:44 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel, Nick Piggin The idea is to modify the core write() code so that it won't take a pagefault while holding a lock on the pagecache page. There are a number of different deadlocks possible if we try to do such a thing: 1. generic_buffered_write 2. lock_page 3. prepare_write 4. unlock_page+vmtruncate 5. copy_from_user 6. mmap_sem(r) 7. handle_mm_fault 8. lock_page (filemap_nopage) 9. commit_write 1. unlock_page b. sys_munmap / sys_mlock / others c. mmap_sem(w) d. make_pages_present e. get_user_pages f. handle_mm_fault g. lock_page (filemap_nopage) 2,8 - recursive deadlock if page is same 2,8;2,7 - ABBA deadlock is page is different 2,6;c,g - ABBA deadlock if page is same - Instead of copy_from_user(), use inc_preempt_count() and copy_from_user_inatomic(). - If the copy_from_user_inatomic() hits a pagefault, it'll return a short copy. - if the page was not uptodate, we cannot commit the write, because the uncopied bit could have uninitialised data. Commit zero length copy, which should do the right thing (ie. not set the page uptodate). - if the page was uptodate, commit the copied portion so we make some progress. - unlock_page() - go back and try to fault the page in again, redo the lock_page, prepare_write, copy_from_user_inatomic(), etc. - Now we have a non uptodate page, and we keep faulting on a 2nd or later iovec, this gives a deadlock, because fault_in_pages readable only faults in the first iovec. To fix this situation, if we fault on a !uptodate page, make the next iteration only attempt to copy a single iovec. - This also showed up a number of buggy prepare_write / commit_write implementations that were setting the page uptodate in the prepare_write side: bad! this allows uninitialised data to be read. Fix these. (also, rename maxlen to seglen, because it was confusing) Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1881,7 +1881,7 @@ generic_file_buffered_write(struct kiocb do { pgoff_t index; /* Pagecache index for current page */ unsigned long offset; /* Offset into pagecache page */ - unsigned long maxlen; /* Bytes remaining in current iovec */ + unsigned long seglen; /* Bytes remaining in current iovec */ size_t bytes; /* Bytes to write to page */ size_t copied; /* Bytes copied from user */ @@ -1891,18 +1891,16 @@ generic_file_buffered_write(struct kiocb if (bytes > count) bytes = count; - maxlen = cur_iov->iov_len - iov_offset; - if (maxlen > bytes) - maxlen = bytes; + seglen = min(cur_iov->iov_len - iov_offset, bytes); +retry_noprogress: #ifndef CONFIG_DEBUG_VM /* - * Bring in the user page that we will copy from _first_. - * Otherwise there's a nasty deadlock on copying from the - * same page as we're writing to, without it being marked - * up-to-date. + * Bring in the user page that we will copy from _first_, this + * minimises the chance we have to break into the slowpath + * below. */ - fault_in_pages_readable(buf, maxlen); + fault_in_pages_readable(buf, seglen); #endif page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec); @@ -1913,8 +1911,6 @@ generic_file_buffered_write(struct kiocb status = a_ops->prepare_write(file, page, offset, offset+bytes); if (unlikely(status)) { - loff_t isize = i_size_read(inode); - if (status != AOP_TRUNCATED_PAGE) unlock_page(page); page_cache_release(page); @@ -1922,52 +1918,86 @@ generic_file_buffered_write(struct kiocb continue; /* * prepare_write() may have instantiated a few blocks - * outside i_size. Trim these off again. + * outside i_size. Trim these off again. Don't need + * i_size_read because we hold i_mutex. */ - if (pos + bytes > isize) - vmtruncate(inode, isize); + if (pos + bytes > inode->i_size) + vmtruncate(inode, inode->i_size); break; } + + /* + * Must not enter the pagefault handler here, because we hold + * the page lock, so we might recursively deadlock on the same + * lock, or get an ABBA deadlock against a different lock, or + * against the mmap_sem (which nests outside the page lock). + * So increment preempt count, and use _atomic usercopies. + */ + inc_preempt_count(); if (likely(nr_segs == 1)) - copied = filemap_copy_from_user(page, offset, + copied = filemap_copy_from_user_atomic(page, offset, buf, bytes); else - copied = filemap_copy_from_user_iovec(page, offset, - cur_iov, iov_offset, bytes); + copied = filemap_copy_from_user_iovec_atomic(page, + offset, cur_iov, iov_offset, + bytes); + dec_preempt_count(); + + if (!PageUptodate(page)) { + /* + * If the page is not uptodate, we cannot allow a + * partial commit_write, because that might expose + * uninitialised data. + * + * We will enter the single-segment path below, which + * should get the filesystem to bring the page + * uputodate for us next time. + */ + if (unlikely(copied != bytes)) + copied = 0; + } + flush_dcache_page(page); - status = a_ops->commit_write(file, page, offset, offset+bytes); + status = a_ops->commit_write(file, page, offset, offset+copied); if (status == AOP_TRUNCATED_PAGE) { page_cache_release(page); continue; } - if (likely(copied > 0)) { - if (!status) - status = copied; - - if (status >= 0) { - written += status; - count -= status; - pos += status; - buf += status; - if (unlikely(nr_segs > 1)) { - filemap_set_next_iovec(&cur_iov, - &iov_offset, status); - if (count) - buf = cur_iov->iov_base + - iov_offset; - } else { - iov_offset += status; - } - } - } - if (unlikely(copied != bytes)) - if (status >= 0) - status = -EFAULT; unlock_page(page); mark_page_accessed(page); page_cache_release(page); + if (status < 0) break; + if (likely(copied > 0)) { + written += copied; + count -= copied; + pos += copied; + buf += copied; + if (unlikely(nr_segs > 1)) { + filemap_set_next_iovec(&cur_iov, + &iov_offset, copied); + if (count) + buf = cur_iov->iov_base + iov_offset; + } else { + iov_offset += copied; + } + } else { +#ifdef CONFIG_DEBUG_VM + fault_in_pages_readable(buf, bytes); +#endif + /* + * OK, we took a fault without making progress. Fall + * back to trying just a single segment next time. + * This is important to prevent deadlock if a full + * page write was not causing the page to be brought + * uptodate. A smaller write will tend to bring it + * uptodate in ->prepare_write, and thus we have a + * chance of making some progress. + */ + bytes = seglen; + goto retry_noprogress; + } balance_dirty_pages_ratelimited(mapping); cond_resched(); } while (count); Index: linux-2.6/mm/filemap.h =================================================================== --- linux-2.6.orig/mm/filemap.h +++ linux-2.6/mm/filemap.h @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic( /* * Copy as much as we can into the page and return the number of bytes which - * were sucessfully copied. If a fault is encountered then clear the page - * out to (offset+bytes) and return the number of bytes which were copied. + * were sucessfully copied. If a fault is encountered then return the number of + * bytes which were copied. * - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache - * to *NOT* zero any tail of the buffer that it failed to copy. If it does, - * and if the following non-atomic copy succeeds, then there is a small window - * where the target page contains neither the data before the write, nor the - * data after the write (it contains zero). A read at this time will see - * data that is inconsistent with any ordering of the read and the write. - * (This has been detected in practice). + * NOTE: For this to work reliably we really want + * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it + * failed to copy. If it does, and if the following non-atomic copy succeeds, + * then there is a small window where the target page contains neither the data + * before the write, nor the data after the write (it contains zero). A read at + * this time will see data that is inconsistent with any ordering of the read + * and the write. (This has been detected in practice). */ static inline size_t -filemap_copy_from_user(struct page *page, unsigned long offset, +filemap_copy_from_user_atomic(struct page *page, unsigned long offset, const char __user *buf, unsigned bytes) { char *kaddr; @@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); kunmap_atomic(kaddr, KM_USER0); - if (left != 0) { - /* Do it the slow way */ - kaddr = kmap(page); - left = __copy_from_user_nocache(kaddr + offset, buf, bytes); - kunmap(page); - } + return bytes - left; +} + +static inline size_t +filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset, + const char __user *buf, unsigned bytes) +{ + char *kaddr; + int left; + + kaddr = kmap(page); + left = __copy_from_user_nocache(kaddr + offset, buf, bytes); + kunmap(page); + return bytes - left; } /* - * This has the same sideeffects and return value as filemap_copy_from_user(). + * This has the same sideeffects and return value as + * filemap_copy_from_user_atomic(). * The difference is that on a fault we need to memset the remainder of the * page (out to offset+bytes), to emulate filemap_copy_from_user()'s * single-segment behaviour. */ static inline size_t -filemap_copy_from_user_iovec(struct page *page, unsigned long offset, +filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset, const struct iovec *iov, size_t base, size_t bytes) { char *kaddr; @@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, base, bytes); kunmap_atomic(kaddr, KM_USER0); - if (copied != bytes) { - kaddr = kmap(page); - copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, - base, bytes); - if (bytes - copied) - memset(kaddr + offset + copied, 0, bytes - copied); - kunmap(page); - } + return copied; +} + +/* + * This has the same sideeffects and return value as + * filemap_copy_from_user_nonatomic(). + * The difference is that on a fault we need to memset the remainder of the + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s + * single-segment behaviour. + */ +static inline size_t +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset, + const struct iovec *iov, size_t base, size_t bytes) +{ + char *kaddr; + size_t copied; + + kaddr = kmap(page); + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, + base, bytes); + kunmap(page); return copied; } Index: linux-2.6/fs/libfs.c =================================================================== --- linux-2.6.orig/fs/libfs.c +++ linux-2.6/fs/libfs.c @@ -327,32 +327,35 @@ int simple_readpage(struct file *file, s int simple_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { - if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) { - void *kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, from); - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to); - flush_dcache_page(page); - kunmap_atomic(kaddr, KM_USER0); - } + if (PageUptodate(page)) + return 0; + + if (to - from != PAGE_CACHE_SIZE) { + clear_highpage(page); + flush_dcache_page(page); SetPageUptodate(page); } + return 0; } int simple_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) + unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - - /* - * No need to use i_size_read() here, the i_size - * cannot change under us because we hold the i_mutex. - */ - if (pos > inode->i_size) - i_size_write(inode, pos); - set_page_dirty(page); + if (to > from) { + struct inode *inode = page->mapping->host; + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + + if (to - from == PAGE_CACHE_SIZE) + SetPageUptodate(page); + /* + * No need to use i_size_read() here, the i_size + * cannot change under us because we hold the i_mutex. + */ + if (pos > inode->i_size) + i_size_write(inode, pos); + set_page_dirty(page); + } return 0; } Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -1856,6 +1856,9 @@ static int __block_commit_write(struct i unsigned blocksize; struct buffer_head *bh, *head; + if (from == to) + return 0; + blocksize = 1 << inode->i_blkbits; for(bh = head = page_buffers(page), block_start = 0; @@ -2317,17 +2320,6 @@ int nobh_prepare_write(struct page *page if (is_mapped_to_disk) SetPageMappedToDisk(page); - SetPageUptodate(page); - - /* - * Setting the page dirty here isn't necessary for the prepare_write - * function - commit_write will do that. But if/when this function is - * used within the pagefault handler to ensure that all mmapped pages - * have backing space in the filesystem, we will need to dirty the page - * if its contents were altered. - */ - if (dirtied_it) - set_page_dirty(page); return 0; @@ -2337,15 +2329,6 @@ failed: free_buffer_head(read_bh[i]); } - /* - * Error recovery is pretty slack. Clear the page and mark it dirty - * so we'll later zero out any blocks which _were_ allocated. - */ - kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, PAGE_CACHE_SIZE); - kunmap_atomic(kaddr, KM_USER0); - SetPageUptodate(page); - set_page_dirty(page); return ret; } EXPORT_SYMBOL(nobh_prepare_write); @@ -2353,13 +2336,16 @@ EXPORT_SYMBOL(nobh_prepare_write); int nobh_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + if (to > from) { + struct inode *inode = page->mapping->host; + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; - set_page_dirty(page); - if (pos > inode->i_size) { - i_size_write(inode, pos); - mark_inode_dirty(inode); + SetPageUptodate(page); + set_page_dirty(page); + if (pos > inode->i_size) { + i_size_write(inode, pos); + mark_inode_dirty(inode); + } } return 0; } @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); + SetPageUptodate(page); set_page_dirty(page); } unlock_page(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton @ 2006-10-13 22:14 ` Andrew Morton 2006-10-14 4:19 ` Nick Piggin 2006-10-14 5:04 ` Nick Piggin ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Andrew Morton @ 2006-10-13 22:14 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel On Fri, 13 Oct 2006 18:44:52 +0200 (CEST) Nick Piggin <npiggin@suse.de> wrote: > From: Andrew Morton <akpm@osdl.org> and Nick Piggin <npiggin@suse.de> > > The idea is to modify the core write() code so that it won't take a pagefault > while holding a lock on the pagecache page. There are a number of different > deadlocks possible if we try to do such a thing: > > 1. generic_buffered_write > 2. lock_page > 3. prepare_write > 4. unlock_page+vmtruncate > 5. copy_from_user > 6. mmap_sem(r) > 7. handle_mm_fault > 8. lock_page (filemap_nopage) > 9. commit_write > 1. unlock_page > > b. sys_munmap / sys_mlock / others > c. mmap_sem(w) > d. make_pages_present > e. get_user_pages > f. handle_mm_fault > g. lock_page (filemap_nopage) > > 2,8 - recursive deadlock if page is same > 2,8;2,7 - ABBA deadlock is page is different > 2,6;c,g - ABBA deadlock if page is same > > - Instead of copy_from_user(), use inc_preempt_count() and > copy_from_user_inatomic(). > > - If the copy_from_user_inatomic() hits a pagefault, it'll return a short > copy. > > - if the page was not uptodate, we cannot commit the write, because the > uncopied bit could have uninitialised data. Commit zero length copy, > which should do the right thing (ie. not set the page uptodate). > > - if the page was uptodate, commit the copied portion so we make some > progress. > > - unlock_page() > > - go back and try to fault the page in again, redo the lock_page, > prepare_write, copy_from_user_inatomic(), etc. > > - Now we have a non uptodate page, and we keep faulting on a 2nd or later > iovec, this gives a deadlock, because fault_in_pages readable only faults > in the first iovec. To fix this situation, if we fault on a !uptodate page, > make the next iteration only attempt to copy a single iovec. > > - This also showed up a number of buggy prepare_write / commit_write > implementations that were setting the page uptodate in the prepare_write > side: bad! this allows uninitialised data to be read. Fix these. Well. It's non-buggy under the current protocol because the page remains locked throughout. This patch would make these ->prepare_write() implementations buggy. > +#ifdef CONFIG_DEBUG_VM > + fault_in_pages_readable(buf, bytes); > +#endif I'll need to remember to take that out later on. Or reorder the patches, I guess. > int simple_commit_write(struct file *file, struct page *page, > - unsigned offset, unsigned to) > + unsigned from, unsigned to) > { > - struct inode *inode = page->mapping->host; > - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; > - > - /* > - * No need to use i_size_read() here, the i_size > - * cannot change under us because we hold the i_mutex. > - */ > - if (pos > inode->i_size) > - i_size_write(inode, pos); > - set_page_dirty(page); > + if (to > from) { > + struct inode *inode = page->mapping->host; > + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; > + > + if (to - from == PAGE_CACHE_SIZE) > + SetPageUptodate(page); This SetPageUptodate() can go away? > @@ -2317,17 +2320,6 @@ int nobh_prepare_write(struct page *page > > if (is_mapped_to_disk) > SetPageMappedToDisk(page); > - SetPageUptodate(page); > - > - /* > - * Setting the page dirty here isn't necessary for the prepare_write > - * function - commit_write will do that. But if/when this function is > - * used within the pagefault handler to ensure that all mmapped pages > - * have backing space in the filesystem, we will need to dirty the page > - * if its contents were altered. > - */ > - if (dirtied_it) > - set_page_dirty(page); > > return 0; Local variable `dirtied_it' can go away now. Or can it? We've modified the page's contents. If the copy_from_user gets a fault and we do a zero-length ->commit_write(), nobody ends up dirtying the page. > @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp > memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); > flush_dcache_page(page); > kunmap_atomic(kaddr, KM_USER0); > + SetPageUptodate(page); > set_page_dirty(page); > } > unlock_page(page); I've already forgotten why this was added. Comment, please ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-13 22:14 ` Andrew Morton @ 2006-10-14 4:19 ` Nick Piggin 2006-10-14 4:30 ` Nick Piggin 2006-10-15 11:35 ` Peter Zijlstra 0 siblings, 2 replies; 23+ messages in thread From: Nick Piggin @ 2006-10-14 4:19 UTC (permalink / raw) To: Andrew Morton Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel On Fri, Oct 13, 2006 at 03:14:57PM -0700, Andrew Morton wrote: > On Fri, 13 Oct 2006 18:44:52 +0200 (CEST) > Nick Piggin <npiggin@suse.de> wrote: > > > > - This also showed up a number of buggy prepare_write / commit_write > > implementations that were setting the page uptodate in the prepare_write > > side: bad! this allows uninitialised data to be read. Fix these. > > Well. It's non-buggy under the current protocol because the page remains > locked throughout. This patch would make these ->prepare_write() > implementations buggy. But if it becomes uptodate, then do_generic_mapping_read can read it without locking it (and so can filemap_nopage at present, although it looks like that's going to take the page lock soon). > > +#ifdef CONFIG_DEBUG_VM > > + fault_in_pages_readable(buf, bytes); > > +#endif > > I'll need to remember to take that out later on. Or reorder the patches, I > guess. > > > int simple_commit_write(struct file *file, struct page *page, > > - unsigned offset, unsigned to) > > + unsigned from, unsigned to) > > { > > - struct inode *inode = page->mapping->host; > > - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; > > - > > - /* > > - * No need to use i_size_read() here, the i_size > > - * cannot change under us because we hold the i_mutex. > > - */ > > - if (pos > inode->i_size) > > - i_size_write(inode, pos); > > - set_page_dirty(page); > > + if (to > from) { > > + struct inode *inode = page->mapping->host; > > + loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; > > + > > + if (to - from == PAGE_CACHE_SIZE) > > + SetPageUptodate(page); > > This SetPageUptodate() can go away? It is needed because the prepare_write does not try to bring the page uptodate if it is a full page write. (I was considering another patch to just remove that completely, as it might be overkill for something like sysfs, but it demonstrates the principle quite nicely). > > > @@ -2317,17 +2320,6 @@ int nobh_prepare_write(struct page *page > > > > if (is_mapped_to_disk) > > SetPageMappedToDisk(page); > > - SetPageUptodate(page); > > - > > - /* > > - * Setting the page dirty here isn't necessary for the prepare_write > > - * function - commit_write will do that. But if/when this function is > > - * used within the pagefault handler to ensure that all mmapped pages > > - * have backing space in the filesystem, we will need to dirty the page > > - * if its contents were altered. > > - */ > > - if (dirtied_it) > > - set_page_dirty(page); > > > > return 0; > > Local variable `dirtied_it' can go away now. > > Or can it? We've modified the page's contents. If the copy_from_user gets > a fault and we do a zero-length ->commit_write(), nobody ends up dirtying > the page. But only if the page is not uptodate. Otherwise we won't modify its contents (no need to). > > @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp > > memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); > > flush_dcache_page(page); > > kunmap_atomic(kaddr, KM_USER0); > > + SetPageUptodate(page); > > set_page_dirty(page); > > } > > unlock_page(page); > > I've already forgotten why this was added. Comment, please ;) Well, nobh_prepare_write no longer sets it uptodate, so we need to if we're going to set_page_dirty. OTOH, why does truncate_page need to zero the pagecache anyway? I wonder if we couldn't delete this whole function? (not in this patchset!) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-14 4:19 ` Nick Piggin @ 2006-10-14 4:30 ` Nick Piggin 2006-10-15 11:35 ` Peter Zijlstra 1 sibling, 0 replies; 23+ messages in thread From: Nick Piggin @ 2006-10-14 4:30 UTC (permalink / raw) To: Andrew Morton Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel On Sat, Oct 14, 2006 at 06:19:27AM +0200, Nick Piggin wrote: > On Fri, Oct 13, 2006 at 03:14:57PM -0700, Andrew Morton wrote: > > On Fri, 13 Oct 2006 18:44:52 +0200 (CEST) > > Nick Piggin <npiggin@suse.de> wrote: > > > > > > - This also showed up a number of buggy prepare_write / commit_write > > > implementations that were setting the page uptodate in the prepare_write > > > side: bad! this allows uninitialised data to be read. Fix these. > > > > Well. It's non-buggy under the current protocol because the page remains > > locked throughout. This patch would make these ->prepare_write() > > implementations buggy. > > But if it becomes uptodate, then do_generic_mapping_read can read it > without locking it (and so can filemap_nopage at present, although it > looks like that's going to take the page lock soon). So the simple_prepare_write bug is an uninitialised data loeak. If you read the part of the file which is about to be written to (and thus does not get memset()ed), you can read junk. I was able to trigger this with a simple test on ramfs. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-14 4:19 ` Nick Piggin 2006-10-14 4:30 ` Nick Piggin @ 2006-10-15 11:35 ` Peter Zijlstra 1 sibling, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2006-10-15 11:35 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel On Sat, 2006-10-14 at 06:19 +0200, Nick Piggin wrote: > On Fri, Oct 13, 2006 at 03:14:57PM -0700, Andrew Morton wrote: > > On Fri, 13 Oct 2006 18:44:52 +0200 (CEST) > > Nick Piggin <npiggin@suse.de> wrote: > > > @@ -2450,6 +2436,7 @@ int nobh_truncate_page(struct address_sp > > > memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); > > > flush_dcache_page(page); > > > kunmap_atomic(kaddr, KM_USER0); > > > + SetPageUptodate(page); > > > set_page_dirty(page); > > > } > > > unlock_page(page); > > > > I've already forgotten why this was added. Comment, please ;) > > Well, nobh_prepare_write no longer sets it uptodate, so we need to if > we're going to set_page_dirty. OTOH, why does truncate_page need to > zero the pagecache anyway? I wonder if we couldn't delete this whole > function? (not in this patchset!) It zeros the tail end of the page so we don't leak old data? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton 2006-10-13 22:14 ` Andrew Morton @ 2006-10-14 5:04 ` Nick Piggin 2006-10-15 11:37 ` Peter Zijlstra 2006-10-18 14:25 ` [patch 6/6] mm: fix pagecache write deadlocks Chris Mason 3 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2006-10-14 5:04 UTC (permalink / raw) To: Linux Memory Management Cc: Neil Brown, Andrew Morton, Anton Altaparmakov, Chris Mason, Linux Kernel On Fri, Oct 13, 2006 at 06:44:52PM +0200, Nick Piggin wrote: > From: Andrew Morton <akpm@osdl.org> and Nick Piggin <npiggin@suse.de> > > The idea is to modify the core write() code so that it won't take a pagefault > while holding a lock on the pagecache page. There are a number of different > deadlocks possible if we try to do such a thing: Here is a patch to improve the comment a little. This is a pretty tricky situation so we must be clear as to why it works. -- Comment was not entirely clear about why we must eliminate all other possibilities. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1946,12 +1946,19 @@ retry_noprogress: if (!PageUptodate(page)) { /* * If the page is not uptodate, we cannot allow a - * partial commit_write, because that might expose - * uninitialised data. + * partial commit_write because when we unlock the + * page below, someone else might bring it uptodate + * and we lose our write. We cannot allow a full + * commit_write, because that exposes uninitialised + * data. We cannot zero the rest of the file and do + * a full commit_write because that exposes transient + * zeroes. * - * We will enter the single-segment path below, which - * should get the filesystem to bring the page - * uputodate for us next time. + * Abort the operation entirely with a zero length + * commit_write. Retry. We will enter the + * single-segment path below, which should get the + * filesystem to bring the page uputodate for us next + * time. */ if (unlikely(copied != bytes)) copied = 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton 2006-10-13 22:14 ` Andrew Morton 2006-10-14 5:04 ` Nick Piggin @ 2006-10-15 11:37 ` Peter Zijlstra 2006-10-15 11:56 ` Nick Piggin 2006-10-18 14:25 ` [patch 6/6] mm: fix pagecache write deadlocks Chris Mason 3 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2006-10-15 11:37 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton On Fri, 2006-10-13 at 18:44 +0200, Andrew Morton wrote: > The idea is to modify the core write() code so that it won't take a pagefault > while holding a lock on the pagecache page. There are a number of different > deadlocks possible if we try to do such a thing: > > 1. generic_buffered_write > 2. lock_page > 3. prepare_write > 4. unlock_page+vmtruncate > 5. copy_from_user > 6. mmap_sem(r) > 7. handle_mm_fault > 8. lock_page (filemap_nopage) > 9. commit_write > 1. unlock_page > > b. sys_munmap / sys_mlock / others > c. mmap_sem(w) > d. make_pages_present > e. get_user_pages > f. handle_mm_fault > g. lock_page (filemap_nopage) > > 2,8 - recursive deadlock if page is same > 2,8;2,7 - ABBA deadlock is page is different 2,8;2,8 I think you mean > 2,6;c,g - ABBA deadlock if page is same > + > + /* > + * Must not enter the pagefault handler here, because we hold > + * the page lock, so we might recursively deadlock on the same > + * lock, or get an ABBA deadlock against a different lock, or > + * against the mmap_sem (which nests outside the page lock). > + * So increment preempt count, and use _atomic usercopies. > + */ > + inc_preempt_count(); > if (likely(nr_segs == 1)) > - copied = filemap_copy_from_user(page, offset, > + copied = filemap_copy_from_user_atomic(page, offset, > buf, bytes); > else > - copied = filemap_copy_from_user_iovec(page, offset, > - cur_iov, iov_offset, bytes); > + copied = filemap_copy_from_user_iovec_atomic(page, > + offset, cur_iov, iov_offset, > + bytes); > + dec_preempt_count(); > + Why use raw {inc,dec}_preempt_count() and not preempt_{disable,enable}()? Is the compiler barrier not needed here? And do we really want to avoid the preempt_check_resched()? > Index: linux-2.6/mm/filemap.h > =================================================================== > --- linux-2.6.orig/mm/filemap.h > +++ linux-2.6/mm/filemap.h > @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic( > > /* > * Copy as much as we can into the page and return the number of bytes which > - * were sucessfully copied. If a fault is encountered then clear the page > - * out to (offset+bytes) and return the number of bytes which were copied. > + * were sucessfully copied. If a fault is encountered then return the number of > + * bytes which were copied. > * > - * NOTE: For this to work reliably we really want copy_from_user_inatomic_nocache > - * to *NOT* zero any tail of the buffer that it failed to copy. If it does, > - * and if the following non-atomic copy succeeds, then there is a small window > - * where the target page contains neither the data before the write, nor the > - * data after the write (it contains zero). A read at this time will see > - * data that is inconsistent with any ordering of the read and the write. > - * (This has been detected in practice). > + * NOTE: For this to work reliably we really want > + * copy_from_user_inatomic_nocache to *NOT* zero any tail of the buffer that it > + * failed to copy. If it does, and if the following non-atomic copy succeeds, > + * then there is a small window where the target page contains neither the data > + * before the write, nor the data after the write (it contains zero). A read at > + * this time will see data that is inconsistent with any ordering of the read > + * and the write. (This has been detected in practice). > */ > static inline size_t > -filemap_copy_from_user(struct page *page, unsigned long offset, > +filemap_copy_from_user_atomic(struct page *page, unsigned long offset, > const char __user *buf, unsigned bytes) > { > char *kaddr; > @@ -44,23 +44,32 @@ filemap_copy_from_user(struct page *page > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); > kunmap_atomic(kaddr, KM_USER0); > > - if (left != 0) { > - /* Do it the slow way */ > - kaddr = kmap(page); > - left = __copy_from_user_nocache(kaddr + offset, buf, bytes); > - kunmap(page); > - } > + return bytes - left; > +} > + > +static inline size_t > +filemap_copy_from_user_nonatomic(struct page *page, unsigned long offset, > + const char __user *buf, unsigned bytes) > +{ > + char *kaddr; > + int left; > + > + kaddr = kmap(page); > + left = __copy_from_user_nocache(kaddr + offset, buf, bytes); > + kunmap(page); > + > return bytes - left; > } > > /* > - * This has the same sideeffects and return value as filemap_copy_from_user(). > + * This has the same sideeffects and return value as > + * filemap_copy_from_user_atomic(). > * The difference is that on a fault we need to memset the remainder of the > * page (out to offset+bytes), to emulate filemap_copy_from_user()'s > * single-segment behaviour. > */ > static inline size_t > -filemap_copy_from_user_iovec(struct page *page, unsigned long offset, > +filemap_copy_from_user_iovec_atomic(struct page *page, unsigned long offset, > const struct iovec *iov, size_t base, size_t bytes) > { > char *kaddr; > @@ -70,14 +79,27 @@ filemap_copy_from_user_iovec(struct page > copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, > base, bytes); > kunmap_atomic(kaddr, KM_USER0); > - if (copied != bytes) { > - kaddr = kmap(page); > - copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, > - base, bytes); > - if (bytes - copied) > - memset(kaddr + offset + copied, 0, bytes - copied); > - kunmap(page); > - } > + return copied; > +} > + > +/* > + * This has the same sideeffects and return value as > + * filemap_copy_from_user_nonatomic(). > + * The difference is that on a fault we need to memset the remainder of the > + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s > + * single-segment behaviour. > + */ > +static inline size_t > +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset, > + const struct iovec *iov, size_t base, size_t bytes) > +{ > + char *kaddr; > + size_t copied; > + > + kaddr = kmap(page); > + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, > + base, bytes); > + kunmap(page); > return copied; > } > Why create the _nonatomic versions? There are no users. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-15 11:37 ` Peter Zijlstra @ 2006-10-15 11:56 ` Nick Piggin 2006-10-15 13:51 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2006-10-15 11:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton On Sun, Oct 15, 2006 at 01:37:10PM +0200, Peter Zijlstra wrote: > On Fri, 2006-10-13 at 18:44 +0200, Andrew Morton wrote: > > The idea is to modify the core write() code so that it won't take a pagefault > > while holding a lock on the pagecache page. There are a number of different > > deadlocks possible if we try to do such a thing: > > > > 1. generic_buffered_write > > 2. lock_page > > 3. prepare_write > > 4. unlock_page+vmtruncate > > 5. copy_from_user > > 6. mmap_sem(r) > > 7. handle_mm_fault > > 8. lock_page (filemap_nopage) > > 9. commit_write > > 1. unlock_page > > > > b. sys_munmap / sys_mlock / others > > c. mmap_sem(w) > > d. make_pages_present > > e. get_user_pages > > f. handle_mm_fault > > g. lock_page (filemap_nopage) > > > > 2,8 - recursive deadlock if page is same > > 2,8;2,7 - ABBA deadlock is page is different > > 2,8;2,8 I think you mean Right. I've asked akpm to make a note of it (I don't think I can send a meta-patch ;)) > > + /* > > + * Must not enter the pagefault handler here, because we hold > > + * the page lock, so we might recursively deadlock on the same > > + * lock, or get an ABBA deadlock against a different lock, or > > + * against the mmap_sem (which nests outside the page lock). > > + * So increment preempt count, and use _atomic usercopies. > > + */ > > + inc_preempt_count(); > > if (likely(nr_segs == 1)) > > - copied = filemap_copy_from_user(page, offset, > > + copied = filemap_copy_from_user_atomic(page, offset, > > buf, bytes); > > else > > - copied = filemap_copy_from_user_iovec(page, offset, > > - cur_iov, iov_offset, bytes); > > + copied = filemap_copy_from_user_iovec_atomic(page, > > + offset, cur_iov, iov_offset, > > + bytes); > > + dec_preempt_count(); > > + > > Why use raw {inc,dec}_preempt_count() and not > preempt_{disable,enable}()? Is the compiler barrier not needed here? And > do we really want to avoid the preempt_check_resched()? Counter to intuition, we actually don't mind being preempted here, but we do mind entering the (core) pagefault handler. Incrementing the preempt count causes the arch specific handler to bail out early before it takes any locks. Clear as mud? Wrapping it in a better name might be an improvement? Or wrapping it into the copy*user_atomic functions themselves (which is AFAIK the only place we use it). > > Index: linux-2.6/mm/filemap.h > > =================================================================== > > --- linux-2.6.orig/mm/filemap.h > > +++ linux-2.6/mm/filemap.h > > @@ -22,19 +22,19 @@ __filemap_copy_from_user_iovec_inatomic( > > +/* > > + * This has the same sideeffects and return value as > > + * filemap_copy_from_user_nonatomic(). > > + * The difference is that on a fault we need to memset the remainder of the > > + * page (out to offset+bytes), to emulate filemap_copy_from_user_nonatomic()'s > > + * single-segment behaviour. > > + */ > > +static inline size_t > > +filemap_copy_from_user_iovec_nonatomic(struct page *page, unsigned long offset, > > + const struct iovec *iov, size_t base, size_t bytes) > > +{ > > + char *kaddr; > > + size_t copied; > > + > > + kaddr = kmap(page); > > + copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, > > + base, bytes); > > + kunmap(page); > > return copied; > > } > > > > Why create the _nonatomic versions? There are no users. This was leftover from Andrew's patch... maybe filemap_xip wants it and I've forgotten about it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-15 11:56 ` Nick Piggin @ 2006-10-15 13:51 ` Peter Zijlstra 2006-10-15 14:19 ` SPAM: " Nick Piggin 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2006-10-15 13:51 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton > > > + /* > > > + * Must not enter the pagefault handler here, because we hold > > > + * the page lock, so we might recursively deadlock on the same > > > + * lock, or get an ABBA deadlock against a different lock, or > > > + * against the mmap_sem (which nests outside the page lock). > > > + * So increment preempt count, and use _atomic usercopies. > > > + */ > > > + inc_preempt_count(); > > > if (likely(nr_segs == 1)) > > > - copied = filemap_copy_from_user(page, offset, > > > + copied = filemap_copy_from_user_atomic(page, offset, > > > buf, bytes); > > > else > > > - copied = filemap_copy_from_user_iovec(page, offset, > > > - cur_iov, iov_offset, bytes); > > > + copied = filemap_copy_from_user_iovec_atomic(page, > > > + offset, cur_iov, iov_offset, > > > + bytes); > > > + dec_preempt_count(); > > > + > > > > Why use raw {inc,dec}_preempt_count() and not > > preempt_{disable,enable}()? Is the compiler barrier not needed here? And > > do we really want to avoid the preempt_check_resched()? > > Counter to intuition, we actually don't mind being preempted here, > but we do mind entering the (core) pagefault handler. Incrementing > the preempt count causes the arch specific handler to bail out early > before it takes any locks. > > Clear as mud? Wrapping it in a better name might be an improvement? > Or wrapping it into the copy*user_atomic functions themselves (which > is AFAIK the only place we use it). Right, but since you do inc the preempt_count you do disable preemption, might as well check TIF_NEED_RESCHED when enabling preemption again. Sticking it in the atomic copy functions does make sense to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: SPAM: Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-15 13:51 ` Peter Zijlstra @ 2006-10-15 14:19 ` Nick Piggin 2006-10-15 15:47 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2006-10-15 14:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton On Sun, Oct 15, 2006 at 03:51:09PM +0200, Peter Zijlstra wrote: > > > > > > > Why use raw {inc,dec}_preempt_count() and not > > > preempt_{disable,enable}()? Is the compiler barrier not needed here? And > > > do we really want to avoid the preempt_check_resched()? > > > > Counter to intuition, we actually don't mind being preempted here, > > but we do mind entering the (core) pagefault handler. Incrementing > > the preempt count causes the arch specific handler to bail out early > > before it takes any locks. > > > > Clear as mud? Wrapping it in a better name might be an improvement? > > Or wrapping it into the copy*user_atomic functions themselves (which > > is AFAIK the only place we use it). > > Right, but since you do inc the preempt_count you do disable preemption, > might as well check TIF_NEED_RESCHED when enabling preemption again. Yeah, you are right about that. Unfortunately there isn't a good way to do this at the moment... well we could disable preempt around the section, but that would be silly for a PREEMPT kernel. And we should really decouple it from preempt entirely, in case we ever want to check for it some other way in the pagefault handler. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: SPAM: Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-15 14:19 ` SPAM: " Nick Piggin @ 2006-10-15 15:47 ` Peter Zijlstra 2006-10-15 15:57 ` RRe: " Nick Piggin 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2006-10-15 15:47 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton, ralf, David Howells On Sun, 2006-10-15 at 16:19 +0200, Nick Piggin wrote: > On Sun, Oct 15, 2006 at 03:51:09PM +0200, Peter Zijlstra wrote: > > > > > > > > > > Why use raw {inc,dec}_preempt_count() and not > > > > preempt_{disable,enable}()? Is the compiler barrier not needed here? And > > > > do we really want to avoid the preempt_check_resched()? > > > > > > Counter to intuition, we actually don't mind being preempted here, > > > but we do mind entering the (core) pagefault handler. Incrementing > > > the preempt count causes the arch specific handler to bail out early > > > before it takes any locks. > > > > > > Clear as mud? Wrapping it in a better name might be an improvement? > > > Or wrapping it into the copy*user_atomic functions themselves (which > > > is AFAIK the only place we use it). > > > > Right, but since you do inc the preempt_count you do disable preemption, > > might as well check TIF_NEED_RESCHED when enabling preemption again. > > Yeah, you are right about that. Unfortunately there isn't a good > way to do this at the moment... well we could disable preempt > around the section, but that would be silly for a PREEMPT kernel. > > And we should really decouple it from preempt entirely, in case we > ever want to check for it some other way in the pagefault handler. How about we make sure all kmap_atomic implementations behave properly and make in_atomic true. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/asm-frv/highmem.h | 5 +++-- include/asm-mips/highmem.h | 10 ++++++++-- include/linux/highmem.h | 9 ++++++--- mm/filemap.c | 4 +--- 4 files changed, 18 insertions(+), 10 deletions(-) Index: linux-2.6/include/asm-frv/highmem.h =================================================================== --- linux-2.6.orig/include/asm-frv/highmem.h 2006-07-17 22:30:51.000000000 +0200 +++ linux-2.6/include/asm-frv/highmem.h 2006-10-15 17:32:02.000000000 +0200 @@ -115,7 +115,7 @@ static inline void *kmap_atomic(struct p { unsigned long paddr; - preempt_disable(); + inc_preempt_count(); paddr = page_to_phys(page); switch (type) { @@ -170,7 +170,8 @@ static inline void kunmap_atomic(void *k default: BUG(); } - preempt_enable(); + dec_preempt_count(); + preempt_check_resched(); } #endif /* !__ASSEMBLY__ */ Index: linux-2.6/include/asm-mips/highmem.h =================================================================== --- linux-2.6.orig/include/asm-mips/highmem.h 2006-07-17 22:30:56.000000000 +0200 +++ linux-2.6/include/asm-mips/highmem.h 2006-10-15 17:36:49.000000000 +0200 @@ -70,11 +70,17 @@ static inline void *kmap(struct page *pa static inline void *kmap_atomic(struct page *page, enum km_type type) { + inc_preempt_count(); return page_address(page); } -static inline void kunmap_atomic(void *kvaddr, enum km_type type) { } -#define kmap_atomic_pfn(pfn, idx) page_address(pfn_to_page(pfn)) +static inline void kunmap_atomic(void *kvaddr, enum km_type type) +{ + dec_preempt_count(); + preempt_check_resched(); +} + +#define kmap_atomic_pfn(pfn, idx) kmap_atomic(pfn_to_page(pfn), (idx)) #define kmap_atomic_to_page(ptr) virt_to_page(ptr) Index: linux-2.6/include/linux/highmem.h =================================================================== --- linux-2.6.orig/include/linux/highmem.h 2006-10-07 18:47:32.000000000 +0200 +++ linux-2.6/include/linux/highmem.h 2006-10-15 17:32:44.000000000 +0200 @@ -3,6 +3,7 @@ #include <linux/fs.h> #include <linux/mm.h> +#include <linux/preempt.h> #include <asm/cacheflush.h> @@ -41,9 +42,11 @@ static inline void *kmap(struct page *pa #define kunmap(page) do { (void) (page); } while (0) -#define kmap_atomic(page, idx) page_address(page) -#define kunmap_atomic(addr, idx) do { } while (0) -#define kmap_atomic_pfn(pfn, idx) page_address(pfn_to_page(pfn)) +#define kmap_atomic(page, idx) \ + ({ inc_preempt_count(); page_address(page); }) +#define kunmap_atomic(addr, idx) \ + do { dec_preempt_count(); preempt_check_resched(); } while (0) +#define kmap_atomic_pfn(pfn, idx) kmap_atomic(pfn_to_page(pfn), (idx)) #define kmap_atomic_to_page(ptr) virt_to_page(ptr) #endif Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c 2006-10-15 17:24:41.000000000 +0200 +++ linux-2.6/mm/filemap.c 2006-10-15 17:40:19.000000000 +0200 @@ -2140,9 +2140,8 @@ retry_noprogress: * the page lock, so we might recursively deadlock on the same * lock, or get an ABBA deadlock against a different lock, or * against the mmap_sem (which nests outside the page lock). - * So increment preempt count, and use _atomic usercopies. + * So use _atomic usercopies. */ - inc_preempt_count(); if (likely(nr_segs == 1)) copied = filemap_copy_from_user_atomic(page, offset, buf, bytes); @@ -2150,7 +2149,6 @@ retry_noprogress: copied = filemap_copy_from_user_iovec_atomic(page, offset, cur_iov, iov_offset, bytes); - dec_preempt_count(); if (!PageUptodate(page)) { /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* RRe: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-15 15:47 ` Peter Zijlstra @ 2006-10-15 15:57 ` Nick Piggin 2006-10-15 16:13 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2006-10-15 15:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton, ralf, David Howells On Sun, Oct 15, 2006 at 05:47:03PM +0200, Peter Zijlstra wrote: > > > > And we should really decouple it from preempt entirely, in case we > > ever want to check for it some other way in the pagefault handler. > > How about we make sure all kmap_atomic implementations behave properly > and make in_atomic true. Hmm, but you may not be doing a copy*user within the kmap. And you may want an atomic copy*user not within a kmap (maybe). I think it really would be more logical to do it in a wrapper function pagefault_disable() pagefault_enable()? ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RRe: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-15 15:57 ` RRe: " Nick Piggin @ 2006-10-15 16:13 ` Peter Zijlstra 2006-10-16 15:24 ` pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) Nick Piggin 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2006-10-15 16:13 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Neil Brown, Anton Altaparmakov, Chris Mason, Linux Kernel, Andrew Morton, ralf, David Howells On Sun, 2006-10-15 at 17:57 +0200, Nick Piggin wrote: > On Sun, Oct 15, 2006 at 05:47:03PM +0200, Peter Zijlstra wrote: > > > > > > And we should really decouple it from preempt entirely, in case we > > > ever want to check for it some other way in the pagefault handler. > > > > How about we make sure all kmap_atomic implementations behave properly > > and make in_atomic true. > > Hmm, but you may not be doing a copy*user within the kmap. And you may > want an atomic copy*user not within a kmap (maybe). > > I think it really would be more logical to do it in a wrapper function > pagefault_disable() pagefault_enable()? ;) I did that one first, but then noticed that most non trivial kmap_atomic implementations already did the inc_preempt_count()/dec_preempt_count() thing (except frv which did preempt_disable()/preempt_enable() ?) Anyway, here goes: Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/filemap.c | 4 +--- mm/filemap.h | 20 ++++++++++++++++++++ 2 files changed, 21 insertions(+), 3 deletions(-) Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c 2006-10-14 20:20:20.000000000 +0200 +++ linux-2.6/mm/filemap.c 2006-10-15 17:16:59.000000000 +0200 @@ -2140,9 +2140,8 @@ retry_noprogress: * the page lock, so we might recursively deadlock on the same * lock, or get an ABBA deadlock against a different lock, or * against the mmap_sem (which nests outside the page lock). - * So increment preempt count, and use _atomic usercopies. + * So use _atomic usercopies. */ - inc_preempt_count(); if (likely(nr_segs == 1)) copied = filemap_copy_from_user_atomic(page, offset, buf, bytes); @@ -2150,7 +2149,6 @@ retry_noprogress: copied = filemap_copy_from_user_iovec_atomic(page, offset, cur_iov, iov_offset, bytes); - dec_preempt_count(); if (!PageUptodate(page)) { /* Index: linux-2.6/mm/filemap.h =================================================================== --- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200 +++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200 @@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic( size_t bytes); /* + * By increasing the preempt_count we make sure the arch preempt + * handler bails out early, before taking any locks, so that the copy + * operation gets terminated early. + */ +pagefault_static inline void disable(void) +{ + inc_preempt_count(); +} + +pagefault_static inline void enable(void) +{ + dec_preempt_count(); + preempt_check_resched(); +} + +/* * Copy as much as we can into the page and return the number of bytes which * were sucessfully copied. If a fault is encountered then return the number of * bytes which were copied. @@ -40,9 +56,11 @@ filemap_copy_from_user_atomic(struct pag char *kaddr; int left; + pagefault_disable(); kaddr = kmap_atomic(page, KM_USER0); left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); kunmap_atomic(kaddr, KM_USER0); + pagefault_enable(); return bytes - left; } @@ -75,10 +93,12 @@ filemap_copy_from_user_iovec_atomic(stru char *kaddr; size_t copied; + pagefault_disable(); kaddr = kmap_atomic(page, KM_USER0); copied = __filemap_copy_from_user_iovec_inatomic(kaddr + offset, iov, base, bytes); kunmap_atomic(kaddr, KM_USER0); + pagefault_enable(); return copied; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) 2006-10-15 16:13 ` Peter Zijlstra @ 2006-10-16 15:24 ` Nick Piggin 2006-10-16 16:05 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2006-10-16 15:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Linux Memory Management, Linux Kernel, Andrew Morton, David Howells, Ingo Molnar (trimming cc list) Peter Zijlstra wrote: > On Sun, 2006-10-15 at 17:57 +0200, Nick Piggin wrote: >>Hmm, but you may not be doing a copy*user within the kmap. And you may >>want an atomic copy*user not within a kmap (maybe). >> >>I think it really would be more logical to do it in a wrapper function >>pagefault_disable() pagefault_enable()? ;) > > > I did that one first, but then noticed that most non trivial kmap_atomic > implementations already did the inc_preempt_count()/dec_preempt_count() > thing (except frv which did preempt_disable()/preempt_enable() ?) > > Anyway, here goes: > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> I think this is a good approach. The missed preempt checks could easily have been causing scheduling delays because the usercopy can take up a lot of kernel time. I don't know that the function should go in filemap.h... uaccess.h seems more appropriate, and had thought the pagefault_disable() be calle directly from within the copy_*_user_inatomic functions themselves, not the filemap helper. Also, the rest of the kernel tree (mainly uaccess and futexes) should be converted ;) > Index: linux-2.6/mm/filemap.h > =================================================================== > --- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200 > +++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200 > @@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic( > size_t bytes); > > /* > + * By increasing the preempt_count we make sure the arch preempt > + * handler bails out early, before taking any locks, so that the copy > + * operation gets terminated early. > + */ > +pagefault_static inline void disable(void) > +{ > + inc_preempt_count(); > +} > + > +pagefault_static inline void enable(void) > +{ > + dec_preempt_count(); > + preempt_check_resched(); > +} Interesting prototype ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) 2006-10-16 15:24 ` pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) Nick Piggin @ 2006-10-16 16:05 ` Peter Zijlstra 2006-10-16 16:12 ` Nick Piggin 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2006-10-16 16:05 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Linux Kernel, Andrew Morton, David Howells, Ingo Molnar On Tue, 2006-10-17 at 01:24 +1000, Nick Piggin wrote: > (trimming cc list) > > Peter Zijlstra wrote: > > On Sun, 2006-10-15 at 17:57 +0200, Nick Piggin wrote: > > >>Hmm, but you may not be doing a copy*user within the kmap. And you may > >>want an atomic copy*user not within a kmap (maybe). > >> > >>I think it really would be more logical to do it in a wrapper function > >>pagefault_disable() pagefault_enable()? ;) > > > > > > I did that one first, but then noticed that most non trivial kmap_atomic > > implementations already did the inc_preempt_count()/dec_preempt_count() > > thing (except frv which did preempt_disable()/preempt_enable() ?) > > > > Anyway, here goes: > > > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > I think this is a good approach. The missed preempt checks could easily > have been causing scheduling delays because the usercopy can take up a > lot of kernel time. > > I don't know that the function should go in filemap.h... uaccess.h seems > more appropriate, and had thought the pagefault_disable() be calle > directly from within the copy_*_user_inatomic functions themselves, not > the filemap helper. > > Also, the rest of the kernel tree (mainly uaccess and futexes) should be > converted ;) Yeah, lotsa places to touch. > > Index: linux-2.6/mm/filemap.h > > =================================================================== > > --- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200 > > +++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200 > > @@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic( > > size_t bytes); > > > > /* > > + * By increasing the preempt_count we make sure the arch preempt > > + * handler bails out early, before taking any locks, so that the copy > > + * operation gets terminated early. > > + */ > > +pagefault_static inline void disable(void) > > +{ > > + inc_preempt_count(); I think we also need a barrier(); here. We need to make sure the preempt count is written to memory before we hit the fault handler. > > +} > > + > > +pagefault_static inline void enable(void) > > +{ > > + dec_preempt_count(); > > + preempt_check_resched(); > > +} > > Interesting prototype ;) Bah, sed magic gone awry ;-( -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) 2006-10-16 16:05 ` Peter Zijlstra @ 2006-10-16 16:12 ` Nick Piggin 0 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2006-10-16 16:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux Memory Management, Linux Kernel, Andrew Morton, David Howells, Ingo Molnar Peter Zijlstra wrote: > On Tue, 2006-10-17 at 01:24 +1000, Nick Piggin wrote: >>Also, the rest of the kernel tree (mainly uaccess and futexes) should be >>converted ;) > > > Yeah, lotsa places to touch. > > >>>Index: linux-2.6/mm/filemap.h >>>=================================================================== >>>--- linux-2.6.orig/mm/filemap.h 2006-10-14 20:20:20.000000000 +0200 >>>+++ linux-2.6/mm/filemap.h 2006-10-15 17:17:45.000000000 +0200 >>>@@ -21,6 +21,22 @@ __filemap_copy_from_user_iovec_inatomic( >>> size_t bytes); >>> >>> /* >>>+ * By increasing the preempt_count we make sure the arch preempt >>>+ * handler bails out early, before taking any locks, so that the copy >>>+ * operation gets terminated early. >>>+ */ >>>+pagefault_static inline void disable(void) >>>+{ >>>+ inc_preempt_count(); > > > I think we also need a barrier(); here. We need to make sure the preempt > count is written to memory before we hit the fault handler. It will come from this thread, but I guess the fault is not an event the compiler can forsee, so indeed it might optimise this into the wrong place. Perhaps not with any copy*user implementation we have, but at least in theory... >>>+pagefault_static inline void enable(void) >>>+{ >>>+ dec_preempt_count(); >>>+ preempt_check_resched(); >>>+} You'll want barriers before and after the dec_preempt_count, for similar reasons. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch 6/6] mm: fix pagecache write deadlocks 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton ` (2 preceding siblings ...) 2006-10-15 11:37 ` Peter Zijlstra @ 2006-10-18 14:25 ` Chris Mason 3 siblings, 0 replies; 23+ messages in thread From: Chris Mason @ 2006-10-18 14:25 UTC (permalink / raw) To: Nick Piggin Cc: Linux Memory Management, Neil Brown, Andrew Morton, Anton Altaparmakov, Linux Kernel > Index: linux-2.6/fs/buffer.c > =================================================================== > --- linux-2.6.orig/fs/buffer.c > +++ linux-2.6/fs/buffer.c > @@ -1856,6 +1856,9 @@ static int __block_commit_write(struct i > unsigned blocksize; > struct buffer_head *bh, *head; > > + if (from == to) > + return 0; > + > blocksize = 1 << inode->i_blkbits; reiserfs v3 copied the __block_commit_write logic for checking for a partially updated page, so reiserfs_commit_page will have to be updated to handle from==to. Right now it will set the page up to date. I also used a prepare/commit pare where from==to as a way to trigger tail conversions in the lilo ioctl. I'll both for you and make a patch. -chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2006-10-18 14:25 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-10-13 16:43 [rfc] buffered write deadlock fix Nick Piggin 2006-10-13 16:44 ` [patch 1/6] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 2/6] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 3/6] mm: generic_file_buffered_write cleanup Nick Piggin, Andrew Morton 2006-10-13 16:44 ` [patch 4/6] mm: comment mmap_sem / lock_page lockorder Nick Piggin 2006-10-13 16:44 ` [patch 5/6] mm: debug write deadlocks Nick Piggin 2006-10-13 16:44 ` [patch 6/6] mm: fix pagecache " Nick Piggin, Andrew Morton 2006-10-13 22:14 ` Andrew Morton 2006-10-14 4:19 ` Nick Piggin 2006-10-14 4:30 ` Nick Piggin 2006-10-15 11:35 ` Peter Zijlstra 2006-10-14 5:04 ` Nick Piggin 2006-10-15 11:37 ` Peter Zijlstra 2006-10-15 11:56 ` Nick Piggin 2006-10-15 13:51 ` Peter Zijlstra 2006-10-15 14:19 ` SPAM: " Nick Piggin 2006-10-15 15:47 ` Peter Zijlstra 2006-10-15 15:57 ` RRe: " Nick Piggin 2006-10-15 16:13 ` Peter Zijlstra 2006-10-16 15:24 ` pagefault_disable (was Re: [patch 6/6] mm: fix pagecache write deadlocks) Nick Piggin 2006-10-16 16:05 ` Peter Zijlstra 2006-10-16 16:12 ` Nick Piggin 2006-10-18 14:25 ` [patch 6/6] mm: fix pagecache write deadlocks Chris Mason
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox