* get_user_pages() with write=1 and force=1 gets read-only pages. @ 2005-07-30 20:53 Robin Holt 2005-07-30 22:13 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Robin Holt @ 2005-07-30 20:53 UTC (permalink / raw) To: linux-mm I am chasing a bug which I think I understand, but would like some confirmation. I believe I have two processes calling get_user_pages at approximately the same time. One is calling with write=0. The other with write=1 and force=1. The vma has the vm_ops->nopage set to filemap_nopage. Both faulters get to the point in do_no_page of being ready to insert the pte. The first one to get the mm->page_table_lock must be the reader. The readable pte gets inserted and results in the writer detecting the pte and returning VM_FAULT_MINOR. Upon return, the writer the does 'lookup_write = write && !force;' and then calls follow_page without having the write flag set. Am I on the right track with this? Is the correct fix to not just pass in the write flag untouched? I believe the change was made by Roland McGrath, but I don't see an email address for him. Thanks, Robin Holt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-30 20:53 get_user_pages() with write=1 and force=1 gets read-only pages Robin Holt @ 2005-07-30 22:13 ` Hugh Dickins 2005-07-31 1:52 ` Nick Piggin 0 siblings, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-07-30 22:13 UTC (permalink / raw) To: Robin Holt; +Cc: Roland McGrath, linux-mm On Sat, 30 Jul 2005, Robin Holt wrote: > I am chasing a bug which I think I understand, but would like some > confirmation. > > I believe I have two processes calling get_user_pages at approximately > the same time. One is calling with write=0. The other with write=1 > and force=1. The vma has the vm_ops->nopage set to filemap_nopage. > > Both faulters get to the point in do_no_page of being ready to insert > the pte. The first one to get the mm->page_table_lock must be the reader. > The readable pte gets inserted and results in the writer detecting the > pte and returning VM_FAULT_MINOR. > > Upon return, the writer the does 'lookup_write = write && !force;' > and then calls follow_page without having the write flag set. > > Am I on the right track with this? I do believe you are. Twice I've inserted fault code to cope with that "surely no longer have a shared page we shouldn't write" assumption, but I think you've just demonstrated that it's inherently unsafe. Certainly goes against the traditional grain of fault handlers, which can just try again when in doubt - as in the pte_same checks you've observed. > Is the correct fix to not just pass in the write flag untouched? I don't understand you there. Suspect you're confusing me with that "not", which perhaps expresses hesitancy, but shouldn't be there? But the correct fix would not be to pass in the write flag untouched: it's trying to avoid an endless loop of finding the pte not writable when ptrace is modifying a page which the user is currently protected against writing to (setting a breakpoint in readonly text, perhaps?). get_user_pages is hard! I don't know the right answer offhand, but thank you for posing a good question. > I believe the change was made by Roland > McGrath, but I don't see an email address for him. I've CC'ed roland@redhat.com Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-30 22:13 ` Hugh Dickins @ 2005-07-31 1:52 ` Nick Piggin 2005-07-31 10:52 ` Robin Holt 0 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-07-31 1:52 UTC (permalink / raw) To: Hugh Dickins; +Cc: Robin Holt, Roland McGrath, linux-mm [-- Attachment #1: Type: text/plain, Size: 261 bytes --] Hugh Dickins wrote: > get_user_pages is hard! I don't know the right answer offhand, > but thank you for posing a good question. > Detect the racing fault perhaps, and retry until we're sure that a write fault has gone through? -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-gup-fix.patch --] [-- Type: text/plain, Size: 3774 bytes --] Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2005-07-28 19:04:34.000000000 +1000 +++ linux-2.6/include/linux/mm.h 2005-07-31 11:40:24.000000000 +1000 @@ -625,6 +625,7 @@ static inline int page_mapped(struct pag * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. */ +#define VM_FAULT_RACE (-2) #define VM_FAULT_OOM (-1) #define VM_FAULT_SIGBUS 0 #define VM_FAULT_MINOR 1 Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2005-07-28 19:04:37.000000000 +1000 +++ linux-2.6/mm/memory.c 2005-07-31 11:49:35.000000000 +1000 @@ -964,6 +964,14 @@ int get_user_pages(struct task_struct *t return i ? i : -EFAULT; case VM_FAULT_OOM: return i ? i : -ENOMEM; + case VM_FAULT_RACE: + /* + * Someone else got there first. + * Must retry before we can assume + * that we have actually performed + * the write fault (below). + */ + continue; default: BUG(); } @@ -1224,6 +1232,7 @@ static int do_wp_page(struct mm_struct * struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); pte_t entry; + int ret; if (unlikely(!pfn_valid(pfn))) { /* @@ -1280,7 +1289,9 @@ static int do_wp_page(struct mm_struct * */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); + ret = VM_FAULT_RACE; if (likely(pte_same(*page_table, pte))) { + ret = VM_FAULT_MINOR; if (PageAnon(old_page)) dec_mm_counter(mm, anon_rss); if (PageReserved(old_page)) @@ -1299,7 +1310,7 @@ static int do_wp_page(struct mm_struct * page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return ret; no_new_page: page_cache_release(old_page); @@ -1654,7 +1665,7 @@ static int do_swap_page(struct mm_struct if (likely(pte_same(*page_table, orig_pte))) ret = VM_FAULT_OOM; else - ret = VM_FAULT_MINOR; + ret = VM_FAULT_RACE; pte_unmap(page_table); spin_unlock(&mm->page_table_lock); goto out; @@ -1676,7 +1687,7 @@ static int do_swap_page(struct mm_struct spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (unlikely(!pte_same(*page_table, orig_pte))) { - ret = VM_FAULT_MINOR; + ret = VM_FAULT_RACE; goto out_nomap; } @@ -1737,6 +1748,7 @@ do_anonymous_page(struct mm_struct *mm, { pte_t entry; struct page * page = ZERO_PAGE(addr); + int ret = VM_FAULT_MINOR; /* Read-only mapping of ZERO_PAGE. */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); @@ -1760,6 +1772,7 @@ do_anonymous_page(struct mm_struct *mm, pte_unmap(page_table); page_cache_release(page); spin_unlock(&mm->page_table_lock); + ret = VM_FAULT_RACE; goto out; } inc_mm_counter(mm, rss); @@ -1779,7 +1792,7 @@ do_anonymous_page(struct mm_struct *mm, lazy_mmu_prot_update(entry); spin_unlock(&mm->page_table_lock); out: - return VM_FAULT_MINOR; + return ret; no_mem: return VM_FAULT_OOM; } @@ -1897,6 +1910,7 @@ retry: pte_unmap(page_table); page_cache_release(new_page); spin_unlock(&mm->page_table_lock); + ret = VM_FAULT_RACE; goto out; } Index: linux-2.6/arch/i386/mm/fault.c =================================================================== --- linux-2.6.orig/arch/i386/mm/fault.c 2005-07-28 19:03:48.000000000 +1000 +++ linux-2.6/arch/i386/mm/fault.c 2005-07-31 11:47:48.000000000 +1000 @@ -351,6 +351,8 @@ good_area: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 1:52 ` Nick Piggin @ 2005-07-31 10:52 ` Robin Holt 2005-07-31 11:07 ` Nick Piggin 0 siblings, 1 reply; 72+ messages in thread From: Robin Holt @ 2005-07-31 10:52 UTC (permalink / raw) To: Nick Piggin; +Cc: Hugh Dickins, Robin Holt, Roland McGrath, linux-mm Should there be a check to ensure we don't return VM_FAULT_RACE when the pte which was inserted is exactly the same one we would have inserted? Could we generalize that more to the point of only returning VM_FAULT_RACE when write access was requested but the racing pte was not writable? Most of the test cases I have thrown at this have gotten the writer faulting first which did not result in problems. I would hate to slow things down if not necessary. I am unaware of more issues than the one I have been tripping. Thanks, Robin Holt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 10:52 ` Robin Holt @ 2005-07-31 11:07 ` Nick Piggin 2005-07-31 11:30 ` Robin Holt 0 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-07-31 11:07 UTC (permalink / raw) To: Robin Holt; +Cc: Hugh Dickins, Roland McGrath, linux-mm Robin Holt wrote: > Should there be a check to ensure we don't return VM_FAULT_RACE when the > pte which was inserted is exactly the same one we would have inserted? That would slow down the do_xxx_fault fastpaths, though. Considering VM_FAULT_RACE will only make any difference to get_user_pages (ie. not the page fault fastpath), and only then in rare cases of a racing fault on the same pte, I don't think the extra test would be worthwhile. > Could we generalize that more to the point of only returning VM_FAULT_RACE > when write access was requested but the racing pte was not writable? > I guess get_user_pages could be changed to retry on VM_FAULT_RACE only if it is attempting write access... is that worthwhile? I guess so... > Most of the test cases I have thrown at this have gotten the writer > faulting first which did not result in problems. I would hate to slow > things down if not necessary. I am unaware of more issues than the one > I have been tripping. > I think the VM_FAULT_RACE patch as-is should be fairly unintrusive to the page fault fastpaths. I think weighing down get_user_pages is preferable to putting logic in the general fault path - though I don't think there should be too much overhead introduced even there... Do you think the patch (or at least, the idea) looks like a likely solution to your problem? Obviously the !i386 architecture specific parts still need to be filled in... Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 11:07 ` Nick Piggin @ 2005-07-31 11:30 ` Robin Holt 2005-07-31 11:39 ` Robin Holt 2005-07-31 12:09 ` Robin Holt 0 siblings, 2 replies; 72+ messages in thread From: Robin Holt @ 2005-07-31 11:30 UTC (permalink / raw) To: Nick Piggin; +Cc: Robin Holt, Hugh Dickins, Roland McGrath, linux-mm On Sun, Jul 31, 2005 at 09:07:24PM +1000, Nick Piggin wrote: > Robin Holt wrote: > >Should there be a check to ensure we don't return VM_FAULT_RACE when the > >pte which was inserted is exactly the same one we would have inserted? > > That would slow down the do_xxx_fault fastpaths, though. > > Considering VM_FAULT_RACE will only make any difference to get_user_pages > (ie. not the page fault fastpath), and only then in rare cases of a racing > fault on the same pte, I don't think the extra test would be worthwhile. > > >Could we generalize that more to the point of only returning VM_FAULT_RACE > >when write access was requested but the racing pte was not writable? > > > > I guess get_user_pages could be changed to retry on VM_FAULT_RACE only if > it is attempting write access... is that worthwhile? I guess so... > > >Most of the test cases I have thrown at this have gotten the writer > >faulting first which did not result in problems. I would hate to slow > >things down if not necessary. I am unaware of more issues than the one > >I have been tripping. > > > > I think the VM_FAULT_RACE patch as-is should be fairly unintrusive to the > page fault fastpaths. I think weighing down get_user_pages is preferable to > putting logic in the general fault path - though I don't think there should > be too much overhead introduced even there... > > Do you think the patch (or at least, the idea) looks like a likely solution > to your problem? Obviously the !i386 architecture specific parts still need > to be filled in... The patch works for me. What I was thinking didn't seem that much heavier than what is already being done. I guess a patch against your patch might be a better illustration: This is on top of your patch: Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c 2005-07-31 05:39:24.161826311 -0500 +++ linux/mm/memory.c 2005-07-31 06:26:33.687274327 -0500 @@ -1768,17 +1768,17 @@ do_anonymous_page(struct mm_struct *mm, spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, addr); + entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, + vma->vm_page_prot)), vma); if (!pte_none(*page_table)) { + if (!pte_same(*page_table, entry)) + ret = VM_FAULT_RACE; pte_unmap(page_table); page_cache_release(page); spin_unlock(&mm->page_table_lock); - ret = VM_FAULT_RACE; goto out; } inc_mm_counter(mm, rss); - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, - vma->vm_page_prot)), - vma); lru_cache_add_active(page); SetPageReferenced(page); page_add_anon_rmap(page, vma, addr); @@ -1879,6 +1879,10 @@ retry: } page_table = pte_offset_map(pmd, address); + entry = mk_pte(new_page, vma->vm_page_prot); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + /* * This silly early PAGE_DIRTY setting removes a race * due to the bad i386 page protection. But it's valid @@ -1895,9 +1899,6 @@ retry: inc_mm_counter(mm, rss); flush_icache_page(vma, new_page); - entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); set_pte_at(mm, address, page_table, entry); if (anon) { lru_cache_add_active(new_page); @@ -1906,11 +1907,12 @@ retry: page_add_file_rmap(new_page); pte_unmap(page_table); } else { + if (!pte_same(*page_table, entry)) + ret=VM_FAULT_RACE; /* One of our sibling threads was faster, back out. */ pte_unmap(page_table); page_cache_release(new_page); spin_unlock(&mm->page_table_lock); - ret = VM_FAULT_RACE; goto out; } In both cases, we have immediately before this read the value from the pte so all the processor infrastructure is already in place and the read should be extremely quick. In truth, the compiler should eliminate the second load, but I can not guarantee that. What do you think? Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 11:30 ` Robin Holt @ 2005-07-31 11:39 ` Robin Holt 2005-07-31 12:09 ` Robin Holt 1 sibling, 0 replies; 72+ messages in thread From: Robin Holt @ 2005-07-31 11:39 UTC (permalink / raw) To: Robin Holt; +Cc: Nick Piggin, Hugh Dickins, Roland McGrath, linux-mm After actually thinking, I realize the do_anonymous_page path changes are pointless. Please ignore those. Thanks, Robin > Index: linux/mm/memory.c > =================================================================== > --- linux.orig/mm/memory.c 2005-07-31 05:39:24.161826311 -0500 > +++ linux/mm/memory.c 2005-07-31 06:26:33.687274327 -0500 > @@ -1768,17 +1768,17 @@ do_anonymous_page(struct mm_struct *mm, > spin_lock(&mm->page_table_lock); > page_table = pte_offset_map(pmd, addr); > > + entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, > + vma->vm_page_prot)), vma); > if (!pte_none(*page_table)) { > + if (!pte_same(*page_table, entry)) > + ret = VM_FAULT_RACE; > pte_unmap(page_table); > page_cache_release(page); > spin_unlock(&mm->page_table_lock); > - ret = VM_FAULT_RACE; > goto out; > } > inc_mm_counter(mm, rss); > - entry = maybe_mkwrite(pte_mkdirty(mk_pte(page, > - vma->vm_page_prot)), > - vma); > lru_cache_add_active(page); > SetPageReferenced(page); > page_add_anon_rmap(page, vma, addr); > @@ -1879,6 +1879,10 @@ retry: > } > page_table = pte_offset_map(pmd, address); > > + entry = mk_pte(new_page, vma->vm_page_prot); > + if (write_access) > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + > /* > * This silly early PAGE_DIRTY setting removes a race > * due to the bad i386 page protection. But it's valid -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 11:30 ` Robin Holt 2005-07-31 11:39 ` Robin Holt @ 2005-07-31 12:09 ` Robin Holt 2005-07-31 22:27 ` Nick Piggin 1 sibling, 1 reply; 72+ messages in thread From: Robin Holt @ 2005-07-31 12:09 UTC (permalink / raw) To: Robin Holt; +Cc: Nick Piggin, Hugh Dickins, Roland McGrath, linux-mm Just for good measure, I added some counters to the do_no_page and tweaked my earlier a little. I made the check look more like: atomic64_inc(&do_no_page_collisions); if (write_access && !pte_write(*page_table)) { ret=VM_FAULT_RACE; atomic64_inc(&do_no_page_asked_write_got_read); } After running the system for a while, I looked at the counters, for the 1162 collisions I had, the write_got_read only incremented 4 times. I am running a customer test program. Don't really know what it does, but I believe it is dealing with a large preinitialized data block which they DIO write data from a file into various areas of the block at the same time as other threads are reading through the pages looking for work to do. During normal run, without their application, I have not seen the do_no_page_asked_write_got_read increment in the more than 1/2 hour of run time. So far, I think the case for setting VM_FAULT_RACE only when there is a conflict for writable seems strong. Thanks, Robin -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 12:09 ` Robin Holt @ 2005-07-31 22:27 ` Nick Piggin 2005-08-01 3:22 ` Roland McGrath 0 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-07-31 22:27 UTC (permalink / raw) To: Robin Holt; +Cc: Hugh Dickins, Roland McGrath, linux-mm [-- Attachment #1: Type: text/plain, Size: 430 bytes --] Robin Holt wrote: > So far, I think the case for setting VM_FAULT_RACE only when there > is a conflict for writable seems strong. > But that's 1162 collisions out of how many faults? And only on the get_user_pages faults does the return value really make a difference. How about the following? This should catch some cases. You still miss the case where you're racing with anotyher writer though. -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-gup-write-fix.patch --] [-- Type: text/plain, Size: 469 bytes --] Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2005-07-31 11:49:35.000000000 +1000 +++ linux-2.6/mm/memory.c 2005-08-01 08:21:21.000000000 +1000 @@ -971,7 +971,9 @@ int get_user_pages(struct task_struct *t * that we have actually performed * the write fault (below). */ - continue; + if (write) + continue; + break; default: BUG(); } ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: get_user_pages() with write=1 and force=1 gets read-only pages. 2005-07-31 22:27 ` Nick Piggin @ 2005-08-01 3:22 ` Roland McGrath 2005-08-01 8:21 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin 0 siblings, 1 reply; 72+ messages in thread From: Roland McGrath @ 2005-08-01 3:22 UTC (permalink / raw) To: Nick Piggin; +Cc: Robin Holt, Hugh Dickins, linux-mm The basic style of this fix seems appropriate to me. I really don't think it matters if get_user_pages does extra iterations of the lookup or fault path in the race situations. The unnecessary ones will always be short anyway. Thanks, Roland -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 3:22 ` Roland McGrath @ 2005-08-01 8:21 ` Nick Piggin 2005-08-01 9:19 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-01 8:21 UTC (permalink / raw) To: Robin Holt, Andrew Morton, Linus Torvalds Cc: Roland McGrath, Hugh Dickins, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 500 bytes --] Hi, Not sure if this should be fixed for 2.6.13. It can result in pagecache corruption: so I guess that answers my own question. This was tested by Robin and appears to solve the problem. Roland had a quick look and thought the basic idea was sound. I'd like to get a couple more acks before going forward, and in particular Robin was contemplating possible efficiency improvements (although efficiency can wait on correctness). Feedback please, anyone. Thanks, Nick -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-gup-fix.patch --] [-- Type: text/plain, Size: 8794 bytes --] When get_user_pages for write access races with another process in the page fault handler that has established the pte for read access, handle_mm_fault in get_user_pages will return VM_FAULT_MINOR even if it hasn't made the page correctly writeable (for example, broken COW). Thus the assumption that get_user_pages has a writeable page at the mapping after handle_mm_fault returns is incorrect. Fix this by reporting a raced (uncompleted) fault and retrying the lookup and fault in get_user_pages before making the assumption that we have a writeable page. Great work by Robin Holt <holt@sgi.com> to debug the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/arch/i386/mm/fault.c =================================================================== --- linux-2.6.orig/arch/i386/mm/fault.c +++ linux-2.6/arch/i386/mm/fault.c @@ -351,6 +351,8 @@ good_area: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -625,6 +625,7 @@ static inline int page_mapped(struct pag * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. */ +#define VM_FAULT_RACE (-2) #define VM_FAULT_OOM (-1) #define VM_FAULT_SIGBUS 0 #define VM_FAULT_MINOR 1 Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -969,6 +969,16 @@ int get_user_pages(struct task_struct *t return i ? i : -EFAULT; case VM_FAULT_OOM: return i ? i : -ENOMEM; + case VM_FAULT_RACE: + /* + * Someone else got there first. + * Must retry before we can assume + * that we have actually performed + * the write fault (below). + */ + if (write) + continue; + break; default: BUG(); } @@ -1229,6 +1239,7 @@ static int do_wp_page(struct mm_struct * struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); pte_t entry; + int ret; if (unlikely(!pfn_valid(pfn))) { /* @@ -1285,7 +1296,9 @@ static int do_wp_page(struct mm_struct * */ spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); + ret = VM_FAULT_RACE; if (likely(pte_same(*page_table, pte))) { + ret = VM_FAULT_MINOR; if (PageAnon(old_page)) dec_mm_counter(mm, anon_rss); if (PageReserved(old_page)) @@ -1304,7 +1317,7 @@ static int do_wp_page(struct mm_struct * page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return ret; no_new_page: page_cache_release(old_page); @@ -1659,7 +1672,7 @@ static int do_swap_page(struct mm_struct if (likely(pte_same(*page_table, orig_pte))) ret = VM_FAULT_OOM; else - ret = VM_FAULT_MINOR; + ret = VM_FAULT_RACE; pte_unmap(page_table); spin_unlock(&mm->page_table_lock); goto out; @@ -1681,7 +1694,7 @@ static int do_swap_page(struct mm_struct spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (unlikely(!pte_same(*page_table, orig_pte))) { - ret = VM_FAULT_MINOR; + ret = VM_FAULT_RACE; goto out_nomap; } @@ -1742,6 +1755,7 @@ do_anonymous_page(struct mm_struct *mm, { pte_t entry; struct page * page = ZERO_PAGE(addr); + int ret = VM_FAULT_MINOR; /* Read-only mapping of ZERO_PAGE. */ entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot)); @@ -1765,6 +1779,7 @@ do_anonymous_page(struct mm_struct *mm, pte_unmap(page_table); page_cache_release(page); spin_unlock(&mm->page_table_lock); + ret = VM_FAULT_RACE; goto out; } inc_mm_counter(mm, rss); @@ -1784,7 +1799,7 @@ do_anonymous_page(struct mm_struct *mm, lazy_mmu_prot_update(entry); spin_unlock(&mm->page_table_lock); out: - return VM_FAULT_MINOR; + return ret; no_mem: return VM_FAULT_OOM; } @@ -1902,6 +1917,7 @@ retry: pte_unmap(page_table); page_cache_release(new_page); spin_unlock(&mm->page_table_lock); + ret = VM_FAULT_RACE; goto out; } Index: linux-2.6/arch/alpha/mm/fault.c =================================================================== --- linux-2.6.orig/arch/alpha/mm/fault.c +++ linux-2.6/arch/alpha/mm/fault.c @@ -162,6 +162,8 @@ do_page_fault(unsigned long address, uns goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/arm/mm/fault.c =================================================================== --- linux-2.6.orig/arch/arm/mm/fault.c +++ linux-2.6/arch/arm/mm/fault.c @@ -195,6 +195,7 @@ survive: case VM_FAULT_MINOR: tsk->min_flt++; case VM_FAULT_SIGBUS: + case VM_FAULT_RACE: return fault; } Index: linux-2.6/arch/ia64/mm/fault.c =================================================================== --- linux-2.6.orig/arch/ia64/mm/fault.c +++ linux-2.6/arch/ia64/mm/fault.c @@ -164,6 +164,8 @@ ia64_do_page_fault (unsigned long addres goto bad_area; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/m32r/mm/fault.c =================================================================== --- linux-2.6.orig/arch/m32r/mm/fault.c +++ linux-2.6/arch/m32r/mm/fault.c @@ -234,6 +234,8 @@ survive: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/mips/mm/fault.c =================================================================== --- linux-2.6.orig/arch/mips/mm/fault.c +++ linux-2.6/arch/mips/mm/fault.c @@ -109,6 +109,8 @@ survive: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/ppc/mm/fault.c =================================================================== --- linux-2.6.orig/arch/ppc/mm/fault.c +++ linux-2.6/arch/ppc/mm/fault.c @@ -259,6 +259,8 @@ good_area: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/ppc64/mm/fault.c =================================================================== --- linux-2.6.orig/arch/ppc64/mm/fault.c +++ linux-2.6/arch/ppc64/mm/fault.c @@ -234,6 +234,8 @@ good_area: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/s390/mm/fault.c =================================================================== --- linux-2.6.orig/arch/s390/mm/fault.c +++ linux-2.6/arch/s390/mm/fault.c @@ -260,6 +260,8 @@ survive: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/sh/mm/fault.c =================================================================== --- linux-2.6.orig/arch/sh/mm/fault.c +++ linux-2.6/arch/sh/mm/fault.c @@ -101,6 +101,8 @@ survive: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/sparc/mm/fault.c =================================================================== --- linux-2.6.orig/arch/sparc/mm/fault.c +++ linux-2.6/arch/sparc/mm/fault.c @@ -302,8 +302,8 @@ good_area: current->maj_flt++; break; case VM_FAULT_MINOR: - default: current->min_flt++; + case VM_FAULT_RACE: break; } up_read(&mm->mmap_sem); Index: linux-2.6/arch/sparc64/mm/fault.c =================================================================== --- linux-2.6.orig/arch/sparc64/mm/fault.c +++ linux-2.6/arch/sparc64/mm/fault.c @@ -454,6 +454,8 @@ good_area: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/um/kernel/trap_kern.c =================================================================== --- linux-2.6.orig/arch/um/kernel/trap_kern.c +++ linux-2.6/arch/um/kernel/trap_kern.c @@ -76,6 +76,8 @@ int handle_page_fault(unsigned long addr case VM_FAULT_OOM: err = -ENOMEM; goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } Index: linux-2.6/arch/xtensa/mm/fault.c =================================================================== --- linux-2.6.orig/arch/xtensa/mm/fault.c +++ linux-2.6/arch/xtensa/mm/fault.c @@ -113,6 +113,8 @@ survive: goto do_sigbus; case VM_FAULT_OOM: goto out_of_memory; + case VM_FAULT_RACE: + break; default: BUG(); } ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 8:21 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin @ 2005-08-01 9:19 ` Ingo Molnar 2005-08-01 9:27 ` Nick Piggin 2005-08-01 15:42 ` Linus Torvalds 2005-08-01 20:03 ` Hugh Dickins 2 siblings, 1 reply; 72+ messages in thread From: Ingo Molnar @ 2005-08-01 9:19 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Hi, > > Not sure if this should be fixed for 2.6.13. It can result in > pagecache corruption: so I guess that answers my own question. > > This was tested by Robin and appears to solve the problem. Roland had > a quick look and thought the basic idea was sound. I'd like to get a > couple more acks before going forward, and in particular Robin was > contemplating possible efficiency improvements (although efficiency > can wait on correctness). > > Feedback please, anyone. it looks good to me, but wouldnt it be simpler (in terms of patch and architecture impact) to always retry the follow_page() in get_user_pages(), in case of a minor fault? The sequence of minor faults must always be finite so it's a safe solution, and should solve the race too. The extra overhead of an additional follow_page() is small. Especially with 2.6.13 being close this looks like the safest bet, any improvement to this (i.e. your patch) can then be done in 2.6.14. Ingo When get_user_pages for write access races with another process in the page fault handler that has established the pte for read access, handle_mm_fault in get_user_pages will return VM_FAULT_MINOR even if it hasn't made the page correctly writeable (for example, broken COW). Thus the assumption that get_user_pages has a writeable page at the mapping after handle_mm_fault returns is incorrect. Fix this by retrying the lookup and fault in get_user_pages before making the assumption that we have a writeable page. Great work by Robin Holt <holt@sgi.com> to debug the problem. Originally-from: Nick Piggin <npiggin@suse.de> simplified the patch into a two-liner: Signed-off-by: Ingo Molnar <mingo@elte.hu> mm/memory.c | 7 +++++++ 1 files changed, 7 insertions(+) Index: linux-sched-curr/mm/memory.c =================================================================== --- linux-sched-curr.orig/mm/memory.c +++ linux-sched-curr/mm/memory.c @@ -961,6 +961,13 @@ int get_user_pages(struct task_struct *t switch (handle_mm_fault(mm,vma,start,write)) { case VM_FAULT_MINOR: tsk->min_flt++; + /* + * We might have raced with a readonly + * pagefault, retry to make sure we + * got write access: + */ + if (write) + continue; break; case VM_FAULT_MAJOR: tsk->maj_flt++; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 9:19 ` Ingo Molnar @ 2005-08-01 9:27 ` Nick Piggin 2005-08-01 10:15 ` Ingo Molnar 0 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-08-01 9:27 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>Feedback please, anyone. > > > it looks good to me, but wouldnt it be simpler (in terms of patch and > architecture impact) to always retry the follow_page() in > get_user_pages(), in case of a minor fault? The sequence of minor faults I believe this can break some things. Hugh posted an example in his recent post to linux-mm (ptrace setting a breakpoint in read-only text). I think? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 9:27 ` Nick Piggin @ 2005-08-01 10:15 ` Ingo Molnar 2005-08-01 10:57 ` Nick Piggin 0 siblings, 1 reply; 72+ messages in thread From: Ingo Molnar @ 2005-08-01 10:15 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Ingo Molnar wrote: > >* Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Feedback please, anyone. > > > > > >it looks good to me, but wouldnt it be simpler (in terms of patch and > >architecture impact) to always retry the follow_page() in > >get_user_pages(), in case of a minor fault? The sequence of minor faults > > I believe this can break some things. Hugh posted an example in his > recent post to linux-mm (ptrace setting a breakpoint in read-only > text). I think? Hugh's posting said: "it's trying to avoid an endless loop of finding the pte not writable when ptrace is modifying a page which the user is currently protected against writing to (setting a breakpoint in readonly text, perhaps?)" i'm wondering, why should that case generate an infinite fault? The first write access should copy the shared-library page into a private page and map it into the task's MM, writable. If this make-writable operation races with a read access then we return a minor fault and the page is still readonly, but retrying the write should then break up the COW protection and generate a writable page, and a subsequent follow_page() success. If the page cannot be made writable, shouldnt the vma flags reflect this fact by not having the VM_MAYWRITE flag, and hence get_user_pages() should have returned with -EFAULT earlier? in other words, can a named MAP_PRIVATE vma with VM_MAYWRITE set ever be non-COW-break-able and thus have the potential to induce an infinite loop? Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 10:15 ` Ingo Molnar @ 2005-08-01 10:57 ` Nick Piggin 2005-08-01 19:43 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-08-01 10:57 UTC (permalink / raw) To: Ingo Molnar Cc: Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel Ingo Molnar wrote: > > Hugh's posting said: > > "it's trying to avoid an endless loop of finding the pte not writable > when ptrace is modifying a page which the user is currently protected > against writing to (setting a breakpoint in readonly text, perhaps?)" > > i'm wondering, why should that case generate an infinite fault? The > first write access should copy the shared-library page into a private > page and map it into the task's MM, writable. If this make-writable It will be mapped readonly. > operation races with a read access then we return a minor fault and the > page is still readonly, but retrying the write should then break up the > COW protection and generate a writable page, and a subsequent > follow_page() success. If the page cannot be made writable, shouldnt the > vma flags reflect this fact by not having the VM_MAYWRITE flag, and > hence get_user_pages() should have returned with -EFAULT earlier? > If it cannot be written to, then yes. If it can be written to but is mapped readonly then you have the problem. Aside, that brings up an interesting question - why should readonly mappings of writeable files (with VM_MAYWRITE set) disallow ptrace write access while readonly mappings of readonly files not? Or am I horribly confused? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 10:57 ` Nick Piggin @ 2005-08-01 19:43 ` Hugh Dickins 2005-08-01 20:08 ` Linus Torvalds 0 siblings, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-01 19:43 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Robin Holt, Andrew Morton, Linus Torvalds, Roland McGrath, linux-mm, linux-kernel On Mon, 1 Aug 2005, Nick Piggin wrote: > Ingo Molnar wrote: > > Hugh's posting said: > > > > "it's trying to avoid an endless loop of finding the pte not writable > > when ptrace is modifying a page which the user is currently protected > > against writing to (setting a breakpoint in readonly text, perhaps?)" > > > > i'm wondering, why should that case generate an infinite fault? The first > > write access should copy the shared-library page into a private page and > > map it into the task's MM, writable. If this make-writable > > It will be mapped readonly. Yes. Sorry to leave you so long pondering over my words! > > operation races with a read access then we return a minor fault and the > > page is still readonly, but retrying the write should then break up the > > COW protection and generate a writable page, and a subsequent > > follow_page() success. If the page cannot be made writable, shouldnt the > > vma flags reflect this fact by not having the VM_MAYWRITE flag, and hence > > get_user_pages() should have returned with -EFAULT earlier? > > If it cannot be written to, then yes. If it can be written to > but is mapped readonly then you have the problem. Yes. The problem case is the one where "maybe_mkwrite" finds !VM_WRITE and so doesn't set writable (but as Linus has observed, dirty is set). I'm no expert on that case, was just guessing its use, I think Robin determined that Roland is the expert on it; but what's very clear is that it's intentional, allowing strace to write where the user cannot currently write. > Aside, that brings up an interesting question - why should readonly > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace > write access while readonly mappings of readonly files not? Or am I > horribly confused? Either you or I. You'll have to spell that out to me in more detail, I don't see it that way. What I do see and am slightly disturbed by, is that do_wp_page on a shared maywrite but not currently writable area, will go the break cow route in mainline, and has done so forever; whereas my page_mkwrite fixes in -mm inadvertently change that to go the the reuse route. I think my inadvertent change is correct, and the current behaviour (putting anonymous pages into a shared vma) is surprising, and may have bad consequences (would at least get the overcommit accounting wrong). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 19:43 ` Hugh Dickins @ 2005-08-01 20:08 ` Linus Torvalds 2005-08-01 21:06 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 20:08 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton, Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky On Mon, 1 Aug 2005, Hugh Dickins wrote: > > > Aside, that brings up an interesting question - why should readonly > > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace > > write access while readonly mappings of readonly files not? Or am I > > horribly confused? > > Either you or I. You'll have to spell that out to me in more detail, > I don't see it that way. We have always just done a COW if it's read-only - even if it's shared. The point being that if a process mapped did a read-only mapping, and a tracer wants to modify memory, the tracer is always allowed to do so, but it's _not_ going to write anything back to the filesystem. Writing something back to an executable just because the user happened to mmap it with MAP_SHARED (but read-only) _and_ the user had the right to write to that fd is _not_ ok. So VM_MAYWRITE is totally immaterial. We _will_not_write_ (and must not do so) to the backing store through ptrace unless it was literally a writable mapping (in which case VM_WRITE will be set, and the page table should be marked writable in the first case). So we have two choices: - not allow the write at all in ptrace (which I think we did at some point) This ends up being really inconvenient, and people seem to really expect to be able to write to readonly areas in debuggers. And doign "MAP_SHARED, PROT_READ" seems to be a common thing (Linux has supported that pretty much since day #1 when mmap was supported - long before writable shared mappings were supported, Linux accepted MAP_SHARED + PROT_READ not just because we could, but because Unix apps do use it). or - turn a shared read-only page into a private page on ptrace write This is what we've been doing. It's strange, and it _does_ change semantics (it's not shared any more, so the debugger writing to it means that now you don't see changes to that file by others), so it's clearly not "correct" either, but it's certainly a million times better than writing out breakpoints to shared files.. At some point (for the longest time), when a debugger was used to modify a read-only page, we also made it writable to the user, which was much easier from a VM standpoint. Now we have this "maybe_mkwrite()" thing, which is part of the reason for this particular problem. Using the dirty flag for a "page is _really_ writable" is admittedly kind of hacky, but it does have the advantage of working even when the -real- write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people to add some more hacks for their strange VM, so be it.. [ Btw, on a totally unrelated note: anybody who is a git user and looks for when this maybe_mkwrite() thing happened, just doing git-whatchanged -p -Smaybe_mkwrite mm/memory.c in the bkcvs conversion pinpoints it immediately. Very useful git trick in case you ever have that kind of question. ] I added Martin Schwidefsky to the Cc: explicitly, so that he can ping whoever in the s390 team needs to figure out what the right thing is for s390 and the dirty bit semantic change. Thanks for pointing it out. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 20:08 ` Linus Torvalds @ 2005-08-01 21:06 ` Hugh Dickins 2005-08-01 21:51 ` Linus Torvalds 0 siblings, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-01 21:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton, Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky On Mon, 1 Aug 2005, Linus Torvalds wrote: > On Mon, 1 Aug 2005, Hugh Dickins wrote: > > > > > Aside, that brings up an interesting question - why should readonly > > > mappings of writeable files (with VM_MAYWRITE set) disallow ptrace > > > write access while readonly mappings of readonly files not? Or am I > > > horribly confused? > > > > Either you or I. You'll have to spell that out to me in more detail, > > I don't see it that way. > > We have always just done a COW if it's read-only - even if it's shared. > > The point being that if a process mapped did a read-only mapping, and a > tracer wants to modify memory, the tracer is always allowed to do so, but > it's _not_ going to write anything back to the filesystem. Writing > something back to an executable just because the user happened to mmap it > with MAP_SHARED (but read-only) _and_ the user had the right to write to > that fd is _not_ ok. I'll need to think that through, but not right now. It's a surprise to me, and it's likely to surprise the current kernel too. I'd prefer to say that if the executable was mapped shared from a writable fd, then the tracer will write back to it; but you're clearly against that. We may need to split the vma to handle your semantics correctly. > Using the dirty flag for a "page is _really_ writable" is admittedly kind > of hacky, but it does have the advantage of working even when the -real- > write bit isn't set due to "maybe_mkwrite()". If it forces the s390 people > to add some more hacks for their strange VM, so be it.. I'll concentrate on this for now. s390 used to have the same semantics as everyone else, they made a conscious choice to diverge, and keep dirty bit in separate array indexed by struct page, and page_test_and_clear_dirty macro they use instead of clearing pte_dirty. It works all right, and I don't think it will prevent us communicating between maybe_mkwrite and get_user_pages by a pte flag that would be the usual pte_dirty on every other architecture. Just need to be careful to handle the right one right. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 21:06 ` Hugh Dickins @ 2005-08-01 21:51 ` Linus Torvalds 2005-08-01 22:01 ` Linus Torvalds 0 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 21:51 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton, Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky On Mon, 1 Aug 2005, Hugh Dickins wrote: > > > > We have always just done a COW if it's read-only - even if it's shared. > > > > The point being that if a process mapped did a read-only mapping, and a > > tracer wants to modify memory, the tracer is always allowed to do so, but > > it's _not_ going to write anything back to the filesystem. Writing > > something back to an executable just because the user happened to mmap it > > with MAP_SHARED (but read-only) _and_ the user had the right to write to > > that fd is _not_ ok. > > I'll need to think that through, but not right now. It's a surprise > to me, and it's likely to surprise the current kernel too. Well, even if you did the write-back if VM_MAYWRITE is set, you'd still have the case of having MAP_SHARED, PROT_READ _without_ VM_MAYWRITE being set, and I'd expect that to actually be the common one (since you'd normally use O_RDONLY to open a fd that you only want to map for reading). And as mentioned, MAP_SHARED+PROT_READ does actually happen in real life. Just do a google search on "MAP_SHARED PROT_READ -PROT_WRITE" and you'll get tons of hits. For good reason too - because MAP_PRIVATE isn't actually coherent on several old UNIXes. So you'd still have to convert such a case to a COW mapping, so it's not like you can avoid it. Of course, if VM_MAYWRITE is not set, you could just convert it silently to a MAP_PRIVATE at the VM level (that's literally what we used to do, back when we didn't support writable shared mappings at all, all those years ago), so at least now the COW behaviour would match the vma_flags. > I'd prefer to say that if the executable was mapped shared from a writable fd, > then the tracer will write back to it; but you're clearly against that. Absolutely. I can just see somebody mapping an executable MAP_SHARED and PROT_READ, and something as simple as doing a breakpoint while debugging causing system-wide trouble. I really don't think that's acceptable. And I'm not making it up - add PROT_EXEC to the google search around, and watch it being done exactly that way. Several of the hits mention shared libraries too. I strongly suspect that almost all cases will be opened with O_RDONLY, but still.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 21:51 ` Linus Torvalds @ 2005-08-01 22:01 ` Linus Torvalds 2005-08-02 12:01 ` Martin Schwidefsky 0 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 22:01 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Ingo Molnar, Robin Holt, Andrew Morton, Roland McGrath, linux-mm, linux-kernel, Martin Schwidefsky On Mon, 1 Aug 2005, Linus Torvalds wrote: > > Of course, if VM_MAYWRITE is not set, you could just convert it silently > to a MAP_PRIVATE at the VM level (that's literally what we used to do, > back when we didn't support writable shared mappings at all, all those > years ago), so at least now the COW behaviour would match the vma_flags. Heh. I just checked. We still do exactly that: if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); some code never dies ;) However, we still set the VM_MAYSHARE bit, and thats' the one that mm/rmap.c checks for some reason. I don't see quite why - VM_MAYSHARE doesn't actually ever do anything else than make sure that we try to allocate a mremap() mapping in a cache-coherent space, I think (ie it's a total no-op on any sane architecture, and as far as rmap is concerned on all of them). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 22:01 ` Linus Torvalds @ 2005-08-02 12:01 ` Martin Schwidefsky 2005-08-02 12:26 ` Hugh Dickins 2005-08-02 15:30 ` Linus Torvalds 0 siblings, 2 replies; 72+ messages in thread From: Martin Schwidefsky @ 2005-08-02 12:01 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath > > Any chance you can change the __follow_page test to account for > > writeable clean ptes? Something like > > > > if (write && !pte_dirty(pte) && !pte_write(pte)) > > goto out; > > > > And then you would re-add the set_page_dirty logic further on. > > Hmm.. That should be possible. I wanted to do the simplest possible code > sequence, but yeah, I guess there's nothing wrong with allowing the code > to dirty the page. > > Somebody want to send me a proper patch? Also, I haven't actually heard > from whoever actually noticed the problem in the first place (Robin?) > whether the fix does fix it. It "obviously does", but testing is always > good ;) Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable clean pte is just fine then why do we check the dirty bit at all? Doesn't pte_dirty() imply pte_write()? With the additional !pte_write(pte) check (and if I haven't overlooked something which is not unlikely) s390 should work fine even without the software-dirty bit hack. blue skies, Martin Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 12:01 ` Martin Schwidefsky @ 2005-08-02 12:26 ` Hugh Dickins 2005-08-02 12:28 ` Nick Piggin 2005-08-02 15:19 ` Martin Schwidefsky 2005-08-02 15:30 ` Linus Torvalds 1 sibling, 2 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-02 12:26 UTC (permalink / raw) To: Martin Schwidefsky Cc: Linus Torvalds, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Martin Schwidefsky wrote: > > Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable > clean pte is just fine then why do we check the dirty bit at all? Doesn't > pte_dirty() imply pte_write()? Not quite. This is all about the peculiar ptrace case, which sets "force" to get_user_pages, and ends up handled by the little maybe_mkwrite function: we sometimes allow ptrace to modify the page while the user does not have have write access to it via the pte. Robin discovered a race which proves it's unsafe for get_user_pages to reset its lookup_write flag (another stage in this peculiar path) after a single try, Nick proposed a patch which adds another VM_ return code which each arch would need to handle, Linus looked for something simpler and hit upon checking pte_dirty rather than pte_write (and removing the then unnecessary lookup_write flag). Linus' changes are in the 2.6.13-rc5 mm/memory.c, but that leaves s390 broken at present. > With the additional !pte_write(pte) check (and if I haven't overlooked > something which is not unlikely) s390 should work fine even without the > software-dirty bit hack. I agree the pte_write check ought to go back in next to the pte_dirty check, and that will leave s390 handling most uses of get_user_pages correctly, but still failing to handle the peculiar case of strace modifying a page to which the user does not currently have write access (e.g. setting a breakpoint in readonly text). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 12:26 ` Hugh Dickins @ 2005-08-02 12:28 ` Nick Piggin 2005-08-02 15:19 ` Martin Schwidefsky 1 sibling, 0 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-02 12:28 UTC (permalink / raw) To: Hugh Dickins Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath [-- Attachment #1: Type: text/plain, Size: 683 bytes --] Hugh Dickins wrote: > On Tue, 2 Aug 2005, Martin Schwidefsky wrote: >>With the additional !pte_write(pte) check (and if I haven't overlooked >>something which is not unlikely) s390 should work fine even without the >>software-dirty bit hack. > > > I agree the pte_write check ought to go back in next to the pte_dirty > check, and that will leave s390 handling most uses of get_user_pages > correctly, but still failing to handle the peculiar case of strace > modifying a page to which the user does not currently have write access > (e.g. setting a breakpoint in readonly text). > Oh, here is the patch I sent Linus and forgot to CC everyone else. -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-opt-gup.patch --] [-- Type: text/plain, Size: 1025 bytes --] Allow __follow_page to succeed when encountering a clean, writeable pte. Requires reintroduction of the direct page dirtying. Means get_user_pages doesn't have to drop the page_table_lock and enter the page fault handler for every clean, writeable pte it encounters (when being called for write). Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -811,15 +811,18 @@ static struct page *__follow_page(struct pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_dirty(pte)) + if (write && !pte_write(pte) && !pte_dirty(pte)) goto out; if (read && !pte_read(pte)) goto out; pfn = pte_pfn(pte); if (pfn_valid(pfn)) { page = pfn_to_page(pfn); - if (accessed) + if (accessed) { + if (write && !pte_dirty(pte)&& !PageDirty(page)) + set_page_dirty(page); mark_page_accessed(page); + } return page; } } ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 12:26 ` Hugh Dickins 2005-08-02 12:28 ` Nick Piggin @ 2005-08-02 15:19 ` Martin Schwidefsky 1 sibling, 0 replies; 72+ messages in thread From: Martin Schwidefsky @ 2005-08-02 15:19 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath, Linus Torvalds Hugh Dickins <hugh@veritas.com> wrote on 08/02/2005 02:26:09 PM: > On Tue, 2 Aug 2005, Martin Schwidefsky wrote: > > > > Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable > > clean pte is just fine then why do we check the dirty bit at all? Doesn't > > pte_dirty() imply pte_write()? > > Not quite. This is all about the peculiar ptrace case, which sets "force" > to get_user_pages, and ends up handled by the little maybe_mkwrite function: > we sometimes allow ptrace to modify the page while the user does not have > have write access to it via the pte. I slowly begin to understand. You want to do a copy on write of the page you want to change with ptrace but without making the pte a writable pte. To indicate the fact the page has been cowed the pte dirty bit is misused. > Robin discovered a race which proves it's unsafe for get_user_pages to > reset its lookup_write flag (another stage in this peculiar path) after > a single try, Nick proposed a patch which adds another VM_ return code > which each arch would need to handle, Linus looked for something simpler > and hit upon checking pte_dirty rather than pte_write (and removing > the then unnecessary lookup_write flag). Yes, I've seen it on lkml. > Linus' changes are in the 2.6.13-rc5 mm/memory.c, > but that leaves s390 broken at present. That is currently my problem which I'm trying to solve. With Nicks latest patch, we are more or less back to the previous situation in regard to the checks in __follow_page as far as s390 is concerned. pte_write() is 0 for write access via access_process_vm on a read-only vma. pte_dirty() is always 0 but since get_user_pages doesn't fall back on lookup_write = 0 anymore we have an endless loop again. Four letter words .. > > With the additional !pte_write(pte) check (and if I haven't overlooked > > something which is not unlikely) s390 should work fine even without the > > software-dirty bit hack. > > I agree the pte_write check ought to go back in next to the pte_dirty > check, and that will leave s390 handling most uses of get_user_pages > correctly, but still failing to handle the peculiar case of strace > modifying a page to which the user does not currently have write access > (e.g. setting a breakpoint in readonly text). The straight forward fix is to introduce a software dirty bit that can get misused for this special case. Not that I like it but its seems the simplest solution. blue skies, Martin Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 12:01 ` Martin Schwidefsky 2005-08-02 12:26 ` Hugh Dickins @ 2005-08-02 15:30 ` Linus Torvalds 2005-08-02 16:03 ` Hugh Dickins 2005-08-02 16:44 ` Martin Schwidefsky 1 sibling, 2 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 15:30 UTC (permalink / raw) To: Martin Schwidefsky Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Martin Schwidefsky wrote: > > Why do we require the !pte_dirty(pte) check? I don't get it. If a writeable > clean pte is just fine then why do we check the dirty bit at all? Doesn't > pte_dirty() imply pte_write()? A _non_writable and clean pty is _also_ fine sometimes. But only if we have broken COW and marked it dirty. > With the additional !pte_write(pte) check (and if I haven't overlooked > something which is not unlikely) s390 should work fine even without the > software-dirty bit hack. No it won't. It will just loop forever in a tight loop if somebody tries to put a breakpoint on a read-only location. On the other hand, this being s390, maybe nobody cares? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 15:30 ` Linus Torvalds @ 2005-08-02 16:03 ` Hugh Dickins 2005-08-02 16:25 ` Linus Torvalds 2005-08-02 16:44 ` Martin Schwidefsky 1 sibling, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-02 16:03 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Linus Torvalds wrote: > > On the other hand, this being s390, maybe nobody cares? You have a cruel streak. But have I just realized a non-s390 problem with your pte_dirty technique? The ptep_set_wrprotect in fork's copy_one_pte. That's specifically write-protecting the pte to force COW, but leaving the dirty bit: so now get_user_pages will skip COW-ing it (in all write cases, not just the peculiar ptrace force one). We really do need to COW those in get_user_pages, don't we? And although we could handle it by clearing dirty and doing a set_page_dirty, it's a path we don't want to clutter further. (Yes, there's another issue of a fork occurring while the pages are already under get_user_pages, which is a significant issue for InfiniBand; but I see that as a separate kind of race, which we can reasonably disregard in this discussion - though it will need proper attention, perhaps through Michael Tsirkin's PROT_DONTCOPY patch.) The simple answer to Robin's pagetable update race is to say that anyone using ptrace should be prepared for a pt race ;) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 16:03 ` Hugh Dickins @ 2005-08-02 16:25 ` Linus Torvalds 2005-08-02 17:02 ` Linus Torvalds 2005-08-02 17:21 ` Hugh Dickins 0 siblings, 2 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 16:25 UTC (permalink / raw) To: Hugh Dickins Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Hugh Dickins wrote: > > But have I just realized a non-s390 problem with your pte_dirty > technique? The ptep_set_wrprotect in fork's copy_one_pte. > > That's specifically write-protecting the pte to force COW, but leaving > the dirty bit: so now get_user_pages will skip COW-ing it (in all write > cases, not just the peculiar ptrace force one). Damn, you're right. We could obviously move the dirty bit from the page tables to the "struct page" in fork() (that may have other advantages: we're scanning the dang thing anyway, after all) to avoid that special case, but yes, that's nasty. One of the reasons I _liked_ the pte_dirty() test is that there's the reverse case: a mapping that used to be writable, and got dirtied (and COW'ed as necessary), and then was mprotected back, and the new test would happily write to it _without_ having to do any extra work. Which in that case is correct. But yeah, fork() does something special. In fact, that brings up another race altogether: a thread that does a fork() at the same time as get_user_pages() will have the exact same issues. Even with the old code. Simply because we test the permissions on the page long before we actually do the real access (ie it may be dirty and writable when we get it, but by the time the write happens, it might have become COW-shared). Now, that's probably not worth worrying about, but it's kind of interesting. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 16:25 ` Linus Torvalds @ 2005-08-02 17:02 ` Linus Torvalds 2005-08-02 17:27 ` Hugh Dickins 2005-08-02 17:21 ` Hugh Dickins 1 sibling, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 17:02 UTC (permalink / raw) To: Hugh Dickins Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Linus Torvalds wrote: > > In fact, that brings up another race altogether: a thread that does a > fork() at the same time [...] You don't even need that, actually. There's another race by which the write could have gotten lost both with the new code _and_ the old code. Since we will have dropped the page table lock when calling handle_mm_fault() (which will just re-get the lock and then drop it again) _and_ since we don't actually mark the page dirty if it was writable, it's entirely possible that the VM scanner comes in and just drops the page from the page tables. Now, that doesn't sound so bad, but what we have then is a page that is marked dirty in the "struct page", but hasn't been actually dirtied yet. It could get written out and marked clean (can anybody say "preemptible kernel"?) before we ever actually do the write to the page. The thing is, we should always set the dirty bit either atomically with the access (normal "CPU sets the dirty bit on write") _or_ we should set it after the write (having kept a reference to the page). Or does anybody see anything that protects us here? Now, I don't think we can fix that race (which is probably pretty much impossible to hit in practice) in the 2.6.13 timeframe. Maybe I'll have to just accept the horrid "VM_FAULT_RACE" patch. I don't much like it, but.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 17:02 ` Linus Torvalds @ 2005-08-02 17:27 ` Hugh Dickins 0 siblings, 0 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-02 17:27 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Linus Torvalds wrote: > > Since we will have dropped the page table lock when calling > handle_mm_fault() (which will just re-get the lock and then drop it > again) _and_ since we don't actually mark the page dirty if it was > writable, it's entirely possible that the VM scanner comes in and just > drops the page from the page tables. > > Now, that doesn't sound so bad, but what we have then is a page that is > marked dirty in the "struct page", but hasn't been actually dirtied yet. > It could get written out and marked clean (can anybody say "preemptible > kernel"?) before we ever actually do the write to the page. > > The thing is, we should always set the dirty bit either atomically with > the access (normal "CPU sets the dirty bit on write") _or_ we should set > it after the write (having kept a reference to the page). > > Or does anybody see anything that protects us here? > > Now, I don't think we can fix that race (which is probably pretty much > impossible to hit in practice) in the 2.6.13 timeframe. I believe this particular race has been recognized since day one of get_user_pages, and we've always demanded that the caller must do a SetPageDirty (I should probably say set_page_dirty) before freeing the pages held for writing. Which is why I was a bit puzzled to see that prior set_page_dirty in __follow_page, which Andrew identified as for s390. > Maybe I'll have to just accept the horrid "VM_FAULT_RACE" patch. I don't > much like it, but.. I've not yet reached a conclusion on that, need to think more about doing mkclean in copy_one_pte. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 16:25 ` Linus Torvalds 2005-08-02 17:02 ` Linus Torvalds @ 2005-08-02 17:21 ` Hugh Dickins 2005-08-02 18:47 ` Linus Torvalds 1 sibling, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-02 17:21 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Linus Torvalds wrote: > On Tue, 2 Aug 2005, Hugh Dickins wrote: > > > > But have I just realized a non-s390 problem with your pte_dirty > > technique? The ptep_set_wrprotect in fork's copy_one_pte. > > > > That's specifically write-protecting the pte to force COW, but leaving > > the dirty bit: so now get_user_pages will skip COW-ing it (in all write > > cases, not just the peculiar ptrace force one). > > Damn, you're right. We could obviously move the dirty bit from the page > tables to the "struct page" in fork() (that may have other advantages: > we're scanning the dang thing anyway, after all) to avoid that special > case, but yes, that's nasty. It might not be so bad. It's going to access the struct page anyway. And clearing dirty from parent and child at fork time could save two set_page_dirtys at exit time. But I'm not sure that we could batch the the dirty bit clearing into one TLB flush like we do the write protection. > In fact, that brings up another race altogether: a thread that does a > fork() at the same time as get_user_pages() will have the exact same > issues. Even with the old code. Simply because we test the permissions on > the page long before we actually do the real access (ie it may be dirty > and writable when we get it, but by the time the write happens, it might > have become COW-shared). > > Now, that's probably not worth worrying about, but it's kind of > interesting. Not worth worrying about in this context: it's one aspect of the InfiniBand (RDMA) issue I was referring to, to be addressed another time. Or is it even possible? We do require the caller of get_user_pages to down_read(&mm->mmap_sem), and fork parent has down_write(&mm->mmap_sem). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 17:21 ` Hugh Dickins @ 2005-08-02 18:47 ` Linus Torvalds 2005-08-02 19:20 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 18:47 UTC (permalink / raw) To: Hugh Dickins Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Hugh Dickins wrote: > > It might not be so bad. It's going to access the struct page anyway. > And clearing dirty from parent and child at fork time could save two > set_page_dirtys at exit time. But I'm not sure that we could batch the > the dirty bit clearing into one TLB flush like we do the write protection. Yes, good point. If the thing is still marked dirty in the TLB, some other thread might be writing to the page after we've cleared dirty but before we've flushed the TLB - causing the new dirty bit to be lost. I think. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 18:47 ` Linus Torvalds @ 2005-08-02 19:20 ` Hugh Dickins 2005-08-02 19:54 ` Linus Torvalds 0 siblings, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-02 19:20 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Linus Torvalds wrote: > On Tue, 2 Aug 2005, Hugh Dickins wrote: > > > > It might not be so bad. It's going to access the struct page anyway. > > And clearing dirty from parent and child at fork time could save two > > set_page_dirtys at exit time. But I'm not sure that we could batch the > > the dirty bit clearing into one TLB flush like we do the write protection. > > Yes, good point. If the thing is still marked dirty in the TLB, some other > thread might be writing to the page after we've cleared dirty but before > we've flushed the TLB - causing the new dirty bit to be lost. I think. Would that matter? Yes, if vmscan sneaked in at some point while page_table_lock is dropped, and wrote away the page with the earlier data. But I was worrying about the reverse case, that we clear dirty, then another thread sets it again before we emerge from copy_page_range, so it gets left behind granting get_user_pages write permission. Hmm, that implies that the other thread doesn't yet see wrprotect (because we've not yet flushed TLB), which probably implies it would still see dirty set: and so not set it again, so not a possible case. But that's a precarious, processor-dependent argument: I don't think it's safe to rely on, and your reverse case is already a problem. I don't believe there's a safe efficient way we could batch clearing dirty there. We could make a second pass of the whole mm after the flush TLB has asserted the wrprotects; but that won't win friends. I'm thinking of reverting to the old __follow_page, setting write_access -1 in the get_user_pages special case (to avoid change to all the arches, in some of which write_access is a boolean, in some a bitflag, but in none -1), and in that write_access -1 case passing back the special code to say do_wp_page has done its full job. Combining your and Nick's and Andrew's ideas, and letting Martin off the hook. Or would that disgust you too much? (We could give -1 a pretty name ;) Working on it now, but my brain in an even lower power state than ever. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 19:20 ` Hugh Dickins @ 2005-08-02 19:54 ` Linus Torvalds 2005-08-02 20:55 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 19:54 UTC (permalink / raw) To: Hugh Dickins Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Hugh Dickins wrote: > > > > Yes, good point. If the thing is still marked dirty in the TLB, some other > > thread might be writing to the page after we've cleared dirty but before > > we've flushed the TLB - causing the new dirty bit to be lost. I think. > > Would that matter? Yes, if vmscan sneaked in at some point while > page_table_lock is dropped, and wrote away the page with the earlier data. Right. > But I was worrying about the reverse case, that we clear dirty, then > another thread sets it again before we emerge from copy_page_range, > so it gets left behind granting get_user_pages write permission. Hmm.. At least x86 won't do that - the dirty bits are updated with an atomic read-modify-write sequence that only sets the dirty bit. We won't get a writable page somehow. But the lost dirty bit is nasty. > I don't believe there's a safe efficient way we could batch clearing > dirty there. Well, there is one really cheap one: look at how many users the VM has. The thing is, fork() from a threaded environment is simply not done, so we could easily have a "slow and careful mode" for the thread case, and nobody would probably ever care. Whether its worth it, I dunno. It might actually speed up the fork/exit cases, so it might be worth looking at for that reason. > I'm thinking of reverting to the old __follow_page, setting write_access > -1 in the get_user_pages special case (to avoid change to all the arches, > in some of which write_access is a boolean, in some a bitflag, but in > none -1), and in that write_access -1 case passing back the special > code to say do_wp_page has done its full job. Combining your and > Nick's and Andrew's ideas, and letting Martin off the hook. > Or would that disgust you too much? (We could give -1 a pretty name ;) Go for it, I think whatever we do won't be wonderfully pretty. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 19:54 ` Linus Torvalds @ 2005-08-02 20:55 ` Hugh Dickins 2005-08-03 10:24 ` Nick Piggin 2005-08-03 10:24 ` Martin Schwidefsky 0 siblings, 2 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-02 20:55 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath On Tue, 2 Aug 2005, Linus Torvalds wrote: > > Go for it, I think whatever we do won't be wonderfully pretty. Here we are: get_user_pages quite untested, let alone the racy case, but I think it should work. Please all hack it around as you see fit, I'll check mail when I get home, but won't be very responsive... Checking pte_dirty instead of pte_write in __follow_page is problematic for s390, and for copy_one_pte which leaves dirty when clearing write. So revert __follow_page to check pte_write as before, and let do_wp_page pass back a special code VM_FAULT_WRITE to say it has done its full job (whereas VM_FAULT_MINOR when it backs out on race): once get_user_pages receives this value, it no longer requires pte_write in __follow_page. But most callers of handle_mm_fault, in the various architectures, have switch statements which do not expect this new case. To avoid changing them all in a hurry, only pass back VM_FAULT_WRITE when write_access arg says VM_FAULT_WRITE_EXPECTED - chosen as -1 since some arches pass write_access as a boolean, some as a bitflag, but none as -1. Yes, we do have a call to do_wp_page from do_swap_page, but no need to change that: in rare case it's needed, another do_wp_page will follow. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- 2.6.13-rc5/include/linux/mm.h 2005-08-02 12:07:14.000000000 +0100 +++ linux/include/linux/mm.h 2005-08-02 21:14:58.000000000 +0100 @@ -629,6 +629,9 @@ static inline int page_mapped(struct pag #define VM_FAULT_SIGBUS 0 #define VM_FAULT_MINOR 1 #define VM_FAULT_MAJOR 2 +#define VM_FAULT_WRITE 3 /* special case for get_user_pages */ + +#define VM_FAULT_WRITE_EXPECTED (-1) /* only for get_user_pages */ #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) --- 2.6.13-rc5/mm/memory.c 2005-08-02 12:07:23.000000000 +0100 +++ linux/mm/memory.c 2005-08-02 21:14:26.000000000 +0100 @@ -811,15 +811,18 @@ static struct page *__follow_page(struct pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_dirty(pte)) + if (write && !pte_write(pte)) goto out; if (read && !pte_read(pte)) goto out; pfn = pte_pfn(pte); if (pfn_valid(pfn)) { page = pfn_to_page(pfn); - if (accessed) + if (accessed) { + if (write && !pte_dirty(pte) &&!PageDirty(page)) + set_page_dirty(page); mark_page_accessed(page); + } return page; } } @@ -941,10 +944,11 @@ int get_user_pages(struct task_struct *t } spin_lock(&mm->page_table_lock); do { + int write_access = write? VM_FAULT_WRITE_EXPECTED: 0; struct page *page; cond_resched_lock(&mm->page_table_lock); - while (!(page = follow_page(mm, start, write))) { + while (!(page = follow_page(mm, start, write_access))) { /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -957,7 +961,16 @@ int get_user_pages(struct task_struct *t break; } spin_unlock(&mm->page_table_lock); - switch (handle_mm_fault(mm,vma,start,write)) { + switch (handle_mm_fault(mm, vma, start, + write_access)) { + case VM_FAULT_WRITE: + /* + * do_wp_page has broken COW when + * necessary, even if maybe_mkwrite + * decided not to set pte_write + */ + write_access = 0; + /* FALLTHRU */ case VM_FAULT_MINOR: tsk->min_flt++; break; @@ -1220,6 +1233,7 @@ static int do_wp_page(struct mm_struct * struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); pte_t entry; + int ret; if (unlikely(!pfn_valid(pfn))) { /* @@ -1247,7 +1261,7 @@ static int do_wp_page(struct mm_struct * lazy_mmu_prot_update(entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return VM_FAULT_WRITE; } } pte_unmap(page_table); @@ -1274,6 +1288,7 @@ static int do_wp_page(struct mm_struct * /* * Re-check the pte - we dropped the lock */ + ret = VM_FAULT_MINOR; spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { @@ -1290,12 +1305,13 @@ static int do_wp_page(struct mm_struct * /* Free the old page.. */ new_page = old_page; + ret = VM_FAULT_WRITE; } pte_unmap(page_table); page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return ret; no_new_page: page_cache_release(old_page); @@ -1985,9 +2001,13 @@ static inline int handle_pte_fault(struc } if (write_access) { - if (!pte_write(entry)) - return do_wp_page(mm, vma, address, pte, pmd, entry); - + if (!pte_write(entry)) { + int ret = do_wp_page(mm, vma, address, pte, pmd, entry); + if (ret == VM_FAULT_WRITE && + write_access != VM_FAULT_WRITE_EXPECTED) + ret = VM_FAULT_MINOR; + return ret; + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 20:55 ` Hugh Dickins @ 2005-08-03 10:24 ` Nick Piggin 2005-08-03 11:47 ` Hugh Dickins 2005-08-03 16:12 ` Linus Torvalds 2005-08-03 10:24 ` Martin Schwidefsky 1 sibling, 2 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-03 10:24 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath [-- Attachment #1: Type: text/plain, Size: 687 bytes --] Hugh Dickins wrote: > On Tue, 2 Aug 2005, Linus Torvalds wrote: > >>Go for it, I think whatever we do won't be wonderfully pretty. > > > Here we are: get_user_pages quite untested, let alone the racy case, > but I think it should work. Please all hack it around as you see fit, > I'll check mail when I get home, but won't be very responsive... > Seems OK to me. I don't know why you think handle_mm_fault can't be inline, but if it can be, then I have a modification attached that removes the condition - any good? Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there is a good reason for it, but might that break out of tree drivers? -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-gup-hugh.patch --] [-- Type: text/plain, Size: 6183 bytes --] Checking pte_dirty instead of pte_write in __follow_page is problematic for s390, and for copy_one_pte which leaves dirty when clearing write. So revert __follow_page to check pte_write as before, and let do_wp_page pass back a special code VM_FAULT_WRITE to say it has done its full job (whereas VM_FAULT_MINOR when it backs out on race): once get_user_pages receives this value, it no longer requires pte_write in __follow_page. But most callers of handle_mm_fault, in the various architectures, have switch statements which do not expect this new case. To avoid changing them all in a hurry, only pass back VM_FAULT_WRITE when write_access arg says VM_FAULT_WRITE_EXPECTED - chosen as -1 since some arches pass write_access as a boolean, some as a bitflag, but none as -1. Yes, we do have a call to do_wp_page from do_swap_page, but no need to change that: in rare case it's needed, another do_wp_page will follow. Signed-off-by: Hugh Dickins <hugh@veritas.com> Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h +++ linux-2.6/include/linux/mm.h @@ -625,10 +625,16 @@ static inline int page_mapped(struct pag * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. */ -#define VM_FAULT_OOM (-1) -#define VM_FAULT_SIGBUS 0 -#define VM_FAULT_MINOR 1 -#define VM_FAULT_MAJOR 2 +#define VM_FAULT_OOM 0x00 +#define VM_FAULT_SIGBUS 0x01 +#define VM_FAULT_MINOR 0x02 +#define VM_FAULT_MAJOR 0x03 + +/* + * Special case for get_user_pages. + * Must be in a distinct bit from the above VM_FAULT_ flags. + */ +#define VM_FAULT_WRITE 0x10 #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) @@ -704,7 +710,13 @@ extern pte_t *FASTCALL(pte_alloc_kernel( extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot); extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot); -extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access); +extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access); + +static inline int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access) +{ + return __handle_mm_fault(mm, vma, address, write_access) & (~VM_FAULT_WRITE); +} + extern int make_pages_present(unsigned long addr, unsigned long end); extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write); void install_arg_page(struct vm_area_struct *, struct page *, unsigned long); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -811,15 +811,18 @@ static struct page *__follow_page(struct pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_dirty(pte)) + if (write && !pte_write(pte)) goto out; if (read && !pte_read(pte)) goto out; pfn = pte_pfn(pte); if (pfn_valid(pfn)) { page = pfn_to_page(pfn); - if (accessed) + if (accessed) { + if (write && !pte_dirty(pte) &&!PageDirty(page)) + set_page_dirty(page); mark_page_accessed(page); + } return page; } } @@ -941,10 +944,11 @@ int get_user_pages(struct task_struct *t } spin_lock(&mm->page_table_lock); do { + int write_access = write; struct page *page; cond_resched_lock(&mm->page_table_lock); - while (!(page = follow_page(mm, start, write))) { + while (!(page = follow_page(mm, start, write_access))) { /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -957,7 +961,16 @@ int get_user_pages(struct task_struct *t break; } spin_unlock(&mm->page_table_lock); - switch (handle_mm_fault(mm,vma,start,write)) { + switch (__handle_mm_fault(mm, vma, start, + write_access)) { + case VM_FAULT_WRITE: + /* + * do_wp_page has broken COW when + * necessary, even if maybe_mkwrite + * decided not to set pte_write + */ + write_access = 0; + /* FALLTHRU */ case VM_FAULT_MINOR: tsk->min_flt++; break; @@ -1220,6 +1233,7 @@ static int do_wp_page(struct mm_struct * struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); pte_t entry; + int ret; if (unlikely(!pfn_valid(pfn))) { /* @@ -1247,7 +1261,7 @@ static int do_wp_page(struct mm_struct * lazy_mmu_prot_update(entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return VM_FAULT_MINOR|VM_FAULT_WRITE; } } pte_unmap(page_table); @@ -1274,6 +1288,7 @@ static int do_wp_page(struct mm_struct * /* * Re-check the pte - we dropped the lock */ + ret = VM_FAULT_MINOR; spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { @@ -1290,12 +1305,13 @@ static int do_wp_page(struct mm_struct * /* Free the old page.. */ new_page = old_page; + ret |= VM_FAULT_WRITE; } pte_unmap(page_table); page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return ret; no_new_page: page_cache_release(old_page); @@ -1987,7 +2003,6 @@ static inline int handle_pte_fault(struc if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, entry); - entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); @@ -2002,7 +2017,7 @@ static inline int handle_pte_fault(struc /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, +int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access) { pgd_t *pgd; ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 10:24 ` Nick Piggin @ 2005-08-03 11:47 ` Hugh Dickins 2005-08-03 12:13 ` Nick Piggin 2005-08-03 16:12 ` Linus Torvalds 1 sibling, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-03 11:47 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath On Wed, 3 Aug 2005, Nick Piggin wrote: > Hugh Dickins wrote: > > > > Here we are: get_user_pages quite untested, let alone the racy case, > > but I think it should work. Please all hack it around as you see fit, > > I'll check mail when I get home, but won't be very responsive... > > Seems OK to me. I don't know why you think handle_mm_fault can't > be inline, but if it can be, then I have a modification attached > that removes the condition - any good? Stupidity was the reason I thought handle_mm_fault couldn't be inline: I was picturing it static inline within mm/memory.c, failed to make the great intellectual leap you've achieved by moving it to include/linux/mm.h. > Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there > is a good reason for it, but might that break out of tree drivers? No, I don't think it would break anything: it's just an historic oddity, used to be -1 for failure, and only got given a name recently, I think when wli added the proper major/minor counting. Your version of the patch looks less hacky to me (not requiring VM_FAULT_WRITE_EXPECTED arg), though we could perfectly well remove that at leisure by adding VM_FAULT_WRITE case into all the arches in 2.6.14 (which might be preferable to leaving the __inline obscurity?). I don't mind either way, but since you've not yet found an actual error in mine, I'd prefer you to make yours a tidyup patch on top, Signed-off-by your own good self, and let Linus decide whether he wants to apply yours on top or not. Or perhaps the decision rests for the moment with Robin, whether he gets his customer to test yours or mine - whichever is tested is the one which should go in. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 11:47 ` Hugh Dickins @ 2005-08-03 12:13 ` Nick Piggin 0 siblings, 0 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-03 12:13 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath Hugh Dickins wrote: > Stupidity was the reason I thought handle_mm_fault couldn't be inline: > I was picturing it static inline within mm/memory.c, failed to make the > great intellectual leap you've achieved by moving it to include/linux/mm.h. > Well it was one of my finer moments, so don't be too hard on yourself. > > No, I don't think it would break anything: it's just an historic oddity, > used to be -1 for failure, and only got given a name recently, I think > when wli added the proper major/minor counting. > > Your version of the patch looks less hacky to me (not requiring > VM_FAULT_WRITE_EXPECTED arg), though we could perfectly well remove > that at leisure by adding VM_FAULT_WRITE case into all the arches in > 2.6.14 (which might be preferable to leaving the __inline obscurity?). > Well depends on what they want I suppose. Does it even make sense to expose VM_FAULT_WRITE to arch code if it will just fall through to VM_FAULT_MINOR? With my earlier VM_FAULT_RACE thing, you can squint and say that's a different case to VM_FAULT_MINOR - and accordingly not increment the minor fault count. Afterall, minor faults *seem* to track count of modifications made to the pte entry minus major faults, rather than the number of times the hardware traps. I say this because we increment the number in get_user_pages too. But... > I don't mind either way, but since you've not yet found an actual > error in mine, I'd prefer you to make yours a tidyup patch on top, > Signed-off-by your own good self, and let Linus decide whether he > wants to apply yours on top or not. Or perhaps the decision rests > for the moment with Robin, whether he gets his customer to test > yours or mine - whichever is tested is the one which should go in. > I agree that for the moment we probably just want something that works. I'd be just as happy to go with your patch as is for now. Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 10:24 ` Nick Piggin 2005-08-03 11:47 ` Hugh Dickins @ 2005-08-03 16:12 ` Linus Torvalds 2005-08-03 16:39 ` Linus Torvalds ` (3 more replies) 1 sibling, 4 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-03 16:12 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath On Wed, 3 Aug 2005, Nick Piggin wrote: > > Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there > is a good reason for it, but might that break out of tree drivers? Ok, I applied this because it was reasonably pretty and I liked the approach. It seems buggy, though, since it was using "switch ()" to test the bits (wrongly, afaik), and I'm going to apply the appended on top of it. Holler quickly if you disagreee.. Linus ---- diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -949,6 +949,8 @@ int get_user_pages(struct task_struct *t cond_resched_lock(&mm->page_table_lock); while (!(page = follow_page(mm, start, write_access))) { + int ret; + /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -961,16 +963,18 @@ int get_user_pages(struct task_struct *t break; } spin_unlock(&mm->page_table_lock); - switch (__handle_mm_fault(mm, vma, start, - write_access)) { - case VM_FAULT_WRITE: - /* - * do_wp_page has broken COW when - * necessary, even if maybe_mkwrite - * decided not to set pte_write - */ + ret = __handle_mm_fault(mm, vma, start, write_access); + + /* + * The VM_FAULT_WRITE bit tells us that do_wp_page has + * broken COW when necessary, even if maybe_mkwrite + * decided not to set pte_write. We can thus safely do + * subsequent page lookups as if they were reads. + */ + if (ret & VM_FAULT_WRITE) write_access = 0; - /* FALLTHRU */ + + switch (ret & ~VM_FAULT_WRITE) { case VM_FAULT_MINOR: tsk->min_flt++; break; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 16:12 ` Linus Torvalds @ 2005-08-03 16:39 ` Linus Torvalds 2005-08-03 16:42 ` Linus Torvalds 2005-08-03 17:12 ` Hugh Dickins ` (2 subsequent siblings) 3 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-03 16:39 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath On Wed, 3 Aug 2005, Linus Torvalds wrote: > > Ok, I applied this because it was reasonably pretty and I liked the > approach. It seems buggy, though, since it was using "switch ()" to test > the bits (wrongly, afaik), and I'm going to apply the appended on top of > it. Holler quickly if you disagreee.. Another problem I just thought of. What about PROT_NONE pages? Afaik, they have _exactly_ the same issue on the read side as a PROT_READ page has on the write side: pte_read() may fail on them. What makes us not get into an infinite loop there? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 16:39 ` Linus Torvalds @ 2005-08-03 16:42 ` Linus Torvalds 0 siblings, 0 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-03 16:42 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath On Wed, 3 Aug 2005, Linus Torvalds wrote: > > What makes us not get into an infinite loop there? Ahh, never mind, I didn't notice that we never set the "read" thing at all, so ptrace will never care if it's readable or not. No problem. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 16:12 ` Linus Torvalds 2005-08-03 16:39 ` Linus Torvalds @ 2005-08-03 17:12 ` Hugh Dickins 2005-08-03 23:03 ` Nick Piggin 2005-08-04 14:14 ` Alexander Nyberg 3 siblings, 0 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-03 17:12 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath On Wed, 3 Aug 2005, Linus Torvalds wrote: > > Ok, I applied this because it was reasonably pretty and I liked the > approach. It seems buggy, though, since it was using "switch ()" to test > the bits (wrongly, afaik), and I'm going to apply the appended on top of > it. Holler quickly if you disagreee.. Looks right to me. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 16:12 ` Linus Torvalds 2005-08-03 16:39 ` Linus Torvalds 2005-08-03 17:12 ` Hugh Dickins @ 2005-08-03 23:03 ` Nick Piggin 2005-08-04 14:14 ` Alexander Nyberg 3 siblings, 0 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-03 23:03 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath Linus Torvalds wrote: > > On Wed, 3 Aug 2005, Nick Piggin wrote: > >>Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there >>is a good reason for it, but might that break out of tree drivers? > > > Ok, I applied this because it was reasonably pretty and I liked the > approach. It seems buggy, though, since it was using "switch ()" to test > the bits (wrongly, afaik), Oops, thanks. > and I'm going to apply the appended on top of > it. Holler quickly if you disagreee.. > No that looks fine. Should really be credited to Hugh... well I guess everyone had some input into it though (Andrew, Hugh, you, me). It probably doesn't matter too much. Thanks everyone. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 16:12 ` Linus Torvalds ` (2 preceding siblings ...) 2005-08-03 23:03 ` Nick Piggin @ 2005-08-04 14:14 ` Alexander Nyberg 2005-08-04 14:30 ` Nick Piggin 3 siblings, 1 reply; 72+ messages in thread From: Alexander Nyberg @ 2005-08-04 14:14 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen On Wed, Aug 03, 2005 at 09:12:37AM -0700 Linus Torvalds wrote: > > > On Wed, 3 Aug 2005, Nick Piggin wrote: > > > > Oh, it gets rid of the -1 for VM_FAULT_OOM. Doesn't seem like there > > is a good reason for it, but might that break out of tree drivers? > > Ok, I applied this because it was reasonably pretty and I liked the > approach. It seems buggy, though, since it was using "switch ()" to test > the bits (wrongly, afaik), and I'm going to apply the appended on top of > it. Holler quickly if you disagreee.. > x86_64 had hardcoded the VM_ numbers so it broke down when the numbers were changed. Signed-off-by: Alexander Nyberg <alexn@telia.com> Index: linux-2.6/arch/x86_64/mm/fault.c =================================================================== --- linux-2.6.orig/arch/x86_64/mm/fault.c 2005-07-31 18:10:20.000000000 +0200 +++ linux-2.6/arch/x86_64/mm/fault.c 2005-08-04 16:04:59.000000000 +0200 @@ -439,13 +439,13 @@ * the fault. */ switch (handle_mm_fault(mm, vma, address, write)) { - case 1: + case VM_FAULT_MINOR: tsk->min_flt++; break; - case 2: + case VM_FAULT_MAJOR: tsk->maj_flt++; break; - case 0: + case VM_FAULT_SIGBUS: goto do_sigbus; default: goto out_of_memory; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 14:14 ` Alexander Nyberg @ 2005-08-04 14:30 ` Nick Piggin 2005-08-04 15:00 ` Alexander Nyberg 2005-08-04 16:29 ` Russell King 0 siblings, 2 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-04 14:30 UTC (permalink / raw) To: Alexander Nyberg Cc: Linus Torvalds, Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen Alexander Nyberg wrote: > On Wed, Aug 03, 2005 at 09:12:37AM -0700 Linus Torvalds wrote: > > >> >>Ok, I applied this because it was reasonably pretty and I liked the >>approach. It seems buggy, though, since it was using "switch ()" to test >>the bits (wrongly, afaik), and I'm going to apply the appended on top of >>it. Holler quickly if you disagreee.. >> > > > x86_64 had hardcoded the VM_ numbers so it broke down when the numbers > were changed. > Ugh, sorry I should have audited this but I really wasn't expecting it (famous last words). Hasn't been a good week for me. parisc, cris, m68k, frv, sh64, arm26 are also broken. Would you mind resending a patch that fixes them all? Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 14:30 ` Nick Piggin @ 2005-08-04 15:00 ` Alexander Nyberg 2005-08-04 15:35 ` Hugh Dickins 2005-08-04 15:36 ` Linus Torvalds 2005-08-04 16:29 ` Russell King 1 sibling, 2 replies; 72+ messages in thread From: Alexander Nyberg @ 2005-08-04 15:00 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen > > > >x86_64 had hardcoded the VM_ numbers so it broke down when the numbers > >were changed. > > > > Ugh, sorry I should have audited this but I really wasn't expecting > it (famous last words). Hasn't been a good week for me. Hardcoding is evil so it's good it gets cleaned up anyway. > parisc, cris, m68k, frv, sh64, arm26 are also broken. > Would you mind resending a patch that fixes them all? > Remove the hardcoding in return value checking of handle_mm_fault() Signed-off-by: Alexander Nyberg <alexn@telia.com> arm26/mm/fault.c | 6 +++--- cris/mm/fault.c | 6 +++--- frv/mm/fault.c | 6 +++--- m68k/mm/fault.c | 6 +++--- parisc/mm/fault.c | 6 +++--- sh64/mm/fault.c | 6 +++--- x86_64/mm/fault.c | 6 +++--- 7 files changed, 21 insertions(+), 21 deletions(-) Index: linux-2.6/arch/x86_64/mm/fault.c =================================================================== --- linux-2.6.orig/arch/x86_64/mm/fault.c 2005-07-31 18:10:20.000000000 +0200 +++ linux-2.6/arch/x86_64/mm/fault.c 2005-08-04 16:04:59.000000000 +0200 @@ -439,13 +439,13 @@ * the fault. */ switch (handle_mm_fault(mm, vma, address, write)) { - case 1: + case VM_FAULT_MINOR: tsk->min_flt++; break; - case 2: + case VM_FAULT_MAJOR: tsk->maj_flt++; break; - case 0: + case VM_FAULT_SIGBUS: goto do_sigbus; default: goto out_of_memory; Index: linux-2.6/arch/cris/mm/fault.c =================================================================== --- linux-2.6.orig/arch/cris/mm/fault.c 2005-07-31 18:10:02.000000000 +0200 +++ linux-2.6/arch/cris/mm/fault.c 2005-08-04 16:40:56.000000000 +0200 @@ -284,13 +284,13 @@ */ switch (handle_mm_fault(mm, vma, address, writeaccess & 1)) { - case 1: + case VM_FAULT_MINOR: tsk->min_flt++; break; - case 2: + case VM_FAULT_MAJOR: tsk->maj_flt++; break; - case 0: + case VM_FAULT_SIGBUS: goto do_sigbus; default: goto out_of_memory; Index: linux-2.6/arch/m68k/mm/fault.c =================================================================== --- linux-2.6.orig/arch/m68k/mm/fault.c 2005-07-31 18:10:05.000000000 +0200 +++ linux-2.6/arch/m68k/mm/fault.c 2005-08-04 16:42:05.000000000 +0200 @@ -160,13 +160,13 @@ printk("handle_mm_fault returns %d\n",fault); #endif switch (fault) { - case 1: + case VM_FAULT_MINOR: current->min_flt++; break; - case 2: + case VM_FAULT_MAJOR: current->maj_flt++; break; - case 0: + case VM_FAULT_SIGBUS: goto bus_err; default: goto out_of_memory; Index: linux-2.6/arch/parisc/mm/fault.c =================================================================== --- linux-2.6.orig/arch/parisc/mm/fault.c 2005-07-31 18:10:11.000000000 +0200 +++ linux-2.6/arch/parisc/mm/fault.c 2005-08-04 16:41:18.000000000 +0200 @@ -178,13 +178,13 @@ */ switch (handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) != 0)) { - case 1: + case VM_FAULT_MINOR: ++current->min_flt; break; - case 2: + case VM_FAULT_MAJOR: ++current->maj_flt; break; - case 0: + case VM_FAULT_SIGBUS: /* * We ran out of memory, or some other thing happened * to us that made us unable to handle the page fault Index: linux-2.6/arch/arm26/mm/fault.c =================================================================== --- linux-2.6.orig/arch/arm26/mm/fault.c 2005-07-31 18:10:00.000000000 +0200 +++ linux-2.6/arch/arm26/mm/fault.c 2005-08-04 16:46:18.000000000 +0200 @@ -176,12 +176,12 @@ * Handle the "normal" cases first - successful and sigbus */ switch (fault) { - case 2: + case VM_FAULT_MAJOR: tsk->maj_flt++; return fault; - case 1: + case VM_FAULT_MINOR: tsk->min_flt++; - case 0: + case VM_FAULT_SIGBUS: return fault; } Index: linux-2.6/arch/frv/mm/fault.c =================================================================== --- linux-2.6.orig/arch/frv/mm/fault.c 2005-07-31 18:10:03.000000000 +0200 +++ linux-2.6/arch/frv/mm/fault.c 2005-08-04 16:44:02.000000000 +0200 @@ -163,13 +163,13 @@ * the fault. */ switch (handle_mm_fault(mm, vma, ear0, write)) { - case 1: + case VM_FAULT_MINOR: current->min_flt++; break; - case 2: + case VM_FAULT_MAJOR: current->maj_flt++; break; - case 0: + case VM_FAULT_SIGBUS: goto do_sigbus; default: goto out_of_memory; Index: linux-2.6/arch/sh64/mm/fault.c =================================================================== --- linux-2.6.orig/arch/sh64/mm/fault.c 2005-07-31 18:10:16.000000000 +0200 +++ linux-2.6/arch/sh64/mm/fault.c 2005-08-04 16:44:58.000000000 +0200 @@ -223,13 +223,13 @@ */ survive: switch (handle_mm_fault(mm, vma, address, writeaccess)) { - case 1: + case VM_FAULT_MINOR: tsk->min_flt++; break; - case 2: + case VM_FAULT_MAJOR: tsk->maj_flt++; break; - case 0: + case VM_FAULT_SIGBUS: goto do_sigbus; default: goto out_of_memory; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 15:00 ` Alexander Nyberg @ 2005-08-04 15:35 ` Hugh Dickins 2005-08-04 16:32 ` Russell King 2005-08-04 15:36 ` Linus Torvalds 1 sibling, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-04 15:35 UTC (permalink / raw) To: Alexander Nyberg, Linus Torvalds Cc: Nick Piggin, Russell King, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen On Thu, 4 Aug 2005, Alexander Nyberg wrote: > > Hardcoding is evil so it's good it gets cleaned up anyway. > > > parisc, cris, m68k, frv, sh64, arm26 are also broken. > > Would you mind resending a patch that fixes them all? > > Remove the hardcoding in return value checking of handle_mm_fault() Your patch looks right to me, and bless you for catching this. But it does get into changing lots of arches, which we were trying to avoid at this moment. Well, that's up to Linus. And it does miss arm, the only arch which actually needs changing right now, if we simply restore the original values which Nick shifted - although arm references the VM_FAULT_ codes in some places, it also uses "> 0". arm26 looks at first as if it needs changing too, but a closer look shows it's remapping the faults and is okay - agreed? I suggest for now the patch below, which does need to be applied for the arm case, and makes applying your good cleanup less urgent. Restore VM_FAULT_SIGBUS, VM_FAULT_MINOR and VM_FAULT_MAJOR to their original values, so that arches which have them hardcoded will still work before they're cleaned up. And correct arm to use the VM_FAULT_ codes throughout, not assuming MINOR and MAJOR are the only ones > 0. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- 2.6.13-rc5-git2/arch/arm/mm/fault.c 2005-08-02 12:06:28.000000000 +0100 +++ linux/arch/arm/mm/fault.c 2005-08-04 16:06:57.000000000 +0100 @@ -240,8 +240,11 @@ do_page_fault(unsigned long addr, unsign /* * Handle the "normal" case first */ - if (fault > 0) + switch (fault) { + case VM_FAULT_MINOR: + case VM_FAULT_MAJOR: return 0; + } /* * If we are in kernel mode at this point, we @@ -261,7 +264,7 @@ do_page_fault(unsigned long addr, unsign do_exit(SIGKILL); return 0; - case 0: + case VM_FAULT_SIGBUS: /* * We had some memory, but were unable to * successfully fix up this page fault. --- 2.6.13-rc5-git2/include/linux/mm.h 2005-08-04 15:20:20.000000000 +0100 +++ linux/include/linux/mm.h 2005-08-04 15:52:34.000000000 +0100 @@ -625,10 +625,10 @@ static inline int page_mapped(struct pag * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. */ -#define VM_FAULT_OOM 0x00 -#define VM_FAULT_SIGBUS 0x01 -#define VM_FAULT_MINOR 0x02 -#define VM_FAULT_MAJOR 0x03 +#define VM_FAULT_SIGBUS 0x00 +#define VM_FAULT_MINOR 0x01 +#define VM_FAULT_MAJOR 0x02 +#define VM_FAULT_OOM 0x03 /* * Special case for get_user_pages. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 15:35 ` Hugh Dickins @ 2005-08-04 16:32 ` Russell King 0 siblings, 0 replies; 72+ messages in thread From: Russell King @ 2005-08-04 16:32 UTC (permalink / raw) To: Hugh Dickins Cc: Alexander Nyberg, Linus Torvalds, Nick Piggin, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen On Thu, Aug 04, 2005 at 04:35:06PM +0100, Hugh Dickins wrote: > And it does miss arm, the only arch which actually needs changing > right now, if we simply restore the original values which Nick shifted > - although arm references the VM_FAULT_ codes in some places, it also > uses "> 0". arm26 looks at first as if it needs changing too, but > a closer look shows it's remapping the faults and is okay - agreed? Your patch doesn't look right. Firstly, I'd rather stay away from switch() if at all possible - past experience has shown that it generates inherently poor code on ARM. Whether that's still true or not I've no idea, but I don't particularly want to find out at the moment. > Restore VM_FAULT_SIGBUS, VM_FAULT_MINOR and VM_FAULT_MAJOR to their > original values, so that arches which have them hardcoded will still > work before they're cleaned up. And correct arm to use the VM_FAULT_ > codes throughout, not assuming MINOR and MAJOR are the only ones > 0. And the above rules this out. As I say, I fixed ARM this morning, so changing these constants will break it again. Let's just wait for things to stabilise instead of trying to race with architecture maintainers... -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 15:00 ` Alexander Nyberg 2005-08-04 15:35 ` Hugh Dickins @ 2005-08-04 15:36 ` Linus Torvalds 1 sibling, 0 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-04 15:36 UTC (permalink / raw) To: Alexander Nyberg Cc: Nick Piggin, Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen On Thu, 4 Aug 2005, Alexander Nyberg wrote: > > Hardcoding is evil so it's good it gets cleaned up anyway. Yes. > > parisc, cris, m68k, frv, sh64, arm26 are also broken. Would you mind > > resending a patch that fixes them all? > > Remove the hardcoding in return value checking of handle_mm_fault() I only saw this one when I had already done it myself. Your arm26 conversion was only partial, btw. Notice how it returns the fault number (with a few extensions of its own) and processes it further? I think I got that right. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 14:30 ` Nick Piggin 2005-08-04 15:00 ` Alexander Nyberg @ 2005-08-04 16:29 ` Russell King 1 sibling, 0 replies; 72+ messages in thread From: Russell King @ 2005-08-04 16:29 UTC (permalink / raw) To: Nick Piggin Cc: Alexander Nyberg, Linus Torvalds, Hugh Dickins, Martin Schwidefsky, Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Roland McGrath, Andi Kleen On Fri, Aug 05, 2005 at 12:30:07AM +1000, Nick Piggin wrote: > Alexander Nyberg wrote: > > On Wed, Aug 03, 2005 at 09:12:37AM -0700 Linus Torvalds wrote: > > > > > >> > >>Ok, I applied this because it was reasonably pretty and I liked the > >>approach. It seems buggy, though, since it was using "switch ()" to test > >>the bits (wrongly, afaik), and I'm going to apply the appended on top of > >>it. Holler quickly if you disagreee.. > >> > > > > > > x86_64 had hardcoded the VM_ numbers so it broke down when the numbers > > were changed. > > > > Ugh, sorry I should have audited this but I really wasn't expecting > it (famous last words). Hasn't been a good week for me. > > parisc, cris, m68k, frv, sh64, arm26 are also broken. > Would you mind resending a patch that fixes them all? ARM as well - fix is pending Linus pulling my tree... -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: 2.6 Serial core -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 20:55 ` Hugh Dickins 2005-08-03 10:24 ` Nick Piggin @ 2005-08-03 10:24 ` Martin Schwidefsky 2005-08-03 11:57 ` Hugh Dickins 1 sibling, 1 reply; 72+ messages in thread From: Martin Schwidefsky @ 2005-08-03 10:24 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath, Linus Torvalds Hugh Dickins <hugh@veritas.com> wrote on 08/02/2005 10:55:31 PM: > > Go for it, I think whatever we do won't be wonderfully pretty. > > Here we are: get_user_pages quite untested, let alone the racy case, > but I think it should work. Please all hack it around as you see fit, > I'll check mail when I get home, but won't be very responsive... Ahh, just tested it and everythings seems to work (even for s390).. I'm happy :-) blue skies, Martin Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 10:24 ` Martin Schwidefsky @ 2005-08-03 11:57 ` Hugh Dickins 0 siblings, 0 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-03 11:57 UTC (permalink / raw) To: Martin Schwidefsky Cc: Andrew Morton, Robin Holt, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath, Linus Torvalds On Wed, 3 Aug 2005, Martin Schwidefsky wrote: > Hugh Dickins <hugh@veritas.com> wrote on 08/02/2005 10:55:31 PM: > > > > Here we are: get_user_pages quite untested, let alone the racy case, > > Ahh, just tested it and everythings seems to work (even for s390).. > I'm happy :-) Thanks for testing, Martin. Your happiness is my bliss. Whether we go with Nick's mods on mine or not, I think you can now safely assume we've given up demanding a sudden change at the s390 end. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 15:30 ` Linus Torvalds 2005-08-02 16:03 ` Hugh Dickins @ 2005-08-02 16:44 ` Martin Schwidefsky 1 sibling, 0 replies; 72+ messages in thread From: Martin Schwidefsky @ 2005-08-02 16:44 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm, Ingo Molnar, Nick Piggin, Roland McGrath Linus Torvalds <torvalds@osdl.org> wrote on 08/02/2005 05:30:37 PM: > > With the additional !pte_write(pte) check (and if I haven't overlooked > > something which is not unlikely) s390 should work fine even without the > > software-dirty bit hack. > > No it won't. It will just loop forever in a tight loop if somebody tries > to put a breakpoint on a read-only location. Yes, I have realized that as well nowe after staring at the code a little bit longer. That maybe_mkwrite is really tricky. > On the other hand, this being s390, maybe nobody cares? Some will care. At least I do. I've tested the latest git with gdb and it will indeed loop forever if I try to write to a read-only vma. blue skies, Martin Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 8:21 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin 2005-08-01 9:19 ` Ingo Molnar @ 2005-08-01 15:42 ` Linus Torvalds 2005-08-01 18:18 ` Linus Torvalds ` (3 more replies) 2005-08-01 20:03 ` Hugh Dickins 2 siblings, 4 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 15:42 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel On Mon, 1 Aug 2005, Nick Piggin wrote: > > Not sure if this should be fixed for 2.6.13. It can result in > pagecache corruption: so I guess that answers my own question. Hell no. This patch is clearly untested and must _not_ be applied: + case VM_FAULT_RACE: + /* + * Someone else got there first. + * Must retry before we can assume + * that we have actually performed + * the write fault (below). + */ + if (write) + continue; + break; that "continue" will continue without the spinlock held, and now do follow_page() will run without page_table_lock, _and_ it will release the spinlock once more afterwards, so if somebody else is racing on this, we might remove the spinlock for them too. Don't do it. Instead, I'd suggest changing the logic for "lookup_write". Make it require that the page table entry is _dirty_ (not writable), and then remove the line that says: lookup_write = write && !force; and you're now done. A successful mm fault for write _should_ always have marked the PTE dirty (and yes, part of testing this would be to verify that this is true - but since architectures that don't have HW dirty bits depend on this anyway, I'm pretty sure it _is_ true). Ie something like the below (which is totally untested, obviously, but I think conceptually is a lot more correct, and obviously a lot simpler). Linus ---- diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -811,18 +811,15 @@ static struct page *__follow_page(struct pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_write(pte)) + if (write && !pte_dirty(pte)) goto out; if (read && !pte_read(pte)) goto out; pfn = pte_pfn(pte); if (pfn_valid(pfn)) { page = pfn_to_page(pfn); - if (accessed) { - if (write && !pte_dirty(pte) &&!PageDirty(page)) - set_page_dirty(page); + if (accessed) mark_page_accessed(page); - } return page; } } @@ -972,14 +969,6 @@ int get_user_pages(struct task_struct *t default: BUG(); } - /* - * Now that we have performed a write fault - * and surely no longer have a shared page we - * shouldn't write, we shouldn't ignore an - * unwritable page in the page table if - * we are forcing write access. - */ - lookup_write = write && !force; spin_lock(&mm->page_table_lock); } if (pages) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 15:42 ` Linus Torvalds @ 2005-08-01 18:18 ` Linus Torvalds 2005-08-03 8:24 ` Robin Holt 2005-08-01 19:29 ` Hugh Dickins ` (2 subsequent siblings) 3 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 18:18 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel On Mon, 1 Aug 2005, Linus Torvalds wrote: > > Ie something like the below (which is totally untested, obviously, but I > think conceptually is a lot more correct, and obviously a lot simpler). I've tested it, and thought more about it, and I can't see any fault with the approach. In fact, I like it more. So it's checked in now (in a further simplified way, since the thing made "lookup_write" always be the same as just "write"). Can somebody who saw the problem in the first place please verify? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 18:18 ` Linus Torvalds @ 2005-08-03 8:24 ` Robin Holt 2005-08-03 11:31 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Robin Holt @ 2005-08-03 8:24 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote: > On Mon, 1 Aug 2005, Linus Torvalds wrote: > > > > Ie something like the below (which is totally untested, obviously, but I > > think conceptually is a lot more correct, and obviously a lot simpler). > > I've tested it, and thought more about it, and I can't see any fault with > the approach. In fact, I like it more. So it's checked in now (in a > further simplified way, since the thing made "lookup_write" always be the > same as just "write"). > > Can somebody who saw the problem in the first place please verify? Unfortunately, I can not get the user test to run against anything but the SLES9 SP2 kernel. I took the commit 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6 and applied that diff to the SUSE kernel. It does fix the problem the customer reported. Thanks, Robin Holt -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 8:24 ` Robin Holt @ 2005-08-03 11:31 ` Hugh Dickins 2005-08-04 11:48 ` Robin Holt 0 siblings, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-03 11:31 UTC (permalink / raw) To: Robin Holt Cc: Linus Torvalds, Nick Piggin, Andrew Morton, Roland McGrath, linux-mm, linux-kernel On Wed, 3 Aug 2005, Robin Holt wrote: > On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote: > > > > Can somebody who saw the problem in the first place please verify? > > Unfortunately, I can not get the user test to run against anything but the > SLES9 SP2 kernel. I took the commit 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6 > and applied that diff to the SUSE kernel. It does fix the problem the > customer reported. Thanks for getting that tried, Robin, but I'm afraid you hadn't got far enough down your mailbox. There's a couple of flaws with Linus' patch above, one with respect to fork and one with respect to s390, so we're now intending to apply my patch below on top of Linus' (I see Nick has now come up with some refinement to mine, test his if you prefer it). Sorry to mess you back and forth, but 4ceb...f0d6 itself won't do, so we do rather need your customer to retest with the patch below too. Thanks, Hugh Checking pte_dirty instead of pte_write in __follow_page is problematic for s390, and for copy_one_pte which leaves dirty when clearing write. So revert __follow_page to check pte_write as before, and let do_wp_page pass back a special code VM_FAULT_WRITE to say it has done its full job (whereas VM_FAULT_MINOR when it backs out on race): once get_user_pages receives this value, it no longer requires pte_write in __follow_page. But most callers of handle_mm_fault, in the various architectures, have switch statements which do not expect this new case. To avoid changing them all in a hurry, only pass back VM_FAULT_WRITE when write_access arg says VM_FAULT_WRITE_EXPECTED - chosen as -1 since some arches pass write_access as a boolean, some as a bitflag, but none as -1. Yes, we do have a call to do_wp_page from do_swap_page, but no need to change that: in rare case it's needed, another do_wp_page will follow. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- 2.6.13-rc5/include/linux/mm.h 2005-08-02 12:07:14.000000000 +0100 +++ linux/include/linux/mm.h 2005-08-02 21:14:58.000000000 +0100 @@ -629,6 +629,9 @@ static inline int page_mapped(struct pag #define VM_FAULT_SIGBUS 0 #define VM_FAULT_MINOR 1 #define VM_FAULT_MAJOR 2 +#define VM_FAULT_WRITE 3 /* special case for get_user_pages */ + +#define VM_FAULT_WRITE_EXPECTED (-1) /* only for get_user_pages */ #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) --- 2.6.13-rc5/mm/memory.c 2005-08-02 12:07:23.000000000 +0100 +++ linux/mm/memory.c 2005-08-02 21:14:26.000000000 +0100 @@ -811,15 +811,18 @@ static struct page *__follow_page(struct pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_dirty(pte)) + if (write && !pte_write(pte)) goto out; if (read && !pte_read(pte)) goto out; pfn = pte_pfn(pte); if (pfn_valid(pfn)) { page = pfn_to_page(pfn); - if (accessed) + if (accessed) { + if (write && !pte_dirty(pte) &&!PageDirty(page)) + set_page_dirty(page); mark_page_accessed(page); + } return page; } } @@ -941,10 +944,11 @@ int get_user_pages(struct task_struct *t } spin_lock(&mm->page_table_lock); do { + int write_access = write? VM_FAULT_WRITE_EXPECTED: 0; struct page *page; cond_resched_lock(&mm->page_table_lock); - while (!(page = follow_page(mm, start, write))) { + while (!(page = follow_page(mm, start, write_access))) { /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -957,7 +961,16 @@ int get_user_pages(struct task_struct *t break; } spin_unlock(&mm->page_table_lock); - switch (handle_mm_fault(mm,vma,start,write)) { + switch (handle_mm_fault(mm, vma, start, + write_access)) { + case VM_FAULT_WRITE: + /* + * do_wp_page has broken COW when + * necessary, even if maybe_mkwrite + * decided not to set pte_write + */ + write_access = 0; + /* FALLTHRU */ case VM_FAULT_MINOR: tsk->min_flt++; break; @@ -1220,6 +1233,7 @@ static int do_wp_page(struct mm_struct * struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); pte_t entry; + int ret; if (unlikely(!pfn_valid(pfn))) { /* @@ -1247,7 +1261,7 @@ static int do_wp_page(struct mm_struct * lazy_mmu_prot_update(entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return VM_FAULT_WRITE; } } pte_unmap(page_table); @@ -1274,6 +1288,7 @@ static int do_wp_page(struct mm_struct * /* * Re-check the pte - we dropped the lock */ + ret = VM_FAULT_MINOR; spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { @@ -1290,12 +1305,13 @@ static int do_wp_page(struct mm_struct * /* Free the old page.. */ new_page = old_page; + ret = VM_FAULT_WRITE; } pte_unmap(page_table); page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return ret; no_new_page: page_cache_release(old_page); @@ -1985,9 +2001,13 @@ static inline int handle_pte_fault(struc } if (write_access) { - if (!pte_write(entry)) - return do_wp_page(mm, vma, address, pte, pmd, entry); - + if (!pte_write(entry)) { + int ret = do_wp_page(mm, vma, address, pte, pmd, entry); + if (ret == VM_FAULT_WRITE && + write_access != VM_FAULT_WRITE_EXPECTED) + ret = VM_FAULT_MINOR; + return ret; + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-03 11:31 ` Hugh Dickins @ 2005-08-04 11:48 ` Robin Holt 2005-08-04 13:04 ` Hugh Dickins 0 siblings, 1 reply; 72+ messages in thread From: Robin Holt @ 2005-08-04 11:48 UTC (permalink / raw) To: Hugh Dickins Cc: Robin Holt, Linus Torvalds, Nick Piggin, Andrew Morton, Roland McGrath, linux-mm, linux-kernel On Wed, Aug 03, 2005 at 12:31:34PM +0100, Hugh Dickins wrote: > On Wed, 3 Aug 2005, Robin Holt wrote: > > On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote: > > > > > > Can somebody who saw the problem in the first place please verify? OK. I took the three commits: 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6 f33ea7f404e592e4563b12101b7a4d17da6558d7 a68d2ebc1581a3aec57bd032651e013fa609f530 I back ported them to the SuSE SLES9SP2 kernel. I will add them at the end so you can tell me if I messed things up. I was then able to run the customer test application multiple times without issue. Before the fix, we had never acheived three consecutive runs that did not trip the fault. After the change, it has been in excess of thirty. I would say this has fixed the problem. Did I miss anything which needs to be tested? Thanks, Robin Holt ----- Patches against SLES9SP2 7.191 kernel ----- Adapted for SLES9 SP2 kernel from X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6 Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c 2005-08-03 16:18:31.553626601 -0500 +++ linux/mm/memory.c 2005-08-03 16:24:08.905247901 -0500 @@ -674,7 +674,7 @@ follow_page(struct mm_struct *mm, unsign pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_write(pte)) + if (write && !pte_dirty(pte)) goto out; if (write && !pte_dirty(pte)) { struct page *page = pte_page(pte); @@ -814,8 +814,7 @@ int get_user_pages(struct task_struct *t spin_lock(&mm->page_table_lock); do { struct page *map; - int lookup_write = write; - while (!(map = follow_page(mm, start, lookup_write))) { + while (!(map = follow_page(mm, start, write))) { /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -823,8 +822,7 @@ int get_user_pages(struct task_struct *t * nobody touched so far. This is important * for doing a core dump for these mappings. */ - if (!lookup_write && - untouched_anonymous_page(mm,vma,start)) { + if (!write && untouched_anonymous_page(mm,vma,start)) { map = ZERO_PAGE(start); break; } @@ -843,14 +841,6 @@ int get_user_pages(struct task_struct *t default: BUG(); } - /* - * Now that we have performed a write fault - * and surely no longer have a shared page we - * shouldn't write, we shouldn't ignore an - * unwritable page in the page table if - * we are forcing write access. - */ - lookup_write = write && !force; spin_lock(&mm->page_table_lock); } if (pages) { Adapted for SLES9 SP2 kernel from X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f33ea7f404e592e4563b12101b7a4d17da6558d7 Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h 2005-08-03 16:22:55.905794491 -0500 +++ linux/include/linux/mm.h 2005-08-03 16:24:48.123938110 -0500 @@ -651,10 +651,16 @@ static inline int page_mapped(struct pag * Used to decide whether a process gets delivered SIGBUS or * just gets major/minor fault counters bumped up. */ -#define VM_FAULT_OOM (-1) -#define VM_FAULT_SIGBUS 0 -#define VM_FAULT_MINOR 1 -#define VM_FAULT_MAJOR 2 +#define VM_FAULT_OOM 0x00 +#define VM_FAULT_SIGBUS 0x01 +#define VM_FAULT_MINOR 0x02 +#define VM_FAULT_MAJOR 0x03 + +/* + * Special case for get_user_pages. + * Must be in a distinct bit from the above VM_FAULT_ flags. + */ +#define VM_FAULT_WRITE 0x10 #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) @@ -704,7 +710,13 @@ extern pte_t *FASTCALL(pte_alloc_kernel( extern pte_t *FASTCALL(pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)); extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot); extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot); -extern int handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access); +extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma, unsigned long address, int write_access); + +static inline int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access) +{ + return __handle_mm_fault(mm, vma, address, write_access) & (~VM_FAULT_WRITE); +} + extern void make_pages_present(unsigned long addr, unsigned long end); extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write); void put_dirty_page(struct task_struct *tsk, struct page *page, Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c 2005-08-03 16:24:08.905247901 -0500 +++ linux/mm/memory.c 2005-08-03 16:29:01.901967868 -0500 @@ -674,7 +674,7 @@ follow_page(struct mm_struct *mm, unsign pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (write && !pte_dirty(pte)) + if (write && !pte_write(pte)) goto out; if (write && !pte_dirty(pte)) { struct page *page = pte_page(pte); @@ -684,8 +684,9 @@ follow_page(struct mm_struct *mm, unsign pfn = pte_pfn(pte); if (pfn_valid(pfn)) { struct page *page = pfn_to_page(pfn); - - mark_page_accessed(page); + if (write && !pte_dirty(pte) &&!PageDirty(page)) + set_page_dirty(page); + mark_page_accessed(page); return page; } } @@ -813,8 +814,9 @@ int get_user_pages(struct task_struct *t } spin_lock(&mm->page_table_lock); do { + int write_access = write; struct page *map; - while (!(map = follow_page(mm, start, write))) { + while (!(map = follow_page(mm, start, write_access))) { /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -827,7 +829,16 @@ int get_user_pages(struct task_struct *t break; } spin_unlock(&mm->page_table_lock); - switch (handle_mm_fault(mm,vma,start,write)) { + switch (__handle_mm_fault(mm, vma, start, + write_access)) { + case VM_FAULT_WRITE: + /* + * do_wp_page has broken COW when + * necessary, even if maybe_mkwrite + * decided not to set pte_write + */ + write_access = 0; + /* FALLTHRU */ case VM_FAULT_MINOR: tsk->min_flt++; break; @@ -1086,6 +1097,7 @@ static int do_wp_page(struct mm_struct * struct page *old_page, *new_page; unsigned long pfn = pte_pfn(pte); pte_t entry; + int ret; if (unlikely(!pfn_valid(pfn))) { /* @@ -1113,7 +1125,7 @@ static int do_wp_page(struct mm_struct * lazy_mmu_prot_update(entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return VM_FAULT_MINOR|VM_FAULT_WRITE; } } pte_unmap(page_table); @@ -1135,6 +1147,7 @@ static int do_wp_page(struct mm_struct * /* * Re-check the pte - we dropped the lock */ + ret = VM_FAULT_MINOR; spin_lock(&mm->page_table_lock); page_table = pte_offset_map(pmd, address); if (likely(pte_same(*page_table, pte))) { @@ -1153,12 +1166,13 @@ static int do_wp_page(struct mm_struct * /* Free the old page.. */ new_page = old_page; + ret |= VM_FAULT_WRITE; } pte_unmap(page_table); page_cache_release(new_page); page_cache_release(old_page); spin_unlock(&mm->page_table_lock); - return VM_FAULT_MINOR; + return ret; no_new_page: page_cache_release(old_page); @@ -1753,7 +1767,6 @@ static inline int handle_pte_fault(struc if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, entry); - entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); @@ -1777,7 +1790,7 @@ int __attribute__((weak)) arch_hugetlb_f /* * By the time we get here, we already hold the mm semaphore */ -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, +int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct * vma, unsigned long address, int write_access) { pgd_t *pgd; Adapted for SLES9 SP2 kernel from X-Git-Url: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a68d2ebc1581a3aec57bd032651e013fa609f530 Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c 2005-08-03 16:29:13.789434124 -0500 +++ linux/mm/memory.c 2005-08-03 16:29:13.838257298 -0500 @@ -817,6 +817,8 @@ int get_user_pages(struct task_struct *t int write_access = write; struct page *map; while (!(map = follow_page(mm, start, write_access))) { + int ret; + /* * Shortcut for anonymous pages. We don't want * to force the creation of pages tables for @@ -829,16 +831,18 @@ int get_user_pages(struct task_struct *t break; } spin_unlock(&mm->page_table_lock); - switch (__handle_mm_fault(mm, vma, start, - write_access)) { - case VM_FAULT_WRITE: - /* - * do_wp_page has broken COW when - * necessary, even if maybe_mkwrite - * decided not to set pte_write - */ + ret = __handle_mm_fault(mm, vma, start, write_access); + + /* + * The VM_FAULT_WRITE bit tells us that do_wp_page has + * broken COW when necessary, even if maybe_mkwrite + * decided not to set pte_write. We can thus safely do + * subsequent page lookups as if they were reads. + */ + if (ret & VM_FAULT_WRITE) write_access = 0; - /* FALLTHRU */ + + switch (ret & ~VM_FAULT_WRITE) { case VM_FAULT_MINOR: tsk->min_flt++; break; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-04 11:48 ` Robin Holt @ 2005-08-04 13:04 ` Hugh Dickins 0 siblings, 0 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-04 13:04 UTC (permalink / raw) To: Robin Holt Cc: Linus Torvalds, Nick Piggin, Andrew Morton, Roland McGrath, linux-mm, linux-kernel On Thu, 4 Aug 2005, Robin Holt wrote: > On Wed, Aug 03, 2005 at 12:31:34PM +0100, Hugh Dickins wrote: > > On Wed, 3 Aug 2005, Robin Holt wrote: > > > On Mon, Aug 01, 2005 at 11:18:42AM -0700, Linus Torvalds wrote: > > > > > > > > Can somebody who saw the problem in the first place please verify? > > OK. I took the three commits: > 4ceb5db9757aaeadcf8fbbf97d76bd42aa4df0d6 > f33ea7f404e592e4563b12101b7a4d17da6558d7 > a68d2ebc1581a3aec57bd032651e013fa609f530 > > I back ported them to the SuSE SLES9SP2 kernel. I will add them at > the end so you can tell me if I messed things up. I was then able > to run the customer test application multiple times without issue. > Before the fix, we had never acheived three consecutive runs that did > not trip the fault. After the change, it has been in excess of thirty. > I would say this has fixed the problem. Did I miss anything which > needs to be tested? Great, thanks for the testing, the set of patches you tested is correct. (I think there is one minor anomaly in your patch backport, which has no effect on the validity of your testing: you've probably ended up with two separate calls to set_page_dirty in your follow_page, because that moved between 2.6.5 and 2.6.13. It doesn't matter, but you might want to tidy that up one way or the other if you're passing your patch on further.) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 15:42 ` Linus Torvalds 2005-08-01 18:18 ` Linus Torvalds @ 2005-08-01 19:29 ` Hugh Dickins 2005-08-01 19:48 ` Linus Torvalds 2005-08-01 19:57 ` Andrew Morton 2005-08-02 0:14 ` Nick Piggin 2005-08-02 1:27 ` Nick Piggin 3 siblings, 2 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-01 19:29 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Robin Holt, Andrew Morton, Roland McGrath, Martin Schwidefsky, linux-mm, linux-kernel On Mon, 1 Aug 2005, Linus Torvalds wrote: > > that "continue" will continue without the spinlock held, and now do Yes, I was at last about to reply on that point and others. I'll make those comments in a separate mail to Nick and all. > Instead, I'd suggest changing the logic for "lookup_write". Make it > require that the page table entry is _dirty_ (not writable), and then Attractive, I very much wanted to do that rather than change all the arches, but I think s390 rules it out: its pte_mkdirty does nothing, its pte_dirty just says no. Whether your patch suits all other uses of (__)follow_page I've not investigated (and I don't see how you can go without the set_page_dirty if it was necessary before); but at present see no alternative to something like Nick's patch, though I'd much prefer smaller. Or should we change s390 to set a flag in the pte just for this purpose? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 19:29 ` Hugh Dickins @ 2005-08-01 19:48 ` Linus Torvalds 2005-08-02 8:07 ` Martin Schwidefsky 2005-08-01 19:57 ` Andrew Morton 1 sibling, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 19:48 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Robin Holt, Andrew Morton, Roland McGrath, Martin Schwidefsky, linux-mm, linux-kernel On Mon, 1 Aug 2005, Hugh Dickins wrote: > > Attractive, I very much wanted to do that rather than change all the > arches, but I think s390 rules it out: its pte_mkdirty does nothing, > its pte_dirty just says no. How does s390 work at all? > Or should we change s390 to set a flag in the pte just for this purpose? If the choice is between a broken and ugly implementation for everybody else, then hell yes. Even if it's a purely sw bit that nothing else actually cares about.. I hope they have an extra bit around somewhere. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 19:48 ` Linus Torvalds @ 2005-08-02 8:07 ` Martin Schwidefsky 0 siblings, 0 replies; 72+ messages in thread From: Martin Schwidefsky @ 2005-08-02 8:07 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Robin Holt, Hugh Dickins, linux-kernel, linux-mm, Nick Piggin, Roland McGrath Linus Torvalds <torvalds@osdl.org> wrote on 08/01/2005 09:48:40 PM: > > Attractive, I very much wanted to do that rather than change all the > > arches, but I think s390 rules it out: its pte_mkdirty does nothing, > > its pte_dirty just says no. > > How does s390 work at all? The big difference between s390 and your standard architecture is that s390 keeps the dirty and reference bits in the storage key. That is per physical page and not per mapping. The primitive pte_dirty() just doesn't make any sense for s390. A pte never contains any information about dirty/reference state of a page. The "page" itself contains it, you access the information with some instructions (sske, iske & rrbe) which get the page frame address as parameter. > > Or should we change s390 to set a flag in the pte just for this purpose? > > If the choice is between a broken and ugly implementation for everybody > else, then hell yes. Even if it's a purely sw bit that nothing else > actually cares about.. I hope they have an extra bit around somewhere. Urg, depending on the pte type there are no bits available. For valid ptes there are some bits we could use but it wouldn't be nice. blue skies, Martin Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 19:29 ` Hugh Dickins 2005-08-01 19:48 ` Linus Torvalds @ 2005-08-01 19:57 ` Andrew Morton 2005-08-01 20:16 ` Linus Torvalds 1 sibling, 1 reply; 72+ messages in thread From: Andrew Morton @ 2005-08-01 19:57 UTC (permalink / raw) To: Hugh Dickins Cc: torvalds, nickpiggin, holt, roland, schwidefsky, linux-mm, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: > > On Mon, 1 Aug 2005, Linus Torvalds wrote: > > > > that "continue" will continue without the spinlock held, and now do > > Yes, I was at last about to reply on that point and others. > I'll make those comments in a separate mail to Nick and all. > > > Instead, I'd suggest changing the logic for "lookup_write". Make it > > require that the page table entry is _dirty_ (not writable), and then > > Attractive, I very much wanted to do that rather than change all the > arches, but I think s390 rules it out: its pte_mkdirty does nothing, > its pte_dirty just says no. > > Whether your patch suits all other uses of (__)follow_page I've not > investigated (and I don't see how you can go without the set_page_dirty > if it was necessary before); That was introduced 19 months ago by the s390 guys (see patch below). I don't really see why Martin decided to mark the page software-dirty at that stage. It's a nice thing to do from the VM dirty-memory accounting POV, but I don't see that it's essential. > but at present see no alternative to > something like Nick's patch, though I'd much prefer smaller. > > Or should we change s390 to set a flag in the pte just for this purpose? That would be a good approach IMO, if possible. From: Martin Schwidefsky <schwidefsky@de.ibm.com> Fix endless loop in get_user_pages() on s390. It happens only on s/390 because pte_dirty always returns 0. For all other architectures this is an optimization. In the case of "write && !pte_dirty(pte)" follow_page() returns NULL. On all architectures except s390 handle_pte_fault() will then create a pte with pte_dirty(pte)==1 because write_access==1. In the following, second call to follow_page() all is fine. With the physical dirty bit patch pte_dirty() is always 0 for s/390 because the dirty bit doesn't live in the pte. --- mm/memory.c | 21 +++++++++++++-------- 1 files changed, 13 insertions(+), 8 deletions(-) diff -puN mm/memory.c~s390-16-follow_page-lockup-fix mm/memory.c --- 25/mm/memory.c~s390-16-follow_page-lockup-fix 2004-01-18 22:36:00.000000000 -0800 +++ 25-akpm/mm/memory.c 2004-01-18 22:36:00.000000000 -0800 @@ -651,14 +651,19 @@ follow_page(struct mm_struct *mm, unsign pte = *ptep; pte_unmap(ptep); if (pte_present(pte)) { - if (!write || (pte_write(pte) && pte_dirty(pte))) { - pfn = pte_pfn(pte); - if (pfn_valid(pfn)) { - struct page *page = pfn_to_page(pfn); - - mark_page_accessed(page); - return page; - } + if (write && !pte_write(pte)) + goto out; + if (write && !pte_dirty(pte)) { + struct page *page = pte_page(pte); + if (!PageDirty(page)) + set_page_dirty(page); + } + pfn = pte_pfn(pte); + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + + mark_page_accessed(page); + return page; } } _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 19:57 ` Andrew Morton @ 2005-08-01 20:16 ` Linus Torvalds 0 siblings, 0 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 20:16 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, nickpiggin, holt, roland, schwidefsky, linux-mm, linux-kernel On Mon, 1 Aug 2005, Andrew Morton wrote: > > That was introduced 19 months ago by the s390 guys (see patch below). This really is a very broken patch, btw. > + if (write && !pte_write(pte)) > + goto out; > + if (write && !pte_dirty(pte)) { > + struct page *page = pte_page(pte); > + if (!PageDirty(page)) > + set_page_dirty(page); > + } > + pfn = pte_pfn(pte); > + if (pfn_valid(pfn)) { > + struct page *page = pfn_to_page(pfn); > + > + mark_page_accessed(page); > + return page; Note how it doesn't do any "pfn_valid()" stuff for the dirty bit setting, so it will set random bits in memory if the pte points to some IO page. Maybe that doesn't happen on s390, but.. Anyway, if the s390 people just have a sw-writable bit in their page table layout, I bet they can fix their problem by just having a "sw dirty" bit, and then make "pte_mkdirty()" set that bit. Nobody else will care, but ptrace will then just work correctly for them too. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 15:42 ` Linus Torvalds 2005-08-01 18:18 ` Linus Torvalds 2005-08-01 19:29 ` Hugh Dickins @ 2005-08-02 0:14 ` Nick Piggin 2005-08-02 1:27 ` Nick Piggin 3 siblings, 0 replies; 72+ messages in thread From: Nick Piggin @ 2005-08-02 0:14 UTC (permalink / raw) To: Linus Torvalds Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel Linus Torvalds wrote: > >On Mon, 1 Aug 2005, Nick Piggin wrote: > >>Not sure if this should be fixed for 2.6.13. It can result in >>pagecache corruption: so I guess that answers my own question. >> > >Hell no. > >This patch is clearly untested and must _not_ be applied: > > Yes, I meant that the problem should be fixed, not that the patch should be applied straight away. I wanted to get discussion going ASAP. Looks like it worked :) I'll catch up on it now. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 15:42 ` Linus Torvalds ` (2 preceding siblings ...) 2005-08-02 0:14 ` Nick Piggin @ 2005-08-02 1:27 ` Nick Piggin 2005-08-02 3:45 ` Linus Torvalds 3 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-08-02 1:27 UTC (permalink / raw) To: Linus Torvalds Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel Linus Torvalds wrote: > >Instead, I'd suggest changing the logic for "lookup_write". Make it >require that the page table entry is _dirty_ (not writable), and then >remove the line that says: > > lookup_write = write && !force; > >and you're now done. A successful mm fault for write _should_ always have >marked the PTE dirty (and yes, part of testing this would be to verify >that this is true - but since architectures that don't have HW dirty >bits depend on this anyway, I'm pretty sure it _is_ true). > >Ie something like the below (which is totally untested, obviously, but I >think conceptually is a lot more correct, and obviously a lot simpler). > > Surely this introduces integrity problems when `force` is not set? Security holes? Perhaps not, but I wouldn't guarantee it. However: I like your idea. And getting rid of the lookup_write logic is a good thing. I don't much like that it changes the semantics of follow_page for write on a readonly pte, and that is where your problem is introduced. I think to go down this route you'd at least need a follow_page check that is distinct from 'write'. 'writeable', maybe. Then, having a 'writeable' flag lets you neatly "comment" your idea of what might constitute a writeable pte, as opposed to the current code which basically looks like black magic to a reader, and gives no indication of how it satisfies the get_user_pages requirements. A minor issue: I don't much like the proliferation of __follow_page flags either. Why not make __follow_page take a bitmask, and be used directly by get_user_pages, which would also allow removal of the 'write' argument from follow_page. I would cook you some patches, but I'm not in front of the source tree. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 1:27 ` Nick Piggin @ 2005-08-02 3:45 ` Linus Torvalds 2005-08-02 4:25 ` Nick Piggin 0 siblings, 1 reply; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 3:45 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, linux-kernel On Tue, 2 Aug 2005, Nick Piggin wrote: > > Surely this introduces integrity problems when `force` is not set? "force" changes how we test the vma->vm_flags, that was always the meaning from a security standpoint (and that hasn't changed). The old code had this "lookup_write = write && !force;" thing because there it used "force" to _clear_ the write bit test, and that was what caused the race in the first place - next time around we would accept a non-writable page, even if it hadn't actually gotten COW'ed. So no, the patch doesn't introduce integrity problems by ignoring "force". Quite the reverse - it _removes_ the integrity problems by ignoring it there. That's kind of the whole point. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 3:45 ` Linus Torvalds @ 2005-08-02 4:25 ` Nick Piggin 2005-08-02 4:35 ` Linus Torvalds 0 siblings, 1 reply; 72+ messages in thread From: Nick Piggin @ 2005-08-02 4:25 UTC (permalink / raw) To: Linus Torvalds Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, lkml On Mon, 2005-08-01 at 20:45 -0700, Linus Torvalds wrote: > > On Tue, 2 Aug 2005, Nick Piggin wrote: > > > > Surely this introduces integrity problems when `force` is not set? > > "force" changes how we test the vma->vm_flags, that was always the > meaning from a security standpoint (and that hasn't changed). > Of course, this test catches the problem I had in mind. > The old code had this "lookup_write = write && !force;" thing because > there it used "force" to _clear_ the write bit test, and that was what > caused the race in the first place - next time around we would accept a > non-writable page, even if it hadn't actually gotten COW'ed. > > So no, the patch doesn't introduce integrity problems by ignoring "force". > Quite the reverse - it _removes_ the integrity problems by ignoring it > there. That's kind of the whole point. > OK, I'm convinced. One last thing - your fix might have a non trivial overhead in terms of spin locks and simply entering the high level page fault handler when dealing with clean, writeable ptes for write. Any chance you can change the __follow_page test to account for writeable clean ptes? Something like if (write && !pte_dirty(pte) && !pte_write(pte)) goto out; And then you would re-add the set_page_dirty logic further on. Not that I know what Robin's customer is doing exactly, but it seems like something you can optimise easily enough. Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-02 4:25 ` Nick Piggin @ 2005-08-02 4:35 ` Linus Torvalds 0 siblings, 0 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-02 4:35 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Roland McGrath, Hugh Dickins, linux-mm, lkml On Tue, 2 Aug 2005, Nick Piggin wrote: > > Any chance you can change the __follow_page test to account for > writeable clean ptes? Something like > > if (write && !pte_dirty(pte) && !pte_write(pte)) > goto out; > > And then you would re-add the set_page_dirty logic further on. Hmm.. That should be possible. I wanted to do the simplest possible code sequence, but yeah, I guess there's nothing wrong with allowing the code to dirty the page. Somebody want to send me a proper patch? Also, I haven't actually heard from whoever actually noticed the problem in the first place (Robin?) whether the fix does fix it. It "obviously does", but testing is always good ;) Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 8:21 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin 2005-08-01 9:19 ` Ingo Molnar 2005-08-01 15:42 ` Linus Torvalds @ 2005-08-01 20:03 ` Hugh Dickins 2005-08-01 20:12 ` Andrew Morton 2 siblings, 1 reply; 72+ messages in thread From: Hugh Dickins @ 2005-08-01 20:03 UTC (permalink / raw) To: Nick Piggin Cc: Robin Holt, Andrew Morton, Linus Torvalds, Ingo Molnar, Roland McGrath, linux-mm, linux-kernel On Mon, 1 Aug 2005, Nick Piggin wrote: > > This was tested by Robin and appears to solve the problem. Roland > had a quick look and thought the basic idea was sound. I'd like to > get a couple more acks before going forward, and in particular > Robin was contemplating possible efficiency improvements (although > efficiency can wait on correctness). I'd much prefer a solution that doesn't invade all the arches, but I don't think Linus' pte_dirty method will work on s390 (unless we change s390 in a less obvious way than your patch), so it looks like we do need something like yours. Comments: There are currently 21 architectures, but so far your patch only updates 14 of them? Personally (rather like Robin) I'd have preferred a return code more directed to the issue at hand, than your VM_FAULT_RACE. Not for efficiency reasons, I think you're right that's not a real issue, but more to document the peculiar case we're addressing. Perhaps a code that says do_wp_page has gone all the way through, even though it hasn't set the writable bit. That would require less change in mm/memory.c, but I've no strong reason to argue that you change your approach in that way. Perhaps others prefer the race case singled out, to gather statistics on that. Assuming we stick with your VM_FAULT_RACE... Could we define VM_FAULT_RACE as 3 rather than -2? I think there's some historical reason why VM_FAULT_OOM is -1, but see no cause to extend the range in that strange direction. VM_FAULT_RACE is a particular subcase of VM_FAULT_MINOR: throughout the arches I think they should just be adjacent cases of the switch statement, both doing the min_flt++. Your continue in get_user_pages skips page_table_lock as Linus noted. The do_wp_page call from do_swap_page needs to be adjusted, to return VM_FAULT_RACE if that's what it returned. If VM_FAULT_RACE is really recording races, then the bottom return value from handle_pte_fault ought to be VM_FAULT_RACE rather than VM_FAULT_MINOR. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 20:03 ` Hugh Dickins @ 2005-08-01 20:12 ` Andrew Morton 2005-08-01 20:26 ` Linus Torvalds 2005-08-01 20:51 ` Hugh Dickins 0 siblings, 2 replies; 72+ messages in thread From: Andrew Morton @ 2005-08-01 20:12 UTC (permalink / raw) To: Hugh Dickins Cc: nickpiggin, holt, torvalds, mingo, roland, linux-mm, linux-kernel Hugh Dickins <hugh@veritas.com> wrote: > > There are currently 21 architectures, > but so far your patch only updates 14 of them? We could just do: static inline int handle_mm_fault(...) { int ret = __handle_mm_fault(...); if (unlikely(ret == VM_FAULT_RACE)) ret = VM_FAULT_MINOR; return ret; } because VM_FAULT_RACE is some internal private thing. It does add another test-n-branch to the pagefault path though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 20:12 ` Andrew Morton @ 2005-08-01 20:26 ` Linus Torvalds 2005-08-01 20:51 ` Hugh Dickins 1 sibling, 0 replies; 72+ messages in thread From: Linus Torvalds @ 2005-08-01 20:26 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, nickpiggin, holt, mingo, roland, linux-mm, linux-kernel On Mon, 1 Aug 2005, Andrew Morton wrote: > > We could just do: > > static inline int handle_mm_fault(...) > { > int ret = __handle_mm_fault(...); > > if (unlikely(ret == VM_FAULT_RACE)) > ret = VM_FAULT_MINOR; The reason I really dislike this whole VM_FAULT_RACE thing is that there's literally just one user that cares, and that user is such a special case anyway that we're _much_ better off fixing it in that user instead. The dirty bit thing is truly trivial, and is a generic VM feature. The fact that s390 does strange things is immaterial: I bet that s390 can be fixed much more easily than the suggested VM_FAULT_RACE patch, and quite frankly, bringing it semantically closer to the rest of the architectures is a _good_ thing regardless. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [patch 2.6.13-rc4] fix get_user_pages bug 2005-08-01 20:12 ` Andrew Morton 2005-08-01 20:26 ` Linus Torvalds @ 2005-08-01 20:51 ` Hugh Dickins 1 sibling, 0 replies; 72+ messages in thread From: Hugh Dickins @ 2005-08-01 20:51 UTC (permalink / raw) To: Andrew Morton Cc: nickpiggin, holt, torvalds, mingo, roland, linux-mm, linux-kernel On Mon, 1 Aug 2005, Andrew Morton wrote: > static inline int handle_mm_fault(...) > { > int ret = __handle_mm_fault(...); > > if (unlikely(ret == VM_FAULT_RACE)) > ret = VM_FAULT_MINOR; > return ret; > } > because VM_FAULT_RACE is some internal private thing. > It does add another test-n-branch to the pagefault path though. Good idea, at least to avoid changing all arches at this moment; though I don't think handle_mm_fault itself can be static inline. But let's set this VM_FAULT_RACE approach aside for now: I think we're agreed that the pte_dirty-with-mods-to-s390 route is more attractive, so I'll now try to find fault with that approach. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2005-08-04 16:32 UTC | newest] Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-07-30 20:53 get_user_pages() with write=1 and force=1 gets read-only pages Robin Holt 2005-07-30 22:13 ` Hugh Dickins 2005-07-31 1:52 ` Nick Piggin 2005-07-31 10:52 ` Robin Holt 2005-07-31 11:07 ` Nick Piggin 2005-07-31 11:30 ` Robin Holt 2005-07-31 11:39 ` Robin Holt 2005-07-31 12:09 ` Robin Holt 2005-07-31 22:27 ` Nick Piggin 2005-08-01 3:22 ` Roland McGrath 2005-08-01 8:21 ` [patch 2.6.13-rc4] fix get_user_pages bug Nick Piggin 2005-08-01 9:19 ` Ingo Molnar 2005-08-01 9:27 ` Nick Piggin 2005-08-01 10:15 ` Ingo Molnar 2005-08-01 10:57 ` Nick Piggin 2005-08-01 19:43 ` Hugh Dickins 2005-08-01 20:08 ` Linus Torvalds 2005-08-01 21:06 ` Hugh Dickins 2005-08-01 21:51 ` Linus Torvalds 2005-08-01 22:01 ` Linus Torvalds 2005-08-02 12:01 ` Martin Schwidefsky 2005-08-02 12:26 ` Hugh Dickins 2005-08-02 12:28 ` Nick Piggin 2005-08-02 15:19 ` Martin Schwidefsky 2005-08-02 15:30 ` Linus Torvalds 2005-08-02 16:03 ` Hugh Dickins 2005-08-02 16:25 ` Linus Torvalds 2005-08-02 17:02 ` Linus Torvalds 2005-08-02 17:27 ` Hugh Dickins 2005-08-02 17:21 ` Hugh Dickins 2005-08-02 18:47 ` Linus Torvalds 2005-08-02 19:20 ` Hugh Dickins 2005-08-02 19:54 ` Linus Torvalds 2005-08-02 20:55 ` Hugh Dickins 2005-08-03 10:24 ` Nick Piggin 2005-08-03 11:47 ` Hugh Dickins 2005-08-03 12:13 ` Nick Piggin 2005-08-03 16:12 ` Linus Torvalds 2005-08-03 16:39 ` Linus Torvalds 2005-08-03 16:42 ` Linus Torvalds 2005-08-03 17:12 ` Hugh Dickins 2005-08-03 23:03 ` Nick Piggin 2005-08-04 14:14 ` Alexander Nyberg 2005-08-04 14:30 ` Nick Piggin 2005-08-04 15:00 ` Alexander Nyberg 2005-08-04 15:35 ` Hugh Dickins 2005-08-04 16:32 ` Russell King 2005-08-04 15:36 ` Linus Torvalds 2005-08-04 16:29 ` Russell King 2005-08-03 10:24 ` Martin Schwidefsky 2005-08-03 11:57 ` Hugh Dickins 2005-08-02 16:44 ` Martin Schwidefsky 2005-08-01 15:42 ` Linus Torvalds 2005-08-01 18:18 ` Linus Torvalds 2005-08-03 8:24 ` Robin Holt 2005-08-03 11:31 ` Hugh Dickins 2005-08-04 11:48 ` Robin Holt 2005-08-04 13:04 ` Hugh Dickins 2005-08-01 19:29 ` Hugh Dickins 2005-08-01 19:48 ` Linus Torvalds 2005-08-02 8:07 ` Martin Schwidefsky 2005-08-01 19:57 ` Andrew Morton 2005-08-01 20:16 ` Linus Torvalds 2005-08-02 0:14 ` Nick Piggin 2005-08-02 1:27 ` Nick Piggin 2005-08-02 3:45 ` Linus Torvalds 2005-08-02 4:25 ` Nick Piggin 2005-08-02 4:35 ` Linus Torvalds 2005-08-01 20:03 ` Hugh Dickins 2005-08-01 20:12 ` Andrew Morton 2005-08-01 20:26 ` Linus Torvalds 2005-08-01 20:51 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox