* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE [not found] ` <1085373839.14969.42.camel@gaston> @ 2004-05-24 5:10 ` Linus Torvalds 2004-05-24 5:34 ` Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-24 5:10 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm On Mon, 24 May 2004, Benjamin Herrenschmidt wrote: > > No, it doesn't. It tests it, if !present, it goes to do_no_page, > do_file_page or do_swap_page, but the,i if present, it does: > > entry = pte_mkyoung(entry); > ptep_establish(vma, address, pte, entry); > update_mmu_cache(vma, address, entry); > pte_unmap(pte); > spin_unlock(&mm->page_table_lock); > return VM_FAULT_MINOR; > > Which is, I think, the software PAGE_ACCESSED path used on some archs. > > (ptep_establish does set_pte) Ahh.. That's a bug, methinks. The reason it's a bug is that if you do this, you can lose the dirty bit being written on some other CPU asynchronously. In other words, I think it's pretty much always a bug to do a "set_pte()" with an existing pte in place, exactly because you lose information. You are trying to cover up the bug in ppc64-specific code, but I think that what you found is actually a (really really) unlikely race condition that can have serious consequences. Or am I missing something else? [ grep grep grep ] Looks like "break_cow()" and "do_wp_page()" are safe, but only because they always sets the dirty bit, and any other bits end up being pretty much "don't care if we miss an accessed bit update" or something. Hmm. Maybe I'm wrong. If this really is buggy, it's been buggy this way basically forever. That code is _not_ new, it's some of the oldes code in the whole VM since the original three-level code rewrite. I think. Of course, back then SMP wasn't an issue, and this seems to have survived all the SMP fixes. Who else has been working on the page tables that could verify this for me? Ingo? Ben LaHaise? I forget who even worked on this, because it's so long ago we went through all the atomicity issues with the page table updates on SMP. There may be some reason that I'm overlooking that explains why I'm full of sh*t. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-24 5:10 ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Linus Torvalds @ 2004-05-24 5:34 ` Benjamin Herrenschmidt 2004-05-24 5:38 ` Benjamin Herrenschmidt 2004-05-24 7:39 ` Ingo Molnar 2004-05-25 3:43 ` Andrea Arcangeli 2 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-24 5:34 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm > Ahh.. That's a bug, methinks. > > The reason it's a bug is that if you do this, you can lose the dirty bit > being written on some other CPU asynchronously. Hrm... right indeed. > In other words, I think it's pretty much always a bug to do a "set_pte()" > with an existing pte in place, exactly because you lose information. You > are trying to cover up the bug in ppc64-specific code, but I think that > what you found is actually a (really really) unlikely race condition that > can have serious consequences. > > Or am I missing something else? Well, the original scenario triggering that from userland is, imho, so broken, that we may just not care losing that dirty bit ... Oh well :) Anyway, apply my patch. If pte is not present, this will have no effect, if it is, it makes sure we never leave a stale HPTE in the hash, which is fatal in far worse ways. > [ grep grep grep ] > > Looks like "break_cow()" and "do_wp_page()" are safe, but only because > they always sets the dirty bit, and any other bits end up being pretty > much "don't care if we miss an accessed bit update" or something. > > Hmm. Maybe I'm wrong. If this really is buggy, it's been buggy this way > basically forever. That code is _not_ new, it's some of the oldes code in > the whole VM since the original three-level code rewrite. I think. Of > course, back then SMP wasn't an issue, and this seems to have survived all > the SMP fixes. > > Who else has been working on the page tables that could verify this for > me? Ingo? Ben LaHaise? I forget who even worked on this, because it's so > long ago we went through all the atomicity issues with the page table > updates on SMP. There may be some reason that I'm overlooking that > explains why I'm full of sh*t. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-24 5:34 ` Benjamin Herrenschmidt @ 2004-05-24 5:38 ` Benjamin Herrenschmidt 2004-05-24 5:52 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-24 5:38 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm > Well, the original scenario triggering that from userland is, imho, so > broken, that we may just not care losing that dirty bit ... Oh well :) > Anyway, apply my patch. If pte is not present, this will have no effect, > if it is, it makes sure we never leave a stale HPTE in the hash, which > is fatal in far worse ways. Hrm... Or maybe I should just do in set_pte something like BUG_ON(pte_present(ptep)) That would make me sleep better ;) Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-24 5:38 ` Benjamin Herrenschmidt @ 2004-05-24 5:52 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-24 5:52 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm On Mon, 2004-05-24 at 15:38, Benjamin Herrenschmidt wrote: > > Well, the original scenario triggering that from userland is, imho, so > > broken, that we may just not care losing that dirty bit ... Oh well :) > > Anyway, apply my patch. If pte is not present, this will have no effect, > > if it is, it makes sure we never leave a stale HPTE in the hash, which > > is fatal in far worse ways. > > Hrm... Or maybe I should just do in set_pte something like > > BUG_ON(pte_present(ptep)) > > That would make me sleep better ;) ... And would not work in the case you mentioned where we set it with the dirty bit set... for write protect, we have a separate function now, though. But I'm a bit paranoid with those PTE manipulations, so I think it would be better to play it the safe way and keep my original patch. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-24 5:10 ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Linus Torvalds 2004-05-24 5:34 ` Benjamin Herrenschmidt @ 2004-05-24 7:39 ` Ingo Molnar 2004-05-24 5:39 ` Benjamin Herrenschmidt 2004-05-25 3:43 ` Andrea Arcangeli 2 siblings, 1 reply; 71+ messages in thread From: Ingo Molnar @ 2004-05-24 7:39 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ben LaHaise, linux-mm * Linus Torvalds <torvalds@osdl.org> wrote: > Who else has been working on the page tables that could verify this > for me? Ingo? Ben LaHaise? I forget who even worked on this, because > it's so long ago we went through all the atomicity issues with the > page table updates on SMP. There may be some reason that I'm > overlooking that explains why I'm full of sh*t. Ben's the master of atomic dirty pte updates! :) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-24 7:39 ` Ingo Molnar @ 2004-05-24 5:39 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-24 5:39 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andrew Morton, Linux Kernel list, Ben LaHaise, linux-mm On Mon, 2004-05-24 at 17:39, Ingo Molnar wrote: > * Linus Torvalds <torvalds@osdl.org> wrote: > > > Who else has been working on the page tables that could verify this > > for me? Ingo? Ben LaHaise? I forget who even worked on this, because > > it's so long ago we went through all the atomicity issues with the > > page table updates on SMP. There may be some reason that I'm > > overlooking that explains why I'm full of sh*t. > > Ben's the master of atomic dirty pte updates! :) Precise which Ben :) Well, in this case it's obvious but still.. :) Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-24 5:10 ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Linus Torvalds 2004-05-24 5:34 ` Benjamin Herrenschmidt 2004-05-24 7:39 ` Ingo Molnar @ 2004-05-25 3:43 ` Andrea Arcangeli 2004-05-25 4:00 ` Linus Torvalds 2 siblings, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 3:43 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm On Sun, May 23, 2004 at 10:10:16PM -0700, Linus Torvalds wrote: > what you found is actually a (really really) unlikely race condition that > can have serious consequences. agreed, it could in theory lose the dirty bit. > course, back then SMP wasn't an issue, and this seems to have survived all > the SMP fixes. looks like to trigger this two threads needs to page fault on the same not-present MAP_SHARED page at the same time and the second thread must be a read-fault. The first thread will establish the pte writeable and dirty, then the first thread releases the page_table_lock, then the VM takes the page_table_lock (from kswapd or similar) then the VM transfers the dirty bit from pte to page and it even starts the and completes the I/O before the the read-fault has a chance to execute. Then after the I/O is completed the second thread finally takes the page_table_lock and reads the pte. Then the first thread writes to the page from userspace marking the pte dirty , and finally the second thread completes the page_fault running pte_establish and losing the dirty bit on the page. Maybe it can trigger even without I/O in between, not sure, certainly it can happen in the above scenario, but the I/O window is quite huge for not being able to take a spinlock before the I/O is started. The below patch should fix it, the only problem is that it can screwup some arch that might use page-faults to keep track of the accessed bit, I think for istance ia64 might do that, not sure if it's using it in linux though. anyways the below should be the optimal fix for x86 and x86-64 at least and the other archs will not corrupt memory anymore too (at worst they will live lock in userspace, and the livelock should be solvable gracefully with sigkill since it's an userspace one). warning, patch is absolutely untested. Index: mm/memory.c =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v retrieving revision 1.154 diff -u -p -r1.154 memory.c --- mm/memory.c 20 Apr 2004 14:22:03 -0000 1.154 +++ mm/memory.c 25 May 2004 03:04:48 -0000 @@ -1665,10 +1665,30 @@ static inline int handle_pte_fault(struc return do_wp_page(mm, vma, address, pte, pmd, entry); entry = pte_mkdirty(entry); + + /* + * We can overwrite the pte not atomically only if this is a + * write fault or we can lose the dirty bit. The dirty and the + * accessed bit are the only two things that can change form + * under us even if we hold the page_table_lock, they're + * set by hardware while other threads runs in userspace. + * + * This could cause an infinite loop with a read-fault if some + * arch tries to keep track of the accessed bit with a kernel + * page fault (but keeping track of the accessed bit with + * kernel exceptions sounds very inefficient anyways). + * If some arch infinite loops, they will have to implement + * some sort of atomic ptep_stablish where they atomically + * compare and swap, then we'll change the common code here + * to learn how to atomic swap the pte. + * + * x86 and x86-64 are sure optimal without atomic ops here and + * with this simple code. + */ + entry = pte_mkyoung(entry); + ptep_establish(vma, address, pte, entry); + update_mmu_cache(vma, address, entry); } - entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry); - update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); return VM_FAULT_MINOR; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 3:43 ` Andrea Arcangeli @ 2004-05-25 4:00 ` Linus Torvalds 2004-05-25 4:17 ` Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 4:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Andrea Arcangeli wrote: > > The below patch should fix it, the only problem is that it can screwup > some arch that might use page-faults to keep track of the accessed bit, Indeed. At least alpha does this - that's where this code came from. SO this will cause infinite page faults on alpha and any other "accessed bit in software" architectures. Not good. I suspect we should just make a "ptep_set_bits()" inline function that _atomically_ does "set the dirty/accessed bits". On x86, it would be a simple asm("lock ; orl %1,%0" :"m" (*ptep) :"r" (entry)); and similarly on most other architectures it should be quite easy to do the equivalent. You can always do it with a simple compare-and-exchange loop, something any SMP-capable architecture should have. Of course, arguably we can actually optimize this by "knowing" that it is safe to set the dirty bit, so then we don't even need an atomic operation, we just need one atomic write. So we only actually need the atomic op for the accessed bit case, and if we make the write-case be totally separate.. Anybody willing to write up a patch for a few architectures? Is there any architecture out there that would have a problem with this? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:00 ` Linus Torvalds @ 2004-05-25 4:17 ` Benjamin Herrenschmidt 2004-05-25 4:37 ` Andrea Arcangeli 2004-05-25 4:20 ` Andrea Arcangeli 2004-05-25 11:44 ` Matthew Wilcox 2 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-25 4:17 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group > and similarly on most other architectures it should be quite easy to do > the equivalent. You can always do it with a simple compare-and-exchange > loop, something any SMP-capable architecture should have. > > Of course, arguably we can actually optimize this by "knowing" that it is > safe to set the dirty bit, so then we don't even need an atomic operation, > we just need one atomic write. So we only actually need the atomic op for > the accessed bit case, and if we make the write-case be totally separate.. Looks good ! That gives us a guarantee that set_pte is never ever called on a present PTE (thus letting set_pte be non-atomic) and we can safely BUG_ON(pte_present(*ptep)) in it, right ? Note that having different set_dirty and set_accessed may be useful for some archs, thouh I agree a single atomic operation is enough on ppc too, I also want to make sure nobody ever gets the idea of using that for anything but those 2 bits... Well, that's a matter of taste, go for what you prefer. ppc64 version of this would look like static inline unsigned long ptep_set_bits(pte_t *p, unsigned long set) { unsigned long old, tmp; __asm__ __volatile__( "1: ldarx %0,0,%3\n\ or %1,%0,%4 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) : "r" (p), "r" (clr), "m" (*p) : "cc" ); return old; } ppc32 would be: #define ptep_set_bits(p, bits) pte_update(p, 0, bits) > Anybody willing to write up a patch for a few architectures? Is there any > architecture out there that would have a problem with this? > > Linus > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Benjamin Herrenschmidt <benh@kernel.crashing.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:17 ` Benjamin Herrenschmidt @ 2004-05-25 4:37 ` Andrea Arcangeli 2004-05-25 4:40 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 4:37 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Linus Torvalds, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, May 25, 2004 at 02:17:41PM +1000, Benjamin Herrenschmidt wrote: > on a present PTE (thus letting set_pte be non-atomic) and we can safely > BUG_ON(pte_present(*ptep)) in it, right ? set_pte is used even to mark the pte non present, so you can forget about using BUG_ON(pte_present(*ptep)) anywhere in set_pte regardless of how we fix this race (see mm/objrmap.c:unmap_pte_page()). If you want to trap for it you should add a set_pte_present and use it at least in objrmap.c during the paging. > ppc64 version of this would look like > > static inline unsigned long ptep_set_bits(pte_t *p, unsigned long set) > { > unsigned long old, tmp; > > __asm__ __volatile__( > "1: ldarx %0,0,%3\n\ > or %1,%0,%4 \n\ > stdcx. %1,0,%3 \n\ > bne- 1b" > : "=&r" (old), "=&r" (tmp), "=m" (*p) > : "r" (p), "r" (clr), "m" (*p) > : "cc" ); > return old; > } > > ppc32 would be: > > #define ptep_set_bits(p, bits) pte_update(p, 0, bits) unless you are generating page faults if the young bit is clear, this will only slowdown compared to my simpler approch. However if some arch is using page faults to set the young bit in hardware (not in software), then slowing micro-down x86 and others might be an option to share all the common code, but we could easily avoid smp locking in x86 and alpha by threating the young-bit-page-fault archs differently too. Would be nice to hear from the ia64 folks what they're doing w.r.t. to the young bit, I think ia64 is the only one providing the young bit with an hardware page fault. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:37 ` Andrea Arcangeli @ 2004-05-25 4:40 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-25 4:40 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 2004-05-25 at 14:37, Andrea Arcangeli wrote: > On Tue, May 25, 2004 at 02:17:41PM +1000, Benjamin Herrenschmidt wrote: > > on a present PTE (thus letting set_pte be non-atomic) and we can safely > > BUG_ON(pte_present(*ptep)) in it, right ? > > set_pte is used even to mark the pte non present, so you can forget > about using BUG_ON(pte_present(*ptep)) anywhere in set_pte regardless of > how we fix this race (see mm/objrmap.c:unmap_pte_page()). If you want to > trap for it you should add a set_pte_present and use it at least in > objrmap.c during the paging. Isn't this the work of pte_clear ? It's quite important to be very careful about such PTE manipulations on ppc & ppc64 or we can wreck everything by losting the hash state bits in there. > unless you are generating page faults if the young bit is clear, this > will only slowdown compared to my simpler approch. > > However if some arch is using page faults to set the young bit in > hardware (not in software), then slowing micro-down x86 and others might > be an option to share all the common code, but we could easily avoid > smp locking in x86 and alpha by threating the young-bit-page-fault archs > differently too. > > Would be nice to hear from the ia64 folks what they're doing w.r.t. to > the young bit, I think ia64 is the only one providing the young bit with > an hardware page fault. -- Benjamin Herrenschmidt <benh@kernel.crashing.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:00 ` Linus Torvalds 2004-05-25 4:17 ` Benjamin Herrenschmidt @ 2004-05-25 4:20 ` Andrea Arcangeli 2004-05-25 4:39 ` Linus Torvalds 2004-05-25 4:43 ` David Mosberger 2004-05-25 11:44 ` Matthew Wilcox 2 siblings, 2 replies; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 4:20 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote: > > > On Tue, 25 May 2004, Andrea Arcangeli wrote: > > > > The below patch should fix it, the only problem is that it can screwup > > some arch that might use page-faults to keep track of the accessed bit, > > Indeed. At least alpha does this - that's where this code came from. SO > this will cause infinite page faults on alpha and any other "accessed bit > in software" architectures. as you say the alpha has no accessed bit at all in hardware, so it cannot generate page faults. "accessed bit in software" is fine with my fix. the only architecture that has the accessed bit in _hardware_ via page faults I know is ia64, but I don't know if it has a mode to set it without page faults and how it is implementing the accessed bit in linux. Everything else either has it in hardware trasparently set by the cpu (like x86) in the pte, or in software (like alpha). Either ways it's fine. > I suspect we should just make a "ptep_set_bits()" inline function that > _atomically_ does "set the dirty/accessed bits". On x86, it would be a > simple > > asm("lock ; orl %1,%0" > :"m" (*ptep) > :"r" (entry)); > > and similarly on most other architectures it should be quite easy to do > the equivalent. You can always do it with a simple compare-and-exchange > loop, something any SMP-capable architecture should have. > > Of course, arguably we can actually optimize this by "knowing" that it is > safe to set the dirty bit, so then we don't even need an atomic operation, > we just need one atomic write. So we only actually need the atomic op for > the accessed bit case, and if we make the write-case be totally separate.. on x86 there's no point to use atomic ops in the write_access path and there's no point to do anything in the read path. The only issue are the archs that may generate page faults for young bit clear and for those we should simply add a: ptep_atomic_set_young(pte) right before the pte_unmap, the tlb flush is not needed of course (they are generating active page faults so they will re-check). But before adding the above operation, I'd wait to hear if any architecture is really using _hardware_ page faults for the accessed bit in linux. Then it also depends if this is the only buggy place or if there are others, but certainly this one doesn't need atomic ops on x86* and alpha (that's the only two I'm 100% sure about ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:20 ` Andrea Arcangeli @ 2004-05-25 4:39 ` Linus Torvalds 2004-05-25 4:44 ` Linus Torvalds 2004-05-25 4:50 ` Andrea Arcangeli 2004-05-25 4:43 ` David Mosberger 1 sibling, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 4:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Andrea Arcangeli wrote: > On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote: > > > > > > On Tue, 25 May 2004, Andrea Arcangeli wrote: > > > > > > The below patch should fix it, the only problem is that it can screwup > > > some arch that might use page-faults to keep track of the accessed bit, > > > > Indeed. At least alpha does this - that's where this code came from. SO > > this will cause infinite page faults on alpha and any other "accessed bit > > in software" architectures. > > as you say the alpha has no accessed bit at all in hardware, so > it cannot generate page faults. It _does_ generate page faults. We do the accessed bit by clearing the "user readable" thing (or something. I forget the exact details, and I'm too lazy to check it out). And a page won't be _really_ readable until it has been marked young. If you don't mark it young, you'll get infinite page faults. That's how we do the accessed bit. > "accessed bit in software" is fine with my fix. NO IT IS NOT. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:39 ` Linus Torvalds @ 2004-05-25 4:44 ` Linus Torvalds 2004-05-25 4:59 ` Andrea Arcangeli 2004-05-25 4:50 ` Andrea Arcangeli 1 sibling, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 4:44 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Mon, 24 May 2004, Linus Torvalds wrote: > > We do the accessed bit by clearing the "user readable" thing (or > something. I forget the exact details, and I'm too lazy to check it out). Yup. Lookie here: #define __ACCESS_BITS (_PAGE_ACCESSED | _PAGE_KRE | _PAGE_URE) extern inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~(__ACCESS_BITS); return pte; } Notice how an "old" pte won't be readable. Then, when we take the page fault, we'll do extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } and now the pte is readable again. In other words, we absolutely _have_ to do the "pte_mkyoung()" part in the page fault, or an "old" pte will never become readable again (unless it's accessed with a write rather than a read, which will then happen to make it young again). I'm not quite senile yet. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:44 ` Linus Torvalds @ 2004-05-25 4:59 ` Andrea Arcangeli 2004-05-25 5:09 ` Andrea Arcangeli 0 siblings, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 4:59 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Mon, May 24, 2004 at 09:44:08PM -0700, Linus Torvalds wrote: > > > On Mon, 24 May 2004, Linus Torvalds wrote: > > > > We do the accessed bit by clearing the "user readable" thing (or > > something. I forget the exact details, and I'm too lazy to check it out). > > Yup. Lookie here: > > #define __ACCESS_BITS (_PAGE_ACCESSED | _PAGE_KRE | _PAGE_URE) > extern inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~(__ACCESS_BITS); return pte; } > > Notice how an "old" pte won't be readable. Then, when we take the page > fault, we'll do > > extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } > > and now the pte is readable again. > > In other words, we absolutely _have_ to do the "pte_mkyoung()" part in the > page fault, or an "old" pte will never become readable again (unless it's > accessed with a write rather than a read, which will then happen to make > it young again). > > I'm not quite senile yet. I see, sorry I was wrong. I misread this code a long time ago and I noticed how the young bit works only now. Infact I always wondered if the young bit was useful at all. So it was possible to emulate it after all. However I wonder what happens for PROT_WRITE? How can you make a mapping only writeable if the mk_young marks it readable? That's why I misread it without even imagining it was setting the readable bit at the same time of the young bit. so while ia64 may not even need to set the young bit, alpha needs it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:59 ` Andrea Arcangeli @ 2004-05-25 5:09 ` Andrea Arcangeli 0 siblings, 0 replies; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 5:09 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group > all. However I wonder what happens for PROT_WRITE? How can you make a I understood now how it works with PROT_WRITE too, it's not FOR but URE being tweaked together with ACCESSED. This has been a very big misread I did when I was doing alpha stuff some year ago. that's why I was so confident it was only setting it during the first page fault and never clearing it again. Sounds good that it can be emulated fully, I thought it wasn't even feasible at all. thanks a lot for pointing out this huge mistake. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:39 ` Linus Torvalds 2004-05-25 4:44 ` Linus Torvalds @ 2004-05-25 4:50 ` Andrea Arcangeli 2004-05-25 4:59 ` Linus Torvalds 1 sibling, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 4:50 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Mon, May 24, 2004 at 09:39:05PM -0700, Linus Torvalds wrote: > > > On Tue, 25 May 2004, Andrea Arcangeli wrote: > > > On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote: > > > > > > > > > On Tue, 25 May 2004, Andrea Arcangeli wrote: > > > > > > > > The below patch should fix it, the only problem is that it can screwup > > > > some arch that might use page-faults to keep track of the accessed bit, > > > > > > Indeed. At least alpha does this - that's where this code came from. SO > > > this will cause infinite page faults on alpha and any other "accessed bit > > > in software" architectures. > > > > as you say the alpha has no accessed bit at all in hardware, so > > it cannot generate page faults. > > It _does_ generate page faults. > > We do the accessed bit by clearing the "user readable" thing (or > something. I forget the exact details, and I'm too lazy to check it out). > And a page won't be _really_ readable until it has been marked young. > > If you don't mark it young, you'll get infinite page faults. > > That's how we do the accessed bit. no, that's not what I remeber from alpha, alpha always sets the young bit as soon as it sets the pte from non-null to something. No matter what kind of pte is that (readonly/writeonly/whatever). Then this bit is cleared by the VM but nothing ever happens again. The accessed bit is quite useless infact. > > "accessed bit in software" is fine with my fix. > > NO IT IS NOT. It definitely is, there's no way the architecture can loop on a bit that doesn't exist as far as the hardware is concerned. see the code: /* .. and these are ours ... */ #define _PAGE_DIRTY 0x20000 #define _PAGE_ACCESSED 0x40000 #define _PAGE_FILE 0x80000 /* set:pagecache, unset:swap */ extern inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED; } extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } So alpha will work fine, there's absolutely no way pte_mkyoung can avoid the next page fault, period. furthermore there is no way to keep the accessed bit coherent without hardware support, I tried that while I was working on the alpha some year ago but there's no way. The accessed bit in software is only useful to give a second chance to pages paged in recently (if useful at all). There are two ways to do the hardware support, one is what x86 does, that is to set the bit without a page fault that enters kernel. The other way is an option that ia64 has to generate even the page fault. My fix works fine for x86* and alpha. It may not work for ia64 if it's using hardware page faults to set the young bit. comments on this detail from the ia64 developers would be welcome. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:50 ` Andrea Arcangeli @ 2004-05-25 4:59 ` Linus Torvalds 0 siblings, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 4:59 UTC (permalink / raw) To: Andrea Arcangeli Cc: Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Andrea Arcangeli wrote: > > no, that's not what I remeber from alpha, alpha always sets the young > bit as soon as it sets the pte from non-null to something. Yes. However, whtn the page is _aged_ later, the young bits will be cleared. And _that_ is when you will now start getting infinite page faults, because with your patch the young bits will never be set again on a normal read. See what I'm saying? Your patch literally leaves the page table alone on pure reads. Which is not acceptable, since the page being marked old was what caused the read-fault in the first place. But yes, pages will be young by default, so you won't ever actually _see_ this behaviour until you start having memory pressure and VM reclaim starts trying to age the things. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:20 ` Andrea Arcangeli 2004-05-25 4:39 ` Linus Torvalds @ 2004-05-25 4:43 ` David Mosberger 2004-05-25 4:53 ` Andrea Arcangeli 1 sibling, 1 reply; 71+ messages in thread From: David Mosberger @ 2004-05-25 4:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group >>>>> On Tue, 25 May 2004 06:20:54 +0200, Andrea Arcangeli <andrea@suse.de> said: Andrea> the only architecture that has the accessed bit in Andrea> _hardware_ via page faults I know is ia64, but I don't know Andrea> if it has a mode to set it without page faults No, it doesn't. Andrea> and how it is implementing the accessed bit in linux. If the "accessed" or "dirty" bits are zero, accessing/writing the page will cause a fault which will be handled in a low-level fault handler. The Linux version of these handlers simply turn on the respective bit. See daccess_bit(), iaccess_bit(), and dirty_bit() in arch/ia64/kernel/ivt.S. Note: I'm on travel and haven't seen the context of this discussion and don't expect to have time to think about this until I return on Thursday. So if you don't hear from me, it's not because I'm ignoring you... ;-) --david -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:43 ` David Mosberger @ 2004-05-25 4:53 ` Andrea Arcangeli 2004-05-27 21:56 ` David Mosberger 0 siblings, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 4:53 UTC (permalink / raw) To: davidm Cc: Linus Torvalds, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Mon, May 24, 2004 at 09:43:00PM -0700, David Mosberger wrote: > >>>>> On Tue, 25 May 2004 06:20:54 +0200, Andrea Arcangeli <andrea@suse.de> said: > > Andrea> the only architecture that has the accessed bit in > Andrea> _hardware_ via page faults I know is ia64, but I don't know > Andrea> if it has a mode to set it without page faults > > No, it doesn't. > > Andrea> and how it is implementing the accessed bit in linux. > > If the "accessed" or "dirty" bits are zero, accessing/writing the > page will cause a fault which will be handled in a low-level > fault handler. The Linux version of these handlers simply turn > on the respective bit. See daccess_bit(), iaccess_bit(), and dirty_bit() > in arch/ia64/kernel/ivt.S. so you mean, this is being set in the arch section before ever reaching handle_mm_fault? in such case my fix should work fine for ia64 too. > Note: I'm on travel and haven't seen the context of this discussion > and don't expect to have time to think about this until I return on > Thursday. So if you don't hear from me, it's not because I'm ignoring > you... ;-) take your time ;) thanks a lot for the above hints about those ivt.S functions (though I don't speak ia64 asm very well ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:53 ` Andrea Arcangeli @ 2004-05-27 21:56 ` David Mosberger 2004-05-27 22:00 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 71+ messages in thread From: David Mosberger @ 2004-05-27 21:56 UTC (permalink / raw) To: Andrea Arcangeli Cc: davidm, Linus Torvalds, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group >>>>> On Tue, 25 May 2004 06:53:22 +0200, Andrea Arcangeli <andrea@suse.de> said: >> If the "accessed" or "dirty" bits are zero, accessing/writing the >> page will cause a fault which will be handled in a low-level >> fault handler. The Linux version of these handlers simply turn >> on the respective bit. See daccess_bit(), iaccess_bit(), and dirty_bit() >> in arch/ia64/kernel/ivt.S. Andrea> so you mean, this is being set in the arch section before Andrea> ever reaching handle_mm_fault? Correct. The low-level fault handlers set the ACCESSED/DIRTY bits with an atomic compare-and-exchange (on SMP). They don't (normally) bubble up all the way to the Linux page-fault handler. --david -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-27 21:56 ` David Mosberger @ 2004-05-27 22:00 ` Benjamin Herrenschmidt 2004-05-27 22:12 ` David Mosberger 0 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-27 22:00 UTC (permalink / raw) To: davidm Cc: Andrea Arcangeli, Linus Torvalds, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Fri, 2004-05-28 at 07:56, David Mosberger wrote: > >>>>> On Tue, 25 May 2004 06:53:22 +0200, Andrea Arcangeli <andrea@suse.de> said: > > >> If the "accessed" or "dirty" bits are zero, accessing/writing the > >> page will cause a fault which will be handled in a low-level > >> fault handler. The Linux version of these handlers simply turn > >> on the respective bit. See daccess_bit(), iaccess_bit(), and dirty_bit() > >> in arch/ia64/kernel/ivt.S. > > Andrea> so you mean, this is being set in the arch section before > Andrea> ever reaching handle_mm_fault? > > Correct. The low-level fault handlers set the ACCESSED/DIRTY bits > with an atomic compare-and-exchange (on SMP). They don't (normally) > bubble up all the way to the Linux page-fault handler. Same for PPC, but we still have a race where we can reach that code, see my initial mail (typically, because your low level code, for obvious reasons, doesn't take the mm semaphore, thus the page may have been mapped right after you decide to go to do_page_fault and before it takes the mm sem). Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-27 22:00 ` Benjamin Herrenschmidt @ 2004-05-27 22:12 ` David Mosberger 0 siblings, 0 replies; 71+ messages in thread From: David Mosberger @ 2004-05-27 22:12 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: davidm, Andrea Arcangeli, Linus Torvalds, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group >>>>> On Fri, 28 May 2004 08:00:54 +1000, Benjamin Herrenschmidt <benh@kernel.crashing.org> said: Benjamin> Same for PPC, but we still have a race where we can reach Benjamin> that code, see my initial mail (typically, because your Benjamin> low level code, for obvious reasons, doesn't take the mm Benjamin> semaphore, thus the page may have been mapped right after Benjamin> you decide to go to do_page_fault and before it takes the Benjamin> mm sem). Yes; I was only explaining how this works on ia64. --david -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 4:00 ` Linus Torvalds 2004-05-25 4:17 ` Benjamin Herrenschmidt 2004-05-25 4:20 ` Andrea Arcangeli @ 2004-05-25 11:44 ` Matthew Wilcox 2004-05-25 14:48 ` Linus Torvalds 2 siblings, 1 reply; 71+ messages in thread From: Matthew Wilcox @ 2004-05-25 11:44 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote: > I suspect we should just make a "ptep_set_bits()" inline function that > _atomically_ does "set the dirty/accessed bits". On x86, it would be a > simple > > asm("lock ; orl %1,%0" > :"m" (*ptep) > :"r" (entry)); > > and similarly on most other architectures it should be quite easy to do > the equivalent. You can always do it with a simple compare-and-exchange > loop, something any SMP-capable architecture should have. ... but PA doesn't. Just load-and-clear-word (and its 64-bit equivalent in 64-bit mode). And that word has to be 16-byte aligned. What race are we protecting against? If it's like xchg() and we only need to protect against a racing xchg() and not a reader, we can just reuse the global array of hashed spinlocks we have for that. > Of course, arguably we can actually optimize this by "knowing" that it is > safe to set the dirty bit, so then we don't even need an atomic operation, > we just need one atomic write. So we only actually need the atomic op for > the accessed bit case, and if we make the write-case be totally separate.. Ah, atomic writes we can do. That's easy. I think all Linux architectures support atomic writes to naturally aligned addresses, don't they? > Anybody willing to write up a patch for a few architectures? Is there any > architecture out there that would have a problem with this? > > Linus -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 11:44 ` Matthew Wilcox @ 2004-05-25 14:48 ` Linus Torvalds 2004-05-25 15:35 ` Keith M Wesolowski 2004-05-25 21:27 ` Andrea Arcangeli 0 siblings, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 14:48 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrea Arcangeli, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Matthew Wilcox wrote: > On Mon, May 24, 2004 at 09:00:02PM -0700, Linus Torvalds wrote: > > I suspect we should just make a "ptep_set_bits()" inline function that > > _atomically_ does "set the dirty/accessed bits". On x86, it would be a > > simple > > > > asm("lock ; orl %1,%0" > > :"m" (*ptep) > > :"r" (entry)); > > > > and similarly on most other architectures it should be quite easy to do > > the equivalent. You can always do it with a simple compare-and-exchange > > loop, something any SMP-capable architecture should have. > > ... but PA doesn't. Just load-and-clear-word (and its 64-bit equivalent > in 64-bit mode). And that word has to be 16-byte aligned. Wow. And this architecture claims to support SMP? > What race are we protecting against? If it's like xchg() and we only > need to protect against a racing xchg() and not a reader, we can just > reuse the global array of hashed spinlocks we have for that. The race is: - one CPU sets the dirty bit (possibly with a hardware walker, but I guess on PA it's probably done in sw) - the other CPU sets the accessed bit in sw as part of the "handle_pte_fault()" processing. Right now we set the accessed bit with a simple "ptep_establish()", which will use "set_pte()", which is just a regular write. So setting the accessed bit will basically be a nonatomic sequence of - read pte entry - entry = pte_mkyoung(entry) - set_pte(entry) which is all done under the mm->page_table_lock, but which does NOT protect against any hardware page-table walkers or any asynchronous sw walkers (if anybody does them). Basically, the suggestion is to replace the "set_pte()" with something that is safe against anything else that updates the page tables (whether software or hardware). If only core kernel code does that, then you should already be fine, since the page-table spinlock should already be held by all updaters. NOTE! One really easy approach would be to say that we never mix software updates of the accessed bit with hw updates, and just have a rule that if the architecture does accessed-bit updates in hardware (and can thus race with us doing them in software _despite_ the fact that we hold the page table lock), then we just don't do the update at all. We'd just pass in a flag to "ptep_establish()" to tell it whether we changed the dirty bit or not. It would be "write_access" in handle_pte_fault(), and 1 in the other two cases. > Ah, atomic writes we can do. That's easy. I think all Linux architectures > support atomic writes to naturally aligned addresses, don't they? Yes. You'd really have to work at it _not_ to support them ;) However, the atomic write case only helps in the case when we update _all_ the bits that hw walkers can update, Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 14:48 ` Linus Torvalds @ 2004-05-25 15:35 ` Keith M Wesolowski 2004-05-25 16:19 ` Linus Torvalds 2004-05-25 21:27 ` Andrea Arcangeli 1 sibling, 1 reply; 71+ messages in thread From: Keith M Wesolowski @ 2004-05-25 15:35 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Andrea Arcangeli, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, May 25, 2004 at 07:48:24AM -0700, Linus Torvalds wrote: > > > the equivalent. You can always do it with a simple compare-and-exchange > > > loop, something any SMP-capable architecture should have. > > The race is: > - one CPU sets the dirty bit (possibly with a hardware walker, but I > guess on PA it's probably done in sw) > - the other CPU sets the accessed bit in sw as part of the > "handle_pte_fault()" processing. > > Right now we set the accessed bit with a simple "ptep_establish()", which > will use "set_pte()", which is just a regular write. So setting the > accessed bit will basically be a nonatomic sequence of > > - read pte entry > - entry = pte_mkyoung(entry) > - set_pte(entry) > > which is all done under the mm->page_table_lock, but which does NOT > protect against any hardware page-table walkers or any asynchronous sw > walkers (if anybody does them). Some sparc32 CPUs are also vulnerable to this race; in fact the supersparc manual describes it specifically and even outlines the compare-exchange loop using our rotten swap instruction. In our case, the race is with a hardware walker. -- Keith M Wesolowski -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 15:35 ` Keith M Wesolowski @ 2004-05-25 16:19 ` Linus Torvalds 2004-05-25 17:25 ` David S. Miller 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 16:19 UTC (permalink / raw) To: Keith M Wesolowski Cc: Matthew Wilcox, Andrea Arcangeli, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Keith M Wesolowski wrote: > > Some sparc32 CPUs are also vulnerable to this race; in fact the > supersparc manual describes it specifically and even outlines the > compare-exchange loop using our rotten swap instruction. In our case, > the race is with a hardware walker. Yes, but the sparc32 page tables are not the same as the linux kernel page tables, so in your case it's a different path and a different page table. Only the shared case really matters (ie things that do hw/microcode walk of a page table _tree_ not a hash). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 16:19 ` Linus Torvalds @ 2004-05-25 17:25 ` David S. Miller 2004-05-25 17:49 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: David S. Miller @ 2004-05-25 17:25 UTC (permalink / raw) To: Linus Torvalds Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004 09:19:52 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > On Tue, 25 May 2004, Keith M Wesolowski wrote: > > > > Some sparc32 CPUs are also vulnerable to this race; in fact the > > supersparc manual describes it specifically and even outlines the > > compare-exchange loop using our rotten swap instruction. In our case, > > the race is with a hardware walker. > > Yes, but the sparc32 page tables are not the same as the linux kernel page > tables, so in your case it's a different path and a different page table. > Only the shared case really matters (ie things that do hw/microcode walk > of a page table _tree_ not a hash). Not true on 32-bit Sparc sun4m systems, it's exactly like i386 except the hardware is stupid and we only have an atomic swap instruction. :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 17:25 ` David S. Miller @ 2004-05-25 17:49 ` Linus Torvalds 2004-05-25 17:54 ` David S. Miller 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 17:49 UTC (permalink / raw) To: David S. Miller Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004, David S. Miller wrote: > > Not true on 32-bit Sparc sun4m systems, it's exactly like i386 except > the hardware is stupid and we only have an atomic swap instruction. > :-) Ouch. Ok. My second suggestion ("don't bother with accessed bit on hw that does it on its own") should work fine there too, I guess. So what I can tell, the fix is really something like this (this does both x86 and ppc64 just to show how two different approaches would handle it, but I have literally _tested_ neither). What do people think? Ben, does this fix your problem? Does my ppc64 assembly code look sane? Shamelessly stolen from the function above it, and it seems to compile ;) Linus ----- ===== include/asm-generic/pgtable.h 1.3 vs edited ===== --- 1.3/include/asm-generic/pgtable.h Sun Jan 18 22:35:59 2004 +++ edited/include/asm-generic/pgtable.h Tue May 25 10:39:38 2004 @@ -10,9 +10,9 @@ * * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock */ -#define ptep_establish(__vma, __address, __ptep, __entry) \ +#define ptep_establish(__vma, __address, __ptep, __entry, __dirty) \ do { \ - set_pte(__ptep, __entry); \ + ptep_update_dirty_accessed(__ptep, __entry, __dirty); \ flush_tlb_page(__vma, __address); \ } while (0) #endif ===== include/asm-i386/pgtable.h 1.44 vs edited ===== --- 1.44/include/asm-i386/pgtable.h Sat May 22 14:56:24 2004 +++ edited/include/asm-i386/pgtable.h Tue May 25 10:39:56 2004 @@ -225,6 +225,12 @@ static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } +static inline void ptep_update_dirty_accessed(pte_t *ptep, pte_t entry, int dirty) +{ + if (dirty) + set_pte(ptep, entry); +} + /* * Macro to mark a page protection value as "uncacheable". On processors which do not support * it, this is a no-op. ===== include/asm-ppc64/pgtable.h 1.33 vs edited ===== --- 1.33/include/asm-ppc64/pgtable.h Sat May 22 14:56:24 2004 +++ edited/include/asm-ppc64/pgtable.h Tue May 25 10:45:58 2004 @@ -306,6 +306,21 @@ return old; } +static inline void ptep_update_dirty_accessed(pte_t *p, pte_t entry, int dirty) +{ + unsigned long old, tmp; + + __asm__ __volatile__( + "1: ldarx %0,0,%3\n\ + or %1,%0,%4 \n\ + stdcx. %1,0,%3 \n\ + bne- 1b" + : "=&r" (old), "=&r" (tmp), "=m" (*p) + : "r" (p), "r" (pte_val(entry)), "m" (*p) + : "cc" ); +} + + /* PTE updating functions */ extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); ===== mm/memory.c 1.176 vs edited ===== --- 1.176/mm/memory.c Sat May 22 14:56:30 2004 +++ edited/mm/memory.c Tue May 25 10:37:19 2004 @@ -1004,7 +1004,7 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), vma); - ptep_establish(vma, address, page_table, entry); + ptep_establish(vma, address, page_table, entry, 1); update_mmu_cache(vma, address, entry); } @@ -1056,7 +1056,7 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_establish(vma, address, page_table, entry); + ptep_establish(vma, address, page_table, entry, 1); update_mmu_cache(vma, address, entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1646,7 +1646,7 @@ entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry); + ptep_establish(vma, address, pte, entry, write_access); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 17:49 ` Linus Torvalds @ 2004-05-25 17:54 ` David S. Miller 2004-05-25 18:05 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: David S. Miller @ 2004-05-25 17:54 UTC (permalink / raw) To: Linus Torvalds Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004 10:49:21 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > So what I can tell, the fix is really something like this (this does both > x86 and ppc64 just to show how two different approaches would handle it, > but I have literally _tested_ neither). > > What do people think? So on sparc32 sun4m we'd implement ptep_update_dirty_accessed() with some kind of loop using the swap instruction? That's in fact what I've always wanted, someway to easily integrate the usage of such a loop so that we could handle this problem on such systems. Keith? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 17:54 ` David S. Miller @ 2004-05-25 18:05 ` Linus Torvalds 2004-05-25 20:30 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 18:05 UTC (permalink / raw) To: David S. Miller Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004, David S. Miller wrote: > On Tue, 25 May 2004 10:49:21 -0700 (PDT) > Linus Torvalds <torvalds@osdl.org> wrote: > > > So what I can tell, the fix is really something like this (this does both > > x86 and ppc64 just to show how two different approaches would handle it, > > but I have literally _tested_ neither). > > > > What do people think? > > So on sparc32 sun4m we'd implement ptep_update_dirty_accessed() with > some kind of loop using the swap instruction? Yes. Except that if everybody else uses atomic updates (including the hw walkers), _and_ "dirty" is true, then you can optimize that case to just to an atomic write (since we don't care what the previous contents were, and everybody else is guaranteed to honor the fact that we set all the bits. (And an independent optimization is obviously to not do the store at all if it is already has the new value, although that _should_ be the rare case, since if that was true I don't see why you got a page fault in the first place). So _if_ such an atomic loop is fundamentally expensive for some reason, it should be perfectly ok to do if (dirty) { one atomic write with all the bits set; } else { cmpxchg until successful; } Oh - btw - my suggested patch was totally broken for ppc64, because that "ptep_update_dirty_accessed()" thing obviously also needs to that damn hpte_update() crud etc. BenH - I'm leaving that ppc64 code to somebody knows what the hell he is doing. Ie you or Anton or something. Ok? I can act as a collector the different architecture things for that "ptep_update_dirty_accessed()" function. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 18:05 ` Linus Torvalds @ 2004-05-25 20:30 ` Linus Torvalds 2004-05-25 20:35 ` David S. Miller 2004-05-25 21:40 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 20:30 UTC (permalink / raw) To: David S. Miller Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004, Linus Torvalds wrote: > > BenH - I'm leaving that ppc64 code to somebody knows what the hell he is > doing. Ie you or Anton or something. Ok? I can act as a collector the > different architecture things for that "ptep_update_dirty_accessed()" > function. Following up to myself. I just committed a couple of trivial changesets that allows any architecture to re-define its own "ptep_update_dirty_accessed()" method. The default one (if none is defined by the architecture) is just #ifndef ptep_update_dirty_accessed #define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry) #endif ie no change in behaviour. As an example of an alternate strategy, this is the one I committed for x86: #define ptep_update_dirty_accessed(__ptep, __entry, __dirty) \ do { \ if (__dirty) set_pte(__ptep, __entry); \ } while (0) which is valid if the architecture updates its own accessed bits. I just realized that for x86 the _clever_ way of doing this (for highmem machines) is actually to only update the low word, which makes for much better code for the PAE case (and still does exactle the same for the non-PAE case): #define ptep_update_dirty_accessed(__ptep, __entry, __dirty) \ do { \ if (__dirty) (__ptep)->pte_low = (__entry).pte_low; \ } while (0) but I haven't actually tested this. Anybody willing to test the x86 PAE optimization? In the meantime, other architectures can now fix their dirty/accessed bit setting any way they damn well please. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 18:05 ` Linus Torvalds 2004-05-25 20:30 ` Linus Torvalds @ 2004-05-25 20:35 ` David S. Miller 2004-05-25 20:49 ` Linus Torvalds 2004-05-25 21:40 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 71+ messages in thread From: David S. Miller @ 2004-05-25 20:35 UTC (permalink / raw) To: Linus Torvalds, wesolows Cc: willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004 11:05:09 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > On Tue, 25 May 2004, David S. Miller wrote: > > So on sparc32 sun4m we'd implement ptep_update_dirty_accessed() with > > some kind of loop using the swap instruction? > > Yes. Except that if everybody else uses atomic updates (including the hw > walkers), _and_ "dirty" is true, then you can optimize that case to just > to an atomic write (since we don't care what the previous contents were, > and everybody else is guaranteed to honor the fact that we set all the > bits. > > (And an independent optimization is obviously to not do the store at all > if it is already has the new value, although that _should_ be the rare > case, since if that was true I don't see why you got a page fault in the > first place). > > So _if_ such an atomic loop is fundamentally expensive for some reason, it > should be perfectly ok to do > > if (dirty) { > one atomic write with all the bits set; > } else { > cmpxchg until successful; > } Hmmm, do you understand how broken the sparc hardware is? :-) Seriously, the issue is that the MMU writes back access/dirty bits asynchronously, does not do a relookup when it writes these bits back into the PTE (like x86 and others do) it actually stores away the PTE physical address and writes into the PTE using that, and finally as previously mentioned we lack a cmpxchg we only have raw SWAP. Keith W. can verify this, he has my old SuperSPARC manual which explains all of this stuff. Keith you might want to quote that little "atomic PTE update sequence" piece of code that's in the manual for Linus. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 20:35 ` David S. Miller @ 2004-05-25 20:49 ` Linus Torvalds 2004-05-25 20:57 ` David S. Miller 2004-05-26 6:20 ` Keith M Wesolowski 0 siblings, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 20:49 UTC (permalink / raw) To: David S. Miller Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004, David S. Miller wrote: > > Hmmm, do you understand how broken the sparc hardware is? :-) I seem to always repress that part. > Seriously, the issue is that the MMU writes back access/dirty bits > asynchronously, does not do a relookup when it writes these bits back > into the PTE (like x86 and others do) it actually stores away the PTE > physical address and writes into the PTE using that, and finally as > previously mentioned we lack a cmpxchg we only have raw SWAP. Ok. Still, that doesn't sound too bad. I assume that the pte write has to be atomic anyway (ie it doesn't walk the page tables, but it clearly _has_ to do an atomic "read-modify-update" or just the _hardware_ updates might race against each other and one CPU loses the dirty bit when another CPU writes back the accessed bit). So I really think it should be safe to just do a regular write when you update both bits, because you know that no other CPU will at least _clear_ any bits. Hmm? So it sounds like the SWAP loop basically ends up being just something like val = pte_value(entry); for (;;) { oldval = SWAP(ptep, val); /* * If we wrote a value that had the same or more bits set * than the old value, we're ok... */ if (!(oldval & ~val)) break; /* * ..otherwise we need to write the "or" of all bits and * try again. */ val |= oldval; } Right? Note that the reason we can do the "dirty and accessed bit both set" case with a simple write is exactly because that's already the "maximal" bits anybody can write to the thing, so we don't need to loop, we can just write it directly. I realize that this isn't as simple as the x86 solution (or most everything else, for that matter ;), but it's by no means totally unreasonable. Or are there even _more_ gotchas that I've missed? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 20:49 ` Linus Torvalds @ 2004-05-25 20:57 ` David S. Miller 2004-05-26 6:20 ` Keith M Wesolowski 1 sibling, 0 replies; 71+ messages in thread From: David S. Miller @ 2004-05-25 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: wesolows, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm, linux-arch On Tue, 25 May 2004 13:49:57 -0700 (PDT) Linus Torvalds <torvalds@osdl.org> wrote: > So it sounds like the SWAP loop basically ends up being just something > like ... > Right? Note that the reason we can do the "dirty and accessed bit both > set" case with a simple write is exactly because that's already the > "maximal" bits anybody can write to the thing, so we don't need to loop, > we can just write it directly. I believe so. So it's still possible for the mmu to write something with less bits, ie. say we're adding dirty, we write dirty+accessed but the TLB has the pre-dirty PTE and writes that with just the accessed bit set. And this is exactly what you code is trying to handle. Perfect. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 20:49 ` Linus Torvalds 2004-05-25 20:57 ` David S. Miller @ 2004-05-26 6:20 ` Keith M Wesolowski 1 sibling, 0 replies; 71+ messages in thread From: Keith M Wesolowski @ 2004-05-26 6:20 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, willy, andrea, benh, akpm, linux-kernel, mingo, bcrl, linux-mm On Tue, May 25, 2004 at 01:49:57PM -0700, Linus Torvalds wrote: > Ok. Still, that doesn't sound too bad. I assume that the pte write has to > be atomic anyway (ie it doesn't walk the page tables, but it clearly _has_ > to do an atomic "read-modify-update" or just the _hardware_ updates might > race against each other and one CPU loses the dirty bit when another CPU > writes back the accessed bit). Correct, we don't have to worry about that. The various chips either (a) lock the bus during the entire tablewalk, or (b) set the accessed bit atomically and use a normal write when setting both bits. > So I really think it should be safe to just do a regular write when you > update both bits, because you know that no other CPU will at least _clear_ > any bits. Hmm? Yes, and this is how our hardware at least works. > So it sounds like the SWAP loop basically ends up being just something > like > > val = pte_value(entry); > for (;;) { > oldval = SWAP(ptep, val); > /* > * If we wrote a value that had the same or more bits set > * than the old value, we're ok... > */ > if (!(oldval & ~val)) > break; > /* > * ..otherwise we need to write the "or" of all bits and > * try again. > */ > val |= oldval; > } > > Right? Note that the reason we can do the "dirty and accessed bit both > set" case with a simple write is exactly because that's already the > "maximal" bits anybody can write to the thing, so we don't need to loop, > we can just write it directly. The documentation actually provides a more general implementation for any kind of pte update. In this case it would logically reduce to your outline above. The actual outline in the manual is: val = pte_val(entry); for(;;) { /* Set the pte to 0 to discourage others */ oldval = 0; SWAP(ptep, oldval); /* Flush this entry from all MMUs */ flush_tlb_entry(ptep); /* Gather the existing bits */ val |= oldval; /* Did another MMU monkey with it since? */ if (!*ptep) break; } set_pte(ptep, entry); which seems far more complicated than necessary just to set these 1 or 2 bits. -- Keith M Wesolowski -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 18:05 ` Linus Torvalds 2004-05-25 20:30 ` Linus Torvalds 2004-05-25 20:35 ` David S. Miller @ 2004-05-25 21:40 ` Benjamin Herrenschmidt 2004-05-25 21:54 ` Linus Torvalds 2004-05-25 22:09 ` Linus Torvalds 2 siblings, 2 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-25 21:40 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list > Oh - btw - my suggested patch was totally broken for ppc64, because that > "ptep_update_dirty_accessed()" thing obviously also needs to that damn > hpte_update() crud etc. > > BenH - I'm leaving that ppc64 code to somebody knows what the hell he is > doing. Ie you or Anton or something. Ok? I can act as a collector the > different architecture things for that "ptep_update_dirty_accessed()" > function. Well, just setting one of those 2 bits doesn't require a hash table invalidate as long as nothing else changes. We do dirty by mapping r/o in the hash table, and accessed on hash faults (our clear_young triggers a flush). So just setting those bits in the linux PTE without touching the hash table is fine, we'll just possibly take an extra fault on the next write or access, but that might not be much slower than going to the hash update the permissions directly The original problem I have with set_pte is that our current implementation of set_pte will overwrite the entire PTE, possibly losing the bits that indicate that there is a copy in the hash and its index in the hash bucket. So if set_pte is called on a PTE that is present, we must flush it properly first as we will lose track of the hash one when overriding. hpte_update() will simply add the old PTE to a batch which is then flushed by either the mmu gather batch end, or by a call to flush_tlb_* Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:40 ` Benjamin Herrenschmidt @ 2004-05-25 21:54 ` Linus Torvalds 2004-05-25 22:00 ` Linus Torvalds 2004-05-25 22:05 ` [PATCH] " Benjamin Herrenschmidt 2004-05-25 22:09 ` Linus Torvalds 1 sibling, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 21:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > Well, just setting one of those 2 bits doesn't require a hash table > invalidate as long as nothing else changes. Ok. And nothing ever writes to the SW page tables outside the page table lock, right? So on ppc64, we could just do #define ptep_update_dirty_accessed(ptep, entry, dirty) \ *(ptep) = (entry) and be done with it. No? I'm not going to do it without a big ack from you. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:54 ` Linus Torvalds @ 2004-05-25 22:00 ` Linus Torvalds 2004-05-25 22:07 ` Benjamin Herrenschmidt 2004-05-25 22:05 ` [PATCH] " Benjamin Herrenschmidt 1 sibling, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 22:00 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Tue, 25 May 2004, Linus Torvalds wrote: > > > On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > Well, just setting one of those 2 bits doesn't require a hash table > > invalidate as long as nothing else changes. > > Ok. And nothing ever writes to the SW page tables outside the page table > lock, right? So on ppc64, we could just do > > #define ptep_update_dirty_accessed(ptep, entry, dirty) \ > *(ptep) = (entry) > > and be done with it. No? No. I'm an idiot. We do need to keep the "hashed" bit atomically, so it would need to be something on the order of static inline void ptep_update_dirty_accessed(pte_t *ptep, pte_t entry, int dirty) { unsigned long bits = pte_value(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED); unsigned long tmp; __asm__ __volatile__( "1: lwarx %0,0,%3\n\ or %0,%2,%0\n\ stwcx. %0,0,%3\n\ bne- 1b" :"=&r" (tmp), "=m" (*ptep) :"r" (bits), "r" (ptep) :"cc"); } /* Make asm-generic/pgtable.h know about it.. */ #define ptep_update_dirty_accessed ptep_update_dirty_accessed or whatever. Anyway, I don't feel competent enough to commit it, so.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:00 ` Linus Torvalds @ 2004-05-25 22:07 ` Benjamin Herrenschmidt 2004-05-25 22:14 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-25 22:07 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list > static inline void ptep_update_dirty_accessed(pte_t *ptep, pte_t entry, int dirty) > { > unsigned long bits = pte_value(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED); > unsigned long tmp; > > __asm__ __volatile__( > "1: lwarx %0,0,%3\n\ > or %0,%2,%0\n\ > stwcx. %0,0,%3\n\ > bne- 1b" > :"=&r" (tmp), "=m" (*ptep) > :"r" (bits), "r" (ptep) > :"cc"); > } > > /* Make asm-generic/pgtable.h know about it.. */ > #define ptep_update_dirty_accessed ptep_update_dirty_accessed That looks good on a first look, I need to re-read the whole discussion since yesterday through to make sure I didn't miss anything. Note that I'd rather call the function ptep_set_* than ptep_update_* to make clear that it can only ever be used to _set_ those bits. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:07 ` Benjamin Herrenschmidt @ 2004-05-25 22:14 ` Linus Torvalds 2004-05-26 0:21 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 22:14 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > Note that I'd rather call the function ptep_set_* than ptep_update_* to > make clear that it can only ever be used to _set_ those bits. Good point. Too late. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:14 ` Linus Torvalds @ 2004-05-26 0:21 ` Benjamin Herrenschmidt 2004-05-26 0:50 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 0:21 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 08:14, Linus Torvalds wrote: > On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > Note that I'd rather call the function ptep_set_* than ptep_update_* to > > make clear that it can only ever be used to _set_ those bits. > > Good point. > > Too late. Heh, I can still send a patch "fixing" it if you want ;) Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 0:21 ` Benjamin Herrenschmidt @ 2004-05-26 0:50 ` Linus Torvalds 2004-05-26 3:25 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-26 0:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > Heh, I can still send a patch "fixing" it if you want ;) If you include a tested version of the ppc64 fix, I'd likely apply that. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 0:50 ` Linus Torvalds @ 2004-05-26 3:25 ` Benjamin Herrenschmidt 2004-05-26 4:08 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 3:25 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 10:50, Linus Torvalds wrote: > On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > Heh, I can still send a patch "fixing" it if you want ;) > > If you include a tested version of the ppc64 fix, I'd likely apply that. Ok, that doesn't work. Trigger BUG_ON in rmap.c:421 I think we are using ptep_establish for more than just setting those 2 bits (like for setting up the new PTE in break_cow and such while the current implementationL/definition is more like just setting those bits and nothing else.... If we end up doing more than setting those bits, we need to flush the PTE completely from the hash etc.... Anyway, here's the non working patch, including the rename. Ben. ===== include/asm-generic/pgtable.h 1.5 vs edited ===== --- 1.5/include/asm-generic/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-generic/pgtable.h 2004-05-26 10:58:17 +10:00 @@ -3,8 +3,8 @@ #ifndef __HAVE_ARCH_PTEP_ESTABLISH -#ifndef ptep_update_dirty_accessed -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry) +#ifndef ptep_set_dirty_accessed +#define ptep_set_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry) #endif /* @@ -17,7 +17,7 @@ */ #define ptep_establish(__vma, __address, __ptep, __entry, __dirty) \ do { \ - ptep_update_dirty_accessed(__ptep, __entry, __dirty); \ + ptep_set_dirty_accessed(__ptep, __entry, __dirty); \ flush_tlb_page(__vma, __address); \ } while (0) #endif ===== include/asm-i386/pgtable.h 1.45 vs edited ===== --- 1.45/include/asm-i386/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-i386/pgtable.h 2004-05-26 10:59:12 +10:00 @@ -325,7 +325,7 @@ * bit at the same time. */ #define update_mmu_cache(vma,address,pte) do { } while (0) -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) \ +#define ptep_set_dirty_accessed(__ptep, __entry, __dirty) \ do { \ if (__dirty) set_pte(__ptep, __entry); \ } while (0) ===== include/asm-ppc64/pgtable.h 1.33 vs edited ===== --- 1.33/include/asm-ppc64/pgtable.h 2004-05-23 07:56:24 +10:00 +++ edited/include/asm-ppc64/pgtable.h 2004-05-26 13:09:59 +10:00 @@ -306,7 +306,10 @@ return old; } -/* PTE updating functions */ +/* PTE updating functions, this function puts the PTE in the + * batch, doesn't actually triggers the hash flush immediately, + * you need to call flush_tlb_pending() to do that. + */ extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); static inline int ptep_test_and_clear_young(pte_t *ptep) @@ -318,7 +321,7 @@ old = pte_update(ptep, _PAGE_ACCESSED); if (old & _PAGE_HASHPTE) { hpte_update(ptep, old, 0); - flush_tlb_pending(); /* XXX generic code doesn't flush */ + flush_tlb_pending(); } return (old & _PAGE_ACCESSED) != 0; } @@ -396,10 +399,36 @@ */ static inline void set_pte(pte_t *ptep, pte_t pte) { - if (pte_present(*ptep)) + if (pte_present(*ptep)) { pte_clear(ptep); + flush_tlb_pending(); + } *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } + +/* Set the dirty and/or accessed bits atomically in a linux PTE, this + * function doesn't need to flush the hash entry + */ +static inline void ptep_set_dirty_accessed(pte_t *ptep, pte_t entry, int dirty) +{ + unsigned long bits = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED); + unsigned long tmp; + + WARN_ON(!pte_present(*ptep)); + + __asm__ __volatile__( + "1: ldarx %0,0,%3\n\ + or %0,%2,%0\n\ + stdcx. %0,0,%3\n\ + bne- 1b" + :"=&r" (tmp), "=m" (*ptep) + :"r" (bits), "r" (ptep) + :"cc"); +} + +/* Make asm-generic/pgtable.h know about it.. */ +#define ptep_set_dirty_accessed ptep_set_dirty_accessed + /* * Macro to mark a page protection value as "uncacheable". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 3:25 ` Benjamin Herrenschmidt @ 2004-05-26 4:08 ` Linus Torvalds 2004-05-26 4:12 ` Benjamin Herrenschmidt 2004-05-26 4:46 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-26 4:08 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > I think we are using ptep_establish for more than just setting those > 2 bits (like for setting up the new PTE in break_cow and such while > the current implementationL/definition is more like just setting > those bits and nothing else.... You're right. We do use it on the do_wp_page() path, and there we actually use a whole new page in the "break_cow()" case. That case is in fact fundamentally different from the other ones. So we should probably break up the "ptep_establish()" into its two pieces, since the callers don't actually want to do the same thing. One really wants to do a "clear old one, set a totally new one", and the two other places want to actually update just the dirty and accessed bits. In fact, the only non-generic user of "ptep_establish()" (s390) didn't want to use the generic version exactly because of this very conceptual bug. It uses "ptep_clear_flush()" for the replacement case, which actually makes sense. So does it work if you do this appended patch first? This is a real cleanup, and I think it will allow us to get rid of the s390-specific code in ptep_establish(). Along with hopefully fixing your problem too. After this, we should be able to have a BUG() in "set_pte()" if the entry wasn't clear before (assuming the arch doesn't use set_pte() for the dirty updates etc). Linus --- ===== mm/memory.c 1.177 vs edited ===== --- 1.177/mm/memory.c Tue May 25 12:37:09 2004 +++ edited/mm/memory.c Tue May 25 21:04:49 2004 @@ -1004,7 +1004,10 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), vma); - ptep_establish(vma, address, page_table, entry, 1); + + /* Get rid of the old entry, replace with new */ + ptep_clear_flush(vma, address, page_table); + set_pte(page_table, entry); update_mmu_cache(vma, address, entry); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:08 ` Linus Torvalds @ 2004-05-26 4:12 ` Benjamin Herrenschmidt 2004-05-26 4:18 ` Benjamin Herrenschmidt 2004-05-26 4:28 ` Linus Torvalds 2004-05-26 4:46 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 4:12 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 14:08, Linus Torvalds wrote: > You're right. We do use it on the do_wp_page() path, and there we actually > use a whole new page in the "break_cow()" case. That case is in fact > fundamentally different from the other ones. > > So we should probably break up the "ptep_establish()" into its two pieces, > since the callers don't actually want to do the same thing. One really > wants to do a "clear old one, set a totally new one", and the two other > places want to actually update just the dirty and accessed bits. The first one could still be called "pte_establish" ... I mean, it makes little sense to continue calling "pte_establish" something that just does set one of those 2 bits... And the flush done by pte_establish in this case is useless on ppc... so I'd rather kill pte_establish completely instead and define it's arch (or generic) impl. of ptep_set_dirty_accessed() responsibility to do the TLB flush if necessary, no ? (well, a call to it on ppc isn't expensive if we didn't add anything to the batch anyway...) > In fact, the only non-generic user of "ptep_establish()" (s390) didn't > want to use the generic version exactly because of this very conceptual > bug. It uses "ptep_clear_flush()" for the replacement case, which actually > makes sense. > > So does it work if you do this appended patch first? This is a real > cleanup, and I think it will allow us to get rid of the s390-specific code > in ptep_establish(). Along with hopefully fixing your problem too. > > After this, we should be able to have a BUG() in "set_pte()" if the entry > wasn't clear before (assuming the arch doesn't use set_pte() for the dirty > updates etc). Ok, I'll give it a spin. > Linus > > --- > ===== mm/memory.c 1.177 vs edited ===== > --- 1.177/mm/memory.c Tue May 25 12:37:09 2004 > +++ edited/mm/memory.c Tue May 25 21:04:49 2004 > @@ -1004,7 +1004,10 @@ > flush_cache_page(vma, address); > entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), > vma); > - ptep_establish(vma, address, page_table, entry, 1); > + > + /* Get rid of the old entry, replace with new */ > + ptep_clear_flush(vma, address, page_table); > + set_pte(page_table, entry); > update_mmu_cache(vma, address, entry); > } > -- Benjamin Herrenschmidt <benh@kernel.crashing.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:12 ` Benjamin Herrenschmidt @ 2004-05-26 4:18 ` Benjamin Herrenschmidt 2004-05-26 4:50 ` Linus Torvalds 2004-05-26 4:28 ` Linus Torvalds 1 sibling, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 4:18 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 14:12, Benjamin Herrenschmidt wrote: > On Wed, 2004-05-26 at 14:08, Linus Torvalds wrote: > > > You're right. We do use it on the do_wp_page() path, and there we actually > > use a whole new page in the "break_cow()" case. That case is in fact > > fundamentally different from the other ones. > > > > So we should probably break up the "ptep_establish()" into its two pieces, > > since the callers don't actually want to do the same thing. One really > > wants to do a "clear old one, set a totally new one", and the two other > > places want to actually update just the dirty and accessed bits. Hrm... Still dies, some kind of loop it seems, I'll have a look Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:18 ` Benjamin Herrenschmidt @ 2004-05-26 4:50 ` Linus Torvalds 2004-05-26 4:49 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-26 4:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > Hrm... Still dies, some kind of loop it seems, I'll have a look Are you sure the it's not just taking infinite page fault because we keep reloading the old value from the hash tables? That "hash fault" thing still doesn't convince me. Why should the hash-refill fastpath ever look at the software page tables? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:50 ` Linus Torvalds @ 2004-05-26 4:49 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 4:49 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 14:50, Linus Torvalds wrote: > On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > Hrm... Still dies, some kind of loop it seems, I'll have a look > > Are you sure the it's not just taking infinite page fault because we keep > reloading the old value from the hash tables? That "hash fault" thing > still doesn't convince me. Why should the hash-refill fastpath ever look > at the software page tables? Where do you think the hash PTE is filled from ? :) We even use one of the linux PTE bits as a lock bit during the hash refill (the PAGE_BUSY bit) to avoid a race where we can end up filling more than one hash entry from the same linux PTE. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:12 ` Benjamin Herrenschmidt 2004-05-26 4:18 ` Benjamin Herrenschmidt @ 2004-05-26 4:28 ` Linus Torvalds 1 sibling, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-26 4:28 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > The first one could still be called "ptep_establish" Yes. And the other places should be called "ptep_set_dirty_access()" (and you might verify that those bits are the only ones that change, if you want to make sure). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:08 ` Linus Torvalds 2004-05-26 4:12 ` Benjamin Herrenschmidt @ 2004-05-26 4:46 ` Benjamin Herrenschmidt 2004-05-26 4:54 ` Linus Torvalds 1 sibling, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 4:46 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list Ok, your patch just missed the do_wp_page() case which needs the same medication as break_cow(), which leaves us with only one caller of ptep_establish... the one which just sets those 2 bits and shouldn't be named ptep establish at all ;) What do we do ? I'd rather have ptep_establish do the ptep_clear_flush & set_pte, and the later just do the bit flip, but then, the arch (and possibly generic) implementation of that bit flip must also do the TLB flush when necessary (it's not on ppc). Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:46 ` Benjamin Herrenschmidt @ 2004-05-26 4:54 ` Linus Torvalds 2004-05-26 4:55 ` Benjamin Herrenschmidt ` (2 more replies) 0 siblings, 3 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-26 4:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > Ok, your patch just missed the do_wp_page() case which needs the same > medication as break_cow(), which leaves us with only one caller of > ptep_establish... the one which just sets those 2 bits and shouldn't > be named ptep establish at all ;) Actually, the do_wp_page() one doesn't change the page, but it potentially changes _three_ bits: accessed, dirty _and_ writable. But the fix on ppc64 should be to add the writable bit to the mask of things, and rename the function to reflect that fact. > I'd rather have ptep_establish do the ptep_clear_flush & set_pte, and > the later just do the bit flip, but then, the arch (and possibly > generic) implementation of that bit flip must also do the TLB flush > when necessary (it's not on ppc). I agree on the renaming, but please don't use the ptep_clear_flush() in do_wp_page(), because it really isn't changing the virtual mapping, it's only a (slightly) extended case of the same old "set another bit". Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:54 ` Linus Torvalds @ 2004-05-26 4:55 ` Benjamin Herrenschmidt 2004-05-26 5:41 ` Benjamin Herrenschmidt 2004-05-26 5:59 ` [PATCH] (signoff) " Benjamin Herrenschmidt 2 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 4:55 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 14:54, Linus Torvalds wrote: > On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > Ok, your patch just missed the do_wp_page() case which needs the same > > medication as break_cow(), which leaves us with only one caller of > > ptep_establish... the one which just sets those 2 bits and shouldn't > > be named ptep establish at all ;) > > Actually, the do_wp_page() one doesn't change the page, but it potentially > changes _three_ bits: accessed, dirty _and_ writable. Hrm... that would work if we only ever set writable, so yes. > But the fix on ppc64 should be to add the writable bit to the mask of > things, and rename the function to reflect that fact. > > > I'd rather have ptep_establish do the ptep_clear_flush & set_pte, and > > the later just do the bit flip, but then, the arch (and possibly > > generic) implementation of that bit flip must also do the TLB flush > > when necessary (it's not on ppc). > > I agree on the renaming, but please don't use the ptep_clear_flush() in > do_wp_page(), because it really isn't changing the virtual mapping, it's > only a (slightly) extended case of the same old "set another bit". Ok, I just wasn't sure what this function was all about. I'll cook a new patch including all the changes we discussed, may take some time, but hopefully you'll get it today. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:54 ` Linus Torvalds 2004-05-26 4:55 ` Benjamin Herrenschmidt @ 2004-05-26 5:41 ` Benjamin Herrenschmidt 2004-05-26 5:59 ` [PATCH] (signoff) " Benjamin Herrenschmidt 2 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 5:41 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list Ok, here it is. I blasted ptep_establish completely and the macro thing and changed the function to ptep_set_access_flags() which becomes also responsible for the flush if necesary. Overall, I feel it's more consistent this way with the other stuff in there. I did the ppc64 impl, the x86 one (hope I got it right). I still need to do ppc32 and I suppose s390 must be fixed now that ptep_estabish is gone but I'll leave that to someone who understand something about these things ;) Here it is, boots on g5 : ===== include/asm-generic/pgtable.h 1.5 vs edited ===== --- 1.5/include/asm-generic/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-generic/pgtable.h 2004-05-26 15:37:02 +10:00 @@ -1,24 +1,11 @@ #ifndef _ASM_GENERIC_PGTABLE_H #define _ASM_GENERIC_PGTABLE_H -#ifndef __HAVE_ARCH_PTEP_ESTABLISH - -#ifndef ptep_update_dirty_accessed -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry) -#endif - -/* - * Establish a new mapping: - * - flush the old one - * - update the page tables - * - inform the TLB about the new one - * - * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock - */ -#define ptep_establish(__vma, __address, __ptep, __entry, __dirty) \ -do { \ - ptep_update_dirty_accessed(__ptep, __entry, __dirty); \ - flush_tlb_page(__vma, __address); \ +#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ +do { \ + set_pte(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ } while (0) #endif ===== include/asm-i386/pgtable.h 1.45 vs edited ===== --- 1.45/include/asm-i386/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-i386/pgtable.h 2004-05-26 15:34:57 +10:00 @@ -325,9 +325,13 @@ * bit at the same time. */ #define update_mmu_cache(vma,address,pte) do { } while (0) -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) \ - do { \ - if (__dirty) set_pte(__ptep, __entry); \ +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ + do { \ + if (__dirty) { \ + set_pte(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ + } \ } while (0) /* Encode and de-code a swap entry */ ===== include/asm-ppc64/pgtable.h 1.33 vs edited ===== --- 1.33/include/asm-ppc64/pgtable.h 2004-05-23 07:56:24 +10:00 +++ edited/include/asm-ppc64/pgtable.h 2004-05-26 15:32:38 +10:00 @@ -306,7 +306,10 @@ return old; } -/* PTE updating functions */ +/* PTE updating functions, this function puts the PTE in the + * batch, doesn't actually triggers the hash flush immediately, + * you need to call flush_tlb_pending() to do that. + */ extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); static inline int ptep_test_and_clear_young(pte_t *ptep) @@ -318,7 +321,7 @@ old = pte_update(ptep, _PAGE_ACCESSED); if (old & _PAGE_HASHPTE) { hpte_update(ptep, old, 0); - flush_tlb_pending(); /* XXX generic code doesn't flush */ + flush_tlb_pending(); } return (old & _PAGE_ACCESSED) != 0; } @@ -396,10 +399,34 @@ */ static inline void set_pte(pte_t *ptep, pte_t pte) { - if (pte_present(*ptep)) + if (pte_present(*ptep)) { pte_clear(ptep); + flush_tlb_pending(); + } *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } + +/* Set the dirty and/or accessed bits atomically in a linux PTE, this + * function doesn't need to flush the hash entry + */ +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty) +{ + unsigned long bits = pte_val(entry) & + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW); + unsigned long tmp; + + __asm__ __volatile__( + "1: ldarx %0,0,%3\n\ + or %0,%2,%0\n\ + stdcx. %0,0,%3\n\ + bne- 1b" + :"=&r" (tmp), "=m" (*ptep) + :"r" (bits), "r" (ptep) + :"cc"); +} +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ + __ptep_set_access_flags(__ptep, __entry, __dirty) /* * Macro to mark a page protection value as "uncacheable". ===== mm/memory.c 1.177 vs edited ===== --- 1.177/mm/memory.c 2004-05-26 05:37:09 +10:00 +++ edited/mm/memory.c 2004-05-26 15:35:40 +10:00 @@ -1004,7 +1004,11 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), vma); - ptep_establish(vma, address, page_table, entry, 1); + + /* Get rid of the old entry, replace with new */ + ptep_clear_flush(vma, address, page_table); + set_pte(page_table, entry); + update_mmu_cache(vma, address, entry); } @@ -1056,7 +1060,7 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_establish(vma, address, page_table, entry, 1); + ptep_set_access_flags(vma, address, page_table, entry, 1); update_mmu_cache(vma, address, entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1646,7 +1650,7 @@ entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry, write_access); + ptep_set_access_flags(vma, address, pte, entry, write_access); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] (signoff) ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 4:54 ` Linus Torvalds 2004-05-26 4:55 ` Benjamin Herrenschmidt 2004-05-26 5:41 ` Benjamin Herrenschmidt @ 2004-05-26 5:59 ` Benjamin Herrenschmidt 2004-05-26 6:55 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 5:59 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list (Same with signoff :) Ok, here it is. I blasted ptep_establish completely and the macro thing and changed the function to ptep_set_access_flags() which becomes also responsible for the flush if necesary. Overall, I feel it's more consistent this way with the other stuff in there. I did the ppc64 impl, the x86 one (hope I got it right). I still need to do ppc32 and I suppose s390 must be fixed now that ptep_estabish is gone but I'll leave that to someone who understand something about these things ;) Here it is, boots on g5 : Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- 1.5/include/asm-generic/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-generic/pgtable.h 2004-05-26 15:37:02 +10:00 @@ -1,24 +1,11 @@ #ifndef _ASM_GENERIC_PGTABLE_H #define _ASM_GENERIC_PGTABLE_H -#ifndef __HAVE_ARCH_PTEP_ESTABLISH - -#ifndef ptep_update_dirty_accessed -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry) -#endif - -/* - * Establish a new mapping: - * - flush the old one - * - update the page tables - * - inform the TLB about the new one - * - * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock - */ -#define ptep_establish(__vma, __address, __ptep, __entry, __dirty) \ -do { \ - ptep_update_dirty_accessed(__ptep, __entry, __dirty); \ - flush_tlb_page(__vma, __address); \ +#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ +do { \ + set_pte(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ } while (0) #endif ===== include/asm-i386/pgtable.h 1.45 vs edited ===== --- 1.45/include/asm-i386/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-i386/pgtable.h 2004-05-26 15:34:57 +10:00 @@ -325,9 +325,13 @@ * bit at the same time. */ #define update_mmu_cache(vma,address,pte) do { } while (0) -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) \ - do { \ - if (__dirty) set_pte(__ptep, __entry); \ +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ + do { \ + if (__dirty) { \ + set_pte(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ + } \ } while (0) /* Encode and de-code a swap entry */ ===== include/asm-ppc64/pgtable.h 1.33 vs edited ===== --- 1.33/include/asm-ppc64/pgtable.h 2004-05-23 07:56:24 +10:00 +++ edited/include/asm-ppc64/pgtable.h 2004-05-26 15:32:38 +10:00 @@ -306,7 +306,10 @@ return old; } -/* PTE updating functions */ +/* PTE updating functions, this function puts the PTE in the + * batch, doesn't actually triggers the hash flush immediately, + * you need to call flush_tlb_pending() to do that. + */ extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); static inline int ptep_test_and_clear_young(pte_t *ptep) @@ -318,7 +321,7 @@ old = pte_update(ptep, _PAGE_ACCESSED); if (old & _PAGE_HASHPTE) { hpte_update(ptep, old, 0); - flush_tlb_pending(); /* XXX generic code doesn't flush */ + flush_tlb_pending(); } return (old & _PAGE_ACCESSED) != 0; } @@ -396,10 +399,34 @@ */ static inline void set_pte(pte_t *ptep, pte_t pte) { - if (pte_present(*ptep)) + if (pte_present(*ptep)) { pte_clear(ptep); + flush_tlb_pending(); + } *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } + +/* Set the dirty and/or accessed bits atomically in a linux PTE, this + * function doesn't need to flush the hash entry + */ +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty) +{ + unsigned long bits = pte_val(entry) & + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW); + unsigned long tmp; + + __asm__ __volatile__( + "1: ldarx %0,0,%3\n\ + or %0,%2,%0\n\ + stdcx. %0,0,%3\n\ + bne- 1b" + :"=&r" (tmp), "=m" (*ptep) + :"r" (bits), "r" (ptep) + :"cc"); +} +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ + __ptep_set_access_flags(__ptep, __entry, __dirty) /* * Macro to mark a page protection value as "uncacheable". ===== mm/memory.c 1.177 vs edited ===== --- 1.177/mm/memory.c 2004-05-26 05:37:09 +10:00 +++ edited/mm/memory.c 2004-05-26 15:35:40 +10:00 @@ -1004,7 +1004,11 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), vma); - ptep_establish(vma, address, page_table, entry, 1); + + /* Get rid of the old entry, replace with new */ + ptep_clear_flush(vma, address, page_table); + set_pte(page_table, entry); + update_mmu_cache(vma, address, entry); } @@ -1056,7 +1060,7 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_establish(vma, address, page_table, entry, 1); + ptep_set_access_flags(vma, address, page_table, entry, 1); update_mmu_cache(vma, address, entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1646,7 +1650,7 @@ entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry, write_access); + ptep_set_access_flags(vma, address, pte, entry, write_access); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] (signoff) ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 5:59 ` [PATCH] (signoff) " Benjamin Herrenschmidt @ 2004-05-26 6:55 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-26 6:55 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 15:59, Benjamin Herrenschmidt wrote: > (Same with signoff :) > > Ok, here it is. I blasted ptep_establish completely and the macro > thing and changed the function to ptep_set_access_flags() which > becomes also responsible for the flush if necesary. Ok, ppc64 implementation is bogus, I just completely forgot that we have to also atomically wait for busy to be cleared. (Or we would race with the hash refill which does set busy atomically at the beginning of the opertation, but later updates the linux PTE non atomically & clears it). Brown paper bag since I'm the one who implemented the hash refill this way in the first place =P Here's a fixed version: Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- 1.5/include/asm-generic/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-generic/pgtable.h 2004-05-26 15:37:02 +10:00 @@ -1,24 +1,11 @@ #ifndef _ASM_GENERIC_PGTABLE_H #define _ASM_GENERIC_PGTABLE_H -#ifndef __HAVE_ARCH_PTEP_ESTABLISH - -#ifndef ptep_update_dirty_accessed -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) set_pte(__ptep, __entry) -#endif - -/* - * Establish a new mapping: - * - flush the old one - * - update the page tables - * - inform the TLB about the new one - * - * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock - */ -#define ptep_establish(__vma, __address, __ptep, __entry, __dirty) \ -do { \ - ptep_update_dirty_accessed(__ptep, __entry, __dirty); \ - flush_tlb_page(__vma, __address); \ +#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ +do { \ + set_pte(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ } while (0) #endif ===== include/asm-i386/pgtable.h 1.45 vs edited ===== --- 1.45/include/asm-i386/pgtable.h 2004-05-26 06:04:54 +10:00 +++ edited/include/asm-i386/pgtable.h 2004-05-26 15:34:57 +10:00 @@ -325,9 +325,13 @@ * bit at the same time. */ #define update_mmu_cache(vma,address,pte) do { } while (0) -#define ptep_update_dirty_accessed(__ptep, __entry, __dirty) \ - do { \ - if (__dirty) set_pte(__ptep, __entry); \ +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ + do { \ + if (__dirty) { \ + set_pte(__ptep, __entry); \ + flush_tlb_page(__vma, __address); \ + } \ } while (0) /* Encode and de-code a swap entry */ ===== include/asm-ppc64/pgtable.h 1.33 vs edited ===== --- 1.33/include/asm-ppc64/pgtable.h 2004-05-23 07:56:24 +10:00 +++ edited/include/asm-ppc64/pgtable.h 2004-05-26 16:47:05 +10:00 @@ -306,7 +306,10 @@ return old; } -/* PTE updating functions */ +/* PTE updating functions, this function puts the PTE in the + * batch, doesn't actually triggers the hash flush immediately, + * you need to call flush_tlb_pending() to do that. + */ extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot); static inline int ptep_test_and_clear_young(pte_t *ptep) @@ -318,7 +321,7 @@ old = pte_update(ptep, _PAGE_ACCESSED); if (old & _PAGE_HASHPTE) { hpte_update(ptep, old, 0); - flush_tlb_pending(); /* XXX generic code doesn't flush */ + flush_tlb_pending(); } return (old & _PAGE_ACCESSED) != 0; } @@ -396,10 +399,36 @@ */ static inline void set_pte(pte_t *ptep, pte_t pte) { - if (pte_present(*ptep)) + if (pte_present(*ptep)) { pte_clear(ptep); + flush_tlb_pending(); + } *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; } + +/* Set the dirty and/or accessed bits atomically in a linux PTE, this + * function doesn't need to flush the hash entry + */ +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty) +{ + unsigned long bits = pte_val(entry) & + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW); + unsigned long old, tmp; + + __asm__ __volatile__( + "1: ldarx %0,0,%4\n\ + andi. %1,%0,%6\n\ + bne- 1b \n\ + or %0,%3,%0\n\ + stdcx. %0,0,%4\n\ + bne- 1b" + :"=&r" (old), "=&r" (tmp), "=m" (*ptep) + :"r" (bits), "r" (ptep), "m" (ptep), "i" (_PAGE_BUSY) + :"cc"); +} +#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \ + __ptep_set_access_flags(__ptep, __entry, __dirty) /* * Macro to mark a page protection value as "uncacheable". ===== mm/memory.c 1.177 vs edited ===== --- 1.177/mm/memory.c 2004-05-26 05:37:09 +10:00 +++ edited/mm/memory.c 2004-05-26 15:35:40 +10:00 @@ -1004,7 +1004,11 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot)), vma); - ptep_establish(vma, address, page_table, entry, 1); + + /* Get rid of the old entry, replace with new */ + ptep_clear_flush(vma, address, page_table); + set_pte(page_table, entry); + update_mmu_cache(vma, address, entry); } @@ -1056,7 +1060,7 @@ flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_establish(vma, address, page_table, entry, 1); + ptep_set_access_flags(vma, address, page_table, entry, 1); update_mmu_cache(vma, address, entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1646,7 +1650,7 @@ entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry, write_access); + ptep_set_access_flags(vma, address, pte, entry, write_access); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:54 ` Linus Torvalds 2004-05-25 22:00 ` Linus Torvalds @ 2004-05-25 22:05 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-25 22:05 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 07:54, Linus Torvalds wrote: > On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > Well, just setting one of those 2 bits doesn't require a hash table > > invalidate as long as nothing else changes. > > Ok. And nothing ever writes to the SW page tables outside the page table > lock, right? So on ppc64, we could just do > > #define ptep_update_dirty_accessed(ptep, entry, dirty) \ > *(ptep) = (entry) > > and be done with it. No? > > I'm not going to do it without a big ack from you. No. The hash fault path will update the PTE dirty/accessed on a hash miss exception without holding the page table lock (acts a bit like a HW TLB as far as linux is concerned). That's why it needs to be atomic. Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:40 ` Benjamin Herrenschmidt 2004-05-25 21:54 ` Linus Torvalds @ 2004-05-25 22:09 ` Linus Torvalds 2004-05-25 22:19 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 22:09 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > Well, just setting one of those 2 bits doesn't require a hash table > invalidate as long as nothing else changes. I'm starting to doubt this, because: > We do dirty by mapping r/o in the hash table, and accessed on hash > faults (our clear_young triggers a flush). So just setting those bits > in the linux PTE without touching the hash table is fine, we'll just > possibly take an extra fault on the next write or access, but that > might not be much slower than going to the hash update the permissions > directly. But if we don't update the hash tables, how will the TLB entry _ever_ say that the page is writable? So we won't take just _one_ extra fault on the next write, we'll _keep_ taking them, since the hash tables will continue to claim that the page is read-only, even if the linux sw page tables say it is writable. So I think the code needs to invalidate the hash after having updated the pte. No? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:09 ` Linus Torvalds @ 2004-05-25 22:19 ` Benjamin Herrenschmidt 2004-05-25 22:24 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: Benjamin Herrenschmidt @ 2004-05-25 22:19 UTC (permalink / raw) To: Linus Torvalds Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 2004-05-26 at 08:09, Linus Torvalds wrote: > But if we don't update the hash tables, how will the TLB entry _ever_ say > that the page is writable? So we won't take just _one_ extra fault on the > next write, we'll _keep_ taking them, since the hash tables will continue > to claim that the page is read-only, even if the linux sw page tables say > it is writable. > > So I think the code needs to invalidate the hash after having updated the > pte. No? No, we'll take a hash fault, not a page fault. The hash fault is an asm fast path, which in this case, will update the HPTE RW permission when the PTE has PAGE_RW (and will set PAGE_DIRTY again, but that's fine). Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:19 ` Benjamin Herrenschmidt @ 2004-05-25 22:24 ` Linus Torvalds 0 siblings, 0 replies; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 22:24 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David S. Miller, wesolows, willy, Andrea Arcangeli, Andrew Morton, Linux Kernel list, mingo, bcrl, linux-mm, Linux Arch list On Wed, 26 May 2004, Benjamin Herrenschmidt wrote: > > > > So I think the code needs to invalidate the hash after having updated the > > pte. No? > > No, we'll take a hash fault, not a page fault. The hash fault is an asm > fast path, which in this case, will update the HPTE RW permission when > the PTE has PAGE_RW (and will set PAGE_DIRTY again, but that's fine). Ahh. Goodie. Then the "simple" atomic bitset probably works. Assuming I translated the asm/atomic.h stuff correctly. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 14:48 ` Linus Torvalds 2004-05-25 15:35 ` Keith M Wesolowski @ 2004-05-25 21:27 ` Andrea Arcangeli 2004-05-25 21:43 ` Linus Torvalds 2004-05-25 21:44 ` Andrea Arcangeli 1 sibling, 2 replies; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 21:27 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, May 25, 2004 at 07:48:24AM -0700, Linus Torvalds wrote: > We'd just pass in a flag to "ptep_establish()" to tell it whether we > changed the dirty bit or not. It would be "write_access" in > handle_pte_fault(), and 1 in the other two cases. I want to show you another approch. This effectively make the end of handle_pte_fault a noop for x86 and x86-64 and it implements two operations ptep_handle_[young|dirty]_page_fault(ptep) that allows architectures with a page-faulting dirty and/or young bits to resolve the page fault inside their own architecture (since the alpha architecture generates a page fault for both, there's not even need of atomic instructions anywhere). The only brainer case is an architecture generating a page fault for the young bit and not for the dirty bit. In such case they would need to use something like this: entry = ptep_get_and_clear(pte); set_pte(pte, pte_modify(entry, newprot)); Again no atomic instructions. Though with an atomic instruction like a set_mask64(__DIRTY_BITS, &ptep->pte) such an architecture may run faster than the above, but the above is guaranteed to work as far as mprotect works too. I believe using ptep_establish in handle_mm_fault makes little sense, to make the most obvious example there's no need of flushing the tlb in handle_mm_fault to resolve young or dirty page faults. Not a big deal for x86 and x86-64 that reaches that path only if a race happens, but on alpha we shouldn't flush the tlb. If some weird architecture really need to flush the tlb they still can inside ptep_handle_[young|dirty]_page_fault. As for set_pte, there's quite a lot of legitimate stuff using it it on present ptes, even not dirty ones (see zap_pte_* too). The last issue is ptep_establish, we're flushing the pte in do_wp_page inside ptep_establish again for no good reason. Those suprious tlb flushes may even trigger IPIs (this time in x86 smp too even with processes), so I'd really like to remove the explicit flush in do_wp_page, however this will likely break s390 but I don't understand s390 so I'll leave it broken for now (at least to show you this alternative and to hear comments if it's as broken as the previous one). Really my basic point for going this direction is that it makes little sense to me to use a tlb-flushing ptep_establish in both handle_mm_fault and do_wp_page, where no tlb flush is required in x86, x86-64 and alpha, so I do see a window for improving performance. Yeah, I'm assuming all the cpus are smart enough to automatically re-read the pte from memory if the information they have in the tlb asks for a page-fault. I also added a trap for non-dirty pte in ptep_establish but it's quite superflous since there's only 1 ptep_establish caller left. patch works for me on x86 and it should work on alpha too this time ;). The really scary thing about this patch is the s390 ptep_establish. Index: linux-2.5/include/asm-alpha/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-alpha/pgtable.h,v retrieving revision 1.22 diff -u -p -r1.22 pgtable.h --- linux-2.5/include/asm-alpha/pgtable.h 23 May 2004 05:02:25 -0000 1.22 +++ linux-2.5/include/asm-alpha/pgtable.h 25 May 2004 20:34:11 -0000 @@ -267,6 +267,28 @@ extern inline pte_t pte_mkexec(pte_t pte extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } +static inline void ptep_handle_young_page_fault(pte_t *ptep) +{ + /* + * WARNING: this is safe only because the dirty bit + * cannot change trasparently in hardware in the pte. + * If the dirty bit in the pte would be set trasparently + * by the CPU this should be implemented with set_bit() + * or with a new set_mask64() implementing an atomic "or". + */ + set_pte(ptep, pte_mkyoung(*ptep)); +} + +static inline void ptep_handle_dirty_page_fault(pte_t *ptep) +{ + /* + * This can run without atomic instructions even if + * the young bit is set in hardware by the architecture. + * Losing the young bit is not important. + */ + set_pte(ptep, pte_mkdirty(*ptep)); +} + #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address)) /* to find an entry in a kernel page-table-directory */ Index: linux-2.5/include/asm-generic/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v retrieving revision 1.4 diff -u -p -r1.4 pgtable.h --- linux-2.5/include/asm-generic/pgtable.h 19 Jan 2004 18:43:03 -0000 1.4 +++ linux-2.5/include/asm-generic/pgtable.h 25 May 2004 20:38:39 -0000 @@ -12,6 +12,7 @@ */ #define ptep_establish(__vma, __address, __ptep, __entry) \ do { \ + BUG_ON(!pte_dirty(__entry)); \ set_pte(__ptep, __entry); \ flush_tlb_page(__vma, __address); \ } while (0) Index: linux-2.5/include/asm-i386/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable.h,v retrieving revision 1.42 diff -u -p -r1.42 pgtable.h --- linux-2.5/include/asm-i386/pgtable.h 23 May 2004 05:02:25 -0000 1.42 +++ linux-2.5/include/asm-i386/pgtable.h 25 May 2004 20:27:00 -0000 @@ -220,7 +220,19 @@ static inline pte_t pte_mkdirty(pte_t pt static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } +/* + * This is used only to set the dirty bit during a "dirty" page fault. + * Nothing to do on x86 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_dirty_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } +/* + * This is used only to set the young bit during a "young" page fault. + * Nothing to do on x86 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_young_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } Index: linux-2.5/include/asm-x86_64/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-x86_64/pgtable.h,v retrieving revision 1.27 diff -u -p -r1.27 pgtable.h --- linux-2.5/include/asm-x86_64/pgtable.h 23 May 2004 05:02:25 -0000 1.27 +++ linux-2.5/include/asm-x86_64/pgtable.h 25 May 2004 20:27:28 -0000 @@ -262,7 +262,19 @@ extern inline pte_t pte_mkexec(pte_t pte extern inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } extern inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; } +/* + * This is used only to set the dirty bit during a "dirty" page fault. + * Nothing to do on x86-64 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_dirty_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); } +/* + * This is used only to set the young bit during a "young" page fault. + * Nothing to do on x86-64 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_young_page_fault(ptep) do { } while(0) static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, ptep); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, ptep); } Index: linux-2.5/mm/memory.c =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v retrieving revision 1.169 diff -u -p -r1.169 memory.c --- linux-2.5/mm/memory.c 23 May 2004 05:10:11 -0000 1.169 +++ linux-2.5/mm/memory.c 25 May 2004 20:28:23 -0000 @@ -1643,10 +1643,9 @@ static inline int handle_pte_fault(struc if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, entry); - entry = pte_mkdirty(entry); + ptep_handle_dirty_page_fault(pte); } - entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry); + ptep_handle_young_page_fault(pte); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); Signed-off-by: Andrea Arcangeli <andrea@suse.de> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:27 ` Andrea Arcangeli @ 2004-05-25 21:43 ` Linus Torvalds 2004-05-25 21:55 ` Andrea Arcangeli 2004-05-25 21:44 ` Andrea Arcangeli 1 sibling, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 21:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Andrea Arcangeli wrote: > > entry = ptep_get_and_clear(pte); > set_pte(pte, pte_modify(entry, newprot)); > > Again no atomic instructions. Well, actually, that "ptep_get_and_clear()" is actually an atomic instruction. Or at least it had _better_ be. > I believe using ptep_establish in handle_mm_fault makes little sense, > to make the most obvious example there's no need of flushing the tlb in > handle_mm_fault to resolve young or dirty page faults. Not a big deal > for x86 and x86-64 that reaches that path only if a race happens, but on > alpha we shouldn't flush the tlb. If some weird architecture really need > to flush the tlb they still can inside > ptep_handle_[young|dirty]_page_fault. Actually, especially on alpha we _do_ need to flush the TLB. Think about it: the TLB clearly contains the right virt->physical mapping, but the TLB contains the wrong "control bits". If we don't flush the TLB, the TLB will _continue_ to contain the wrong control bits. And as you saw earlier, if those bits are wrong, you get some really nasty behaviour, like infinite page faults. If the TLB still says that the page is non-readable, even though _memory_ says it is readable, it doesn't much matter that we updated the page tables correctly in memory. The CPU will use the TLB. So that TLB flush actually _is_ fundamental. Arguably we could remove it from x86. On the other hand, arguably it doesn't actually matter on x86, so.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:43 ` Linus Torvalds @ 2004-05-25 21:55 ` Andrea Arcangeli 2004-05-25 22:01 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 21:55 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, May 25, 2004 at 02:43:46PM -0700, Linus Torvalds wrote: > > > On Tue, 25 May 2004, Andrea Arcangeli wrote: > > > > entry = ptep_get_and_clear(pte); > > set_pte(pte, pte_modify(entry, newprot)); > > > > Again no atomic instructions. > > Well, actually, that "ptep_get_and_clear()" is actually an atomic > instruction. Or at least it had _better_ be. sure, I really meant no "new" atomic instruction. > > I believe using ptep_establish in handle_mm_fault makes little sense, > > to make the most obvious example there's no need of flushing the tlb in > > handle_mm_fault to resolve young or dirty page faults. Not a big deal > > for x86 and x86-64 that reaches that path only if a race happens, but on > > alpha we shouldn't flush the tlb. If some weird architecture really need > > to flush the tlb they still can inside > > ptep_handle_[young|dirty]_page_fault. > > Actually, especially on alpha we _do_ need to flush the TLB. > > Think about it: the TLB clearly contains the right virt->physical mapping, > but the TLB contains the wrong "control bits". > > If we don't flush the TLB, the TLB will _continue_ to contain the wrong > control bits. I expected the pal code to re-read the pte if the control bits asked for page fault, like it must happen if the control bits are set to non-present. This latter this must be true or linux wouldn't run at all on alpha. We certainly don't flush the tlb after marking the page from non-present to present, example in do_anonymous_page. Anyways if the pal code behaves like that specifically with the KWE/UWE/KRE/URE and not with the PAGE_VALID bit, I obviously agree have to flush the tlb. I just didn't expect it, though I admit I couldn't easily find specs about it. > And as you saw earlier, if those bits are wrong, you get some really nasty > behaviour, like infinite page faults. If the TLB still says that the page > is non-readable, even though _memory_ says it is readable, it doesn't much > matter that we updated the page tables correctly in memory. The CPU will > use the TLB. > > So that TLB flush actually _is_ fundamental. > > Arguably we could remove it from x86. On the other hand, arguably it > doesn't actually matter on x86, so.. it doesn't matter in handle_mm_fault but it does matter in do_wp_page. since we've to mess with ptep_establish to fix this race, it would be nice to remove such flush_tlb_page from do_wp_page. Or is the x86 tlb also not refilling the pte automatically if the control bits asks for page-fault? In mprotect obviously we've to flush since we can upgrade and downgrade the control bits, but in do_wp_page we only ugprade, so there should be no need of tlb flush. I'll try to find some documentation about this to be sure, at least for x86 it should be easy to find. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:55 ` Andrea Arcangeli @ 2004-05-25 22:01 ` Linus Torvalds 2004-05-25 22:18 ` Ivan Kokshaysky 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-25 22:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, 25 May 2004, Andrea Arcangeli wrote: > > I expected the pal code to re-read the pte if the control bits asked for > page fault, like it must happen if the control bits are set to > non-present. That may or may not be true. I _think_ it wasn't true. > This latter this must be true or linux wouldn't run at all > on alpha. A "not-present" fault is a totally different fault from a "protection fault". Only the not-present fault ends up walking the page tables, if I remember correctly. The PAL-code sources are out there somewhere, so I guess this should be easy to check if I wasn't so lazy. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:01 ` Linus Torvalds @ 2004-05-25 22:18 ` Ivan Kokshaysky 2004-05-25 22:42 ` Andrea Arcangeli 0 siblings, 1 reply; 71+ messages in thread From: Ivan Kokshaysky @ 2004-05-25 22:18 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, May 25, 2004 at 03:01:55PM -0700, Linus Torvalds wrote: > A "not-present" fault is a totally different fault from a "protection > fault". Only the not-present fault ends up walking the page tables, if I > remember correctly. Precisely. The architecture reference manual says: "Additionally, when the software changes any part (except the software field) of a *valid* PTE, it must also execute a tbi instruction." Ivan. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:18 ` Ivan Kokshaysky @ 2004-05-25 22:42 ` Andrea Arcangeli 2004-05-26 2:26 ` Linus Torvalds 0 siblings, 1 reply; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 22:42 UTC (permalink / raw) To: Ivan Kokshaysky Cc: Linus Torvalds, Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Wed, May 26, 2004 at 02:18:45AM +0400, Ivan Kokshaysky wrote: > On Tue, May 25, 2004 at 03:01:55PM -0700, Linus Torvalds wrote: > > A "not-present" fault is a totally different fault from a "protection > > fault". Only the not-present fault ends up walking the page tables, if I > > remember correctly. > > Precisely. The architecture reference manual says: > "Additionally, when the software changes any part (except the software > field) of a *valid* PTE, it must also execute a tbi instruction." thanks for checking. after various searching on the x86 docs I found: Whenever a page-directory or page-table entry is changed (including when the present flag is set to zero), the operating-system must immediately invalidate the corresponding entry in the TLB so that it can be updated the next time the entry is referenced. according to the above we'd need to flush the tlb even in do_anonymous_page on x86, or am I reading it wrong? We're not really doing that, is that a bug? I'd be very surprised if we overlooked x86 wasting some time in some page fault loop, I guess it works like the alpha in practice even if the specs tells us we've to flush. anyways to make things work right with my approch I'd need to flush the tlb after the handle_*_page_fault operations (they could return 1 if a flush is required before returning from the page fault) and I should resurrect pte_establish in do_wp_page. but then I certainly agree leaving ptep_establish in handle_mm_fault is fine if we've to flush the tlb anyways, so I'm not going to update my patch unless anybody prefers it for any other reason I don't see. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 22:42 ` Andrea Arcangeli @ 2004-05-26 2:26 ` Linus Torvalds 2004-05-26 7:06 ` Andrea Arcangeli 0 siblings, 1 reply; 71+ messages in thread From: Linus Torvalds @ 2004-05-26 2:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Ivan Kokshaysky, Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Wed, 26 May 2004, Andrea Arcangeli wrote: > > after various searching on the x86 docs I found: > > Whenever a page-directory or page-table entry is changed (including when > the present flag is set to zero), the operating-system must immediately > invalidate the corresponding entry in the TLB so that it can be updated > the next time the entry is referenced. > > according to the above we'd need to flush the tlb even in > do_anonymous_page on x86, or am I reading it wrong? You're reading it wrong. The "including when the present flag is set to zero" part does not mean that the present flag was zero _before_, it means "is being set to zero" as in "having been non-zero before that". Anytime the P flag was clear _before_, we don't need to invalidate, because non-present entries are not cached in the TLB. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-26 2:26 ` Linus Torvalds @ 2004-05-26 7:06 ` Andrea Arcangeli 0 siblings, 0 replies; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-26 7:06 UTC (permalink / raw) To: Linus Torvalds Cc: Ivan Kokshaysky, Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group On Tue, May 25, 2004 at 07:26:21PM -0700, Linus Torvalds wrote: > You're reading it wrong. > > The "including when the present flag is set to zero" part does not mean > that the present flag was zero _before_, it means "is being set to zero" > as in "having been non-zero before that". "having been non-zero before that" makes a lot more sense indeed, the wording in the specs wasn't the best IMHO. Interestingly the ptep_establish at the end of handle_pte_fault would have hidden any double fault completely, nobody but a tracer would have noticed that, but it made very little sense that non-present entries can be cached. It's all clear now thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE 2004-05-25 21:27 ` Andrea Arcangeli 2004-05-25 21:43 ` Linus Torvalds @ 2004-05-25 21:44 ` Andrea Arcangeli 1 sibling, 0 replies; 71+ messages in thread From: Andrea Arcangeli @ 2004-05-25 21:44 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Benjamin Herrenschmidt, Andrew Morton, Linux Kernel list, Ingo Molnar, Ben LaHaise, linux-mm, Architectures Group in the previous email I attached a slightly older version of the patch that didn't optimize the spurious tlb flush away from do_wp_page yet, this is the latest version I tested: Index: linux-2.5/include/asm-alpha/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-alpha/pgtable.h,v retrieving revision 1.22 diff -u -p -r1.22 pgtable.h --- linux-2.5/include/asm-alpha/pgtable.h 23 May 2004 05:02:25 -0000 1.22 +++ linux-2.5/include/asm-alpha/pgtable.h 25 May 2004 20:34:11 -0000 @@ -267,6 +267,28 @@ extern inline pte_t pte_mkexec(pte_t pte extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } +static inline void ptep_handle_young_page_fault(pte_t *ptep) +{ + /* + * WARNING: this is safe only because the dirty bit + * cannot change trasparently in hardware in the pte. + * If the dirty bit in the pte would be set trasparently + * by the CPU this should be implemented with set_bit() + * or with a new set_mask64() implementing an atomic "or". + */ + set_pte(ptep, pte_mkyoung(*ptep)); +} + +static inline void ptep_handle_dirty_page_fault(pte_t *ptep) +{ + /* + * This can run without atomic instructions even if + * the young bit is set in hardware by the architecture. + * Losing the young bit is not important. + */ + set_pte(ptep, pte_mkdirty(*ptep)); +} + #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address)) /* to find an entry in a kernel page-table-directory */ Index: linux-2.5/include/asm-generic/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v retrieving revision 1.4 diff -u -p -r1.4 pgtable.h --- linux-2.5/include/asm-generic/pgtable.h 19 Jan 2004 18:43:03 -0000 1.4 +++ linux-2.5/include/asm-generic/pgtable.h 25 May 2004 20:38:39 -0000 @@ -12,6 +12,7 @@ */ #define ptep_establish(__vma, __address, __ptep, __entry) \ do { \ + BUG_ON(!pte_dirty(__entry)); \ set_pte(__ptep, __entry); \ flush_tlb_page(__vma, __address); \ } while (0) Index: linux-2.5/include/asm-i386/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable.h,v retrieving revision 1.42 diff -u -p -r1.42 pgtable.h --- linux-2.5/include/asm-i386/pgtable.h 23 May 2004 05:02:25 -0000 1.42 +++ linux-2.5/include/asm-i386/pgtable.h 25 May 2004 20:27:00 -0000 @@ -220,7 +220,19 @@ static inline pte_t pte_mkdirty(pte_t pt static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } +/* + * This is used only to set the dirty bit during a "dirty" page fault. + * Nothing to do on x86 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_dirty_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } +/* + * This is used only to set the young bit during a "young" page fault. + * Nothing to do on x86 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_young_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } Index: linux-2.5/include/asm-x86_64/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-x86_64/pgtable.h,v retrieving revision 1.27 diff -u -p -r1.27 pgtable.h --- linux-2.5/include/asm-x86_64/pgtable.h 23 May 2004 05:02:25 -0000 1.27 +++ linux-2.5/include/asm-x86_64/pgtable.h 25 May 2004 20:27:28 -0000 @@ -262,7 +262,19 @@ extern inline pte_t pte_mkexec(pte_t pte extern inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } extern inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; } +/* + * This is used only to set the dirty bit during a "dirty" page fault. + * Nothing to do on x86-64 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_dirty_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); } +/* + * This is used only to set the young bit during a "young" page fault. + * Nothing to do on x86-64 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_young_page_fault(ptep) do { } while(0) static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, ptep); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, ptep); } Index: linux-2.5/mm/memory.c =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v retrieving revision 1.169 diff -u -p -r1.169 memory.c --- linux-2.5/mm/memory.c 23 May 2004 05:10:11 -0000 1.169 +++ linux-2.5/mm/memory.c 25 May 2004 21:20:11 -0000 @@ -1056,7 +1056,7 @@ static int do_wp_page(struct mm_struct * flush_cache_page(vma, address); entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)), vma); - ptep_establish(vma, address, page_table, entry); + set_pte(page_table, entry); update_mmu_cache(vma, address, entry); pte_unmap(page_table); spin_unlock(&mm->page_table_lock); @@ -1643,10 +1643,9 @@ static inline int handle_pte_fault(struc if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, entry); - entry = pte_mkdirty(entry); + ptep_handle_dirty_page_fault(pte); } - entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry); + ptep_handle_young_page_fault(pte); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); Signed-off-by: Andrea Arcangeli <andrea@suse.de> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE @ 2004-06-01 12:04 Martin Schwidefsky 0 siblings, 0 replies; 71+ messages in thread From: Martin Schwidefsky @ 2004-06-01 12:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Ben LaHaise, Benjamin Herrenschmidt, Architectures Group, Linux Kernel list, linux-mm, Ingo Molnar, Linus Torvalds, Matthew Wilcox > The last issue is ptep_establish, we're flushing the pte in do_wp_page > inside ptep_establish again for no good reason. Those suprious tlb > flushes may even trigger IPIs (this time in x86 smp too even with > processes), so I'd really like to remove the explicit flush in > do_wp_page, however this will likely break s390 but I don't understand > s390 so I'll leave it broken for now (at least to show you this > alternative and to hear comments if it's as broken as the previous one). No, this shouldn't break s390 in any way, removing superfluous tlb flushes will benefit s390 just like any other architecture. > The really scary thing about this patch is the s390 ptep_establish. The s390 version of ptep_establish isn't scary at all, it's just an optimization. s390 can use the generic set_pte & flush_tlb_page sequence for ptep_establish without a problem but there is a better way to do it. We use the ipte instruction because it only flushes the tlb entries for a single page and not all of them. Don't worry too much about breaking s390, if you do I will complain. blue skies, Martin Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schonaicherstr. 220, D-71032 Boblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: schwidefsky@de.ibm.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE
@ 2004-06-01 12:10 Martin Schwidefsky
0 siblings, 0 replies; 71+ messages in thread
From: Martin Schwidefsky @ 2004-06-01 12:10 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Andrew Morton, Andrea Arcangeli, bcrl, David S. Miller,
Linux Arch list, Linux Kernel list, linux-mm, mingo,
Linus Torvalds, wesolows, willy
> I did the ppc64 impl, the x86 one (hope I got it right). I still need to
> do ppc32 and I suppose s390 must be fixed now that ptep_estabish is gone
> but I'll leave that to someone who understand something about these
things ;)
At the moment I can't access linux.bkbits.net so I can't test anything but
as far as I can tell s390 should just work as is. ptep_establish is gone
but it has been replaced by correct sequences: ptep_clear_flush & set_pte
and set_pte & flush_tlb_page. The second sequence can be optimized to a
ptep_clear_flush & set_pte if the _PAGE_RO bit has changed. Apart from
that s390 is perfectly fine with the change.
blue skies,
Martin
Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Schonaicherstr. 220, D-71032 Boblingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: schwidefsky@de.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
^ permalink raw reply [flat|nested] 71+ messages in threadend of thread, other threads:[~2004-06-01 12:10 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1085369393.15315.28.camel@gaston>
[not found] ` <Pine.LNX.4.58.0405232046210.25502@ppc970.osdl.org>
[not found] ` <1085371988.15281.38.camel@gaston>
[not found] ` <Pine.LNX.4.58.0405232134480.25502@ppc970.osdl.org>
[not found] ` <1085373839.14969.42.camel@gaston>
2004-05-24 5:10 ` [PATCH] ppc64: Fix possible race with set_pte on a present PTE Linus Torvalds
2004-05-24 5:34 ` Benjamin Herrenschmidt
2004-05-24 5:38 ` Benjamin Herrenschmidt
2004-05-24 5:52 ` Benjamin Herrenschmidt
2004-05-24 7:39 ` Ingo Molnar
2004-05-24 5:39 ` Benjamin Herrenschmidt
2004-05-25 3:43 ` Andrea Arcangeli
2004-05-25 4:00 ` Linus Torvalds
2004-05-25 4:17 ` Benjamin Herrenschmidt
2004-05-25 4:37 ` Andrea Arcangeli
2004-05-25 4:40 ` Benjamin Herrenschmidt
2004-05-25 4:20 ` Andrea Arcangeli
2004-05-25 4:39 ` Linus Torvalds
2004-05-25 4:44 ` Linus Torvalds
2004-05-25 4:59 ` Andrea Arcangeli
2004-05-25 5:09 ` Andrea Arcangeli
2004-05-25 4:50 ` Andrea Arcangeli
2004-05-25 4:59 ` Linus Torvalds
2004-05-25 4:43 ` David Mosberger
2004-05-25 4:53 ` Andrea Arcangeli
2004-05-27 21:56 ` David Mosberger
2004-05-27 22:00 ` Benjamin Herrenschmidt
2004-05-27 22:12 ` David Mosberger
2004-05-25 11:44 ` Matthew Wilcox
2004-05-25 14:48 ` Linus Torvalds
2004-05-25 15:35 ` Keith M Wesolowski
2004-05-25 16:19 ` Linus Torvalds
2004-05-25 17:25 ` David S. Miller
2004-05-25 17:49 ` Linus Torvalds
2004-05-25 17:54 ` David S. Miller
2004-05-25 18:05 ` Linus Torvalds
2004-05-25 20:30 ` Linus Torvalds
2004-05-25 20:35 ` David S. Miller
2004-05-25 20:49 ` Linus Torvalds
2004-05-25 20:57 ` David S. Miller
2004-05-26 6:20 ` Keith M Wesolowski
2004-05-25 21:40 ` Benjamin Herrenschmidt
2004-05-25 21:54 ` Linus Torvalds
2004-05-25 22:00 ` Linus Torvalds
2004-05-25 22:07 ` Benjamin Herrenschmidt
2004-05-25 22:14 ` Linus Torvalds
2004-05-26 0:21 ` Benjamin Herrenschmidt
2004-05-26 0:50 ` Linus Torvalds
2004-05-26 3:25 ` Benjamin Herrenschmidt
2004-05-26 4:08 ` Linus Torvalds
2004-05-26 4:12 ` Benjamin Herrenschmidt
2004-05-26 4:18 ` Benjamin Herrenschmidt
2004-05-26 4:50 ` Linus Torvalds
2004-05-26 4:49 ` Benjamin Herrenschmidt
2004-05-26 4:28 ` Linus Torvalds
2004-05-26 4:46 ` Benjamin Herrenschmidt
2004-05-26 4:54 ` Linus Torvalds
2004-05-26 4:55 ` Benjamin Herrenschmidt
2004-05-26 5:41 ` Benjamin Herrenschmidt
2004-05-26 5:59 ` [PATCH] (signoff) " Benjamin Herrenschmidt
2004-05-26 6:55 ` Benjamin Herrenschmidt
2004-05-25 22:05 ` [PATCH] " Benjamin Herrenschmidt
2004-05-25 22:09 ` Linus Torvalds
2004-05-25 22:19 ` Benjamin Herrenschmidt
2004-05-25 22:24 ` Linus Torvalds
2004-05-25 21:27 ` Andrea Arcangeli
2004-05-25 21:43 ` Linus Torvalds
2004-05-25 21:55 ` Andrea Arcangeli
2004-05-25 22:01 ` Linus Torvalds
2004-05-25 22:18 ` Ivan Kokshaysky
2004-05-25 22:42 ` Andrea Arcangeli
2004-05-26 2:26 ` Linus Torvalds
2004-05-26 7:06 ` Andrea Arcangeli
2004-05-25 21:44 ` Andrea Arcangeli
2004-06-01 12:04 Martin Schwidefsky
2004-06-01 12:10 Martin Schwidefsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox