From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 25 May 2004 23:27:20 +0200 From: Andrea Arcangeli Subject: Re: [PATCH] ppc64: Fix possible race with set_pte on a present PTE Message-ID: <20040525212720.GG29378@dualathlon.random> References: <1085369393.15315.28.camel@gaston> <1085371988.15281.38.camel@gaston> <1085373839.14969.42.camel@gaston> <20040525034326.GT29378@dualathlon.random> <20040525114437.GC29154@parcelfarce.linux.theplanet.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: Matthew Wilcox , Benjamin Herrenschmidt , Andrew Morton , Linux Kernel list , Ingo Molnar , Ben LaHaise , linux-mm@kvack.org, Architectures Group List-ID: On Tue, May 25, 2004 at 07:48:24AM -0700, Linus Torvalds wrote: > We'd just pass in a flag to "ptep_establish()" to tell it whether we > changed the dirty bit or not. It would be "write_access" in > handle_pte_fault(), and 1 in the other two cases. I want to show you another approch. This effectively make the end of handle_pte_fault a noop for x86 and x86-64 and it implements two operations ptep_handle_[young|dirty]_page_fault(ptep) that allows architectures with a page-faulting dirty and/or young bits to resolve the page fault inside their own architecture (since the alpha architecture generates a page fault for both, there's not even need of atomic instructions anywhere). The only brainer case is an architecture generating a page fault for the young bit and not for the dirty bit. In such case they would need to use something like this: entry = ptep_get_and_clear(pte); set_pte(pte, pte_modify(entry, newprot)); Again no atomic instructions. Though with an atomic instruction like a set_mask64(__DIRTY_BITS, &ptep->pte) such an architecture may run faster than the above, but the above is guaranteed to work as far as mprotect works too. I believe using ptep_establish in handle_mm_fault makes little sense, to make the most obvious example there's no need of flushing the tlb in handle_mm_fault to resolve young or dirty page faults. Not a big deal for x86 and x86-64 that reaches that path only if a race happens, but on alpha we shouldn't flush the tlb. If some weird architecture really need to flush the tlb they still can inside ptep_handle_[young|dirty]_page_fault. As for set_pte, there's quite a lot of legitimate stuff using it it on present ptes, even not dirty ones (see zap_pte_* too). The last issue is ptep_establish, we're flushing the pte in do_wp_page inside ptep_establish again for no good reason. Those suprious tlb flushes may even trigger IPIs (this time in x86 smp too even with processes), so I'd really like to remove the explicit flush in do_wp_page, however this will likely break s390 but I don't understand s390 so I'll leave it broken for now (at least to show you this alternative and to hear comments if it's as broken as the previous one). Really my basic point for going this direction is that it makes little sense to me to use a tlb-flushing ptep_establish in both handle_mm_fault and do_wp_page, where no tlb flush is required in x86, x86-64 and alpha, so I do see a window for improving performance. Yeah, I'm assuming all the cpus are smart enough to automatically re-read the pte from memory if the information they have in the tlb asks for a page-fault. I also added a trap for non-dirty pte in ptep_establish but it's quite superflous since there's only 1 ptep_establish caller left. patch works for me on x86 and it should work on alpha too this time ;). The really scary thing about this patch is the s390 ptep_establish. Index: linux-2.5/include/asm-alpha/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-alpha/pgtable.h,v retrieving revision 1.22 diff -u -p -r1.22 pgtable.h --- linux-2.5/include/asm-alpha/pgtable.h 23 May 2004 05:02:25 -0000 1.22 +++ linux-2.5/include/asm-alpha/pgtable.h 25 May 2004 20:34:11 -0000 @@ -267,6 +267,28 @@ extern inline pte_t pte_mkexec(pte_t pte extern inline pte_t pte_mkdirty(pte_t pte) { pte_val(pte) |= __DIRTY_BITS; return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= __ACCESS_BITS; return pte; } +static inline void ptep_handle_young_page_fault(pte_t *ptep) +{ + /* + * WARNING: this is safe only because the dirty bit + * cannot change trasparently in hardware in the pte. + * If the dirty bit in the pte would be set trasparently + * by the CPU this should be implemented with set_bit() + * or with a new set_mask64() implementing an atomic "or". + */ + set_pte(ptep, pte_mkyoung(*ptep)); +} + +static inline void ptep_handle_dirty_page_fault(pte_t *ptep) +{ + /* + * This can run without atomic instructions even if + * the young bit is set in hardware by the architecture. + * Losing the young bit is not important. + */ + set_pte(ptep, pte_mkdirty(*ptep)); +} + #define PAGE_DIR_OFFSET(tsk,address) pgd_offset((tsk),(address)) /* to find an entry in a kernel page-table-directory */ Index: linux-2.5/include/asm-generic/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-generic/pgtable.h,v retrieving revision 1.4 diff -u -p -r1.4 pgtable.h --- linux-2.5/include/asm-generic/pgtable.h 19 Jan 2004 18:43:03 -0000 1.4 +++ linux-2.5/include/asm-generic/pgtable.h 25 May 2004 20:38:39 -0000 @@ -12,6 +12,7 @@ */ #define ptep_establish(__vma, __address, __ptep, __entry) \ do { \ + BUG_ON(!pte_dirty(__entry)); \ set_pte(__ptep, __entry); \ flush_tlb_page(__vma, __address); \ } while (0) Index: linux-2.5/include/asm-i386/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-i386/pgtable.h,v retrieving revision 1.42 diff -u -p -r1.42 pgtable.h --- linux-2.5/include/asm-i386/pgtable.h 23 May 2004 05:02:25 -0000 1.42 +++ linux-2.5/include/asm-i386/pgtable.h 25 May 2004 20:27:00 -0000 @@ -220,7 +220,19 @@ static inline pte_t pte_mkdirty(pte_t pt static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } +/* + * This is used only to set the dirty bit during a "dirty" page fault. + * Nothing to do on x86 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_dirty_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } +/* + * This is used only to set the young bit during a "young" page fault. + * Nothing to do on x86 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_young_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); } Index: linux-2.5/include/asm-x86_64/pgtable.h =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/include/asm-x86_64/pgtable.h,v retrieving revision 1.27 diff -u -p -r1.27 pgtable.h --- linux-2.5/include/asm-x86_64/pgtable.h 23 May 2004 05:02:25 -0000 1.27 +++ linux-2.5/include/asm-x86_64/pgtable.h 25 May 2004 20:27:28 -0000 @@ -262,7 +262,19 @@ extern inline pte_t pte_mkexec(pte_t pte extern inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; } extern inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; } extern inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; } +/* + * This is used only to set the dirty bit during a "dirty" page fault. + * Nothing to do on x86-64 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_dirty_page_fault(ptep) do { } while (0) static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); } +/* + * This is used only to set the young bit during a "young" page fault. + * Nothing to do on x86-64 since it's set by the hardware transparently + * and it cannot generate page faults. + */ +#define ptep_handle_young_page_fault(ptep) do { } while(0) static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); } static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, ptep); } static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, ptep); } Index: linux-2.5/mm/memory.c =================================================================== RCS file: /home/andrea/crypto/cvs/linux-2.5/mm/memory.c,v retrieving revision 1.169 diff -u -p -r1.169 memory.c --- linux-2.5/mm/memory.c 23 May 2004 05:10:11 -0000 1.169 +++ linux-2.5/mm/memory.c 25 May 2004 20:28:23 -0000 @@ -1643,10 +1643,9 @@ static inline int handle_pte_fault(struc if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, entry); - entry = pte_mkdirty(entry); + ptep_handle_dirty_page_fault(pte); } - entry = pte_mkyoung(entry); - ptep_establish(vma, address, pte, entry); + ptep_handle_young_page_fault(pte); update_mmu_cache(vma, address, entry); pte_unmap(pte); spin_unlock(&mm->page_table_lock); Signed-off-by: Andrea Arcangeli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org