* Re: Updated 2.4 TODO List [not found] <200010090419.e994JQT09775@trampoline.thunk.org> @ 2000-10-10 20:53 ` Rik van Riel 2000-10-11 0:06 ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans 2000-10-11 18:38 ` Updated 2.4 TODO List tytso 0 siblings, 2 replies; 20+ messages in thread From: Rik van Riel @ 2000-10-10 20:53 UTC (permalink / raw) To: tytso; +Cc: linux-kernel, linux-mm On Mon, 9 Oct 2000 tytso@mit.edu wrote: > 2. Capable Of Corrupting Your FS/data > > * Non-atomic page-map operations can cause loss of dirty bit on > pages (sct, alan) Is anybody looking into fixing this bug ? > 9. To Do > > * mm->rss is modified in some places without holding the > page_table_lock (sct) Probably not a show-stopper, but we're looking for volunteers to fix this one anyway ;) > * VM: Out of Memory handling {CRITICAL} Seems to work now, except for the fact that it is possible to end up with a heavily thrashing system that /just/ didn't run out of memory and doesn't get anything killed. Then again, you can end up with a heavily thrashing system where you can't get anything done without running out of swap anyway ... the proper fix for this is probably some form of thrashing control... > * VM: Fix the highmem deadlock, where the swapper cannot create low > memory bounce buffers OR swap out low memory because it has > consumed all resources {CRITICAL} (old bug, already reported in > 2.4.0test6) Haven't been able to reproduce it on my 1GB test machine, but it might still be there. Can anyone confirm if this bug is still present ? > * VM: page->mapping->flush() callback in page_lauder() for easier > integration with journaling filesystem and maybe the network > filesystems Possibly a 2.5 issue, or something to merge later in 2.4, since we don't have journaling filesystems in the kernel anyway. I guess we'll want it for the network filesystems though. But this is a fairly simple thing to integrate: 1) have an appropriate function in the filesystems 2) insert function pointer in the right struct 3) call the function from vmscan.c::page_launder() > * VM: maybe rebalance the swapper a bit... we do page aging now so > maybe refill_inactive_scan() / shm_swap() and swap_out() need to > be rebalanced a bit I'll try to look into this (3 days to go before I have to leave for Miami) and see how things can be improved here. > 11. To Check > > * VFS?VM - mmap/write deadlock (demo code seems to show lock is > there) Does anyone have the demo code at hand so we can verify if this still happens ? > * Stressing the VM (IOPS SPEC SFS) with HIGHMEM turned on can hang > system (linux-2.4.0test5, Ying Chen, Rik van Riel) Ditto. Can this still be reproduced with the latest VM or was it simply a side effect of something else in the VM that got fixed recently ? (the highmem code itself looks ok so the bug might well have been caused by a side effect of something else) > 12. Probably Post 2.4 > > * addres_space needs a VM pressure/flush callback (Ingo) [duplicate item?] We may want this to better support the journaling filesystems in 2.4 .... but I agree that it should probably be post 2.4.0. > * VM: physical->virtual reverse mapping, so we can do much better > page aging with less CPU usage spikes > * VM: better IO clustering for swap (and filesystem) IO > * VM: move all the global VM variables, lists, etc. into the pgdat > struct for better NUMA scalability > * VM: (maybe) some QoS things, as far as they are major improvements > with minor intrusion These 4 seem /definate/ 2.5 issues, though I hope to have them (except maybe QoS?) ready in an patch before 2.5.0 is split off. > * VM: thrashing control, maybe process suspension with some forced > swapping ? > * VM: include Ben LaHaise's code, which moves readahead to the VMA > level, this way we can do streaming swap IO, complete with > drop_behind() These two are fairly simple and may well be done in the next few weeks. If no bug reports about the current 2.4 VM pop up, I'll probably look into some of the issues above... FYI, my personal VM TODO list: - see if refill_inactive_scan(), swapout_shm(), swap_out(), etc... need rebalancing - anti-thrashing code (if no hidden nasties are present) - better IO clustering + readahead at VMA level AFAIK Juan Quintela is already looking into the ->flush() callback for journaling filesystems. And one more TODO item: * pinned page reservation system for journaling filesystems regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* 2.4.0test9 vm: disappointing streaming i/o under load 2000-10-10 20:53 ` Updated 2.4 TODO List Rik van Riel @ 2000-10-11 0:06 ` Chris Evans 2000-10-11 11:38 ` Eric Lowe 2000-10-11 18:38 ` Updated 2.4 TODO List tytso 1 sibling, 1 reply; 20+ messages in thread From: Chris Evans @ 2000-10-11 0:06 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm Hi, Finally got round to checking out 2.4.0test9. Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when under a bit of memory pressure. The test is this: boot with mem=32M, log onto GNOME and start xmms playing a big .wav ripped from a CD (this requires 100-200k read i/o per second). Then, I start then kill netscape. I then started a find / and started gnumeric firing up at the same time. Results ======= 2.2 RH7.0: the music skipped maybe twice briefly during the test. 2.4.0test9: music stuttered repeatedly while netscape started. Worse, when firing up gnumeric with the find / on the go, there were big pauses in sound output. On pause was over 5 seconds!!! So not so hot. Could this perhaps be related to the drop_behind magic penalizing streaming i/o pages too much? Perhaps the greater ago on the i/o pages means that when there is a little memory pressure, they are getting thrown out the page cache before the app (xmms) gets a chance to use them! Might it be useful for me to try pre10-1, I note it has more "balancing fixes". Cheers Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.0test9 vm: disappointing streaming i/o under load 2000-10-11 0:06 ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans @ 2000-10-11 11:38 ` Eric Lowe 2000-10-11 20:59 ` Chris Evans 0 siblings, 1 reply; 20+ messages in thread From: Eric Lowe @ 2000-10-11 11:38 UTC (permalink / raw) To: Chris Evans; +Cc: linux-mm Hello, > Finally got round to checking out 2.4.0test9. > > Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when > under a bit of memory pressure. > > The test is this: boot with mem=32M, log onto GNOME and start xmms playing > a big .wav ripped from a CD (this requires 100-200k read i/o per second). > > Then, I start then kill netscape. I then started a find / and started > gnumeric firing up at the same time. Would you try setting /proc/sys/vm/page-cluster to 8 or 16 and let me know the results? I think one _part_ of the problem is that when the swapper isn't agressive enough, it causes too much disk thrashing which gets in the way of normal I/O... my experience has been that with modern disks with 512K+ cache you have to write in 64K clusters to get optimum throughput. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.0test9 vm: disappointing streaming i/o under load 2000-10-11 11:38 ` Eric Lowe @ 2000-10-11 20:59 ` Chris Evans 2000-10-11 22:10 ` Roger Larsson 0 siblings, 1 reply; 20+ messages in thread From: Chris Evans @ 2000-10-11 20:59 UTC (permalink / raw) To: Eric Lowe; +Cc: linux-mm On Wed, 11 Oct 2000, Eric Lowe wrote: > > Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when > > under a bit of memory pressure. [...] > Would you try setting /proc/sys/vm/page-cluster to 8 or 16 and let > me know the results? I think one _part_ of the problem is that > when the swapper isn't agressive enough, it causes too much disk > thrashing which gets in the way of normal I/O... my experience > has been that with modern disks with 512K+ cache you have to > write in 64K clusters to get optimum throughput. Raising the cluster size didn't seem to do much apart from generally slow down interactive response. Lowering it, however, seemed to make playback less jittery. I guess that's to be expected; faulting in large chunks of sequential i/o won't help much when under memory pressure because the pages will get thrown out again before they get a chance to be used. Especially with drop_behind. Rik what do you think. Cheers Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.0test9 vm: disappointing streaming i/o under load 2000-10-11 20:59 ` Chris Evans @ 2000-10-11 22:10 ` Roger Larsson 2000-10-11 22:46 ` Chris Evans 0 siblings, 1 reply; 20+ messages in thread From: Roger Larsson @ 2000-10-11 22:10 UTC (permalink / raw) To: Chris Evans; +Cc: linux-mm Hi, (you do have DMA enabled...) I have tested throughput - new kernels are rather good. I have also tested latency stuff in test9 - I have not seen any thing as bad as your results. But my audio apps runs with high priority... To be able to determine the cause Try to to renice your audio deamon (and audio clients) renice -10 <pid> Did it become better? /RogerL Chris Evans wrote: > > On Wed, 11 Oct 2000, Eric Lowe wrote: > > > > Unfortunately, 2.4.0test9 exhibits poor streaming i/o performance when > > > under a bit of memory pressure. > > [...] > > > Would you try setting /proc/sys/vm/page-cluster to 8 or 16 and let > > me know the results? I think one _part_ of the problem is that > > when the swapper isn't agressive enough, it causes too much disk > > thrashing which gets in the way of normal I/O... my experience > > has been that with modern disks with 512K+ cache you have to > > write in 64K clusters to get optimum throughput. > > Raising the cluster size didn't seem to do much apart from generally slow > down interactive response. Lowering it, however, seemed to make playback > less jittery. I guess that's to be expected; faulting in large chunks of > sequential i/o won't help much when under memory pressure because the > pages will get thrown out again before they get a chance to be > used. Especially with drop_behind. > > Rik what do you think. > > Cheers > Chris > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux.eu.org/Linux-MM/ -- Home page: http://www.norran.net/nra02596/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.0test9 vm: disappointing streaming i/o under load 2000-10-11 22:10 ` Roger Larsson @ 2000-10-11 22:46 ` Chris Evans 2000-10-13 16:57 ` Rik van Riel 0 siblings, 1 reply; 20+ messages in thread From: Chris Evans @ 2000-10-11 22:46 UTC (permalink / raw) To: Roger Larsson; +Cc: linux-mm On Thu, 12 Oct 2000, Roger Larsson wrote: > Hi, > > (you do have DMA enabled...) Oh yes (discovering that in fact my chipset is only UDMA33 in the process). > I have tested throughput - new kernels are rather good. I don't doubt it. I'll try and post some numbers on this later. > I have also tested latency stuff in test9 - I have not > seen any thing as bad as your results. > But my audio apps runs with high priority... > > To be able to determine the cause > Try to to renice your audio deamon (and audio clients) > renice -10 <pid> > > > Did it become better? Not noticeably :-( Perhaps I'm just asking too much, booting with mem=32M. No point testing the new VM with a 128Mb desktop, though; it wouldn't break a sweat! 2.2 (RH7.0 kernel) does skip less, though, and the duration of skip is less. Perhaps the two kernels have different elevator settings? Cheers Chris -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: 2.4.0test9 vm: disappointing streaming i/o under load 2000-10-11 22:46 ` Chris Evans @ 2000-10-13 16:57 ` Rik van Riel 0 siblings, 0 replies; 20+ messages in thread From: Rik van Riel @ 2000-10-13 16:57 UTC (permalink / raw) To: Chris Evans; +Cc: Roger Larsson, linux-mm On Wed, 11 Oct 2000, Chris Evans wrote: > Perhaps I'm just asking too much, booting with mem=32M. No point > testing the new VM with a 128Mb desktop, though; it wouldn't > break a sweat! > > 2.2 (RH7.0 kernel) does skip less, though, and the duration of > skip is less. > > Perhaps the two kernels have different elevator settings? That too. And you just -might- be catching a boundary condition of the drop-behind code (if the audio isn't kept mapped by any of the processes, but is left to sit in a file which is write()n to by one process and is read() by the other). regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Updated 2.4 TODO List 2000-10-10 20:53 ` Updated 2.4 TODO List Rik van Riel 2000-10-11 0:06 ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans @ 2000-10-11 18:38 ` tytso 2000-10-11 23:52 ` [RFC] atomic pte updates for x86 smp Ben LaHaise 1 sibling, 1 reply; 20+ messages in thread From: tytso @ 2000-10-11 18:38 UTC (permalink / raw) To: riel; +Cc: linux-kernel, linux-mm > 2. Capable Of Corrupting Your FS/data > > * Non-atomic page-map operations can cause loss of dirty bit on > pages (sct, alan) Is anybody looking into fixing this bug ? According to sct (who's sitting next to me in my hotel room at ALS) Ben LaHaise has a bugfix for this, but it hasn't been merged. > * VM: Fix the highmem deadlock, where the swapper cannot create low > memory bounce buffers OR swap out low memory because it has > consumed all resources {CRITICAL} (old bug, already reported in > 2.4.0test6) Haven't been able to reproduce it on my 1GB test machine, but it might still be there. Can anyone confirm if this bug is still present ? Note: all of the issues on the TODO list with the "VM:" prefix are from a VM todo list you posted a week or two ago; so I'm assuming that you know more about those issues than I do..... (feel free to send me an updated list and I'll merge it into the 2.4 TODO list.) - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC] atomic pte updates for x86 smp 2000-10-11 18:38 ` Updated 2.4 TODO List tytso @ 2000-10-11 23:52 ` Ben LaHaise 2000-10-12 0:09 ` Linus Torvalds 0 siblings, 1 reply; 20+ messages in thread From: Ben LaHaise @ 2000-10-11 23:52 UTC (permalink / raw) To: torvalds, tytso; +Cc: linux-kernel, linux-mm On Wed, 11 Oct 2000 tytso@mit.edu wrote: > > 2. Capable Of Corrupting Your FS/data > > > > * Non-atomic page-map operations can cause loss of dirty bit on > > pages (sct, alan) > > Is anybody looking into fixing this bug ? > > According to sct (who's sitting next to me in my hotel room at ALS) Ben > LaHaise has a bugfix for this, but it hasn't been merged. Here's an updated version of the patch that doesn't do the funky RISC like dirty bit updates. It doesn't incur the additional overhead of page faults on dirty, which actually happens a lot on SHM attaches (during Oracle runs this is quite noticeable due to their use of hundreds of MB of SHM). Ted: Note that there are a couple of other SMP races that still need fixing: list them under VM threading bug under SMP (different bug). -ben # v2.4.0-test10-1-smp_pte_fix.diff diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h Fri Dec 3 14:12:23 1999 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h Wed Oct 11 16:08:08 2000 @@ -55,4 +55,7 @@ return (pmd_t *) dir; } +#define __HAVE_ARCH_pte_xchg_clear +#define pte_xchg_clear(xp) __pte(xchg(&(xp)->pte, 0)) + #endif /* _I386_PGTABLE_2LEVEL_H */ diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h Mon Dec 6 19:19:13 1999 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h Wed Oct 11 16:14:40 2000 @@ -76,4 +76,17 @@ #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \ __pmd_offset(address)) +#define __HAVE_ARCH_pte_xchg_clear +extern inline pte_t pte_xchg_clear(pte_t *ptep) +{ + long long res = pte_val(*ptep); +__asm__ __volatile__ ( + "1: cmpxchg8b (%1); + jnz 1b" + : "=A" (res) + :"D"(ptep), "0" (res), "b"(0), "c"(0) + : "memory"); + return (pte_t){ res }; +} + #endif /* _I386_PGTABLE_3LEVEL_H */ diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable.h Mon Oct 2 14:06:43 2000 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h Wed Oct 11 17:44:04 2000 @@ -17,6 +17,10 @@ #include <asm/fixmap.h> #include <linux/threads.h> +#ifndef _I386_BITOPS_H +#include <asm/bitops.h> +#endif + extern pgd_t swapper_pg_dir[1024]; extern void paging_init(void); @@ -145,6 +149,16 @@ * the page directory entry points directly to a 4MB-aligned block of * memory. */ +#define _PAGE_BIT_PRESENT 0 +#define _PAGE_BIT_RW 1 +#define _PAGE_BIT_USER 2 +#define _PAGE_BIT_PWT 3 +#define _PAGE_BIT_PCD 4 +#define _PAGE_BIT_ACCESSED 5 +#define _PAGE_BIT_DIRTY 6 +#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page, Pentium+, if present.. */ +#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ + #define _PAGE_PRESENT 0x001 #define _PAGE_RW 0x002 #define _PAGE_USER 0x004 @@ -234,6 +248,24 @@ #define pte_none(x) (!pte_val(x)) #define pte_present(x) (pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE)) #define pte_clear(xp) do { set_pte(xp, __pte(0)); } while (0) + +#define __HAVE_ARCH_pte_test_and_clear_dirty +static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte) +{ + return test_and_clear_bit(_PAGE_BIT_DIRTY, page_table); +} + +#define __HAVE_ARCH_pte_test_and_clear_young +static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte) +{ + return test_and_clear_bit(_PAGE_BIT_ACCESSED, page_table); +} + +#define __HAVE_ARCH_atomic_pte_wrprotect +static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte) +{ + clear_bit(_PAGE_BIT_RW, page_table); +} #define pmd_none(x) (!pmd_val(x)) #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT) diff -ur v2.4.0-test10-pre1/include/linux/mm.h work-v2.4.0-test10-pre1/include/linux/mm.h --- v2.4.0-test10-pre1/include/linux/mm.h Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/include/linux/mm.h Wed Oct 11 17:44:38 2000 @@ -532,6 +532,42 @@ #define vmlist_modify_lock(mm) vmlist_access_lock(mm) #define vmlist_modify_unlock(mm) vmlist_access_unlock(mm) +#ifndef __HAVE_ARCH_pte_test_and_clear_young +static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte) +{ + if (!pte_young(pte)) + return 0; + set_pte(page_table, pte_mkold(pte)); + return 1; +} +#endif + +#ifndef __HAVE_ARCH_pte_test_and_clear_dirty +static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte) +{ + if (!pte_dirty(pte)) + return 0; + set_pte(page_table, pte_mkclean(pte)); + return 1; +} +#endif + +#ifndef __HAVE_ARCH_pte_xchg_clear +static pte_t pte_xchg_clear(pte_t *page_table) +{ + pte_t pte = *page_table; + pte_clear(page_table); + return pte; +} +#endif + +#ifndef __HAVE_ARCH_atomic_pte_wrprotect +static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte) +{ + set_pte(page_table, pte_wrprotect(old_pte)); +} +#endif + #endif /* __KERNEL__ */ #endif diff -ur v2.4.0-test10-pre1/mm/filemap.c work-v2.4.0-test10-pre1/mm/filemap.c --- v2.4.0-test10-pre1/mm/filemap.c Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/mm/filemap.c Wed Oct 11 18:26:35 2000 @@ -1475,39 +1475,47 @@ return retval; } +/* Called with mm->page_table_lock held to protect against other + * threads/the swapper from ripping pte's out from under us. + */ static inline int filemap_sync_pte(pte_t * ptep, struct vm_area_struct *vma, unsigned long address, unsigned int flags) { unsigned long pgoff; - pte_t pte = *ptep; + pte_t pte; struct page *page; int error; + pte = *ptep; + if (!(flags & MS_INVALIDATE)) { if (!pte_present(pte)) - return 0; - if (!pte_dirty(pte)) - return 0; + goto out; + if (!pte_test_and_clear_dirty(ptep, pte)) + goto out; flush_page_to_ram(pte_page(pte)); flush_cache_page(vma, address); - set_pte(ptep, pte_mkclean(pte)); flush_tlb_page(vma, address); page = pte_page(pte); page_cache_get(page); } else { if (pte_none(pte)) - return 0; + goto out; flush_cache_page(vma, address); - pte_clear(ptep); + + pte = pte_xchg_clear(ptep); flush_tlb_page(vma, address); + if (!pte_present(pte)) { + spin_unlock(&vma->vm_mm->page_table_lock); swap_free(pte_to_swp_entry(pte)); - return 0; + spin_lock(&vma->vm_mm->page_table_lock); + goto out; } page = pte_page(pte); if (!pte_dirty(pte) || flags == MS_INVALIDATE) { page_cache_free(page); - return 0; + goto out; } } pgoff = (address - vma->vm_start) >> PAGE_CACHE_SHIFT; @@ -1516,11 +1524,18 @@ printk("weirdness: pgoff=%lu index=%lu address=%lu vm_start=%lu vm_pgoff=%lu\n", pgoff, page->index, address, vma->vm_start, vma->vm_pgoff); } + + spin_unlock(&vma->vm_mm->page_table_lock); lock_page(page); error = filemap_write_page(vma->vm_file, page, 1); UnlockPage(page); page_cache_free(page); + + spin_lock(&vma->vm_mm->page_table_lock); return error; + +out: + return 0; } static inline int filemap_sync_pte_range(pmd_t * pmd, @@ -1590,6 +1605,11 @@ unsigned long end = address + size; int error = 0; + /* Aquire the lock early; it may be possible to avoid dropping + * and reaquiring it repeatedly. + */ + spin_lock(&vma->vm_mm->page_table_lock); + dir = pgd_offset(vma->vm_mm, address); flush_cache_range(vma->vm_mm, end - size, end); if (address >= end) @@ -1600,6 +1620,9 @@ dir++; } while (address && (address < end)); flush_tlb_range(vma->vm_mm, end - size, end); + + spin_unlock(&vma->vm_mm->page_table_lock); + return error; } diff -ur v2.4.0-test10-pre1/mm/highmem.c work-v2.4.0-test10-pre1/mm/highmem.c --- v2.4.0-test10-pre1/mm/highmem.c Tue Oct 10 16:57:31 2000 +++ work-v2.4.0-test10-pre1/mm/highmem.c Tue Oct 10 18:13:44 2000 @@ -130,10 +130,10 @@ if (pkmap_count[i] != 1) continue; pkmap_count[i] = 0; - pte = pkmap_page_table[i]; + //pte = pkmap_page_table[i]; pte_clear(pkmap_page_table+i); + pte = pte_xchg_clear(pkmap_page_table+i); if (pte_none(pte)) BUG(); - pte_clear(pkmap_page_table+i); page = pte_page(pte); page->virtual = NULL; } diff -ur v2.4.0-test10-pre1/mm/memory.c work-v2.4.0-test10-pre1/mm/memory.c --- v2.4.0-test10-pre1/mm/memory.c Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/mm/memory.c Wed Oct 11 18:30:17 2000 @@ -215,30 +215,30 @@ /* copy_one_pte */ if (pte_none(pte)) - goto cont_copy_pte_range; + goto cont_copy_pte_range_noset; if (!pte_present(pte)) { swap_duplicate(pte_to_swp_entry(pte)); - set_pte(dst_pte, pte); goto cont_copy_pte_range; } ptepage = pte_page(pte); if ((!VALID_PAGE(ptepage)) || - PageReserved(ptepage)) { - set_pte(dst_pte, pte); + PageReserved(ptepage)) goto cont_copy_pte_range; - } + /* If it's a COW mapping, write protect it both in the parent and the child */ if (cow) { - pte = pte_wrprotect(pte); - set_pte(src_pte, pte); + atomic_pte_wrprotect(src_pte, pte); + pte = *src_pte; } + /* If it's a shared mapping, mark it clean in the child */ if (vma->vm_flags & VM_SHARED) pte = pte_mkclean(pte); - set_pte(dst_pte, pte_mkold(pte)); + pte = pte_mkold(pte); get_page(ptepage); - -cont_copy_pte_range: address += PAGE_SIZE; + +cont_copy_pte_range: set_pte(dst_pte, pte); +cont_copy_pte_range_noset: address += PAGE_SIZE; if (address >= end) goto out; src_pte++; @@ -306,10 +306,9 @@ pte_t page; if (!size) break; - page = *pte; + page = pte_xchg_clear(pte); pte++; size--; - pte_clear(pte-1); if (pte_none(page)) continue; freed += free_pte(page); @@ -712,8 +711,8 @@ end = PMD_SIZE; do { struct page *page; - pte_t oldpage = *pte; - pte_clear(pte); + pte_t oldpage; + oldpage = pte_xchg_clear(pte); page = virt_to_page(__va(phys_addr)); if ((!VALID_PAGE(page)) || PageReserved(page)) @@ -746,6 +745,7 @@ return 0; } +/* Note: this is only safe if the mm semaphore is held when called. */ int remap_page_range(unsigned long from, unsigned long phys_addr, unsigned long size, pgprot_t prot) { int error = 0; diff -ur v2.4.0-test10-pre1/mm/mremap.c work-v2.4.0-test10-pre1/mm/mremap.c --- v2.4.0-test10-pre1/mm/mremap.c Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/mm/mremap.c Wed Oct 11 02:38:41 2000 @@ -63,14 +63,14 @@ pte_t pte; spin_lock(&mm->page_table_lock); - pte = *src; + pte = pte_xchg_clear(src); if (!pte_none(pte)) { - error++; - if (dst) { - pte_clear(src); - set_pte(dst, pte); - error--; + if (!dst) { + /* No dest? We must put it back. */ + dst = src; + error++; } + set_pte(dst, pte); } spin_unlock(&mm->page_table_lock); return error; diff -ur v2.4.0-test10-pre1/mm/vmalloc.c work-v2.4.0-test10-pre1/mm/vmalloc.c --- v2.4.0-test10-pre1/mm/vmalloc.c Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/mm/vmalloc.c Wed Oct 11 16:38:21 2000 @@ -34,14 +34,15 @@ if (end > PMD_SIZE) end = PMD_SIZE; do { - pte_t page = *pte; - pte_clear(pte); + pte_t page; + page = pte_xchg_clear(pte); address += PAGE_SIZE; pte++; if (pte_none(page)) continue; if (pte_present(page)) { struct page *ptpage = pte_page(page); + /* FIXME: i am an ugly little race condition */ if (VALID_PAGE(ptpage) && (!PageReserved(ptpage))) __free_page(ptpage); continue; diff -ur v2.4.0-test10-pre1/mm/vmscan.c work-v2.4.0-test10-pre1/mm/vmscan.c --- v2.4.0-test10-pre1/mm/vmscan.c Tue Oct 10 16:57:31 2000 +++ work-v2.4.0-test10-pre1/mm/vmscan.c Wed Oct 11 18:17:17 2000 @@ -55,8 +55,7 @@ onlist = PageActive(page); /* Don't look at this pte if it's been accessed recently. */ - if (pte_young(pte)) { - set_pte(page_table, pte_mkold(pte)); + if (pte_test_and_clear_young(page_table, pte)) { if (onlist) { /* * Transfer the "accessed" bit from the page @@ -99,6 +98,10 @@ if (PageSwapCache(page)) { entry.val = page->index; swap_duplicate(entry); + if (pte_dirty(pte)) + BUG(); + if (pte_write(pte)) + BUG(); set_pte(page_table, swp_entry_to_pte(entry)); drop_pte: UnlockPage(page); @@ -109,6 +112,13 @@ goto out_failed; } + /* From this point on, the odds are that we're going to + * nuke this pte, so read and clear the pte. This hook + * is needed on CPUs which update the accessed and dirty + * bits in hardware. + */ + pte = pte_xchg_clear(page_table); + /* * Is it a clean page? Then it must be recoverable * by just paging it in again, and we can just drop @@ -124,7 +134,6 @@ */ if (!pte_dirty(pte)) { flush_cache_page(vma, address); - pte_clear(page_table); goto drop_pte; } @@ -134,7 +143,7 @@ * locks etc. */ if (!(gfp_mask & __GFP_IO)) - goto out_unlock; + goto out_unlock_restore; /* * Don't do any of the expensive stuff if @@ -143,7 +152,7 @@ if (page->zone->free_pages + page->zone->inactive_clean_pages + page->zone->inactive_dirty_pages > page->zone->pages_high + inactive_target) - goto out_unlock; + goto out_unlock_restore; /* * Ok, it's really dirty. That means that @@ -169,7 +178,7 @@ int error; struct file *file = vma->vm_file; if (file) get_file(file); - pte_clear(page_table); + mm->rss--; flush_tlb_page(vma, address); vmlist_access_unlock(mm); @@ -191,10 +200,12 @@ */ entry = get_swap_page(); if (!entry.val) - goto out_unlock; /* No swap space left */ + goto out_unlock_restore; /* No swap space left */ - if (!(page = prepare_highmem_swapout(page))) + if (!(page = prepare_highmem_swapout(page))) { + set_pte(page_table, pte); goto out_swap_free; + } swap_duplicate(entry); /* One for the process, one for the swap cache */ @@ -218,7 +229,8 @@ swap_free(entry); out_failed: return 0; -out_unlock: +out_unlock_restore: + set_pte(page_table, pte); UnlockPage(page); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-11 23:52 ` [RFC] atomic pte updates for x86 smp Ben LaHaise @ 2000-10-12 0:09 ` Linus Torvalds 2000-10-12 4:03 ` Benjamin C.R. LaHaise 0 siblings, 1 reply; 20+ messages in thread From: Linus Torvalds @ 2000-10-12 0:09 UTC (permalink / raw) To: Ben LaHaise; +Cc: tytso, linux-kernel, linux-mm On Wed, 11 Oct 2000, Ben LaHaise wrote: > > Here's an updated version of the patch that doesn't do the funky RISC like > dirty bit updates. It doesn't incur the additional overhead of page > faults on dirty, which actually happens a lot on SHM attaches > (during Oracle runs this is quite noticeable due to their use of > hundreds of MB of SHM). I much prefered the dirty fault version. What does "quite noticeable" mean? Does it mean that you can see page faults (no big deal), or does it mean that you can actually measure the performance degradation objectively? Also, this version doesn't seem to fix the bug. > diff -ur v2.4.0-test10-pre1/mm/vmscan.c work-v2.4.0-test10-pre1/mm/vmscan.c > --- v2.4.0-test10-pre1/mm/vmscan.c Tue Oct 10 16:57:31 2000 > +++ work-v2.4.0-test10-pre1/mm/vmscan.c Wed Oct 11 18:17:17 2000 > @@ -134,7 +143,7 @@ > * locks etc. > */ > if (!(gfp_mask & __GFP_IO)) > - goto out_unlock; > + goto out_unlock_restore; > > /* > * Don't do any of the expensive stuff if > @@ -143,7 +152,7 @@ > if (page->zone->free_pages + page->zone->inactive_clean_pages > + page->zone->inactive_dirty_pages > > page->zone->pages_high + inactive_target) > - goto out_unlock; > + goto out_unlock_restore; > > /* > * Ok, it's really dirty. That means that Both of the above paths can cause the dirty bit to be dropped again, as far as I can see. In fact, you seem to have _added_ those drops in this patch. What's up? I'm not going to apply a patch that I don't see will even fix the problem at this point. I _will_ apply the "exception on dirty" version, if you remove the SMP special case (ie you do it unconditionally). At least that one I believe really fixes the problem. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 0:09 ` Linus Torvalds @ 2000-10-12 4:03 ` Benjamin C.R. LaHaise 2000-10-12 4:06 ` David S. Miller 2000-10-12 6:42 ` Linus Torvalds 0 siblings, 2 replies; 20+ messages in thread From: Benjamin C.R. LaHaise @ 2000-10-12 4:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: tytso, linux-kernel, linux-mm Hello Linus, On Wed, 11 Oct 2000, Linus Torvalds wrote: > I much prefered the dirty fault version. > What does "quite noticeable" mean? Does it mean that you can see page > faults (no big deal), or does it mean that you can actually measure the > performance degradation objectively? It's a factor of 4 difference in execution time on the filemap rewrite test on a 1GB file (including all those cache misses that should have dwarfed the page fault handler). Moving the writable test and mkdirty early on in the page fault handler made no measurable difference in execution time; the bulk of the overhead appears to be in handling the page fault itself. > Also, this version doesn't seem to fix the bug. ... > Both of the above paths can cause the dirty bit to be dropped again, as > far as I can see. Note the fragment above those portions of the patch where the pte_xchg_clear is done on the page table: this results in a page fault for any other cpu that looks at the pte while it is unavailable. > In fact, you seem to have _added_ those drops in this patch. What's up? It's safe because of how x86s hardware works when it encounters the cleared pte. According to one of the manuals I've got here (the old 386 book is the only one that states it outright, sigh), the access and dirty bits are updated with a locked memory cycle only if the entry is marked present. If you want test code demonstrating that x86 does a reread of the pte on a dirty fault, I'll gladly share it. > I'm not going to apply a patch that I don't see will even fix the problem > at this point. > > I _will_ apply the "exception on dirty" version, if you remove the SMP > special case (ie you do it unconditionally). At least that one I believe > really fixes the problem. I'd rather not lose the use of a hardware feature that makes a difference during the most important time: when the system is under heavy load and the page table scanner is active. If there's a way the atomic updates can be cleaned up acceptably, then I want to do so. Cheers, -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 4:03 ` Benjamin C.R. LaHaise @ 2000-10-12 4:06 ` David S. Miller 2000-10-12 4:31 ` Cort Dougan 2000-10-12 4:37 ` Benjamin C.R. LaHaise 2000-10-12 6:42 ` Linus Torvalds 1 sibling, 2 replies; 20+ messages in thread From: David S. Miller @ 2000-10-12 4:06 UTC (permalink / raw) To: blah; +Cc: torvalds, tytso, linux-kernel, linux-mm It's safe because of how x86s hardware works What about other platforms? Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 4:06 ` David S. Miller @ 2000-10-12 4:31 ` Cort Dougan 2000-10-12 4:37 ` Benjamin C.R. LaHaise 1 sibling, 0 replies; 20+ messages in thread From: Cort Dougan @ 2000-10-12 4:31 UTC (permalink / raw) To: David S. Miller; +Cc: blah, torvalds, tytso, linux-kernel, linux-mm } Date: Thu, 12 Oct 2000 00:03:31 -0400 (EDT) } From: "Benjamin C.R. LaHaise" <blah@kvack.org> } } It's safe because of how x86s hardware works } } What about other platforms? On the PPC's that don't do a hardware walk we do a normal write to the hash table (with a spinlock). On the hardware walk PPC's I'm told this is done with with a lwarx/stwcx pair (conditional load/store on exclusive access). Any comments on how this would affect PPC? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 4:06 ` David S. Miller 2000-10-12 4:31 ` Cort Dougan @ 2000-10-12 4:37 ` Benjamin C.R. LaHaise 1 sibling, 0 replies; 20+ messages in thread From: Benjamin C.R. LaHaise @ 2000-10-12 4:37 UTC (permalink / raw) To: David S. Miller; +Cc: torvalds, tytso, linux-kernel, linux-mm On Wed, 11 Oct 2000, David S. Miller wrote: > It's safe because of how x86s hardware works > > What about other platforms? If atomic ops don't work, then software dirty bits are still an option (read as: it shouldn't break RISC CPUs). -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 4:03 ` Benjamin C.R. LaHaise 2000-10-12 4:06 ` David S. Miller @ 2000-10-12 6:42 ` Linus Torvalds 2000-10-12 8:13 ` Ingo Molnar 2000-10-12 15:10 ` Benjamin C.R. LaHaise 1 sibling, 2 replies; 20+ messages in thread From: Linus Torvalds @ 2000-10-12 6:42 UTC (permalink / raw) To: Benjamin C.R. LaHaise; +Cc: tytso, linux-kernel, linux-mm, MOLNAR Ingo On Thu, 12 Oct 2000, Benjamin C.R. LaHaise wrote: > > Note the fragment above those portions of the patch where the > pte_xchg_clear is done on the page table: this results in a page fault > for any other cpu that looks at the pte while it is unavailable. Ok, I see.. Hmm.. That's a singularly ugly interface, though - it all looks very x86-specific. Things like "pte_xchg_clear()" look just a bit too obviously like the name only makes sense due to the x86 implementation. So I'd like to change the naming to be more about the design and less about the implementation.. (It also doesn't make sense to me that you call the "clear the write bit" thing "atomic_pte_wrprotect()", but you call the "clear the dirty bit" "pte_test_and_clear_dirty()" - why not the same naming scheme for the two things?). I also have this suspicion that if this was done right, we should be able to clean up the 64-bit atomic stuff for the x86 PAE case - which does a cmpxchg8b right now on PAE entries exactly because of atomicity reasons. With your patch as it stands now, we'd end up basically always doing two of them. And looking at the patch I get this nagging feeling that if this was really done right, we could get rid of that PAE special case for set_pte(), because the issue with atomic updates on PAE really boils down to pretty much the same thing as the issue of one atomic bit. (Instead of doing an atomic 64-bit memory write, we would be doing the atomic "pte_xchg_clear()" followed by two _non_atomic 32-bit writes where the second write would set the present bit. Although maybe the erratum about the PAE pgd entry not honoring the P bit correctly makes this be unworkable). Ingo? I'd really like you to take a long look at this patch for sanity, especially wrt PAE. After this patch, are there any cases where we do a "set_pte()" where the PTE wasn't clear before? That might be a good sanity-test to add, just to make sure. And I'd really like to speed up the PAE set_pte() - as far as I can tell both set_pte and set_pmd really should be safe without the atomic 64-bit crap with your changes. Why do I care? Basically, I'd be a lot happier about this patch if it also solves another problem - if the "lost dirty bits" patch automagically also solves the "64-bit atomic PTE" issue for the PAE case, then I will just feel a lot happier about the fact that the solution is not just a specific hack for handling "dirty", but a real change that makes conceptual sense for two unrelated problems. Because this, as always, is my final test for a "GoodDesign(tm)" patch: if it solves just one problem it's a bug-fix, but if it solves two problems it is the "RightThing(tm)" to do. And bug-fixes are a dime a dozen. Good design is something to be admired. What do you say, Ben? Do you think your approach really would solve the PAE atomicity issue too, or am I just expecting too much? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 6:42 ` Linus Torvalds @ 2000-10-12 8:13 ` Ingo Molnar 2000-10-12 8:56 ` David S. Miller 2000-10-12 15:10 ` Benjamin C.R. LaHaise 1 sibling, 1 reply; 20+ messages in thread From: Ingo Molnar @ 2000-10-12 8:13 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin C.R. LaHaise, Theodore Y. Ts'o, linux-kernel, MM mailing list On Wed, 11 Oct 2000, Linus Torvalds wrote: > (Instead of doing an atomic 64-bit memory write, we would be doing the > atomic "pte_xchg_clear()" followed by two _non_atomic 32-bit writes where > the second write would set the present bit. Although maybe the erratum > about the PAE pgd entry not honoring the P bit correctly makes this be > unworkable). > > Ingo? I'd really like you to take a long look at this patch for sanity, > especially wrt PAE. the PAE pgd 'anomaly' should not affect this case, because we never clear neither user-space pgds, nor user-space pmds in PAE mode. Unless we start swapping pagetables i dont think this will ever happen in the future. The PAE anomaly only affects the four top-level pgds, so even if we started swapping pagetables, we'll never have to swap the pgds themselves. i completely agree with the need to clean the pte-setting atomicity interface up. And getting rid of cmpxch8b will be a definite performance (and GCC-optimization) improvement. > After this patch, are there any cases where we do a "set_pte()" where > the PTE wasn't clear before? That might be a good sanity-test to add, > just to make sure. And I'd really like to speed up the PAE set_pte() - > as far as I can tell both set_pte and set_pmd really should be safe > without the atomic 64-bit crap with your changes. yep, the two 32-bit writes idea is very nice - this should be safe - and there isnt even any need for any barriers (except optimization barrier), given that writes are strongly ordered on x86. my gut feeling is that all these things will only benefit PAE support, and the risk of those changes is low, none of those should bite us in the future, design-wise. And it's also a nice speedup. And after this we could finally get rid of the 'unsigned long long' as well and just define two 32-bit fields in pte. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 8:13 ` Ingo Molnar @ 2000-10-12 8:56 ` David S. Miller 2000-10-12 10:05 ` Ingo Molnar 0 siblings, 1 reply; 20+ messages in thread From: David S. Miller @ 2000-10-12 8:56 UTC (permalink / raw) To: mingo; +Cc: torvalds, blah, tytso, linux-kernel, linux-mm the PAE pgd 'anomaly' should not affect this case, because we never clear neither user-space pgds, nor user-space pmds in PAE mode Eh? munmap() --> clear_page_tables() --> free_one_pgd() --> pgd_clear Later, David S. Miller davem@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 8:56 ` David S. Miller @ 2000-10-12 10:05 ` Ingo Molnar 2000-10-12 11:10 ` Ingo Molnar 0 siblings, 1 reply; 20+ messages in thread From: Ingo Molnar @ 2000-10-12 10:05 UTC (permalink / raw) To: David S. Miller Cc: Linus Torvalds, blah, Theodore Y. Ts'o, linux-kernel, MM mailing list On Thu, 12 Oct 2000, David S. Miller wrote: > clear neither user-space pgds, nor user-space pmds in PAE mode > > Eh? > > munmap() --> clear_page_tables() --> free_one_pgd() --> pgd_clear you are right, i was focused too much on the swapping case. I dont think munmap() is a problem in the PAE case. pgd_clear() should stay a 64-bit operation (like in Ben's patch) because we could get a legitimate TLB flush between two 32-bit writes. (the 4 pgd entries are otherwise cached in the CPU core, only TLB flushes reload them.) Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 10:05 ` Ingo Molnar @ 2000-10-12 11:10 ` Ingo Molnar 0 siblings, 0 replies; 20+ messages in thread From: Ingo Molnar @ 2000-10-12 11:10 UTC (permalink / raw) To: David S. Miller Cc: Linus Torvalds, blah, Theodore Y. Ts'o, linux-kernel, MM mailing list On Thu, 12 Oct 2000, Ingo Molnar wrote: > [...] pgd_clear() should stay a 64-bit operation [...] even this isnt strictly necessery - pgds and pmds are allocated in 'low memory', and thus a simple 32-bit write to the lower 32 bits of the pgd entry is enough to clear a PAE pgd. But it still must be a special case due to the pgd present-bit restriction. Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC] atomic pte updates for x86 smp 2000-10-12 6:42 ` Linus Torvalds 2000-10-12 8:13 ` Ingo Molnar @ 2000-10-12 15:10 ` Benjamin C.R. LaHaise 1 sibling, 0 replies; 20+ messages in thread From: Benjamin C.R. LaHaise @ 2000-10-12 15:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: tytso, linux-kernel, linux-mm, MOLNAR Ingo On Wed, 11 Oct 2000, Linus Torvalds wrote: > > On Thu, 12 Oct 2000, Benjamin C.R. LaHaise wrote: > > > > Note the fragment above those portions of the patch where the > > pte_xchg_clear is done on the page table: this results in a page fault > > for any other cpu that looks at the pte while it is unavailable. > > Ok, I see.. > > Hmm.. That's a singularly ugly interface, though - it all looks very > x86-specific. Things like "pte_xchg_clear()" look just a bit too obviously > like the name only makes sense due to the x86 implementation. So I'd like > to change the naming to be more about the design and less about the > implementation.. How about pte_get_and_clear? > (It also doesn't make sense to me that you call the "clear the write bit" > thing "atomic_pte_wrprotect()", but you call the "clear the dirty bit" > "pte_test_and_clear_dirty()" - why not the same naming scheme for the two > things?). *nod* > I also have this suspicion that if this was done right, we should be able > to clean up the 64-bit atomic stuff for the x86 PAE case - which does a > cmpxchg8b right now on PAE entries exactly because of atomicity reasons. > > With your patch as it stands now, we'd end up basically always doing two > of them. > > And looking at the patch I get this nagging feeling that if this was > really done right, we could get rid of that PAE special case for > set_pte(), because the issue with atomic updates on PAE really boils down > to pretty much the same thing as the issue of one atomic bit. > (Instead of doing an atomic 64-bit memory write, we would be doing the > atomic "pte_xchg_clear()" followed by two _non_atomic 32-bit writes where > the second write would set the present bit. Although maybe the erratum > about the PAE pgd entry not honoring the P bit correctly makes this be > unworkable). As Ingo pointed out, this is only a problem for the pgd; we're safe so long as atomic operations are used on the present bit for pte's. I think we can completely eliminate the cmpxchg8b for ptes by using xchg on the low byte containing the P bit and non atomic ops on the high byte. This should be much better! ... > What do you say, Ben? Do you think your approach really would solve the > PAE atomicity issue too, or am I just expecting too much? These are good ideas. I'll go back and rework the patch for PAE stuff and see what kind of results turn out. -ben -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2000-10-13 16:57 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <200010090419.e994JQT09775@trampoline.thunk.org>
2000-10-10 20:53 ` Updated 2.4 TODO List Rik van Riel
2000-10-11 0:06 ` 2.4.0test9 vm: disappointing streaming i/o under load Chris Evans
2000-10-11 11:38 ` Eric Lowe
2000-10-11 20:59 ` Chris Evans
2000-10-11 22:10 ` Roger Larsson
2000-10-11 22:46 ` Chris Evans
2000-10-13 16:57 ` Rik van Riel
2000-10-11 18:38 ` Updated 2.4 TODO List tytso
2000-10-11 23:52 ` [RFC] atomic pte updates for x86 smp Ben LaHaise
2000-10-12 0:09 ` Linus Torvalds
2000-10-12 4:03 ` Benjamin C.R. LaHaise
2000-10-12 4:06 ` David S. Miller
2000-10-12 4:31 ` Cort Dougan
2000-10-12 4:37 ` Benjamin C.R. LaHaise
2000-10-12 6:42 ` Linus Torvalds
2000-10-12 8:13 ` Ingo Molnar
2000-10-12 8:56 ` David S. Miller
2000-10-12 10:05 ` Ingo Molnar
2000-10-12 11:10 ` Ingo Molnar
2000-10-12 15:10 ` Benjamin C.R. LaHaise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox