* 2.5.52-mm2 @ 2002-12-19 5:53 Andrew Morton 2002-12-19 8:54 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 9:41 ` 2.5.52-mm2 Andrew Morton 0 siblings, 2 replies; 11+ messages in thread From: Andrew Morton @ 2002-12-19 5:53 UTC (permalink / raw) To: lkml, linux-mm url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.52/2.5.52-mm2/ . Big reorganisation of shared pagetable code. It is a cleanup, and there should be no functional changes. The diff is considerably easier to read now. In this patchset, shared pagetables are configurable again, and the default is "off". This is because the intent is that pagetable sharing always be enabled (on ia32 at least). But we want it to work when it is disabled too. So in this -mm, pagetable sahring is disabled. Henceforth it will be enabled. Make sense? . Added Bill Irwin's patches, get them some additional testing. . The per-cpu kmalloc infrastructure. . Another update of the patch management scripts is at http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.9/ no great changes here. Various fixes and tweaks. Changes since 2.5.52-mm1: +shpte-reorg.patch The shared pagetable patch reorganisation. +shpte-reorg-fixes.patch Make it work with CONFIG_SHAREPTE=n -lockless-current_kernel_time.patch Dropped for now, because it is ia32-only and it is time to get some non-ia32 testing done. +block-allocator-doc.patch Some commentary. +ext2-rename-vars.patch Make ext2_new_block() understandable +remove-memshared.patch Remvoe /proc/meminfo:MemShared +bin2bcd.patch Code consolidation/cleanup +log_buf_size.patch Configurable printk buffer size. +semtimedop-update.patch Wire up semtimedop() for 32-bit ia32 apps on ia64. +nfs-kmap_atomic.patch Use kmap_atomic in NFS +ext3-bh-dirty-race.patch Fix a rare BUG in ext3 +unalign-radix-tree-nodes.patch Space saving for radix_tree_nodes +htlb-0.patch +htlb-1.patch +htlb-2.patch +htlb-3.patch hugetlbpage updates +kmalloc_percpu.patch per-cpu kmalloc infrastructure +kmalloc_percpu-rtcache.patch +kmalloc_percpu-mibs-1.patch +kmalloc_percpu-mibs-2.patch +kmalloc_percpu-mibs-3.patch Applications thereof +wli-01_numaq_io.patch +wli-02_do_sak.patch +wli-03_proc_super.patch +wli-04_cap_set_pg.patch +wli-06_uml_get_task.patch +wli-07_numaq_mem_map.patch +wli-08_numaq_pgdat.patch +wli-09_has_stopped_jobs.patch +wli-10_inode_wait.patch +wli-11_pgd_ctor.patch +wli-12_pidhash_size.patch +wli-13_rmap_nrpte.patch Bill's stuff. All 78 patches: linus.patch cset-1.883.3.60-to-1.900.txt.gz kgdb.patch sync_fs-deadlock-fix.patch sync_fs deadlock fix shrink_list-dirty-page-race.patch fix a page dirtying race in vmscan.c slab-poisoning.patch more informative slab poisoning nommu-generic_file_readonly_mmap.patch Add generic_file_readonly_mmap() for nommu dio-return-partial-result.patch aio-direct-io-infrastructure.patch AIO support for raw/O_DIRECT deferred-bio-dirtying.patch bio dirtying infrastructure aio-direct-io.patch AIO support for raw/O_DIRECT aio-dio-debug.patch dio-reduce-context-switch-rate.patch Reduced wakeup rate in direct-io code cputimes_stat.patch Retore per-cpu time accounting, with a config option reduce-random-context-switch-rate.patch Reduce context switch rate due to the random driver inlines-net.patch rbtree-iosched.patch rbtree-based IO scheduler deadsched-fix.patch deadline scheduler fix quota-smp-locks.patch Subject: [PATCH] Quota SMP locks shpte-ng.patch pagetable sharing for ia32 shpte-nonlinear.patch shpte: support nonlinear mappings and clean up clear_share_range() shpte-reorg.patch shpte-reorg-fixes.patch shared pagetable reorg fixes shpte-always-on.patch Force CONFIG_SHAREPTE=y for ia32 ptrace-flush.patch Subject: [PATCH] ptrace on 2.5.44 buffer-debug.patch buffer.c debugging misc.patch misc fixes warn-null-wakeup.patch pentium-II.patch Pentium-II support bits rcu-stats.patch RCU statistics reporting auto-unplug.patch self-unplugging request queues less-unplugging.patch Remove most of the blk_run_queues() calls ext3-fsync-speedup.patch Clean up ext3_sync_file() remove-PF_NOWARN.patch Remove PF_NOWARN scheduler-tunables.patch scheduler tunables blocking-kswapd.patch Give kswapd writeback higher priority than pdflush block-allocator-doc.patch ext2/3 commentary and cleanup spread-find_group_other.patch ext2/3: better starting group for S_ISREG files ext3-alloc-spread.patch ext3: smarter block allocation startup ext2-alloc-spread.patch ext2: smarter block allocation startup ext2-rename-vars.patch rename locals in ext2_new_block() ext3-use-after-free.patch ext3 use-after-free bugfix dio-always-kmalloc.patch direct-io: dynamically allocate struct dio file-nr-doc-fix.patch Docs: fix explanation of file-nr set_page_dirty_lock.patch fix set_page_dirty vs truncate&free races remove-memshared.patch Remove /proc/meminfo:MemShared bin2bcd.patch BIN_TO_BCD consolidation log_buf_size.patch move LOG_BUF_SIZE to header/config semtimedop-update.patch Enable semtimedop for ia64 32-bit emulation. nfs-kmap_atomic.patch use kmap_atomic instaed of kmap in NFS client ext3-bh-dirty-race.patch ext3: fix buffer dirtying unalign-radix-tree-nodes.patch don't cacheline-align radix_tree_nodes htlb-0.patch hugetlb bugfixes htlb-1.patch hugetlb: report shared memory attachment counts htlb-2.patch hugetlb: fix MAP_FIXED handling htlb-3.patch hugetlbfs: set inode->i_size kmalloc_percpu.patch kmalloc_percpu -- stripped down version kmalloc_percpu-rtcache.patch Make rt_cache_stat use kmalloc_percpu kmalloc_percpu-mibs-1.patch Change Networking mibs to use kmalloc_percpu -- 1/3 kmalloc_percpu-mibs-2.patch Change Networking mibs to use kmalloc_percpu -- 2/3 kmalloc_percpu-mibs-3.patch Change Networking mibs to use kmalloc_percpu -- 3/3 wli-01_numaq_io.patch (undescribed patch) wli-02_do_sak.patch (undescribed patch) wli-03_proc_super.patch (undescribed patch) wli-04_cap_set_pg.patch (undescribed patch) wli-06_uml_get_task.patch (undescribed patch) wli-07_numaq_mem_map.patch (undescribed patch) wli-08_numaq_pgdat.patch (undescribed patch) wli-09_has_stopped_jobs.patch (undescribed patch) wli-10_inode_wait.patch (undescribed patch) wli-11_pgd_ctor.patch (undescribed patch) wli-12_pidhash_size.patch (undescribed patch) wli-13_rmap_nrpte.patch (undescribed patch) dcache_rcu-2.patch dcache_rcu-2-2.5.51.patch dcache_rcu-3.patch dcache_rcu-3-2.5.51.patch page-walk-api.patch page-walk-scsi.patch page-walk-api-update.patch pagewalk API update gup-check-valid.patch valid page test in get_user_pages() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 5:53 2.5.52-mm2 Andrew Morton @ 2002-12-19 8:54 ` William Lee Irwin III 2002-12-19 9:28 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 9:41 ` 2.5.52-mm2 Andrew Morton 1 sibling, 1 reply; 11+ messages in thread From: William Lee Irwin III @ 2002-12-19 8:54 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm On Wed, Dec 18, 2002 at 09:53:18PM -0800, Andrew Morton wrote: > url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.52/2.5.52-mm2/ Kernel compile on ramfs, shpte off, overcommit on (probably more like a stress test for shpte): c0116810 187174 0.528821 pfn_to_nid c01168b8 192912 0.545032 x86_profile_hook c0163c0c 267362 0.755375 d_lookup c014e530 286920 0.810632 get_empty_filp c01b0b18 287959 0.813567 __copy_from_user c01320cc 307597 0.86905 find_get_page c011a6b0 331219 0.935789 scheduler_tick c0140890 342427 0.967455 vm_enough_memory c013fc60 353705 0.999319 handle_mm_fault c014eb49 358710 1.01346 .text.lock.file_table c011a1f8 379521 1.07226 load_balance c01b0ab0 840162 2.3737 __copy_to_user c013f5a0 1040429 2.93951 do_anonymous_page c01358c8 1056576 2.98513 __get_page_state c014406c 1260931 3.56249 page_remove_rmap c0143e68 1265355 3.57499 page_add_rmap c0106f38 21236731 59.9999 poll_idle shpte on will follow. Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 8:54 ` 2.5.52-mm2 William Lee Irwin III @ 2002-12-19 9:28 ` William Lee Irwin III 2002-12-19 10:12 ` 2.5.52-mm2 William Lee Irwin III 0 siblings, 1 reply; 11+ messages in thread From: William Lee Irwin III @ 2002-12-19 9:28 UTC (permalink / raw) To: Andrew Morton, lkml, linux-mm On Wed, Dec 18, 2002 at 09:53:18PM -0800, Andrew Morton wrote: >> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.52/2.5.52-mm2/ On Thu, Dec 19, 2002 at 12:54:26AM -0800, William Lee Irwin III wrote: > Kernel compile on ramfs, shpte off, overcommit on (probably more like a > stress test for shpte): With shpte on: c0135790 123025 0.33807 nr_free_pages c015021c 134621 0.369936 __fput c0119974 144602 0.397364 kmap_atomic c01b23d0 146899 0.403676 __copy_user_intel c014c050 159002 0.436935 pte_unshare c0122c28 171802 0.472109 current_kernel_time c01fbd9c 172897 0.475118 sync_buffer c0116870 187066 0.514054 pfn_to_nid c0115390 199497 0.548214 smp_apic_timer_interrupt c013f820 236468 0.64981 do_no_page c0116918 260314 0.715338 x86_profile_hook c016566c 274151 0.753362 d_lookup c01b2578 280625 0.771152 __copy_from_user c013212c 323752 0.889665 find_get_page c0140800 338288 0.929609 vm_enough_memory c013fbd0 355384 0.976589 handle_mm_fault c014ff90 358839 0.986083 get_empty_filp c011a710 363642 0.999282 scheduler_tick c011a258 507320 1.39411 load_balance c01505a9 664873 1.82706 .text.lock.file_table c01b2510 834467 2.2931 __copy_to_user c0135928 1051119 2.88846 __get_page_state c013f400 1062731 2.92037 do_anonymous_page c0143fa0 1273498 3.49955 page_add_rmap c014419c 1386659 3.81051 page_remove_rmap c0106f38 21099307 57.9805 poll_idle Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 9:28 ` 2.5.52-mm2 William Lee Irwin III @ 2002-12-19 10:12 ` William Lee Irwin III 2002-12-19 10:31 ` 2.5.52-mm2 Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: William Lee Irwin III @ 2002-12-19 10:12 UTC (permalink / raw) To: Andrew Morton, lkml, linux-mm On Wed, Dec 18, 2002 at 09:53:18PM -0800, Andrew Morton wrote: >>> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.52/2.5.52-mm2/ On Thu, Dec 19, 2002 at 12:54:26AM -0800, William Lee Irwin III wrote: >> Kernel compile on ramfs, shpte off, overcommit on (probably more like a >> stress test for shpte): On Thu, Dec 19, 2002 at 01:28:53AM -0800, William Lee Irwin III wrote: > With shpte on: With the following patch: c013d4d4 94944 0.310788 zap_pte_range c01355d0 104773 0.342962 nr_free_pages c014f65c 107566 0.352105 __fput c01b1750 112055 0.366799 __copy_user_intel c0115350 121040 0.39621 smp_apic_timer_interrupt c0119814 126089 0.412738 kmap_atomic c014b6cc 145095 0.474952 pte_unshare c01fb11c 145992 0.477888 sync_buffer c0122a78 148079 0.484719 current_kernel_time c01168b8 193805 0.634398 x86_profile_hook c013f140 205233 0.671806 do_no_page c0164aac 235356 0.77041 d_lookup c01b18f8 257358 0.842431 __copy_from_user c0131f7c 275559 0.90201 find_get_page c011a560 282341 0.92421 scheduler_tick c0140090 300128 0.982434 vm_enough_memory c013f4bc 310474 1.0163 handle_mm_fault c014f3d0 312725 1.02367 get_empty_filp c011a0a8 365066 1.195 load_balance c014f9e9 502737 1.64565 .text.lock.file_table c01b1890 719105 2.35391 __copy_to_user c0135768 911894 2.98498 __get_page_state c013ee50 952823 3.11895 do_anonymous_page c01436d0 1079864 3.53481 page_add_rmap c01438cc 1186938 3.8853 page_remove_rmap c0106f38 17763755 58.1476 poll_idle pfn_to_nid() got lots of icache misses. Try using a macro. arch/i386/kernel/i386_ksyms.c | 1 - arch/i386/kernel/numaq.c | 15 ++------------- include/asm-i386/numaq.h | 3 ++- 3 files changed, 4 insertions(+), 15 deletions(-) diff -urpN linux-2.5.52-mm1/arch/i386/kernel/i386_ksyms.c mm1-2.5.52-1/arch/i386/kernel/i386_ksyms.c --- linux-2.5.52-mm1/arch/i386/kernel/i386_ksyms.c 2002-12-16 19:29:45.000000000 -0800 +++ mm1-2.5.52-1/arch/i386/kernel/i386_ksyms.c 2002-12-17 08:47:25.000000000 -0800 @@ -67,7 +67,6 @@ EXPORT_SYMBOL(EISA_bus); EXPORT_SYMBOL(MCA_bus); #ifdef CONFIG_DISCONTIGMEM EXPORT_SYMBOL(node_data); -EXPORT_SYMBOL(pfn_to_nid); #endif #ifdef CONFIG_X86_NUMAQ EXPORT_SYMBOL(xquad_portio); diff -urpN linux-2.5.52-mm1/arch/i386/kernel/numaq.c mm1-2.5.52-1/arch/i386/kernel/numaq.c --- linux-2.5.52-mm1/arch/i386/kernel/numaq.c 2002-12-15 18:08:13.000000000 -0800 +++ mm1-2.5.52-1/arch/i386/kernel/numaq.c 2002-12-17 08:51:44.000000000 -0800 @@ -27,6 +27,7 @@ #include <linux/mm.h> #include <linux/bootmem.h> #include <linux/mmzone.h> +#include <linux/module.h> #include <asm/numaq.h> /* These are needed before the pgdat's are created */ @@ -82,19 +83,7 @@ static void __init smp_dump_qct(void) * physnode_map[8- ] = -1; */ int physnode_map[MAX_ELEMENTS] = { [0 ... (MAX_ELEMENTS - 1)] = -1}; - -#define PFN_TO_ELEMENT(pfn) (pfn / PAGES_PER_ELEMENT) -#define PA_TO_ELEMENT(pa) (PFN_TO_ELEMENT(pa >> PAGE_SHIFT)) - -int pfn_to_nid(unsigned long pfn) -{ - int nid = physnode_map[PFN_TO_ELEMENT(pfn)]; - - if (nid == -1) - BUG(); /* address is not present */ - - return nid; -} +EXPORT_SYMBOL(physnode_map); /* * for each node mark the regions diff -urpN linux-2.5.52-mm1/include/asm-i386/numaq.h mm1-2.5.52-1/include/asm-i386/numaq.h --- linux-2.5.52-mm1/include/asm-i386/numaq.h 2002-12-15 18:08:09.000000000 -0800 +++ mm1-2.5.52-1/include/asm-i386/numaq.h 2002-12-17 08:45:19.000000000 -0800 @@ -38,10 +38,11 @@ #define MAX_ELEMENTS 256 #define PAGES_PER_ELEMENT (16777216/256) +extern int physnode_map[]; +#define pfn_to_nid(pfn) ({ physnode_map[(pfn) / PAGES_PER_ELEMENT]; }) #define pfn_to_pgdat(pfn) NODE_DATA(pfn_to_nid(pfn)) #define PHYSADDR_TO_NID(pa) pfn_to_nid(pa >> PAGE_SHIFT) #define MAX_NUMNODES 8 -extern int pfn_to_nid(unsigned long); extern void get_memcfg_numaq(void); #define get_memcfg_numa() get_memcfg_numaq() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 10:12 ` 2.5.52-mm2 William Lee Irwin III @ 2002-12-19 10:31 ` Andrew Morton 2002-12-19 10:51 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 15:22 ` 2.5.52-mm2 Martin J. Bligh 0 siblings, 2 replies; 11+ messages in thread From: Andrew Morton @ 2002-12-19 10:31 UTC (permalink / raw) To: William Lee Irwin III; +Cc: lkml, linux-mm William Lee Irwin III wrote: > > On Wed, Dec 18, 2002 at 09:53:18PM -0800, Andrew Morton wrote: > >>> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.52/2.5.52-mm2/ > > On Thu, Dec 19, 2002 at 12:54:26AM -0800, William Lee Irwin III wrote: > >> Kernel compile on ramfs, shpte off, overcommit on (probably more like a > >> stress test for shpte): > > On Thu, Dec 19, 2002 at 01:28:53AM -0800, William Lee Irwin III wrote: > > With shpte on: > > With the following patch: > > c013d4d4 94944 0.310788 zap_pte_range > c01355d0 104773 0.342962 nr_free_pages > c014f65c 107566 0.352105 __fput > c01b1750 112055 0.366799 __copy_user_intel > c0115350 121040 0.39621 smp_apic_timer_interrupt > c0119814 126089 0.412738 kmap_atomic > c014b6cc 145095 0.474952 pte_unshare > c01fb11c 145992 0.477888 sync_buffer > c0122a78 148079 0.484719 current_kernel_time > c01168b8 193805 0.634398 x86_profile_hook > c013f140 205233 0.671806 do_no_page > c0164aac 235356 0.77041 d_lookup > c01b18f8 257358 0.842431 __copy_from_user > c0131f7c 275559 0.90201 find_get_page > c011a560 282341 0.92421 scheduler_tick > c0140090 300128 0.982434 vm_enough_memory > c013f4bc 310474 1.0163 handle_mm_fault > c014f3d0 312725 1.02367 get_empty_filp > c011a0a8 365066 1.195 load_balance > c014f9e9 502737 1.64565 .text.lock.file_table > c01b1890 719105 2.35391 __copy_to_user > c0135768 911894 2.98498 __get_page_state > c013ee50 952823 3.11895 do_anonymous_page > c01436d0 1079864 3.53481 page_add_rmap > c01438cc 1186938 3.8853 page_remove_rmap > c0106f38 17763755 58.1476 poll_idle Is that improved? > pfn_to_nid() got lots of icache misses. Try using a macro. What's the callsite? Actually, just looking at mmzone.h, I have to say "ick". The non-NUMA case seems unnecessarily overdone. eg: #define page_to_pfn(page) ((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn) Ouch. Why can't we have the good old `page - mem_map' here? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 10:31 ` 2.5.52-mm2 Andrew Morton @ 2002-12-19 10:51 ` William Lee Irwin III 2002-12-19 15:22 ` 2.5.52-mm2 Martin J. Bligh 1 sibling, 0 replies; 11+ messages in thread From: William Lee Irwin III @ 2002-12-19 10:51 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm William Lee Irwin III wrote: >> c014f9e9 502737 1.64565 .text.lock.file_table >> c01b1890 719105 2.35391 __copy_to_user >> c0135768 911894 2.98498 __get_page_state >> c013ee50 952823 3.11895 do_anonymous_page >> c01436d0 1079864 3.53481 page_add_rmap >> c01438cc 1186938 3.8853 page_remove_rmap >> c0106f38 17763755 58.1476 poll_idle On Thu, Dec 19, 2002 at 02:31:32AM -0800, Andrew Morton wrote: > Is that improved? Only in the qualitative sense. No statistically significant differences in running times are observable. William Lee Irwin III wrote: >> pfn_to_nid() got lots of icache misses. Try using a macro. On Thu, Dec 19, 2002 at 02:31:32AM -0800, Andrew Morton wrote: > What's the callsite? I was not able to collect this information and would be much obliged to hear of how to do so for instruction cache profiling. On Thu, Dec 19, 2002 at 02:31:32AM -0800, Andrew Morton wrote: > Actually, just looking at mmzone.h, I have to say "ick". The > non-NUMA case seems unnecessarily overdone. eg: > #define page_to_pfn(page) > ((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn) > Ouch. Why can't we have the good old `page - mem_map' here? There is no reason why it could not be re-established for the Pee Cee case. IMHO It would be excellent to have the proper architecture discrimination defines in place so that Pee Cees could utilize the vastly simpler calculation. So long as this is devolved to arch code there is no danger of page->virtual being required by arch-independent code. Thanks, Bill -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 10:31 ` 2.5.52-mm2 Andrew Morton 2002-12-19 10:51 ` 2.5.52-mm2 William Lee Irwin III @ 2002-12-19 15:22 ` Martin J. Bligh 1 sibling, 0 replies; 11+ messages in thread From: Martin J. Bligh @ 2002-12-19 15:22 UTC (permalink / raw) To: Andrew Morton, William Lee Irwin III; +Cc: lkml, linux-mm > Actually, just looking at mmzone.h, I have to say "ick". The > non-NUMA case seems unnecessarily overdone. eg: > ># define page_to_pfn(page) > ((page - page_zone(page)->zone_mem_map) + > page_zone(page)->zone_start_pfn) > > Ouch. Why can't we have the good old `page - mem_map' here? Ummm .... mmzone.h: #ifdef CONFIG_DISCONTIGMEM .... #define page_to_pfn(page) ((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn) .... #endif /* CONFIG_DISCONTIGMEM */ page.h: #ifndef CONFIG_DISCONTIGMEM #define page_to_pfn(page) ((unsigned long)((page) - mem_map)) #endif /* !CONFIG_DISCONTIGMEM */ I'll admit the file obfuscation hides this from being easy to read, but i'm not stupid enough to screw things up *that* badly. Well, not most of the time ;-) Want me to reshuffle things around so that the same defines end up in the same file, and people have a hope in hell of reading it? If I do that, it'll probably be based on the struct page breakout patch, and making these things all static inlines, so people stop blowing their own feet off. M. PS. cscope is cool ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 5:53 2.5.52-mm2 Andrew Morton 2002-12-19 8:54 ` 2.5.52-mm2 William Lee Irwin III @ 2002-12-19 9:41 ` Andrew Morton 2002-12-19 9:50 ` 2.5.52-mm2 Andrew Morton 2002-12-19 17:02 ` mremap use-after-free [was Re: 2.5.52-mm2] Hugh Dickins 1 sibling, 2 replies; 11+ messages in thread From: Andrew Morton @ 2002-12-19 9:41 UTC (permalink / raw) To: lkml, linux-mm Andrew Morton wrote: > > ... > slab-poisoning.patch > more informative slab poisoning This patch has exposed a quite long-standing use-after-free bug in mremap(). It make the machine go BUG when starting the X server if memory debugging is turned on. The bug might be present in 2.4 as well.. --- 25/mm/mremap.c~move_vma-use-after-free Thu Dec 19 00:51:49 2002 +++ 25-akpm/mm/mremap.c Thu Dec 19 01:08:45 2002 @@ -183,14 +183,16 @@ static unsigned long move_vma(struct vm_ next = find_vma_prev(mm, new_addr, &prev); if (next) { if (prev && prev->vm_end == new_addr && - can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { + can_vma_merge(prev, vma->vm_flags) && + !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); new_vma = prev; if (next != prev->vm_next) BUG(); - if (prev->vm_end == next->vm_start && can_vma_merge(next, prev->vm_flags)) { + if (prev->vm_end == next->vm_start && + can_vma_merge(next, prev->vm_flags)) { spin_lock(&mm->page_table_lock); prev->vm_end = next->vm_end; __vma_unlink(mm, next, prev); @@ -201,7 +203,8 @@ static unsigned long move_vma(struct vm_ kmem_cache_free(vm_area_cachep, next); } } else if (next->vm_start == new_addr + new_len && - can_vma_merge(next, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { + can_vma_merge(next, vma->vm_flags) && + !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); next->vm_start = new_addr; spin_unlock(&mm->page_table_lock); @@ -210,7 +213,8 @@ static unsigned long move_vma(struct vm_ } else { prev = find_vma(mm, new_addr-1); if (prev && prev->vm_end == new_addr && - can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { + can_vma_merge(prev, vma->vm_flags) && + !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); @@ -227,12 +231,16 @@ static unsigned long move_vma(struct vm_ } if (!move_page_tables(vma, new_addr, addr, old_len)) { + unsigned long must_fault_in; + unsigned long fault_in_start; + unsigned long fault_in_end; + if (allocated_vma) { *new_vma = *vma; INIT_LIST_HEAD(&new_vma->shared); new_vma->vm_start = new_addr; new_vma->vm_end = new_addr+new_len; - new_vma->vm_pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; + new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT; if (new_vma->vm_file) get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) @@ -251,19 +259,25 @@ static unsigned long move_vma(struct vm_ } else vma = NULL; /* nothing more to do */ - do_munmap(current->mm, addr, old_len); - /* Restore VM_ACCOUNT if one or two pieces of vma left */ if (vma) { vma->vm_flags |= VM_ACCOUNT; if (split) vma->vm_next->vm_flags |= VM_ACCOUNT; } + + must_fault_in = new_vma->vm_flags & VM_LOCKED; + fault_in_start = new_vma->vm_start; + fault_in_end = new_vma->vm_end; + + do_munmap(current->mm, addr, old_len); + + /* new_vma could have been invalidated by do_munmap */ + current->mm->total_vm += new_len >> PAGE_SHIFT; - if (new_vma->vm_flags & VM_LOCKED) { + if (must_fault_in) { current->mm->locked_vm += new_len >> PAGE_SHIFT; - make_pages_present(new_vma->vm_start, - new_vma->vm_end); + make_pages_present(fault_in_start, fault_in_end); } return new_addr; } _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: 2.5.52-mm2 2002-12-19 9:41 ` 2.5.52-mm2 Andrew Morton @ 2002-12-19 9:50 ` Andrew Morton 2002-12-19 17:02 ` mremap use-after-free [was Re: 2.5.52-mm2] Hugh Dickins 1 sibling, 0 replies; 11+ messages in thread From: Andrew Morton @ 2002-12-19 9:50 UTC (permalink / raw) To: lkml, linux-mm Andrew Morton wrote: > > Andrew Morton wrote: > > > > ... > > slab-poisoning.patch > > more informative slab poisoning > > This patch has exposed a quite long-standing use-after-free bug in > mremap(). It make the machine go BUG when starting the X server if > memory debugging is turned on. > > The bug might be present in 2.4 as well.. here's a 2.4 patch: move_vma() calls do_munmap() and then uses the memory at *new_vma. But when starting X11 it just happens that the memory which do_munmap unmapped had the same start address and the range at *new_vma. So new_vma is freed by do_munmap(). This was never noticed before because (vm_flags & VM_LOCKED) evaluates false when vm_flags is 0x5a5a5a5a. But I just changed that to 0x6b6b6b6b and we call make_pages_present() with start == end == 0x6b6b6b6b and it goes BUG. The fix is for move_vma() to not inspect the values of any vma's after it has called do_munmap(). The patch does that, for `vma' and `new_vma'. It also removes three redundant tests of !vma->vm_file. can_vma_merge() already tested that. mm/mremap.c | 31 +++++++++++++++++++++++-------- 1 files changed, 23 insertions(+), 8 deletions(-) --- 24/mm/mremap.c~move_vma-use-after-free Thu Dec 19 01:29:52 2002 +++ 24-akpm/mm/mremap.c Thu Dec 19 01:31:43 2002 @@ -134,14 +134,16 @@ static inline unsigned long move_vma(str next = find_vma_prev(mm, new_addr, &prev); if (next) { if (prev && prev->vm_end == new_addr && - can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { + can_vma_merge(prev, vma->vm_flags) && + !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); new_vma = prev; if (next != prev->vm_next) BUG(); - if (prev->vm_end == next->vm_start && can_vma_merge(next, prev->vm_flags)) { + if (prev->vm_end == next->vm_start && + can_vma_merge(next, prev->vm_flags)) { spin_lock(&mm->page_table_lock); prev->vm_end = next->vm_end; __vma_unlink(mm, next, prev); @@ -151,7 +153,8 @@ static inline unsigned long move_vma(str kmem_cache_free(vm_area_cachep, next); } } else if (next->vm_start == new_addr + new_len && - can_vma_merge(next, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { + can_vma_merge(next, vma->vm_flags) && + !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); next->vm_start = new_addr; spin_unlock(&mm->page_table_lock); @@ -160,7 +163,8 @@ static inline unsigned long move_vma(str } else { prev = find_vma(mm, new_addr-1); if (prev && prev->vm_end == new_addr && - can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { + can_vma_merge(prev, vma->vm_flags) && + !(vma->vm_flags & VM_SHARED)) { spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); @@ -177,11 +181,15 @@ static inline unsigned long move_vma(str } if (!move_page_tables(current->mm, new_addr, addr, old_len)) { + unsigned long must_fault_in; + unsigned long fault_in_start; + unsigned long fault_in_end; + if (allocated_vma) { *new_vma = *vma; new_vma->vm_start = new_addr; new_vma->vm_end = new_addr+new_len; - new_vma->vm_pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; + new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT; new_vma->vm_raend = 0; if (new_vma->vm_file) get_file(new_vma->vm_file); @@ -189,12 +197,19 @@ static inline unsigned long move_vma(str new_vma->vm_ops->open(new_vma); insert_vm_struct(current->mm, new_vma); } + + must_fault_in = new_vma->vm_flags & VM_LOCKED; + fault_in_start = new_vma->vm_start; + fault_in_end = new_vma->vm_end; + do_munmap(current->mm, addr, old_len); + + /* new_vma could have been invalidated by do_munmap */ + current->mm->total_vm += new_len >> PAGE_SHIFT; - if (new_vma->vm_flags & VM_LOCKED) { + if (must_fault_in) { current->mm->locked_vm += new_len >> PAGE_SHIFT; - make_pages_present(new_vma->vm_start, - new_vma->vm_end); + make_pages_present(fault_in_start, fault_in_end); } return new_addr; } _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* mremap use-after-free [was Re: 2.5.52-mm2] 2002-12-19 9:41 ` 2.5.52-mm2 Andrew Morton 2002-12-19 9:50 ` 2.5.52-mm2 Andrew Morton @ 2002-12-19 17:02 ` Hugh Dickins 2002-12-19 18:18 ` Andrew Morton 1 sibling, 1 reply; 11+ messages in thread From: Hugh Dickins @ 2002-12-19 17:02 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, linux-mm On Thu, 19 Dec 2002, Andrew Morton wrote: > Andrew Morton wrote: > > ... > > slab-poisoning.patch > > more informative slab poisoning > > This patch has exposed a quite long-standing use-after-free bug in > mremap(). It make the machine go BUG when starting the X server if > memory debugging is turned on. Good catch, shame about the patch! Please don't apply this, or its 2.4 sister, as is. > The bug might be present in 2.4 as well.. I doubt that (but may be wrong, I haven't time right now to think as twistedly as mremap demands). The code (patently!) expects new_vma to be good at the end, it certainly wasn't intending to unmap it; but 2.5 split_vma has been through a couple of convulsions, either of which might have resulted in the potential for new_vma to be freed where before it was guaranteed to remain. Do you know the vmas before and after, and the mremap which did this? I couldn't reproduce it starting X here, and an example would help to uncloud my mind. But you can reasonably answer that one example won't prove anything, and where there's doubt, the code must be defensive. (Besides, I'll be offline shortly.) On to the patch... > --- 25/mm/mremap.c~move_vma-use-after-free Thu Dec 19 00:51:49 2002 > +++ 25-akpm/mm/mremap.c Thu Dec 19 01:08:45 2002 > @@ -183,14 +183,16 @@ static unsigned long move_vma(struct vm_ > next = find_vma_prev(mm, new_addr, &prev); > if (next) { > if (prev && prev->vm_end == new_addr && > - can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { > + can_vma_merge(prev, vma->vm_flags) && > + !(vma->vm_flags & VM_SHARED)) { > spin_lock(&mm->page_table_lock); > prev->vm_end = new_addr + new_len; > spin_unlock(&mm->page_table_lock); > new_vma = prev; > if (next != prev->vm_next) > BUG(); > - if (prev->vm_end == next->vm_start && can_vma_merge(next, prev->vm_flags)) { > + if (prev->vm_end == next->vm_start && > + can_vma_merge(next, prev->vm_flags)) { > spin_lock(&mm->page_table_lock); > prev->vm_end = next->vm_end; > __vma_unlink(mm, next, prev); > @@ -201,7 +203,8 @@ static unsigned long move_vma(struct vm_ > kmem_cache_free(vm_area_cachep, next); > } > } else if (next->vm_start == new_addr + new_len && > - can_vma_merge(next, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { > + can_vma_merge(next, vma->vm_flags) && > + !(vma->vm_flags & VM_SHARED)) { > spin_lock(&mm->page_table_lock); > next->vm_start = new_addr; > spin_unlock(&mm->page_table_lock); > @@ -210,7 +213,8 @@ static unsigned long move_vma(struct vm_ > } else { > prev = find_vma(mm, new_addr-1); > if (prev && prev->vm_end == new_addr && > - can_vma_merge(prev, vma->vm_flags) && !vma->vm_file && !(vma->vm_flags & VM_SHARED)) { > + can_vma_merge(prev, vma->vm_flags) && > + !(vma->vm_flags & VM_SHARED)) { > spin_lock(&mm->page_table_lock); > prev->vm_end = new_addr + new_len; > spin_unlock(&mm->page_table_lock); Hmmm. Am I right to suppose that all the changes above are "cleanup" which you couldn't resist making while you looked through this code, but entirely irrelevant to the bug in question? If those mods above were right, they should be the subject of a separate patch. There's certainly room for cleanup there, but my preference would be to remove "can_vma_merge" completely, or at least its use in mremap.c, using its explicit tests instead. It looks like it was originally quite appropriate for a use or two in mmap.c, but obscurely unhelpful here - because in itself it is testing a bizarre asymmetric subset of what's needed (that subset which remained to be tested in its original use in mmap.c). The problem with your changes above is, you've removed the !vma->vm_file tests, presumably because you noticed that can_vma_merge already tests !vma->vm_file. But "vma" within can_vma_merge is "prev" or "next" here: they are distinct tests, and you're now liable to merge an anonymous mapping with a private file mapping - nice if it's from /dev/zero, but otherwise not. Please just cut those hunks out. (Of course, I wouldn't have spotted this if I hadn't embarked on, then retreated from, a similar cleanup myself a few months back.) > @@ -227,12 +231,16 @@ static unsigned long move_vma(struct vm_ > } > > if (!move_page_tables(vma, new_addr, addr, old_len)) { > + unsigned long must_fault_in; > + unsigned long fault_in_start; > + unsigned long fault_in_end; > + > if (allocated_vma) { > *new_vma = *vma; > INIT_LIST_HEAD(&new_vma->shared); > new_vma->vm_start = new_addr; > new_vma->vm_end = new_addr+new_len; > - new_vma->vm_pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; > + new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT; Hrrmph. > if (new_vma->vm_file) > get_file(new_vma->vm_file); > if (new_vma->vm_ops && new_vma->vm_ops->open) > @@ -251,19 +259,25 @@ static unsigned long move_vma(struct vm_ > } else > vma = NULL; /* nothing more to do */ > > - do_munmap(current->mm, addr, old_len); > - Anguished cry! There was careful manipulation of VM_ACCOUNT before and after do_munmap, now you've for no reason moved do_munmap down outside. > /* Restore VM_ACCOUNT if one or two pieces of vma left */ > if (vma) { > vma->vm_flags |= VM_ACCOUNT; > if (split) > vma->vm_next->vm_flags |= VM_ACCOUNT; > } > + > + must_fault_in = new_vma->vm_flags & VM_LOCKED; > + fault_in_start = new_vma->vm_start; > + fault_in_end = new_vma->vm_end; > + > + do_munmap(current->mm, addr, old_len); > + > + /* new_vma could have been invalidated by do_munmap */ > + > current->mm->total_vm += new_len >> PAGE_SHIFT; > - if (new_vma->vm_flags & VM_LOCKED) { > + if (must_fault_in) { > current->mm->locked_vm += new_len >> PAGE_SHIFT; > - make_pages_present(new_vma->vm_start, > - new_vma->vm_end); > + make_pages_present(fault_in_start, fault_in_end); > } > return new_addr; > } But the bugfix part of it looks good. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: mremap use-after-free [was Re: 2.5.52-mm2] 2002-12-19 17:02 ` mremap use-after-free [was Re: 2.5.52-mm2] Hugh Dickins @ 2002-12-19 18:18 ` Andrew Morton 0 siblings, 0 replies; 11+ messages in thread From: Andrew Morton @ 2002-12-19 18:18 UTC (permalink / raw) To: Hugh Dickins; +Cc: lkml, linux-mm Hugh Dickins wrote: > > ... > I doubt that (but may be wrong, I haven't time right now to think as > twistedly as mremap demands). Maybe, maybe not. Think about it - because VM_LOCKED is cleared by slab poisoning, the chances of anyone noticing it are very small. > The code (patently!) expects new_vma > to be good at the end, it certainly wasn't intending to unmap it; > but 2.5 split_vma has been through a couple of convulsions, either > of which might have resulted in the potential for new_vma to be > freed where before it was guaranteed to remain. I see no "guarantees" around do_munmap() at all. The whole thing is foul; no wonder it keeps blowing up. It died partway into kde startup. It was in fact the first mremap call. > Do you know the vmas before and after, and the mremap which did this? Breakpoint 1, move_vma (vma=0xcf1ed734, addr=1077805056, old_len=36864, new_len=204800, new_addr=1078050816) at mm/mremap.c:176 176 { (gdb) n 183 next = find_vma_prev(mm, new_addr, &prev); (gdb) 177 struct mm_struct * mm = vma->vm_mm; (gdb) 180 int split = 0; (gdb) 182 new_vma = NULL; (gdb) 183 next = find_vma_prev(mm, new_addr, &prev); (gdb) 184 if (next) { (gdb) p/x next $1 = 0xce3bee84 (gdb) p/x *next $2 = {vm_mm = 0xcf7cc624, vm_start = 0xbffcd000, vm_end = 0xc0000000, vm_next = 0x0, vm_page_prot = {pgprot = 0x25}, vm_flags = 0x100177, vm_rb = {rb_parent = 0xcf1ed74c, rb_color = 0x1, rb_right = 0x0, rb_left = 0x0}, shared = {next = 0xce3beeac, prev = 0xce3beeac}, vm_ops = 0x0, vm_pgoff = 0xffffffce, vm_file = 0x0, vm_private_data = 0x0} (gdb) p prev $3 = (struct vm_area_struct *) 0xcf1ed734 (gdb) p/x *prev $4 = {vm_mm = 0xcf7cc624, vm_start = 0x403e0000, vm_end = 0x4041c000, vm_next = 0xce3bee84, vm_page_prot = {pgprot = 0x25}, vm_flags = 0x100073, vm_rb = {rb_parent = 0xce3be4c4, rb_color = 0x0, rb_right = 0xce3bee9c, rb_left = 0xce3be1ac}, shared = { next = 0xcf1ed75c, prev = 0xcf1ed75c}, vm_ops = 0x0, vm_pgoff = 0x0, vm_file = 0x0, vm_private_data = 0x0} > ... > > Hmmm. Am I right to suppose that all the changes above are "cleanup" > which you couldn't resist making while you looked through this code, > but entirely irrelevant to the bug in question? to make the code readable so I could work on the bug, it then "accidentally" got left in. > If those mods above > were right, they should be the subject of a separate patch. well yes it would be nice to clean that code up. Like, documenting it and working out what the locking rules are. What does this do? spin_lock(&mm->page_table_lock); prev->vm_end = new_addr + new_len; spin_unlock(&mm->page_table_lock); and this? spin_lock(&mm->page_table_lock); next->vm_start = new_addr; spin_unlock(&mm->page_table_lock); clearly nothing. OK, but is this effectively-unlocked code safe? > There's certainly room for cleanup there, but my preference would be > to remove "can_vma_merge" completely, or at least its use in mremap.c, > using its explicit tests instead. It looks like it was originally > quite appropriate for a use or two in mmap.c, but obscurely unhelpful > here - because in itself it is testing a bizarre asymmetric subset of > what's needed (that subset which remained to be tested in its original > use in mmap.c). yes > The problem with your changes above is, you've removed the !vma->vm_file > tests, presumably because you noticed that can_vma_merge already tests > !vma->vm_file. But "vma" within can_vma_merge is "prev" or "next" here: > they are distinct tests, and you're now liable to merge an anonymous > mapping with a private file mapping - nice if it's from /dev/zero, > but otherwise not. Please just cut those hunks out. ok > ... > > - do_munmap(current->mm, addr, old_len); > > - > > Anguished cry! There was careful manipulation of VM_ACCOUNT before and > after do_munmap, now you've for no reason moved do_munmap down outside. How do we know that *vma was not altered by the do_munmap() call? With this change, nobody will touch any locally-cached vma*'s from the do_munmap() right through to syscall exit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2002-12-19 18:18 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-12-19 5:53 2.5.52-mm2 Andrew Morton 2002-12-19 8:54 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 9:28 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 10:12 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 10:31 ` 2.5.52-mm2 Andrew Morton 2002-12-19 10:51 ` 2.5.52-mm2 William Lee Irwin III 2002-12-19 15:22 ` 2.5.52-mm2 Martin J. Bligh 2002-12-19 9:41 ` 2.5.52-mm2 Andrew Morton 2002-12-19 9:50 ` 2.5.52-mm2 Andrew Morton 2002-12-19 17:02 ` mremap use-after-free [was Re: 2.5.52-mm2] Hugh Dickins 2002-12-19 18:18 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox