From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from talaria.jf.intel.com (talaria.jf.intel.com [10.7.209.7]) by hermes.jf.intel.com (8.12.9-20030918-01/8.12.9/d: outer.mc,v 1.66 2003/05/22 21:17:36 rfjohns1 Exp $) with ESMTP id h8IKMQ3D006524 for ; Thu, 18 Sep 2003 20:22:26 GMT Date: Thu, 18 Sep 2003 13:21:49 -0700 Message-Id: <200309182021.h8IKLnqX006918@penguin.co.intel.com> From: Rusty Lynch Subject: swapping to death by stressing mlock Sender: owner-linux-mm@kvack.org Return-Path: To: linux-mm@kvack.org Cc: rusty@linux.co.intel.com List-ID: While getting more familiar with the vm subsystem I discovered that it is fairly easy to lockup my system by mlocking enough memory. I believe what is happening is that I am reducing the amount of swappable physical ram to the point that try_to_free_pages() will go into an endless loop waiting for bdflush to free up some pages. I'm guessing this is not a valid condition for a properly configured server, but since I'm not feeling very confident about my above explanation, I'm not so sure this isn't something to look into. On my 2.6.0-test5 kernel I run a little utility that attempts to allocate a large enough chunk of memory, touch all pages in the buffer, and then mlock the buffer. Just setting vm.overcommit_memory=2 and a real low vm.overcommit_ratio doesn't help a lot since all I have to do is squeeze out the available physical ram that can be swapped out. This is what I see for my offending process if I meta-sysrq-t. fat_bastard D 00000001 4293732848 598 550 (NOTLB) cc9d3c78 00000082 c1285bc0 00000001 00000003 c1286580 c1285bc0 cc9d3c98 00000000 00000246 c014f520 cc9d3c6c cf033004 cf6ff000 00000007 00000000 00000000 ffff8258 cc9d3c8c 00000000 cc9d3cc4 c0134dde cc9d3c8c ffff8258 Call Trace: [] background_writeout+0x0/0xe0 [] schedule_timeout+0x6e/0xc0 [] process_timeout+0x0/0x10 [] io_schedule_timeout+0x2b/0x40 [] blk_congestion_wait+0x8b/0xa0 [] autoremove_wake_function+0x0/0x50 [] autoremove_wake_function+0x0/0x50 [] try_to_free_pages+0x102/0x1c0 [] __alloc_pages+0x1f7/0x3a0 [] read_swap_cache_async+0xb1/0xbd [] swapin_readahead+0x42/0x90 [] do_swap_page+0x268/0x340 [] save_v86_state+0x4b/0x200 [] handle_mm_fault+0xf1/0x200 [] get_user_pages+0xee/0x3a0 [] insert_vm_struct+0x6d/0x77 [] make_pages_present+0x8d/0xa0 [] mlock_fixup+0xe4/0x120 [] capable+0x24/0x50 [] do_mlock+0xe9/0x110 [] sys_mlock+0xc7/0xe0 [] syscall_call+0x7/0xb If I attempt to kill all processes with meta-sysrq-i, then I start seeing init stuck in the same spot: init D 00000001 21838320 606 1 605 (NOTLB) cea9fc5c 00000082 c1285bc0 00000001 00000003 c1286580 c1285bc0 cea9fc7c 00000000 00000246 c014f520 cea9fc50 ce3d0004 cf6ff000 00000007 00000000 00000000 00076d98 cea9fc70 00000000 cea9fca8 c0134dde cea9fc70 00076d98 Call Trace: [] background_writeout+0x0/0xe0 [] schedule_timeout+0x6e/0xc0 [] process_timeout+0x0/0x10 [] io_schedule_timeout+0x2b/0x40 [] blk_congestion_wait+0x8b/0xa0 [] autoremove_wake_function+0x0/0x50 [] autoremove_wake_function+0x0/0x50 [] try_to_free_pages+0x102/0x1c0 [] __alloc_pages+0x1f7/0x3a0 [] __do_page_cache_readahead+0x182/0x21e [] filemap_nopage+0x11f/0x330 [] do_no_page+0xd1/0x3f0 [] handle_mm_fault+0x118/0x200 [] do_page_fault+0x176/0x4dc [] sigprocmask+0x71/0x150 [] sys_rt_sigprocmask+0xa1/0x1e0 [] do_page_fault+0x0/0x4dc [] error_code+0x2d/0x38 The current process (as seen via meta-sysrq-p) seems to always be the swapper: Pid: 0, comm: swapper EIP: 0060:[] CPU: 0 EIP is at default_idle+0x30/0x40 EFLAGS: 00000246 Not tainted EAX: 00000000 EBX: c0600000 ECX: 001d9b2e EDX: c0600000 ESI: c0600000 EDI: c010a040 EBP: c0601fb4 DS: 007b ES: 007b CR0: 8005003b CR2: 0804d6a0 CR3: 0b9b8000 CR4: 00000680 Call Trace: [] cpu_idle+0x46/0x50 [] rest_init+0x0/0x80 [] start_kernel+0x181/0x1b0 [] unknown_bootoption+0x0/0x100 I also noticed that try_to_free_pages() is ignoring the return value for wakeup_bdflush(), so for kicks I + WARN_ON(wakeup_bdflush(total_scanned)); - wakeup_bdflush(total_scanned); After my system is nicely locked up, I start seeing tons of warnings like: Badness in try_to_free_pages at mm/vmscan.c:886 Call Trace: [] try_to_free_pages+0x1c8/0x1e0 [] __alloc_pages+0x1f7/0x3a0 [] __get_free_pages+0x22/0x50 [] cache_grow+0x125/0x400 [] del_timer_sync+0x2c/0x80 [] kernel_map_pages+0x29/0x64 [] cache_alloc_refill+0x13a/0x4c0 [] kmem_cache_alloc+0x1b5/0x1e0 [] getname+0x29/0xd0 [] __user_walk+0x1b/0x60 [] select_bits_alloc+0x1e/0x30 [] vfs_stat+0x1e/0x60 [] sys_select+0x23b/0x520 [] sys_stat64+0x1b/0x40 [] sys_time+0x35/0x70 [] syscall_call+0x7/0xb So... is my explanation on target? Is this a condition that would really only pop up in crazy stress testing? If not then maybe sys_mlock should have an additional threshold? --rustyl -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org