From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id AA72B6B0071 for ; Tue, 12 Jan 2010 00:59:21 -0500 (EST) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id o0C5xJre000942 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Tue, 12 Jan 2010 14:59:19 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 1A99045DE4F for ; Tue, 12 Jan 2010 14:59:19 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id DE9E845DE4E for ; Tue, 12 Jan 2010 14:59:18 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id C6E031DB803B for ; Tue, 12 Jan 2010 14:59:18 +0900 (JST) Received: from m107.s.css.fujitsu.com (m107.s.css.fujitsu.com [10.249.87.107]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 766411DB8038 for ; Tue, 12 Jan 2010 14:59:18 +0900 (JST) Date: Tue, 12 Jan 2010 14:56:03 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC][BUGFIX][PATCH] memcg: ensure list is empty at rmdir Message-Id: <20100112145603.06dc2de0.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20100112140836.45e7fabb.nishimura@mxp.nes.nec.co.jp> References: <20100112140836.45e7fabb.nishimura@mxp.nes.nec.co.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Daisuke Nishimura Cc: linux-mm , Balbir Singh List-ID: On Tue, 12 Jan 2010 14:08:36 +0900 Daisuke Nishimura wrote: > I see a bug bellow at the end of last week after a very long run(more than 17days) > while testing mmotm-2009-12-10-17-19 + move charge patch. > > > [1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 > [1530745.950651] IP: [] mem_cgroup_del_lru_list+0x30/0x80 > [1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0 > [1530745.950651] Oops: 0002 [#1] SMP > [1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map > [1530745.950651] CPU 3 > [1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables > x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 blue > tooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2 > i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_ > multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom s > erio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core > rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp > megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloa > ded: freq_table] > [1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G M 2.6.32-mm1-00701-g2b > 04386 #3 Express5800/140Rd-4 [N8100-1065] > [1530745.950651] RIP: 0010:[] [] mem_cgroup_del_lru_l > ist+0x30/0x80 > [1530745.950651] RSP: 0018:ffff8803863ddcb8 EFLAGS: 00010002 > [1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0 > [1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238 > [1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643 > [1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000 > [1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310 > [1530745.950651] FS: 0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000 > 000 > [1530745.950651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0 > [1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff > 88038612a8a0) > [1530745.950651] Stack: > [1530745.950651] ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a > [1530745.950651] <0> 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000 > [1530745.950651] <0> 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046 > [1530745.950651] Call Trace: > [1530745.950651] [] release_pages+0x142/0x1e7 > [1530745.950651] [] ? pagevec_move_tail+0x6e/0x112 > [1530745.950651] [] pagevec_move_tail+0xfd/0x112 > [1530745.950651] [] lru_add_drain+0x76/0x94 > [1530745.950651] [] exit_mmap+0x6e/0x145 > [1530745.950651] [] mmput+0x5e/0xcf > [1530745.950651] [] exit_mm+0x11c/0x129 > [1530745.950651] [] ? audit_free+0x196/0x1c9 > [1530745.950651] [] do_exit+0x1f5/0x6b7 > [1530745.950651] [] ? up_read+0x2b/0x2f > [1530745.950651] [] ? lockdep_sys_exit_thunk+0x35/0x67 > [1530745.950651] [] do_group_exit+0x83/0xb0 > [1530745.950651] [] sys_exit_group+0x17/0x1b > [1530745.950651] [] system_call_fastpath+0x16/0x1b > [1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 8 > 3 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 <48> ff 4c d0 50 48 8b 05 2b > 2d 7c 00 48 39 43 08 74 39 48 8b 4b > [1530745.950651] RIP [] mem_cgroup_del_lru_list+0x30/0x80 > [1530745.950651] RSP > [1530745.950651] CR2: 0000000000000230 > [1530745.950651] ---[ end trace c3419c1bb8acc34f ]--- > [1530745.950651] Fixing recursive fault but reboot is needed! > > > gdb says: > > (gdb) list *0xffffffff810fbc11 > 0xffffffff810fbc11 is in mem_cgroup_del_lru_list (mm/memcontrol.c:683). > warning: Source file is more recent than executable. > 678 /* > 679 * We don't check PCG_USED bit. It's cleared when the "page" is finally > 680 * removed from global LRU. > 681 */ > 682 mz = page_cgroup_zoneinfo(pc); > 683 MEM_CGROUP_ZSTAT(mz, lru) -= 1; > 684 if (mem_cgroup_is_root(pc->mem_cgroup)) > 685 return; > 686 VM_BUG_ON(list_empty(&pc->lru)); > 687 list_del_init(&pc->lru); > (gdb) disassemble mem_cgroup_del_lru_list > Dump of assembler code for function mem_cgroup_del_lru_list: > 0xffffffff810fbbe1 : push %rbp > 0xffffffff810fbbe2 : mov %rsp,%rbp > 0xffffffff810fbbe5 : push %r12 > 0xffffffff810fbbe7 : push %rbx > 0xffffffff810fbbe8 : callq 0xffffffff81002900 > 0xffffffff810fbbed : cmpl $0x0,0x7c29cc(%rip) # 0xffffffff818be5c0 > 0xffffffff810fbbf4 : mov %esi,%r12d > 0xffffffff810fbbf7 : jne 0xffffffff810fbc5c > 0xffffffff810fbbf9 : jmp 0xffffffff810fbc49 > 0xffffffff810fbbfb : cmpq $0x0,0x8(%rbx) > 0xffffffff810fbc00 : jne 0xffffffff810fbc06 > 0xffffffff810fbc02 : ud2a > 0xffffffff810fbc04 : jmp 0xffffffff810fbc04 > 0xffffffff810fbc06 : mov %rbx,%rdi > 0xffffffff810fbc09 : callq 0xffffffff810faf26 > 0xffffffff810fbc0e : mov %r12d,%edx > 0xffffffff810fbc11 : decq 0x50(%rax,%rdx,8) > 0xffffffff810fbc16 : mov 0x7c2d2b(%rip),%rax # 0xffffffff818be948 > 0xffffffff810fbc1d : cmp %rax,0x8(%rbx) > 0xffffffff810fbc21 : je 0xffffffff810fbc5c > 0xffffffff810fbc23 : mov 0x18(%rbx),%rcx > 0xffffffff810fbc27 : lea 0x18(%rbx),%rdx > 0xffffffff810fbc2b : cmp %rdx,%rcx > 0xffffffff810fbc2e : jne 0xffffffff810fbc34 > 0xffffffff810fbc30 : ud2a > 0xffffffff810fbc32 : jmp 0xffffffff810fbc32 > 0xffffffff810fbc34 : mov 0x8(%rdx),%rax > 0xffffffff810fbc38 : mov %rax,0x8(%rcx) > 0xffffffff810fbc3c : mov %rcx,(%rax) > 0xffffffff810fbc3f : mov %rdx,0x8(%rdx) > 0xffffffff810fbc43 : mov %rdx,0x18(%rbx) > 0xffffffff810fbc47 : jmp 0xffffffff810fbc5c > 0xffffffff810fbc49 : callq 0xffffffff810ff5e9 > 0xffffffff810fbc4e : mov %rax,%rbx > 0xffffffff810fbc51 : lock btrl $0x3,(%rax) > 0xffffffff810fbc56 : sbb %eax,%eax > 0xffffffff810fbc58 : test %eax,%eax > 0xffffffff810fbc5a : jne 0xffffffff810fbbfb > 0xffffffff810fbc5c : pop %rbx > 0xffffffff810fbc5d : pop %r12 > 0xffffffff810fbc5f : leaveq > 0xffffffff810fbc60 : retq > End of assembler dump. > > > These outputs mean that MEM_CGROUP_ZSTAT() is called with @mz = 0x00000000000001e0. > So I suspect that the mem_cgroup has already been freed by rmdir at this point. > > I found a race condition which seems to be the root cause of this problem. > If it is a valid fix, I think it should go to stable too. > > Any comments? > Hmm. mem->info.nodeinfo[x] == NULL when the function is called. > > Thanks, > Daisuke Nishimura. > === > > This patch tries to fix this bug by ensuring not only the usage is zero but also > all of the LRUs are empty. mem_cgroup_del_lru_list() checks the list is empty > or not, so we can make use of it. > Ah, ok. We call lru_add_drain() but doesn't check lru is really empty or not. It seems this patch can fix the problem. Thank you for great fix. Acked-by: KAMEZAWA Hiroyuki Following is nitpicks. > - } > - ret = 0; > + } while (mem->res.usage > 0 || ret); This seems unclear. (Not your mistake, maybe mine.) BTW, I think it's better to move drain_all_stock_sync(), too. as.. == do { ret = -EBUSY; if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)) goto out; ret = -EINTR; if (signal_pending(current)) goto out; /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); ret = 0; for_each_node_state(node, N_HIGH_MEMORY) { ...... cond_resched(); /* Need to drain all cached "usage" befor we check counter */ if (!ret) drain_all_stock_sync(); if (ret == -EBUSY) cond_resched(); } while (mem->res.usage != 0); == Hmm. Anyway, thank you!! -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org