From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id B7A036B0031 for ; Mon, 10 Feb 2014 07:57:01 -0500 (EST) Received: by mail-pd0-f182.google.com with SMTP id v10so6122941pde.27 for ; Mon, 10 Feb 2014 04:57:01 -0800 (PST) Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTP id bq5si15336488pbb.198.2014.02.10.04.57.00 for ; Mon, 10 Feb 2014 04:57:00 -0800 (PST) From: "Kirill A. Shutemov" In-Reply-To: <20140210111928.GA7117@dhcp22.suse.cz> References: <52F81C5D.6010601@jp.fujitsu.com> <20140210111928.GA7117@dhcp22.suse.cz> Subject: Re: mm: memcg: A infinite loop in __handle_mm_fault() Content-Transfer-Encoding: 7bit Message-Id: <20140210125655.4AB48E0090@blue.fi.intel.com> Date: Mon, 10 Feb 2014 14:56:55 +0200 (EET) Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: "Mizuma, Masayoshi" , Johannes Weiner , Balbir Singh , KAMEZAWA Hiroyuki , cgroups@vger.kernel.org, linux-mm@kvack.org, "Kirill A. Shutemov" Michal Hocko wrote: > [CCing Kirill] > > On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote: > > Hi, > > Hi, > > > This is a bug report for memory cgroup hang up. > > I reproduced this using 3.14-rc1 but I couldn't in 3.7. > > > > When I ran a program (see below) under a limit of memcg, the process hanged up. > > Using kprobe trace, I detected the hangup in __handle_mm_fault(). > > do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns > > VM_FAULT_OOM, so it repeats goto retry and the task can't be killed. > > Thanks a lot for this very good report. I would bet the issue is related > to the THP zero page. > > __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page > expects that the pmd is marked for splitting so that it can break out > and retry the fault. This is not the case for THP zero page though. > do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new > huge page which will succeed in your case because you are hitting memcg > limit not the global memory pressure. But then a new page is charged by > mem_cgroup_newpage_charge which fails. An existing page is then split > and we are returning VM_FAULT_OOM. But we do not have page initialized > in that path because page = pmd_page(orig_pmd) is called after > is_huge_zero_pmd check. > > I am not familiar with THP zero page code much but I guess splitting > such a zero page is not a way to go. Instead we should simply drop the > zero page and retry the fault. I would assume that one of > do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should > do the trick but both of them try to charge new page(s) before the > current zero page is uncharged. That makes it prone to the same issue > AFAICS. > > But may be Kirill has a better idea. Your analysis looks accurate. Although I was not able to reproduce hang up. The problem with do_huge_pmd_wp_zero_page_fallback() that it can return VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM. Untested patch below tries to fix. Masayoshi, could you test. BTW, Michal, I've triggered sleep-in-atomic bug in mem_cgroup_print_oom_info(): [ 2.386563] Task in /test killed as a result of limit of /test [ 2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51 [ 2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0 [ 2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0 [ 2.389640] Memory cgroup stats for /test: [ 2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68 [ 2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test [ 2.392416] 2 locks held by memcg_test/66: [ 2.392945] #0: (memcg_oom_lock#2){+.+...}, at: [] pagefault_out_of_memory+0x14/0x90 [ 2.394233] #1: (oom_info_lock){+.+...}, at: [] mem_cgroup_print_oom_info+0x2a/0x390 [ 2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745 [ 2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 [ 2.397540] ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000 [ 2.398541] ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000 [ 2.399533] 0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10 [ 2.400588] Call Trace: [ 2.400908] [] dump_stack+0x4d/0x66 [ 2.401578] [] __might_sleep+0x16a/0x210 [ 2.402295] [] get_online_cpus+0x1c/0x60 [ 2.403005] [] mem_cgroup_read_stat+0x27/0xb0 [ 2.403769] [] mem_cgroup_print_oom_info+0x260/0x390 [ 2.404653] [] dump_header+0x88/0x251 [ 2.405342] [] ? trace_hardirqs_on+0xd/0x10 [ 2.406098] [] oom_kill_process+0x258/0x3d0 [ 2.406833] [] mem_cgroup_oom_synchronize+0x656/0x6c0 [ 2.407674] [] ? mem_cgroup_charge_common+0xd0/0xd0 [ 2.408553] [] pagefault_out_of_memory+0x14/0x90 [ 2.409354] [] mm_fault_error+0x91/0x189 [ 2.410069] [] __do_page_fault+0x48e/0x580 [ 2.410791] [] ? local_clock+0x16/0x30 [ 2.411467] [] ? trace_hardirqs_on+0xd/0x10 [ 2.412248] [] ? _raw_spin_unlock_irq+0x2c/0x40 [ 2.413039] [] ? finish_task_switch+0x7b/0x100 [ 2.413821] [] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2.414652] [] do_page_fault+0xe/0x10 [ 2.415330] [] page_fault+0x22/0x30 diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf974e1..974eb9eea2c0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1166,8 +1166,10 @@ alloc: } else { ret = do_huge_pmd_wp_page_fallback(mm, vma, address, pmd, orig_pmd, page, haddr); - if (ret & VM_FAULT_OOM) + if (ret & VM_FAULT_OOM) { split_huge_page(page); + ret |= VM_FAULT_FALLBACK; + } put_page(page); } count_vm_event(THP_FAULT_FALLBACK); @@ -1179,9 +1181,12 @@ alloc: if (page) { split_huge_page(page); put_page(page); + ret |= VM_FAULT_FALLBACK; + } else { + ret = do_huge_pmd_wp_zero_page_fallback(mm, vma, + address, pmd, orig_pmd, haddr); } count_vm_event(THP_FAULT_FALLBACK); - ret |= VM_FAULT_OOM; goto out; } diff --git a/mm/memory.c b/mm/memory.c index be6a0c0d4ae0..3b57b7864667 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, flags); -retry: pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) @@ -3741,20 +3740,13 @@ retry: if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); - /* - * If COW results in an oom, the huge pmd will - * have been split, so retry the fault on the - * pte for a smaller charge. - */ - if (unlikely(ret & VM_FAULT_OOM)) - goto retry; - return ret; + if (!(ret & VM_FAULT_FALLBACK)) + return ret; } else { huge_pmd_set_accessed(mm, vma, address, pmd, orig_pmd, dirty); + return 0; } - - return 0; } } -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org