From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: "Mizuma, Masayoshi" <m.mizuma@jp.fujitsu.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Balbir Singh <bsingharora@gmail.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
cgroups@vger.kernel.org, linux-mm@kvack.org,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: mm: memcg: A infinite loop in __handle_mm_fault()
Date: Mon, 10 Feb 2014 14:56:55 +0200 (EET) [thread overview]
Message-ID: <20140210125655.4AB48E0090@blue.fi.intel.com> (raw)
In-Reply-To: <20140210111928.GA7117@dhcp22.suse.cz>
Michal Hocko wrote:
> [CCing Kirill]
>
> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
> > Hi,
>
> Hi,
>
> > This is a bug report for memory cgroup hang up.
> > I reproduced this using 3.14-rc1 but I couldn't in 3.7.
> >
> > When I ran a program (see below) under a limit of memcg, the process hanged up.
> > Using kprobe trace, I detected the hangup in __handle_mm_fault().
> > do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
> > VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
>
> Thanks a lot for this very good report. I would bet the issue is related
> to the THP zero page.
>
> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
> expects that the pmd is marked for splitting so that it can break out
> and retry the fault. This is not the case for THP zero page though.
> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
> huge page which will succeed in your case because you are hitting memcg
> limit not the global memory pressure. But then a new page is charged by
> mem_cgroup_newpage_charge which fails. An existing page is then split
> and we are returning VM_FAULT_OOM. But we do not have page initialized
> in that path because page = pmd_page(orig_pmd) is called after
> is_huge_zero_pmd check.
>
> I am not familiar with THP zero page code much but I guess splitting
> such a zero page is not a way to go. Instead we should simply drop the
> zero page and retry the fault. I would assume that one of
> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
> do the trick but both of them try to charge new page(s) before the
> current zero page is uncharged. That makes it prone to the same issue
> AFAICS.
>
> But may be Kirill has a better idea.
Your analysis looks accurate. Although I was not able to reproduce
hang up.
The problem with do_huge_pmd_wp_zero_page_fallback() that it can return
VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM.
Untested patch below tries to fix. Masayoshi, could you test.
BTW, Michal, I've triggered sleep-in-atomic bug in
mem_cgroup_print_oom_info():
[ 2.386563] Task in /test killed as a result of limit of /test
[ 2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
[ 2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
[ 2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[ 2.389640] Memory cgroup stats for /test:
[ 2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
[ 2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
[ 2.392416] 2 locks held by memcg_test/66:
[ 2.392945] #0: (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
[ 2.394233] #1: (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
[ 2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
[ 2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
[ 2.397540] ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
[ 2.398541] ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
[ 2.399533] 0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
[ 2.400588] Call Trace:
[ 2.400908] [<ffffffff81776ea3>] dump_stack+0x4d/0x66
[ 2.401578] [<ffffffff8108418a>] __might_sleep+0x16a/0x210
[ 2.402295] [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
[ 2.403005] [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
[ 2.403769] [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
[ 2.404653] [<ffffffff8177314e>] dump_header+0x88/0x251
[ 2.405342] [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
[ 2.406098] [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
[ 2.406833] [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
[ 2.407674] [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
[ 2.408553] [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
[ 2.409354] [<ffffffff817712f7>] mm_fault_error+0x91/0x189
[ 2.410069] [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
[ 2.410791] [<ffffffff8108f656>] ? local_clock+0x16/0x30
[ 2.411467] [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
[ 2.412248] [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
[ 2.413039] [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
[ 2.413821] [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2.414652] [<ffffffff81783fae>] do_page_fault+0xe/0x10
[ 2.415330] [<ffffffff81780552>] page_fault+0x22/0x30
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82166bf974e1..974eb9eea2c0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1166,8 +1166,10 @@ alloc:
} else {
ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
pmd, orig_pmd, page, haddr);
- if (ret & VM_FAULT_OOM)
+ if (ret & VM_FAULT_OOM) {
split_huge_page(page);
+ ret |= VM_FAULT_FALLBACK;
+ }
put_page(page);
}
count_vm_event(THP_FAULT_FALLBACK);
@@ -1179,9 +1181,12 @@ alloc:
if (page) {
split_huge_page(page);
put_page(page);
+ ret |= VM_FAULT_FALLBACK;
+ } else {
+ ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+ address, pmd, orig_pmd, haddr);
}
count_vm_event(THP_FAULT_FALLBACK);
- ret |= VM_FAULT_OOM;
goto out;
}
diff --git a/mm/memory.c b/mm/memory.c
index be6a0c0d4ae0..3b57b7864667 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
-retry:
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
@@ -3741,20 +3740,13 @@ retry:
if (dirty && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
- /*
- * If COW results in an oom, the huge pmd will
- * have been split, so retry the fault on the
- * pte for a smaller charge.
- */
- if (unlikely(ret & VM_FAULT_OOM))
- goto retry;
- return ret;
+ if (!(ret & VM_FAULT_FALLBACK))
+ return ret;
} else {
huge_pmd_set_accessed(mm, vma, address, pmd,
orig_pmd, dirty);
+ return 0;
}
-
- return 0;
}
}
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-02-10 12:57 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-10 0:25 Mizuma, Masayoshi
2014-02-10 11:19 ` Michal Hocko
2014-02-10 11:51 ` Mizuma, Masayoshi
2014-02-10 12:56 ` Kirill A. Shutemov [this message]
2014-02-10 13:52 ` Michal Hocko
2014-02-12 1:04 ` Mizuma, Masayoshi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140210125655.4AB48E0090@blue.fi.intel.com \
--to=kirill.shutemov@linux.intel.com \
--cc=bsingharora@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=m.mizuma@jp.fujitsu.com \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox