From: "Mizuma, Masayoshi" <m.mizuma@jp.fujitsu.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Balbir Singh <bsingharora@gmail.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: mm: memcg: A infinite loop in __handle_mm_fault()
Date: Wed, 12 Feb 2014 10:04:34 +0900 [thread overview]
Message-ID: <52FAC8A2.1080607@jp.fujitsu.com> (raw)
In-Reply-To: <20140210125655.4AB48E0090@blue.fi.intel.com>
On Mon, 10 Feb 2014 14:56:55 +0200 Kirill A. Shutemov wrote:
> Michal Hocko wrote:
>> [CCing Kirill]
>>
>> On Mon 10-02-14 09:25:01, Mizuma, Masayoshi wrote:
>>> Hi,
>>
>> Hi,
>>
>>> This is a bug report for memory cgroup hang up.
>>> I reproduced this using 3.14-rc1 but I couldn't in 3.7.
>>>
>>> When I ran a program (see below) under a limit of memcg, the process hanged up.
>>> Using kprobe trace, I detected the hangup in __handle_mm_fault().
>>> do_huge_pmd_wp_page(), which is called by __handle_mm_fault(), always returns
>>> VM_FAULT_OOM, so it repeats goto retry and the task can't be killed.
>>
>> Thanks a lot for this very good report. I would bet the issue is related
>> to the THP zero page.
>>
>> __handle_mm_fault retry loop for VM_FAULT_OOM from do_huge_pmd_wp_page
>> expects that the pmd is marked for splitting so that it can break out
>> and retry the fault. This is not the case for THP zero page though.
>> do_huge_pmd_wp_page checks is_huge_zero_pmd and goes to allocate a new
>> huge page which will succeed in your case because you are hitting memcg
>> limit not the global memory pressure. But then a new page is charged by
>> mem_cgroup_newpage_charge which fails. An existing page is then split
>> and we are returning VM_FAULT_OOM. But we do not have page initialized
>> in that path because page = pmd_page(orig_pmd) is called after
>> is_huge_zero_pmd check.
>>
>> I am not familiar with THP zero page code much but I guess splitting
>> such a zero page is not a way to go. Instead we should simply drop the
>> zero page and retry the fault. I would assume that one of
>> do_huge_pmd_wp_zero_page_fallback or do_huge_pmd_wp_page_fallback should
>> do the trick but both of them try to charge new page(s) before the
>> current zero page is uncharged. That makes it prone to the same issue
>> AFAICS.
>>
>> But may be Kirill has a better idea.
>
> Your analysis looks accurate. Although I was not able to reproduce
> hang up.
>
> The problem with do_huge_pmd_wp_zero_page_fallback() that it can return
> VM_FAULT_OOM if it failed to allocate new *small* page, so it's real OOM.
>
> Untested patch below tries to fix. Masayoshi, could you test.
I applied the patch to 3.14-rc2.
Then, I confirmed this issue does not happen and the process is killed by
oom-killer normally.
Thank you for analyzing the root cause and providing the fix!
Regards,
Masayoshi Mizuma
>
> BTW, Michal, I've triggered sleep-in-atomic bug in
> mem_cgroup_print_oom_info():
>
> [ 2.386563] Task in /test killed as a result of limit of /test
> [ 2.387326] memory: usage 10240kB, limit 10240kB, failcnt 51
> [ 2.388098] memory+swap: usage 10240kB, limit 10240kB, failcnt 0
> [ 2.388861] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> [ 2.389640] Memory cgroup stats for /test:
> [ 2.390178] BUG: sleeping function called from invalid context at /home/space/kas/git/public/linux/kernel/cpu.c:68
> [ 2.391516] in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
> [ 2.392416] 2 locks held by memcg_test/66:
> [ 2.392945] #0: (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [ 2.394233] #1: (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
> [ 2.395496] CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
> [ 2.396536] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
> [ 2.397540] ffffffff81a3cc90 ffff88081d26dba0 ffffffff81776ea3 0000000000000000
> [ 2.398541] ffff88081d26dbc8 ffffffff8108418a 0000000000000000 ffff88081d15c000
> [ 2.399533] 0000000000000000 ffff88081d26dbd8 ffffffff8104f6bc ffff88081d26dc10
> [ 2.400588] Call Trace:
> [ 2.400908] [<ffffffff81776ea3>] dump_stack+0x4d/0x66
> [ 2.401578] [<ffffffff8108418a>] __might_sleep+0x16a/0x210
> [ 2.402295] [<ffffffff8104f6bc>] get_online_cpus+0x1c/0x60
> [ 2.403005] [<ffffffff8118fb67>] mem_cgroup_read_stat+0x27/0xb0
> [ 2.403769] [<ffffffff81197d60>] mem_cgroup_print_oom_info+0x260/0x390
> [ 2.404653] [<ffffffff8177314e>] dump_header+0x88/0x251
> [ 2.405342] [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [ 2.406098] [<ffffffff81130618>] oom_kill_process+0x258/0x3d0
> [ 2.406833] [<ffffffff81198746>] mem_cgroup_oom_synchronize+0x656/0x6c0
> [ 2.407674] [<ffffffff811973a0>] ? mem_cgroup_charge_common+0xd0/0xd0
> [ 2.408553] [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
> [ 2.409354] [<ffffffff817712f7>] mm_fault_error+0x91/0x189
> [ 2.410069] [<ffffffff81783eae>] __do_page_fault+0x48e/0x580
> [ 2.410791] [<ffffffff8108f656>] ? local_clock+0x16/0x30
> [ 2.411467] [<ffffffff810a3bfd>] ? trace_hardirqs_on+0xd/0x10
> [ 2.412248] [<ffffffff8177f6fc>] ? _raw_spin_unlock_irq+0x2c/0x40
> [ 2.413039] [<ffffffff8108312b>] ? finish_task_switch+0x7b/0x100
> [ 2.413821] [<ffffffff813b954a>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 2.414652] [<ffffffff81783fae>] do_page_fault+0xe/0x10
> [ 2.415330] [<ffffffff81780552>] page_fault+0x22/0x30
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 82166bf974e1..974eb9eea2c0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1166,8 +1166,10 @@ alloc:
> } else {
> ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
> pmd, orig_pmd, page, haddr);
> - if (ret & VM_FAULT_OOM)
> + if (ret & VM_FAULT_OOM) {
> split_huge_page(page);
> + ret |= VM_FAULT_FALLBACK;
> + }
> put_page(page);
> }
> count_vm_event(THP_FAULT_FALLBACK);
> @@ -1179,9 +1181,12 @@ alloc:
> if (page) {
> split_huge_page(page);
> put_page(page);
> + ret |= VM_FAULT_FALLBACK;
> + } else {
> + ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
> + address, pmd, orig_pmd, haddr);
> }
> count_vm_event(THP_FAULT_FALLBACK);
> - ret |= VM_FAULT_OOM;
> goto out;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index be6a0c0d4ae0..3b57b7864667 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3703,7 +3703,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (unlikely(is_vm_hugetlb_page(vma)))
> return hugetlb_fault(mm, vma, address, flags);
>
> -retry:
> pgd = pgd_offset(mm, address);
> pud = pud_alloc(mm, pgd, address);
> if (!pud)
> @@ -3741,20 +3740,13 @@ retry:
> if (dirty && !pmd_write(orig_pmd)) {
> ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
> orig_pmd);
> - /*
> - * If COW results in an oom, the huge pmd will
> - * have been split, so retry the fault on the
> - * pte for a smaller charge.
> - */
> - if (unlikely(ret & VM_FAULT_OOM))
> - goto retry;
> - return ret;
> + if (!(ret & VM_FAULT_FALLBACK))
> + return ret;
> } else {
> huge_pmd_set_accessed(mm, vma, address, pmd,
> orig_pmd, dirty);
> + return 0;
> }
> -
> - return 0;
> }
> }
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2014-02-12 1:07 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-10 0:25 Mizuma, Masayoshi
2014-02-10 11:19 ` Michal Hocko
2014-02-10 11:51 ` Mizuma, Masayoshi
2014-02-10 12:56 ` Kirill A. Shutemov
2014-02-10 13:52 ` Michal Hocko
2014-02-12 1:04 ` Mizuma, Masayoshi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52FAC8A2.1080607@jp.fujitsu.com \
--to=m.mizuma@jp.fujitsu.com \
--cc=bsingharora@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox