From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
bugzilla-daemon@bugzilla.kernel.org, linux-mm@kvack.org,
Paul Furtado <paulfurtado91@gmail.com>
Subject: Re: [Bug 80881] New: Memory cgroup OOM leads to BUG: unable to handle kernel paging request at ffffffffffffffd8
Date: Thu, 24 Jul 2014 08:34:56 -0400 [thread overview]
Message-ID: <20140724123456.GJ1725@cmpxchg.org> (raw)
In-Reply-To: <20140724120959.GC14578@dhcp22.suse.cz>
Hi Michal,
On Thu, Jul 24, 2014 at 02:09:59PM +0200, Michal Hocko wrote:
> On Tue 22-07-14 13:07:41, Andrew Morton wrote:
> [...]
> > > The full log is attached, but here is the part I believe is relevant from the
> > > 3.16.0-rc5 error:
> > > [162005.262545] memory: usage 131072kB, limit 131072kB, failcnt 1314
> > > [162005.262550] memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> > > [162005.262554] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> > > [162005.262558] Memory cgroup stats for
> > > /mesos/c206ce2a-9f11-4340-a3c9-c59b405690a7: cache:8KB rss:131064KB
> > > rss_huge:0KB mapped_file:0KB writeback:0KB inactive_anon:0KB
> > > active_anon:131064KB inactive_file:0KB active_file:0KB unevictable:0KB
> > > [162005.262581] [ pid ] uid tgid total_vm rss nr_ptes swapents
> > > oom_score_adj name
> > > [162005.262602] [ 3002] 0 3002 544153 22244 151 0
> > > 0 java7
> > > [162005.262609] [ 3061] 0 3061 424397 20423 88 0
> > > 0 java
> > > [162005.262615] Memory cgroup out of memory: Kill process 3002 (java7) score
> > > 662 or sacrifice child
> > > [162005.262623] Killed process 3002 (java7) total-vm:2176612kB,
> > > anon-rss:60400kB, file-rss:28576kB
>
> Nothing unusual here.
>
> [fixed up line wraps]
> > [162005.263453] general protection fault: 0000 [#1] SMP
> > [162005.263463] Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
> > [162005.264060] CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
> > [162005.264060] task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
> > [162005.264060] RIP: e030:[<ffffffff811c0b80>] [<ffffffff811c0b80>] mem_cgroup_oom_synchronize+0x140/0x240
> > [162005.264060] RSP: e02b:ffff8801d2ec7d48 EFLAGS: 00010283
> > [162005.264060] RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
> > [162005.264060] RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
> > [162005.264060] RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
> > [162005.264060] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
> > [162005.264060] R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
> > [162005.264060] FS: 00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
> > [162005.264060] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [162005.264060] CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
> > [162005.264060] Stack:
> > [162005.264060] ffff88009d633800 0000000000000000 ffff8801cfe8f170 ffffffff811bae10
> > [162005.264060] ffffffff81ca73f8 ffffffff81ca73f8 ffff8801d2ec7dc8 0000000000000006
> > [162005.264060] 00000000e3b30000 00000000e3b30000 ffff8801d2ec7f58 0000000000000001
> > [162005.264060] Call Trace:
> > [162005.264060] [<ffffffff811bae10>] ? mem_cgroup_wait_acct_move+0x110/0x110
> > [162005.264060] [<ffffffff81159628>] pagefault_out_of_memory+0x18/0x90
> > [162005.264060] [<ffffffff8105cee9>] mm_fault_error+0xa9/0x1a0
> > [162005.264060] [<ffffffff8105d488>] __do_page_fault+0x478/0x4c0
> > [162005.264060] [<ffffffff81004f00>] ? xen_mc_flush+0xb0/0x1b0
> > [162005.264060] [<ffffffff81003ab3>] ? xen_write_msr_safe+0xa3/0xd0
> > [162005.264060] [<ffffffff81012a40>] ? __switch_to+0x2d0/0x600
> > [162005.264060] [<ffffffff8109e273>] ? finish_task_switch+0x53/0xf0
> > [162005.264060] [<ffffffff81643b0a>] ? __schedule+0x37a/0x6d0
> > [162005.264060] [<ffffffff8105d5dc>] do_page_fault+0x2c/0x40
> > [162005.264060] [<ffffffff81649858>] page_fault+0x28/0x30
> > [162005.264060] Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
> > [162005.264060] RIP [<ffffffff811c0b80>] mem_cgroup_oom_synchronize+0x140/0x240
> > [162005.264060] RSP <ffff8801d2ec7d48>
> > [162005.458051] ---[ end trace 050b00c5503ce96a ]---
>
> This decodes to:
> [162005.264060] Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
> All code
> ========
> 0: 44 00 00 add %r8b,(%rax)
> 3: 48 89 df mov %rbx,%rdi
> 6: e8 40 ca ff ff callq 0xffffffffffffca4b
> b: 48 85 c0 test %rax,%rax
> e: 49 89 c4 mov %rax,%r12
> 11: 74 35 je 0x48
> 13: 4c 8b b0 30 02 00 00 mov 0x230(%rax),%r14
> 1a: 4c 8d b8 30 02 00 00 lea 0x230(%rax),%r15
> 21: 4d 39 fe cmp %r15,%r14
> 24: 74 1b je 0x41
> 26: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 2b:* 49 8b 7e 10 mov 0x10(%r14),%rdi <-- trapping instruction
> 2f: be 01 00 00 00 mov $0x1,%esi
> 34: e8 42 d2 04 00 callq 0x4d27b
> 39: 4d 8b 36 mov (%r14),%r14
> 3c: 4d 39 fe cmp %r15,%r14
> 3f: 75 .byte 0x75
>
> R14 is dead000000100100 which is a poison value. If I am reading the
> code correctly this should be somewhere in mem_cgroup_oom_notify_cb
> where we stumble over event which has been removed from the notify chain.
>
> And indeed there is nothing to protect the oom_notify chain in the oom
> path. {Un}Registration is protected by memcg_oom_lock and that one is
> used in mem_cgroup_oom_trylock but it is taken only locally in that
> function. The issue seems to be introduced by fb2a6fc56be6 (mm: memcg:
> rework and document OOM waiting and wakeup) in 3.12.
>
> The most simplistic fix would be simply using memcg_oom_lock inside
> mem_cgroup_oom_notify_cb, but I cannot say I would like it much. Another
> approach would be using RCU for mem_cgroup_eventfd_list deallocation and
> {un}linking.
Thanks a lot for looking into this. Your analysis makes sense to me.
Would it be better to move mem_cgroup_oom_notify() directly into the
trylock function while the memcg_oom_lock is still held?
> Let's go with simpler route for now as this is not a hot path, though.
> ---
> >From 2c2642dbfb3f7d8c9f20f7793850426daa770078 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Thu, 24 Jul 2014 14:00:39 +0200
> Subject: [PATCH] memcg: oom_notify use-after-free fix
>
> Paul Furtado has reported the following GPF:
> general protection fault: 0000 [#1] SMP
> Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
> CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
> task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
> RIP: e030:[<ffffffff811c0b80>] [<ffffffff811c0b80>] mem_cgroup_oom_synchronize+0x140/0x240
> RSP: e02b:ffff8801d2ec7d48 EFLAGS: 00010283
> RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
> RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
> RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
> R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
> FS: 00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
> Stack:
> ffff88009d633800 0000000000000000 ffff8801cfe8f170 ffffffff811bae10
> ffffffff81ca73f8 ffffffff81ca73f8 ffff8801d2ec7dc8 0000000000000006
> 00000000e3b30000 00000000e3b30000 ffff8801d2ec7f58 0000000000000001
> Call Trace:
> [<ffffffff811bae10>] ? mem_cgroup_wait_acct_move+0x110/0x110
> [<ffffffff81159628>] pagefault_out_of_memory+0x18/0x90
> [<ffffffff8105cee9>] mm_fault_error+0xa9/0x1a0
> [<ffffffff8105d488>] __do_page_fault+0x478/0x4c0
> [<ffffffff81004f00>] ? xen_mc_flush+0xb0/0x1b0
> [<ffffffff81003ab3>] ? xen_write_msr_safe+0xa3/0xd0
> [<ffffffff81012a40>] ? __switch_to+0x2d0/0x600
> [<ffffffff8109e273>] ? finish_task_switch+0x53/0xf0
> [<ffffffff81643b0a>] ? __schedule+0x37a/0x6d0
> [<ffffffff8105d5dc>] do_page_fault+0x2c/0x40
> [<ffffffff81649858>] page_fault+0x28/0x30
> Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
> RIP [<ffffffff811c0b80>] mem_cgroup_oom_synchronize+0x140/0x240
> RSP <ffff8801d2ec7d48>
> ---[ end trace 050b00c5503ce96a ]---
>
> fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup) has
> moved mem_cgroup_oom_notify outside of memcg_oom_lock assuming it is
> protected by the hierarchical OOM-lock. Although this is true for the
> notification part the protection doesn't cover unregistration of event
> which can happen in parallel now so mem_cgroup_oom_notify can see
> already unlinked and/or freed mem_cgroup_eventfd_list.
>
> Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.
>
> Reported-by: Paul Furtado <paulfurtado91@gmail.com>
> Fixes: fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup)
> Cc: stable@vger.kernel.org # 3.12+
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-07-24 12:35 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <bug-80881-27@https.bugzilla.kernel.org/>
2014-07-22 20:07 ` Andrew Morton
2014-07-24 12:09 ` Michal Hocko
2014-07-24 12:34 ` Johannes Weiner [this message]
2014-07-24 13:15 ` Michal Hocko
2014-07-25 2:55 ` Paul Furtado
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140724123456.GJ1725@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=bugzilla-daemon@bugzilla.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=paulfurtado91@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox