From: Mark Hills <mark@xwax.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Subject: Re: ps lockups, cgroup memory reclaim
Date: Thu, 24 Oct 2013 13:39:38 -0400 (EDT) [thread overview]
Message-ID: <1310241253140.26474@wes.ijneb.com> (raw)
In-Reply-To: <1309180141270.29932@wes.ijneb.com>
On Wed, 18 Sep 2013, Mark Hills wrote:
> On Tue, 17 Sep 2013, Johannes Weiner wrote:
>
> > On Tue, Sep 17, 2013 at 04:50:42PM +0100, Mark Hills wrote:
> > > I'm investigating intermitten kernel lockups in an HPC environment, with
> > > the RedHat kernel.
> > >
> > > The symptoms are seen as lockups of multiple ps commands, with one
> > > consuming full CPU:
> > >
> > > # ps aux | grep ps
> > > root 19557 68.9 0.0 108100 908 ? D Sep16 1045:37 ps --ppid 1 -o args=
> > > root 19871 0.0 0.0 108100 908 ? D Sep16 0:00 ps --ppid 1 -o args=
> > >
> > > SIGKILL on the busy one causes the other ps processes to run to completion
> > > (TERM has no effect).
> >
> > Any chance you can get to the stack of the non-busy blocked tasks?
> >
> > It would be /proc/19871/stack in this case.
>
> I had to return the machine above to the cluster, but next time I'll log
> this information.
>
> > > In this case I was able to run my own ps to see the process list,
> > > but not always.
> > >
> > > perf shows the locality of the spinning, roughly:
> > >
> > > proc_pid_cmdline
> > > get_user_pages
> > > handle_mm_fault
> > > mem_cgroup_try_charge_swapin
> > > mem_cgroup_reclaim
> > >
> > > There are two entry points, the codepaths taken are better shown by
> > > the attached profile of CPU time.
> >
> > Looks like it's spinning like crazy in shrink_mem_cgroup_zone(). Maybe
> > an LRU counter underflow, maybe endlessly looping on the
> > should_continue_reclaim() compaction condition. But I don't see an
> > obvious connection to why killing the busy task wakes up the blocked
> > one.
>
> Maybe it's as simple as the lock taken at quite a high level; perhaps even
> a lock when reading values for /proc.
>
> But no need for me to guess, we'll find out next time from the /proc
> information.
>
> > So yeah, it would be helpful to know what that task is waiting for.
> >
> > > We've had this behaviour since switching to Scientific Linux 6 (based on
> > > RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64.
> > >
> > > The example above is kernel 2.6.32-358.el6.x86_64.
> >
> > Can you test with the debug build? That should trap LRU counter
> > underflows at least.
>
> Ah, excellent -- I did not realise there was a kernel-debug package.
[...]
Ok, it has taken me a while but I now have some automated capture in place
and the equivalent packaged debug kernel:
$ uname -r
2.6.32-358.el6.x86_64.debug
Here is the initial result. The bad news is that there doesn't appear to
be an obvious BUG or alert to any kind of overflow.
With this in mind, I noted the times of the OOM-killer (which is operating
on containers that users run their jobs in).
The most recent is less than 5 minutes before lockup was detected, and my
detection runs every 5 minutes. I have extracted this below (I can post
the full dmesg elsewhere if interested)
I assume that trace is of pid 8099 is a thread ID, because this process
does not reside in the cgroup only PID 7919.
Notably the process requesting the memory is itself the memory hog. The
cgroup contains only a launching shell and a binary process.
Perhaps an expect can see an immediate bug from the call trace; eg.
killing a process at an awkward time that would cause some kind of counter
underflow?
I'm unsure where to go next, given there is no obvious BUG here. I think I
will look next week to find a reproducable test, perhaps I can simulate
the same out-of-memory condition.
Thanks
--
Mark
frenpic.bin invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
frenpic.bin cpuset=manager-7805 mems_allowed=0-1
Pid: 8099, comm: frenpic.bin Not tainted 2.6.32-358.el6.x86_64.debug #1
Call Trace:
[<ffffffff81541dcb>] ? _spin_unlock+0x2b/0x40
[<ffffffff811314f3>] ? dump_header+0x83/0x210
[<ffffffff810b4ffd>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff810b504d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff811319f2>] ? oom_kill_process+0x82/0x2f0
[<ffffffff811321e2>] ? mem_cgroup_out_of_memory+0x92/0xb0
[<ffffffff8118cff4>] ? mem_cgroup_handle_oom+0x274/0x2a0
[<ffffffff8118a700>] ? memcg_oom_wake_function+0x0/0xa0
[<ffffffff8118d5e8>] ? __mem_cgroup_try_charge+0x5c8/0x6b0
[<ffffffff8118d218>] ? __mem_cgroup_try_charge+0x1f8/0x6b0
[<ffffffff8118eca0>] ? mem_cgroup_charge_common+0x90/0xd0
[<ffffffff8118ed28>] ? mem_cgroup_newpage_charge+0x48/0x50
[<ffffffff8115a05a>] ? handle_pte_fault+0x72a/0xa90
[<ffffffff810a32bd>] ? sched_clock_cpu+0xcd/0x110
[<ffffffff8115a5fa>] ? handle_mm_fault+0x23a/0x310
[<ffffffff81047533>] ? __do_page_fault+0x163/0x4e0
[<ffffffff810b504d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81541d10>] ? _spin_unlock_irq+0x30/0x40
[<ffffffff8105c43d>] ? finish_task_switch+0x7d/0x110
[<ffffffff8105c408>] ? finish_task_switch+0x48/0x110
[<ffffffff8153e853>] ? thread_return+0x4e/0x7db
[<ffffffff8100bb10>] ? restore_args+0x0/0x30
[<ffffffff815457be>] ? do_page_fault+0x3e/0xa0
[<ffffffff81542905>] ? page_fault+0x25/0x30
Task in /manager-7805 killed as a result of limit of /manager-7805
memory: usage 16381840kB, limit 16384000kB, failcnt 250389
memory+swap: usage 18022400kB, limit 18022400kB, failcnt 1117
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
CPU 4: hi: 0, btch: 1 usd: 0
CPU 5: hi: 0, btch: 1 usd: 0
CPU 6: hi: 0, btch: 1 usd: 0
CPU 7: hi: 0, btch: 1 usd: 0
CPU 8: hi: 0, btch: 1 usd: 0
CPU 9: hi: 0, btch: 1 usd: 0
CPU 10: hi: 0, btch: 1 usd: 0
CPU 11: hi: 0, btch: 1 usd: 0
CPU 12: hi: 0, btch: 1 usd: 0
CPU 13: hi: 0, btch: 1 usd: 0
CPU 14: hi: 0, btch: 1 usd: 0
CPU 15: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
CPU 4: hi: 186, btch: 31 usd: 0
CPU 5: hi: 186, btch: 31 usd: 0
CPU 6: hi: 186, btch: 31 usd: 0
CPU 7: hi: 186, btch: 31 usd: 0
CPU 8: hi: 186, btch: 31 usd: 161
CPU 9: hi: 186, btch: 31 usd: 164
CPU 10: hi: 186, btch: 31 usd: 171
CPU 11: hi: 186, btch: 31 usd: 178
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 0
CPU 14: hi: 186, btch: 31 usd: 0
CPU 15: hi: 186, btch: 31 usd: 0
Node 0 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 175
CPU 1: hi: 186, btch: 31 usd: 120
CPU 2: hi: 186, btch: 31 usd: 77
CPU 3: hi: 186, btch: 31 usd: 0
CPU 4: hi: 186, btch: 31 usd: 0
CPU 5: hi: 186, btch: 31 usd: 0
CPU 6: hi: 186, btch: 31 usd: 89
CPU 7: hi: 186, btch: 31 usd: 2
CPU 8: hi: 186, btch: 31 usd: 39
CPU 9: hi: 186, btch: 31 usd: 127
CPU 10: hi: 186, btch: 31 usd: 9
CPU 11: hi: 186, btch: 31 usd: 170
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 0
CPU 14: hi: 186, btch: 31 usd: 121
CPU 15: hi: 186, btch: 31 usd: 0
Node 1 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 82
CPU 1: hi: 186, btch: 31 usd: 158
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 89
CPU 4: hi: 186, btch: 31 usd: 0
CPU 5: hi: 186, btch: 31 usd: 30
CPU 6: hi: 186, btch: 31 usd: 0
CPU 7: hi: 186, btch: 31 usd: 61
CPU 8: hi: 186, btch: 31 usd: 180
CPU 9: hi: 186, btch: 31 usd: 103
CPU 10: hi: 186, btch: 31 usd: 162
CPU 11: hi: 186, btch: 31 usd: 42
CPU 12: hi: 186, btch: 31 usd: 0
CPU 13: hi: 186, btch: 31 usd: 0
CPU 14: hi: 186, btch: 31 usd: 0
CPU 15: hi: 186, btch: 31 usd: 98
active_anon:7716448 inactive_anon:1137356 isolated_anon:128
active_file:2563426 inactive_file:3632112 isolated_file:0
unevictable:0 dirty:0 writeback:1676 unstable:0
free:1033107 slab_reclaimable:222717 slab_unreclaimable:21054
mapped:12050 shmem:18 pagetables:20595 bounce:0
Node 0 DMA free:15736kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15352kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3179 32216 32216
Node 0 DMA32 free:1395136kB min:4436kB low:5544kB high:6652kB active_anon:780288kB inactive_anon:376980kB active_file:6968kB inactive_file:25896kB unevictable:0kB isolated(anon):512kB isolated(file):0kB present:3255392kB mlocked:0kB dirty:0kB writeback:6700kB mapped:492kB shmem:0kB slab_reclaimable:3356kB slab_unreclaimable:408kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 29037 29037
Node 0 Normal free:694076kB min:40532kB low:50664kB high:60796kB active_anon:14153412kB inactive_anon:1905888kB active_file:5255884kB inactive_file:7325488kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29734400kB mlocked:0kB dirty:0kB writeback:0kB mapped:20016kB shmem:68kB slab_reclaimable:533876kB slab_unreclaimable:42860kB kernel_stack:2840kB pagetables:34016kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:2027480kB min:45116kB low:56392kB high:67672kB active_anon:15932092kB inactive_anon:2266556kB active_file:4990852kB inactive_file:7177064kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:33095680kB mlocked:0kB dirty:0kB writeback:4kB mapped:27692kB shmem:4kB slab_reclaimable:353636kB slab_unreclaimable:40948kB kernel_stack:528kB pagetables:48364kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 2*4kB 2*8kB 2*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15736kB
Node 0 DMA32: 676*4kB 716*8kB 493*16kB 482*32kB 331*64kB 314*128kB 194*256kB 74*512kB 90*1024kB 32*2048kB 258*4096kB = 1395136kB
Node 0 Normal: 409*4kB 355*8kB 360*16kB 90*32kB 48*64kB 1070*128kB 951*256kB 307*512kB 137*1024kB 0*2048kB 0*4096kB = 694076kB
Node 1 Normal: 2112*4kB 480*8kB 251*16kB 65*32kB 40*64kB 18*128kB 1522*256kB 1826*512kB 664*1024kB 0*2048kB 0*4096kB = 2027728kB
6319722 total pagecache pages
124333 pages in swap cache
Swap cache stats: add 8249007, delete 8124674, find 737685/862651
Free swap = 29529892kB
Total swap = 33013752kB
16777215 pages RAM
322632 pages reserved
176578 pages shared
15249459 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 7806] 243986 7806 28804 507 9 0 0 tcsh
[ 7919] 243986 7919 5133045 4096768 9 0 0 frenpic.bin
Memory cgroup out of memory: Kill process 7919 (frenpic.bin) score 1000 or sacrifice child
Killed process 7919, UID 243986, (frenpic.bin) total-vm:20532180kB, anon-rss:16373904kB, file-rss:13168kB
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2013-10-24 17:39 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-17 15:50 Mark Hills
2013-09-17 16:28 ` Johannes Weiner
2013-09-18 0:50 ` Mark Hills
2013-10-24 17:39 ` Mark Hills [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1310241253140.26474@wes.ijneb.com \
--to=mark@xwax.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox