On Sat, Feb 7, 2015 at 9:56 AM, Pat Erley wrote: > On 02/07/2015 02:42 AM, Sedat Dilek wrote: >> >> On Sat, Feb 7, 2015 at 8:33 AM, Sedat Dilek wrote: >>> >>> On Sat, Feb 7, 2015 at 6:12 AM, Pat Erley wrote: >>>> >>>> I'm seeing the message in $subject on my Xen DOM0 on next-20150204 on >>>> x86_64. I haven't had time to bisect it, but have seen some discussion >>>> on >>>> similar topics here recently. I can trigger this pretty reliably by >>>> watching Netflix. At some point (minutes to hours) into it, the netflix >>>> video goes black (audio keeps going, so it still thinks it's working) >>>> and >>>> the error appears in dmesg. Refreshing the page gets the video going >>>> again, >>>> and it will continue playing for some indeterminate amount of time. >>>> >>>> Kirill, I've CC'd you as looking in the logs, you've patched a false >>>> positive trigger of this very recently(patch in kernel I'm running). Am >>>> I >>>> actually hitting a problem, or is this another false positive case? Any >>>> additional details that might help? >>>> >>>> Dmesg from system attached. >>> >>> >>> [ CC some mm folks ] >>> >>> I have seen this, too. >>> >>> root# grep "BUG: non-zero nr_pmds on freeing mm:" /var/log/kern.log | wc >>> -l >>> 21 >>> >>> Checking my logs: On next-20150203 and next-20150204. >>> >>> I am here not in a VM environment and cannot say what causes these >>> messages. >>> >> >> I checked a bit the logs and commits in mm.git and linux-next.git. >> >> [1] lists: >> >> Kirill A. Shutemov (1): mm: do not use mm->nr_pmds on !MMU configurations >> >> NOTE: next-20150204 has this commit, but next-20150203 not (seen the >> BUG: line in both releases). >> >> Looking at Kirill's commit... >> >> At my 1st quick look I thought Kirill mixed mm_nr_pmds_init() in the >> case of defined(__PAGETABLE_PMD_FOLDED), but I was wrong. >> >> @@ -1440,13 +1440,15 @@ static inline int __pud_alloc(struct mm_struct >> *mm, pgd_t *pgd, >> ... >> #if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU) >> ... >> static inline void mm_nr_pmds_init(struct mm_struct *mm) >> { >> atomic_long_set(&mm->nr_pmds, 0); >> } >> ... >> #else >> ... >> static inline void mm_nr_pmds_init(struct mm_struct *mm) {} >> ... >> #endif >> >> So, I drop my idea of reverting Kirill's commit. >> >> Pat, not sure how often you build linux-next. >> When doing a daily linux-next testing... Before bisecting I normally >> checked which version of linux-next was the last good and which one >> was the first bad. >> I cannot say which strategy is better. >> But you seem to have a reliable test with watching Netflix. >> >> Regards, >> - Sedat - >> >> [1] >> http://git.kernel.org/cgit/linux/kernel/git/mhocko/mm.git/tag/?id=mmotm-2015-02-03-16-38 >> [2] >> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e060ff1f1f00caab403bd208ffa78ed1b7ee0c4d > > > Yeah, I only recently found a patch that lets me boot xen on a recent -next > kernel: > > diff --git a/arch/x86/kernel/cpu/microcode/core.c > b/arch/x86/kernel/cpu/microcode/core.c > index 15c2909..36a8361 100644 > --- a/arch/x86/kernel/cpu/microcode/core.c > +++ b/arch/x86/kernel/cpu/microcode/core.c > @@ -552,7 +552,7 @@ static int __init microcode_init(void) > int error; > > if (paravirt_enabled() || dis_ucode_ldr) > - return 0; > + return -EINVAL; > > if (c->x86_vendor == X86_VENDOR_INTEL) > microcode_ops = init_intel_microcode(); > > that I found on it's way to upstream. The last 'known good' Xen setup for > me was a 3.18.0 rc6 kernel. I only use Xen to experiment with, so I don't > boot every kernel with Xen enabled, only when I'm working on learning it. > So as far as a bisect window goes, that's a pretty large one. I'll wait to > see if anyone else chimes in before attempting the bisect(mostly because > it's 3am here, and they'll all likely have a chance to see this chain of > e-mails before I can get going on the bisect tomorrow). I'll also check to > see if I can trigger it on this kernel without booting in xen. > I have run ltp (20150119) in special the mm testsuite. It produces call-traces here when running OOM tests (oom03, oom04 and oom05). # cd /opt/ltp # cat Version 20150119 root# LC_ALL=C ./runltp -f mm 2>&1 | tee results-ltp_mm-testsuite_$(uname -r).txt 1st snippet in dmesg: ... [ 2808.331428] BUG: non-zero nr_pmds on freeing mm: 17 [ 3283.043499] oom03 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [ 3283.043505] oom03 cpuset=/ mems_allowed=0 [ 3283.043551] CPU: 2 PID: 14892 Comm: oom03 Not tainted 3.19.0-rc7-next-20150204.14-iniza-small #1 [ 3283.043553] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH, BIOS 13XK 03/28/2013 [ 3283.043555] 0000000000000000 ffff88005402fca8 ffffffff817e392d 000000000000000a [ 3283.043559] ffff8800bcf04000 ffff88005402fd38 ffffffff817e1a16 ffff88005402fcd8 [ 3283.043562] ffffffff810d827d 0000000000000206 ffffffff81c6e800 ffff88005402fce8 [ 3283.043565] Call Trace: [ 3283.043571] [] dump_stack+0x4c/0x65 [ 3283.043576] [] dump_header+0x9e/0x261 [ 3283.043580] [] ? trace_hardirqs_on_caller+0x15d/0x200 [ 3283.043583] [] ? trace_hardirqs_on+0xd/0x10 [ 3283.043587] [] oom_kill_process+0x1dc/0x3d0 [ 3283.043590] [] mem_cgroup_oom_synchronize+0x6b8/0x6f0 [ 3283.043594] [] ? mem_cgroup_reset+0xb0/0xb0 [ 3283.043597] [] pagefault_out_of_memory+0x24/0xe0 [ 3283.043600] [] mm_fault_error+0x8d/0x190 [ 3283.043603] [] __do_page_fault+0x4b0/0x4c0 [ 3283.043605] [] do_page_fault+0x31/0x70 [ 3283.043609] [] page_fault+0x28/0x30 [ 3283.043657] Task in /1 killed as a result of limit of /1 [ 3283.043790] memory: usage 1048576kB, limit 1048576kB, failcnt 28578 [ 3283.043792] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 3283.043793] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 3283.043795] Memory cgroup stats for /1: cache:0KB rss:1048576KB rss_huge:0KB mapped_file:0KB writeback:4316KB inactive_anon:524296KB active_anon:524228KB inactive_file:0KB active_file:0KB unevictable:0KB [ 3283.043867] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 3283.044061] [14891] 0 14891 1618 427 9 3 0 0 oom03 [ 3283.044066] [14892] 0 14892 788050 252632 631 6 65535 0 oom03 [ 3283.044069] Memory cgroup out of memory: Kill process 14892 (oom03) score 943 or sacrifice child [ 3283.044103] Killed process 14892 (oom03) total-vm:3152200kB, anon-rss:1009556kB, file-rss:972kB ... Hope this helps to get the beast. - Sedat -