On 09/08/2014 01:18 PM, Mel Gorman wrote: > On Thu, Sep 04, 2014 at 05:04:37AM -0400, Sasha Levin wrote: >> On 08/29/2014 09:23 PM, Sasha Levin wrote: >>> On 08/27/2014 11:26 AM, Mel Gorman wrote: >>>>> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h >>>>> index 281870f..ffea570 100644 >>>>> --- a/include/asm-generic/pgtable.h >>>>> +++ b/include/asm-generic/pgtable.h >>>>> @@ -723,6 +723,9 @@ static inline pte_t pte_mknuma(pte_t pte) >>>>> >>>>> VM_BUG_ON(!(val & _PAGE_PRESENT)); >>>>> >>>>> + /* debugging only, specific to x86 */ >>>>> + VM_BUG_ON(val & _PAGE_PROTNONE); >>>>> + >>>>> val &= ~_PAGE_PRESENT; >>>>> val |= _PAGE_NUMA; >>> Triggered again, the first VM_BUG_ON got hit, the second one never did. >> >> Okay, this bug has reproduced quite a few times since then that I no longer >> suspect it's random memory corruption. I'd be happy to try out more debug >> patches if you have any leads. >> > > The fact the second one doesn't trigger makes me think that this is not > related to how the helpers are called and is instead relating to timing. > I tried reproducing this but got nothing after 3 hours. How long does it > typically take to reproduce in a given run? You mentioned that it takes a > few weeks to hit but maybe the frequency has changed since. I tried todays > linux-next kernel but it didn't even boot so next-20140826 to match your > original report but got nothing. Can you also send me the config you used > in case that's a factor. The frequency seems to have changed, I can trigger this 5-10 times a day now. Config is attached. Thanks, Sasha > I had one hunch that this may somehow be related to a collision between > pagetable teardown during exit and the scanner but I could not find a > way that could actually happen. During teardown there should be only one > user of the mm and it can't race with itself. > > A worse possibility is that somehow the lock is getting corrupted but > that's also a tough sell considering that the locks should be allocated > from a dedicated cache. I guess I could try breaking that to allocate > one page per lock so DEBUG_PAGEALLOC triggers but I'm not very > optimistic.