On Fri, May 15, 2009 at 02:53:27PM -0400, starlight@binnacle.cx wrote: > Here's another possible clue: > > I tried the first 'tcbm' testcase on a 2.6.27.7 > kernel that was hanging around from a few months > ago and it breaks it 100% of the time. > > Completely hoses huge memory. Enough "bad pmd" > errors to fill the kernel log. > So I investigated what's wrong with 2.6.27.7. The problem is a race between exec() and the handling of mlock()ed VMAs but I can't see where. The normal teardown of pages is applied to a shared memory segment as if VM_HUGETLB was not set. This was fixed between 2.6.27 and 2.6.28 but apparently by accident during the introduction of CONFIG_UNEVITABLE_LRU. This patchset made a number of changes to how mlock()ed are handled but I didn't spot which was the relevant change that fixed the problem and reverse bisecting didn't help. I've added two people that were working on the unevictable LRU patches to see if they spot something. For context, the two attached files are used to reproduce a problem where bad pmd messages are scribbled all over the console on 2.6.27.7. Do something like echo 64 > /proc/sys/vm/nr_hugepages mount -t hugetlbfs none /mnt sh ./test-tcbm.sh I did confirm that it didn't matter to 2.6.29.1 if CONFIG_UNEVITABLE_LRU is set or not. It's possible the race it still there but I don't know where it is. Any ideas where the race might be? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab