On 05/31/2012 11:08 AM, Andrea Arcangeli wrote: > Hi, > > On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote: >> Similar problem with __autonuma_migrate_page_remove here. >> >> [ 1945.516632] ------------[ cut here ]------------ >> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0() >> [ 1945.516642] Hardware name: ProLiant DL585 G5 >> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100) >> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan] >> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G W 3.4.0aa_alpha+ #3 >> [ 1945.516701] Call Trace: >> [ 1945.516710] [] warn_slowpath_common+0x7f/0xc0 >> [ 1945.516717] [] warn_slowpath_fmt+0x46/0x50 >> [ 1945.516726] [] __list_del_entry+0x63/0xd0 >> [ 1945.516735] [] list_del+0x11/0x40 >> [ 1945.516743] [] __autonuma_migrate_page_remove+0x48/0x80 >> [ 1945.516746] [] knuma_migrated+0x296/0x8a0 >> [ 1945.516749] [] ? wake_up_bit+0x40/0x40 >> [ 1945.516758] [] ? __autonuma_migrate_page_remove+0x80/0x80 >> [ 1945.516766] [] kthread+0x93/0xa0 >> [ 1945.516780] [] kernel_thread_helper+0x4/0x10 >> [ 1945.516791] [] ? flush_kthread_worker+0x80/0x80 >> [ 1945.516798] [] ? gs_change+0x13/0x13 >> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]--- > > I didn't manage to reproduce it on my hardware but it seems this was > caused by the autonuma_migrate_split_huge_page: the tail page list > linking wasn't surrounded by the compound lock to make list insertion > and migrate_nid setting atomic like it happens everywhere else (the > caller holding the lock on the head page wasn't enough to make the > tails stable too). > > I released an AutoNUMA15 branch that includes all pending fixes: > > git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git Got this from AutoNUMA14 as well, thought I'd just missed a fix... but it is happening with AutoNUMA15 as well, now: [ 258.911975] BUG: Bad page map in process Xorg pte:80000000c2e4626f pmd:8208b2067 [ 258.911979] addr:00007f495e3fe000 vm_flags:08100073 anon_vma:ffff88081d8bde10 mapping:ffff880821a249f0 index:14 [ 258.911984] vma->vm_ops->fault: filemap_fault+0x0/0x4c0 [ 258.911987] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60 [ 258.911990] Pid: 4707, comm: Xorg Tainted: G B 3.4.0-09252-gebb5196 #1 [ 258.911992] Call Trace: [ 258.911997] [] print_bad_pte+0x1d4/0x270 [ 258.912003] [] ? mem_cgroup_count_vm_event+0x1a/0xd0 [ 258.912015] [] vm_normal_page+0x69/0x80 [ 258.912021] [] __pmd_numa_fixup+0x124/0x160 [ 258.912026] [] ? do_page_fault+0xd0/0x570 [ 258.912031] [] handle_mm_fault+0x2b7/0x380 [ 258.912037] [] do_page_fault+0x156/0x570 [ 258.912042] [] ? lockdep_sys_exit+0x28/0x90 [ 258.912052] [] ? lockdep_sys_exit_thunk+0x35/0x67 [ 258.912064] [] page_fault+0x25/0x30 [ 258.912071] BUG: Bad page map in process Xorg pte:80000000c2e4726f pmd:8208b2067 [ 258.912079] addr:00007f495e3ff000 vm_flags:08100073 anon_vma:ffff88081d8bde10 mapping:ffff880821a249f0 index:15 [ 258.912085] vma->vm_ops->fault: filemap_fault+0x0/0x4c0 [ 258.912088] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60 [ 258.912096] Pid: 4707, comm: Xorg Tainted: G B 3.4.0-09252-gebb5196 #1 [ 258.912098] Call Trace: [ 258.912107] [] print_bad_pte+0x1d4/0x270 [ 258.912115] [] ? mem_cgroup_count_vm_event+0x1a/0xd0 [ 258.912124] [] vm_normal_page+0x69/0x80 [ 258.912137] [] __pmd_numa_fixup+0x124/0x160 [ 258.912146] [] ? do_page_fault+0xd0/0x570 [ 258.912156] [] handle_mm_fault+0x2b7/0x380 [ 258.912165] [] do_page_fault+0x156/0x570 [ 258.912173] [] ? lockdep_sys_exit+0x28/0x90 [ 258.912180] [] ? lockdep_sys_exit_thunk+0x35/0x67 [ 258.912186] [] page_fault+0x25/0x30 (repeats... a *lot*). Config file attached. Seems consistently in Xorg, and likely for a good reason (see below). If I turn off cgroups, I still get the problem on fault: [ 816.631466] BUG: Bad page map in process Xorg pte:80000000c2e3526f pmd:81e57f067 [ 816.631473] addr:00007fa5ae495000 vm_flags:08000075 anon_vma: (null) mapping:ffff88049d3791b0 index:ffffffffffff4 [ 816.631478] vma->vm_ops->fault: filemap_fault+0x0/0x4b0 [ 816.631486] vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x60 [ 816.631490] Pid: 4711, comm: Xorg Tainted: G B 3.4.0-09252-gebb5196 #3 [ 816.631496] Call Trace: [ 816.631501] [] print_bad_pte+0x1d4/0x270 [ 816.631509] [] vm_normal_page+0x69/0x80 [ 816.631519] [] __pmd_numa_fixup+0x124/0x160 [ 816.631529] [] ? do_page_fault+0xd0/0x570 [ 816.631538] [] handle_mm_fault+0x2a7/0x360 [ 816.631548] [] do_page_fault+0x156/0x570 [ 816.631557] [] ? copy_user_generic_string+0x2d/0x40 [ 816.631565] [] ? poll_select_copy_remaining+0x101/0x170 [ 816.631576] [] ? lockdep_sys_exit+0x28/0x90 [ 816.631586] [] ? lockdep_sys_exit_thunk+0x35/0x67 [ 816.631596] [] page_fault+0x25/0x30 >From a little printk addition -- it looks like this is the following block: if (HAVE_PTE_SPECIAL) { if (likely(!pte_special(pte))) goto check_pfn; if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) return NULL; if (!is_zero_pfn(pfn)) { print_bad_pte(vma, addr, pte, NULL); } return NULL; } We're a special pte, but a non-zero pfn. Being Xorg, I'm assuming this is a remap of a kernel page into the user virtual address space, but that's just a gut instinct. Since I read the above as "We don't expect to ever take spurious faults on instantiated special ptes", I would think you'd need to either never migrate such special pages and have __pmd_numa_fixup() skip them with an explicit check before the vm_normal_page() call, or expand vm_normal_page() recognize this as a legal case. Since the NUMA bit/state is cleared prior on the pmd/pte prior to this call, you'll have to pass the NUMA-ness of the fault down by other means. Of course... I'm still really ramping up on this kernel, so that could all be hokum, too. Hopefully it helps. I can dump the EFI memory map and whatnot to you if you need it, but I think this is more of an algorithmic issue than a hardware corner case. 2 sockets, 6 cores per socket, HT on, 32Gb of RAM split a little asymmetrically (18Gb / 14Gb) if it matters. Thanks, Don Morris