On 11/06/2012 05:14 PM, Mel Gorman wrote: > There are currently two competing approaches to implement support for > automatically migrating pages to optimise NUMA locality. Performance results > are available for both but review highlighted different problems in both. > They are not compatible with each other even though some fundamental > mechanics should have been the same. > > For example, schednuma implements many of its optimisations before the code > that benefits most from these optimisations are introduced obscuring what the > cost of schednuma might be and if the optimisations can be used elsewhere > independant of the series. It also effectively hard-codes PROT_NONE to be > the hinting fault even though it should be an achitecture-specific decision. > On the other hand, it is well integrated and implements all its work in the > context of the process that benefits from the migration. > > autonuma goes straight to kernel threads for marking PTEs pte_numa to > capture the necessary statistics it depends on. This obscures the cost of > autonuma in a manner that is difficult to measure and hard to retro-fit > to put in the context of the process. Some of these costs are in paths the > scheduler folk traditionally are very wary of making heavier, particularly > if that cost is difficult to measure. On the other hand, performance > tests indicate it is the best perfoming solution. > > As the patch sets do not share any code, it is difficult to incrementally > develop one to take advantage of the strengths of the other. Many of the > patches would be code churn that is annoying to review and fairly measuring > the results would be problematic. > > This series addresses part of the integration and sharing problem by > implementing a foundation that either the policy for schednuma or autonuma > can be rebased on. The actual policy it implements is a very stupid > greedy policy called "Migrate On Reference Of pte_numa Node (MORON)". > While stupid, it can be faster than the vanilla kernel and the expectation > is that any clever policy should be able to beat MORON. The advantage is > that it still defines how the policy needs to hook into the core code -- > scheduler and mempolicy mostly so many optimisations (such as native THP > migration) can be shared between different policy implementations. > > This series steals very heavily from both autonuma and schednuma with very > little original code. In some cases I removed the signed-off-bys because > the result was too different. I have noted in the changelog where this > happened but the signed-offs can be restored if the original authors agree. > > Patches 1-3 move some vmstat counters so that migrated pages get accounted > for. In the past the primary user of migration was compaction but > if pages are to migrate for NUMA optimisation then the counters > need to be generally useful. > > Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used > to trigger faults later in the series. A placement policy is expected > to use these faults to determine if a page should migrate. On x86, > the bit is the same as _PAGE_PROTNONE but other architectures > may differ. > > Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and > friends. It implements them for x86, handles GUP and preserves > the _PAGE_NUMA bit across THP splits. > > Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears > them again. > > Patches 9-11 add a migrate-on-fault mode that applications can specifically > ask for. Applications can take advantage of this if they wish. It > also meanst that if automatic balancing was broken for some workload > that the application could disable the automatic stuff but still > get some advantage. > > Patch 12 adds migrate_misplaced_page which is responsible for migrating > a page to a new location. > > Patch 13 migrates the page on fault if mpol_misplaced() says to do so. > > Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use. > On the next reference the memory should be migrated to the node that > references the memory. > > Patch 15 sets pte_numa within the context of the scheduler. > > Patch 16 adds some vmstats that can be used to approximate the cost of the > scheduling policy in a more fine-grained fashion than looking at > the system CPU usage. > > Patch 17 implements the MORON policy. > > Patches 18-19 note that the marking of pte_numa has a number of disadvantages and > instead incrementally updates a limited range of the address space > each tick. > > The obvious next step is to rebase a proper placement policy on top of this > foundation and compare it to MORON (or any other placement policy). It > should be possible to share optimisations between different policies to > allow meaningful comparisons. > > For now, I am going to compare this patchset with the most recent posting > of schednuma and autonuma just to get a feeling for where it stands. I > only ran the autonuma benchmark and specjbb tests. > > The baseline kernel has stat patches 1-3 applied. Hello Mel, my 2 nodes machine hit a panic fault after applied the patch set(based on kernel-3.7.0-rc4), please review it: ..... [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.7.0-rc4+ root=UUID=a557cd78-962e-48a2-b606-c77b3d8d22dd console=ttyS0,115200 console=tty0 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 init 3 debug earlyprintk=ttyS0,115200 LANG=en_US.UTF-8 [ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes) [ 0.000000] __ex_table already sorted, skipping sort [ 0.000000] Checking aperture... [ 0.000000] No AGP bridge found [ 0.000000] Memory: 8102020k/10485760k available (6112k kernel code, 2108912k absent, 274828k reserved, 3823k data, 1176k init) [ 0.000000] ------------[ cut here ]------------ [ 0.000000] kernel BUG at mm/mempolicy.c:1785! [ 0.000000] invalid opcode: 0000 [#1] SMP [ 0.000000] Modules linked in: [ 0.000000] CPU 0 [ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4+ #9 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356 [ 0.000000] RIP: 0010:[] [] policy_zonelist+0x1e/0xa0 [ 0.000000] RSP: 0000:ffffffff818afe68 EFLAGS: 00010093 [ 0.000000] RAX: 0000000000000000 RBX: ffffffff81cbfe00 RCX: 000000000000049d [ 0.000000] RDX: 0000000000000000 RSI: ffffffff81cbfe00 RDI: 0000000000008000 [ 0.000000] RBP: ffffffff818afe78 R08: 203a79726f6d654d R09: 0000000000000179 [ 0.000000] R10: 303138203a79726f R11: 30312f6b30323032 R12: 0000000000008000 [ 0.000000] R13: 0000000000000000 R14: ffffffff818c1420 R15: ffffffff818c1420 [ 0.000000] FS: 0000000000000000(0000) GS:ffff88017bc00000(0000) knlGS:0000000000000000 [ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 0.000000] CR2: 0000000000000000 CR3: 00000000018b9000 CR4: 00000000000006b0 [ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 0.000000] Process swapper (pid: 0, threadinfo ffffffff818ae000, task ffffffff818c1420) [ 0.000000] Stack: [ 0.000000] ffff88027ffbe8c0 ffffffff81cbfe00 ffffffff818afec8 ffffffff81176966 [ 0.000000] 0000000000000000 0000000000000030 ffffffff818afef8 0000000000000100 [ 0.000000] ffffffff81a12000 0000000000000000 ffff88027ffbe8c0 000000007b5d69a0 [ 0.000000] Call Trace: [ 0.000000] [] alloc_pages_current+0xa6/0x170 [ 0.000000] [] __get_free_pages+0x14/0x50 [ 0.000000] [] kmem_cache_init+0x53/0x2d2 [ 0.000000] [] start_kernel+0x1e0/0x3c7 [ 0.000000] [] ? repair_env_string+0x5e/0x5e [ 0.000000] [] x86_64_start_reservations+0x131/0x135 [ 0.000000] [] x86_64_start_kernel+0x100/0x10f [ 0.000000] Code: e4 17 00 48 89 e5 5d c3 0f 1f 44 00 00 e8 cb e2 47 00 55 48 89 e5 53 48 83 ec 08 0f b7 46 04 66 83 f8 01 74 08 66 83 f8 02 74 42 <0f> 0b 89 fb 81 e3 00 00 04 00 f6 46 06 02 75 04 0f bf 56 08 31 [ 0.000000] RIP [] policy_zonelist+0x1e/0xa0 [ 0.000000] RSP [ 0.000000] ---[ end trace ce62cfec816bb3fe ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! ...... the config file is attached and no such issue found in mainline, please let me know if you need further info. Thanks, Zhouping