On 11/06/2012 05:14 PM, Mel Gorman wrote:
> There are currently two competing approaches to implement support for
> automatically migrating pages to optimise NUMA locality.  Performance results
> are available for both but review highlighted different problems in both.
> They are not compatible with each other even though some fundamental
> mechanics should have been the same.
>
> For example, schednuma implements many of its optimisations before the code
> that benefits most from these optimisations are introduced obscuring what the
> cost of schednuma might be and if the optimisations can be used elsewhere
> independant of the series. It also effectively hard-codes PROT_NONE to be
> the hinting fault even though it should be an achitecture-specific decision.
> On the other hand, it is well integrated and implements all its work in the
> context of the process that benefits from the migration.
>
> autonuma goes straight to kernel threads for marking PTEs pte_numa to
> capture the necessary statistics it depends on. This obscures the cost of
> autonuma in a manner that is difficult to measure and hard to retro-fit
> to put in the context of the process. Some of these costs are in paths the
> scheduler folk traditionally are very wary of making heavier, particularly
> if that cost is difficult to measure.  On the other hand, performance
> tests indicate it is the best perfoming solution.
>
> As the patch sets do not share any code, it is difficult to incrementally
> develop one to take advantage of the strengths of the other. Many of the
> patches would be code churn that is annoying to review and fairly measuring
> the results would be problematic.
>
> This series addresses part of the integration and sharing problem by
> implementing a foundation that either the policy for schednuma or autonuma
> can be rebased on. The actual policy it implements is a very stupid
> greedy policy called "Migrate On Reference Of pte_numa Node (MORON)".
> While stupid, it can be faster than the vanilla kernel and the expectation
> is that any clever policy should be able to beat MORON. The advantage is
> that it still defines how the policy needs to hook into the core code --
> scheduler and mempolicy mostly so many optimisations (such as native THP
> migration) can be shared between different policy implementations.
>
> This series steals very heavily from both autonuma and schednuma with very
> little original code. In some cases I removed the signed-off-bys because
> the result was too different. I have noted in the changelog where this
> happened but the signed-offs can be restored if the original authors agree.
>
> Patches 1-3 move some vmstat counters so that migrated pages get accounted
> 	for. In the past the primary user of migration was compaction but
> 	if pages are to migrate for NUMA optimisation then the counters
> 	need to be generally useful.
>
> Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
> 	to trigger faults later in the series. A placement policy is expected
> 	to use these faults to determine if a page should migrate.  On x86,
> 	the bit is the same as _PAGE_PROTNONE but other architectures
> 	may differ.
>
> Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
> 	friends. It implements them for x86, handles GUP and preserves
> 	the _PAGE_NUMA bit across THP splits.
>
> Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears
> 	them again.
>
> Patches 9-11 add a migrate-on-fault mode that applications can specifically
> 	ask for. Applications can take advantage of this if they wish. It
> 	also meanst that if automatic balancing was broken for some workload
> 	that the application could disable the automatic stuff but still
> 	get some advantage.
>
> Patch 12 adds migrate_misplaced_page which is responsible for migrating
> 	a page to a new location.
>
> Patch 13 migrates the page on fault if mpol_misplaced() says to do so.
>
> Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
> 	On the next reference the memory should be migrated to the node that
> 	references the memory.
>
> Patch 15 sets pte_numa within the context of the scheduler.
>
> Patch 16 adds some vmstats that can be used to approximate the cost of the
> 	scheduling policy in a more fine-grained fashion than looking at
> 	the system CPU usage.
>
> Patch 17 implements the MORON policy.
>
> Patches 18-19 note that the marking of pte_numa has a number of disadvantages and
> 	instead incrementally updates a limited range of the address space
> 	each tick.
>
> The obvious next step is to rebase a proper placement policy on top of this
> foundation and compare it to MORON (or any other placement policy). It
> should be possible to share optimisations between different policies to
> allow meaningful comparisons.
>
> For now, I am going to compare this patchset with the most recent posting
> of schednuma and autonuma just to get a feeling for where it stands. I
> only ran the autonuma benchmark and specjbb tests.
>
> The baseline kernel has stat patches 1-3 applied.

Hello Mel,

my 2 nodes machine hit a panic fault after applied the patch set(based 
on kernel-3.7.0-rc4), please review it:

.....
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.7.0-rc4+ 
root=UUID=a557cd78-962e-48a2-b606-c77b3d8d22dd console=ttyS0,115200 
console=tty0 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 init 3 debug 
earlyprintk=ttyS0,115200 LANG=en_US.UTF-8
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] __ex_table already sorted, skipping sort
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Memory: 8102020k/10485760k available (6112k kernel code, 
2108912k absent, 274828k reserved, 3823k data, 1176k init)
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at mm/mempolicy.c:1785!
[ 0.000000] invalid opcode: 0000 [#1] SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU 0
[ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4+ #9 IBM IBM 
System x3400 M3 Server -[7379I08]-/69Y4356
[ 0.000000] RIP: 0010:[<ffffffff81175b0e>] [<ffffffff81175b0e>] 
policy_zonelist+0x1e/0xa0
[ 0.000000] RSP: 0000:ffffffff818afe68  EFLAGS: 00010093
[ 0.000000] RAX: 0000000000000000 RBX: ffffffff81cbfe00 RCX: 
000000000000049d
[ 0.000000] RDX: 0000000000000000 RSI: ffffffff81cbfe00 RDI: 
0000000000008000
[ 0.000000] RBP: ffffffff818afe78 R08: 203a79726f6d654d R09: 
0000000000000179
[ 0.000000] R10: 303138203a79726f R11: 30312f6b30323032 R12: 
0000000000008000
[ 0.000000] R13: 0000000000000000 R14: ffffffff818c1420 R15: 
ffffffff818c1420
[ 0.000000] FS:  0000000000000000(0000) GS:ffff88017bc00000(0000) 
knlGS:0000000000000000
[ 0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.000000] CR2: 0000000000000000 CR3: 00000000018b9000 CR4: 
00000000000006b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff818ae000, 
task ffffffff818c1420)
[    0.000000] Stack:
[    0.000000]  ffff88027ffbe8c0 ffffffff81cbfe00 ffffffff818afec8 
ffffffff81176966
[    0.000000]  0000000000000000 0000000000000030 ffffffff818afef8 
0000000000000100
[    0.000000]  ffffffff81a12000 0000000000000000 ffff88027ffbe8c0 
000000007b5d69a0
[    0.000000] Call Trace:
[    0.000000] [<ffffffff81176966>] alloc_pages_current+0xa6/0x170
[    0.000000] [<ffffffff81137a44>] __get_free_pages+0x14/0x50
[    0.000000] [<ffffffff819efd9b>] kmem_cache_init+0x53/0x2d2
[    0.000000] [<ffffffff819caa53>] start_kernel+0x1e0/0x3c7
[    0.000000] [<ffffffff819ca672>] ? repair_env_string+0x5e/0x5e
[    0.000000] [<ffffffff819ca356>] x86_64_start_reservations+0x131/0x135
[    0.000000] [<ffffffff819ca45a>] x86_64_start_kernel+0x100/0x10f
[    0.000000] Code: e4 17 00 48 89 e5 5d c3 0f 1f 44 00 00 e8 cb e2 47 
00 55 48 89 e5 53 48 83 ec 08 0f b7 46 04 66 83 f8 01 74 08 66 83 f8 02 
74 42 <0f> 0b 89 fb 81 e3 00 00 04 00 f6 46 06 02 75 04 0f bf 56 08 31
[    0.000000] RIP [<ffffffff81175b0e>] policy_zonelist+0x1e/0xa0
[    0.000000]  RSP <ffffffff818afe68>
[    0.000000] ---[ end trace ce62cfec816bb3fe ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
......

the config file is attached
and no such issue found in mainline, please let me know if you need 
further info.

Thanks,
Zhouping