While running workloads that do lots of forking processes, exiting processes and page reclamation(AIM 7) on large systems very high system time(100%) and lots of lock contention was observed. CPU5: [] ? _spin_lock+0x27/0x48 [] ? anon_vma_link+0x2a/0x5a [] ? dup_mm+0x242/0x40c [] ? copy_process+0xab1/0x12be [] ? do_fork+0x151/0x330 [] ? default_wake_function+0x0/0x36 [] ? _spin_lock_irqsave+0x2f/0x68 [] ? stub_clone+0x13/0x20 [] ? system_call_fastpath+0x16/0x1b CPU4: [] ? _spin_lock+0x29/0x48 [] ? anon_vma_unlink+0x2a/0x84 [] ? free_pgtables+0x3c/0xe1 [] ? exit_mmap+0xc5/0x110 [] ? mmput+0x55/0xd9 [] ? exit_mm+0x109/0x129 [] ? do_exit+0x1d7/0x712 [] ? _spin_lock_irqsave+0x2f/0x68 [] ? do_group_exit+0x86/0xb2 [] ? sys_exit_group+0x22/0x3e [] ? system_call_fastpath+0x16/0x1b CPU0: [] ? _spin_lock+0x29/0x48 [] ? page_check_address+0x9e/0x16f [] ? page_referenced_one+0x53/0x10b [] ? page_referenced+0xcd/0x167 [] ? shrink_active_list+0x1ed/0x2a3 [] ? shrink_zone+0xa06/0xa38 [] ? getnstimeofday+0x64/0xce [] ? do_try_to_free_pages+0x1e5/0x362 [] ? try_to_free_pages+0x7a/0x94 [] ? isolate_pages_global+0x0/0x242 [] ? __alloc_pages_nodemask+0x397/0x572 [] ? __get_free_pages+0x19/0x6e [] ? copy_process+0xd1/0x12be [] ? avc_has_perm+0x5c/0x84 [] ? user_path_at+0x65/0xa3 [] ? do_fork+0x151/0x330 [] ? check_for_new_grace_period+0x78/0xab [] ? stub_clone+0x13/0x20 [] ? system_call_fastpath+0x16/0x1b ------------------------------------------------------------------------------ PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8 CPUs) ------------------------------------------------------------------------------ samples pcnt RIP kernel function ______ _______ _____ ________________ _______________ 3235.00 - 75.1% - ffffffff814afb21 : _spin_lock 670.00 - 15.6% - ffffffff81101a33 : page_check_address 165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc] 40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one 29.00 - 0.7% - ffffffff81101c65 : page_referenced_one 27.00 - 0.6% - ffffffff81101964 : vma_address 8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c 6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c 6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon 5.00 - 0.1% - ffffffff810fb014 : copy_page_range 5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist The cause was determined to be the unconditional call to page_referenced() for every mapped page encountered in shrink_active_list(). page_referenced() takes the anon_vma->lock and calls page_referenced_one() for each vma. page_referenced_one() then calls page_check_address() which takes the pte_lockptr spinlock. If several CPUs are doing this at the same time there is a lot of pte_lockptr spinlock contention with the anon_vma->lock held. This causes contention on the anon_vma->lock, stalling in the fo and very high system time. Before the splitLRU patch shrink_active_list() would only call page_referenced() when reclaim_mapped got set. reclaim_mapped only got set when the priority worked its way from 12 all the way to 7. This prevented page_referenced() from being called from shrink_active_list() until the system was really struggling to reclaim memory. On way to prevent this is to change page_check_address() to execute a spin_trylock(ptl) when it was called by shrink_active_list() and simply fail if it could not get the pte_lockptr spinlock. This will make shrink_active_list() consider the page not referenced and allow the anon_vma->lock to be dropped much quicker. The attached patch does just that, thoughts???