Re: [RFC 0/6] the big khugepaged redesign

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Vlastimil Babka <vbabka@suse.cz>
To: Andres Freund <andres@anarazel.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
	Michal Hocko <mhocko@suse.cz>,
	Ebru Akagunduz <ebru.akagunduz@gmail.com>,
	Alex Thorlton <athorlton@sgi.com>,
	David Rientjes <rientjes@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Robert Haas <robertmhaas@gmail.com>,
	Josh Berkus <josh@agliodbs.com>
Subject: Re: [RFC 0/6] the big khugepaged redesign
Date: Fri, 06 Mar 2015 08:50:24 +0100	[thread overview]
Message-ID: <54F95C40.6040302@suse.cz> (raw)
In-Reply-To: <20150306002102.GU30405@awork2.anarazel.de>

On 03/06/2015 01:21 AM, Andres Freund wrote:
> Long mail ahead, sorry for that.

No problem, thanks a lot!

> TL;DR: THP is still noticeable, but not nearly as bad.
> 
> On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
>> That however means the workload is based on hugetlbfs and shouldn't trigger THP
>> page fault activity, which is the aim of this patchset. Some more googling made
>> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
>> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
>> patchset should help, but I obviously won't be able to measure this before LSF/MM...
> 
> Just as a reference, this is how some the more extreme profiles looked
> like in the past:
> 
>>     96.50%    postmaster  [kernel.kallsyms]         [k] _spin_lock_irq
>>               |
>>               --- _spin_lock_irq
>>                  |
>>                  |--99.87%-- compact_zone
>>                  |          compact_zone_order
>>                  |          try_to_compact_pages
>>                  |          __alloc_pages_nodemask
>>                  |          alloc_pages_vma
>>                  |          do_huge_pmd_anonymous_page
>>                  |          handle_mm_fault
>>                  |          __do_page_fault
>>                  |          do_page_fault
>>                  |          page_fault
>>                  |          0x631d98
>>                   --0.13%-- [...]
> 
> That specific profile is from a rather old kernel as you probably
> recognize.

Yeah, sounds like synchronous compaction before it was forbidden for THP page
faults...

>> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
>> psql performance with THPs enabled/disabled on recent kernels, where e.g.
>> compaction is no longer synchronous for THP page faults?
> 
> So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
> RAM.
> 
> First of: It's noticeably harder to trigger problems than it used to
> be. But, I can still trigger various problems that are much worse with
> THP enabled than without.
> 
> There seem to be various different bottlenecks; I can get somewhat
> different profiles.
> 
> In a somewhat artificial workload, that tries to simulate what I've seen
> trigger the problem at a customer, I can quite easily trigger large
> differences between THP=enable and THP=never.  There's two types of
> tasks running, one purely OLTP, another doing somewhat more complex
> statements that require a fair amount of process local memory.
> 
> (ignore the absolute numbers for progress, I just waited for somewhat
> stable results while doing other stuff)
> 
> THP off:
> Task 1 solo:
> progress: 200.0 s, 391442.0 tps, 0.654 ms lat
> progress: 201.0 s, 394816.1 tps, 0.683 ms lat
> progress: 202.0 s, 409722.5 tps, 0.625 ms lat
> progress: 203.0 s, 384794.9 tps, 0.665 ms lat
> 
> combined:
> Task 1:
> progress: 144.0 s, 25430.4 tps, 10.067 ms lat
> progress: 145.0 s, 22260.3 tps, 11.500 ms lat
> progress: 146.0 s, 24089.9 tps, 10.627 ms lat
> progress: 147.0 s, 25888.8 tps, 9.888 ms lat
> 
> Task 2:
> progress: 24.4 s, 30.0 tps, 2134.043 ms lat
> progress: 26.5 s, 29.8 tps, 2150.487 ms lat
> progress: 28.4 s, 29.7 tps, 2151.557 ms lat
> progress: 30.4 s, 28.5 tps, 2245.304 ms lat
> 
> flat profile:
>      6.07%      postgres  postgres            [.] heap_form_minimal_tuple
>      4.36%      postgres  postgres            [.] heap_fill_tuple
>      4.22%      postgres  postgres            [.] ExecStoreMinimalTuple
>      4.11%      postgres  postgres            [.] AllocSetAlloc
>      3.97%      postgres  postgres            [.] advance_aggregates
>      3.94%      postgres  postgres            [.] advance_transition_function
>      3.94%      postgres  postgres            [.] ExecMakeTableFunctionResult
>      3.33%      postgres  postgres            [.] heap_compute_data_size
>      3.30%      postgres  postgres            [.] MemoryContextReset
>      3.28%      postgres  postgres            [.] ExecScan
>      3.04%      postgres  postgres            [.] ExecProject
>      2.96%      postgres  postgres            [.] generate_series_step_int4
>      2.94%      postgres  [kernel.kallsyms]   [k] clear_page_c
> 
> (i.e. most of it postgres, cache miss bound)
> 
> THP on:
> Task 1 solo:
> progress: 140.0 s, 390458.1 tps, 0.656 ms lat
> progress: 141.0 s, 391174.2 tps, 0.654 ms lat
> progress: 142.0 s, 394828.8 tps, 0.648 ms lat
> progress: 143.0 s, 398156.2 tps, 0.643 ms lat
> 
> Task 1:
> progress: 179.0 s, 23963.1 tps, 10.683 ms lat
> progress: 180.0 s, 22712.9 tps, 11.271 ms lat
> progress: 181.0 s, 21211.4 tps, 12.069 ms lat
> progress: 182.0 s, 23207.8 tps, 11.031 ms lat
> 
> Task 2:
> progress: 28.2 s, 19.1 tps, 3349.747 ms lat
> progress: 31.0 s, 19.8 tps, 3230.589 ms lat
> progress: 34.3 s, 21.5 tps, 2979.113 ms lat
> progress: 37.4 s, 20.9 tps, 3055.143 ms lat

So that's 1/3 worse tps for task 2? Not very nice...

> flat profile:
>     21.36%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page

Interesting. This function shouldn't be heavyweight, although cache misses are
certainly possible. It's only called once per pageblock, so for this to be so
prominent, the pageblocks are probably marked as unsuitable and it just skips
over them uselessly. The compaction doesn't become deferred, since that only
happens for synchronous compaction and this is probably doing just a lots of
asynchronous ones.

I wonder what are the /proc/vmstat here for compaction and thp fault succcesses...

>      4.93%      postgres  postgres            [.] ExecStoreMinimalTuple
>      4.02%      postgres  postgres            [.] heap_form_minimal_tuple
>      3.55%      postgres  [kernel.kallsyms]   [k] clear_page_c
>      2.85%      postgres  postgres            [.] heap_fill_tuple
>      2.60%      postgres  postgres            [.] ExecMakeTableFunctionResult
>      2.57%      postgres  postgres            [.] AllocSetAlloc
>      2.44%      postgres  postgres            [.] advance_transition_function
>      2.43%      postgres  postgres            [.] generate_series_step_int4
> 
> callgraph:
>     18.23%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
>                 |
>                 --- pageblock_pfn_to_page
>                    |
>                    |--99.05%-- isolate_migratepages
>                    |          compact_zone
>                    |          compact_zone_order
>                    |          try_to_compact_pages
>                    |          __alloc_pages_direct_compact
>                    |          __alloc_pages_nodemask
>                    |          alloc_pages_vma
>                    |          do_huge_pmd_anonymous_page
>                    |          __handle_mm_fault
>                    |          handle_mm_fault
>                    |          __do_page_fault
>                    |          do_page_fault
>                    |          page_fault
> ....
>                    |
>                     --0.95%-- compact_zone
>                               compact_zone_order
>                               try_to_compact_pages
>                               __alloc_pages_direct_compact
>                               __alloc_pages_nodemask
>                               alloc_pages_vma
>                               do_huge_pmd_anonymous_page
>                               __handle_mm_fault
>                               handle_mm_fault
>                               __do_page_fault
>      4.98%      postgres  postgres            [.] ExecStoreMinimalTuple
>                 |
>      4.20%      postgres  postgres            [.] heap_form_minimal_tuple
>                 |
>      3.69%      postgres  [kernel.kallsyms]   [k] clear_page_c
>                 |
>                 --- clear_page_c
>                    |
>                    |--58.89%-- __do_huge_pmd_anonymous_page
>                    |          do_huge_pmd_anonymous_page
>                    |          __handle_mm_fault
>                    |          handle_mm_fault
>                    |          __do_page_fault
>                    |          do_page_fault
>                    |          page_fault
> 
> As you can see THP on/off makes a noticeable difference, especially for
> Task 2. Compaction suddenly takes a significant amount of time. But:
> It's a relatively gradual slowdown, at pretty extreme concurrency. So
> I'm pretty happy already.
> 
> 
> In the workload tested here most non-shared allocations are short
> lived. So it's not surprising that it's not worth compacting pages. I do
> wonder whether it'd be possible to keep some running statistics about
> THP being worthwhile or not.

My goal was to be more conservative and collapse mostly in khugepaged instead
of page faults. But maybe some running per-thread statistics of hugepage lifetime
could work too...

> This is just one workload, and I saw some different profiles while
> playing around. But I've already invested more time in this today than I
> should have... :)

Again, thanks a lot! If you find some more time, could you please also quickly
try how this workload looks like when THP's are enabled but page fault
compaction disabled completely by:

echo never > /sys/kernel/mm/transparent_hugepage/defrag

After LSF/MM I might be interested in how to reproduce this locally to use as a
testcase...

> BTW, parallel process exits with large shared mappings isn't
> particularly fun:
> 
>     80.09%      postgres  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>                 |
>                 --- _raw_spin_lock_irqsave
>                    |
>                    |--99.97%-- pagevec_lru_move_fn
>                    |          |
>                    |          |--65.51%-- activate_page

Hm at first sight it seems odd that page activation would be useful to do when
pages are being unmapped. But I'm not that familiar with this area...

>                    |          |          mark_page_accessed.part.23
>                    |          |          mark_page_accessed
>                    |          |          zap_pte_range
>                    |          |          unmap_page_range
>                    |          |          unmap_single_vma
>                    |          |          unmap_vmas
>                    |          |          exit_mmap
>                    |          |          mmput.part.27
>                    |          |          mmput
>                    |          |          exit_mm
>                    |          |          do_exit
>                    |          |          do_group_exit
>                    |          |          sys_exit_group
>                    |          |          system_call_fastpath
>                    |          |
>                    |           --34.49%-- lru_add_drain_cpu
>                    |                     lru_add_drain
>                    |                     free_pages_and_swap_cache
>                    |                     tlb_flush_mmu_free
>                    |                     zap_pte_range
>                    |                     unmap_page_range
>                    |                     unmap_single_vma
>                    |                     unmap_vmas
>                    |                     exit_mmap
>                    |                     mmput.part.27
>                    |                     mmput
>                    |                     exit_mm
>                    |                     do_exit
>                    |                     do_group_exit
>                    |                     sys_exit_group
>                    |                     system_call_fastpath
>                     --0.03%-- [...]
> 
>      9.75%      postgres  [kernel.kallsyms]  [k] zap_pte_range
>                 |
>                 --- zap_pte_range
>                     unmap_page_range
>                     unmap_single_vma
>                     unmap_vmas
>                     exit_mmap
>                     mmput.part.27
>                     mmput
>                     exit_mm
>                     do_exit
>                     do_group_exit
>                     sys_exit_group
>                     system_call_fastpath
> 
>      1.93%      postgres  [kernel.kallsyms]  [k] release_pages
>                 |
>                 --- release_pages
>                    |
>                    |--77.09%-- free_pages_and_swap_cache
>                    |          tlb_flush_mmu_free
>                    |          zap_pte_range
>                    |          unmap_page_range
>                    |          unmap_single_vma
>                    |          unmap_vmas
>                    |          exit_mmap
>                    |          mmput.part.27
>                    |          mmput
>                    |          exit_mm
>                    |          do_exit
>                    |          do_group_exit
>                    |          sys_exit_group
>                    |          system_call_fastpath
>                    |
>                    |--22.64%-- pagevec_lru_move_fn
>                    |          |
>                    |          |--63.88%-- activate_page
>                    |          |          mark_page_accessed.part.23
>                    |          |          mark_page_accessed
>                    |          |          zap_pte_range
>                    |          |          unmap_page_range
>                    |          |          unmap_single_vma
>                    |          |          unmap_vmas
>                    |          |          exit_mmap
>                    |          |          mmput.part.27
>                    |          |          mmput
>                    |          |          exit_mm
>                    |          |          do_exit
>                    |          |          do_group_exit
>                    |          |          sys_exit_group
>                    |          |          system_call_fastpath
>                    |          |
>                    |           --36.12%-- lru_add_drain_cpu
>                    |                     lru_add_drain
>                    |                     free_pages_and_swap_cache
>                    |                     tlb_flush_mmu_free
>                    |                     zap_pte_range
>                    |                     unmap_page_range
>                    |                     unmap_single_vma
>                    |                     unmap_vmas
>                    |                     exit_mmap
>                    |                     mmput.part.27
>                    |                     mmput
>                    |                     exit_mm
>                    |                     do_exit
>                    |                     do_group_exit
>                    |                     sys_exit_group
>                    |                     system_call_fastpath
>                     --0.27%-- [...]
> 
>      1.91%      postgres  [kernel.kallsyms]  [k] page_remove_file_rmap
>                 |
>                 --- page_remove_file_rmap
>                    |
>                    |--98.18%-- page_remove_rmap
>                    |          zap_pte_range
>                    |          unmap_page_range
>                    |          unmap_single_vma
>                    |          unmap_vmas
>                    |          exit_mmap
>                    |          mmput.part.27
>                    |          mmput
>                    |          exit_mm
>                    |          do_exit
>                    |          do_group_exit
>                    |          sys_exit_group
>                    |          system_call_fastpath
>                    |
>                     --1.82%-- zap_pte_range
>                               unmap_page_range
>                               unmap_single_vma
>                               unmap_vmas
>                               exit_mmap
>                               mmput.part.27
>                               mmput
>                               exit_mm
>                               do_exit
>                               do_group_exit
>                               sys_exit_group
>                               system_call_fastpath
> 
> 
> 
> Greetings,
> 
> Andres Freund
> 
> --
>  Andres Freund	                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-03-06  7:50 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-23 12:58 Vlastimil Babka
2015-02-23 12:58 ` [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
2015-02-23 12:58 ` [RFC 2/6] mm, thp: make khugepaged check for THP allocability before scanning Vlastimil Babka
2015-02-23 12:58 ` [RFC 3/6] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
2015-02-23 12:58 ` [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context Vlastimil Babka
2015-02-23 14:25   ` Peter Zijlstra
2015-02-23 12:58 ` [RFC 5/6] mm, thp: wakeup khugepaged when THP allocation fails Vlastimil Babka
2015-02-23 12:58 ` [RFC 6/6] mm, thp: remove no longer needed khugepaged code Vlastimil Babka
2015-02-23 21:03 ` [RFC 0/6] the big khugepaged redesign Andi Kleen
2015-02-23 22:46 ` Davidlohr Bueso
2015-02-23 22:56   ` Andrew Morton
2015-02-23 22:58     ` Sasha Levin
2015-02-24 10:32     ` Vlastimil Babka
2015-02-24 11:24       ` Andrea Arcangeli
2015-02-24 11:45         ` Andrea Arcangeli
2015-02-25 12:42         ` Vlastimil Babka
2015-03-05 16:30       ` Vlastimil Babka
2015-03-05 16:52         ` Andres Freund
2015-03-05 17:01           ` Vlastimil Babka
2015-03-05 17:07             ` Andres Freund
2015-03-06  0:21         ` Andres Freund
2015-03-06  7:50           ` Vlastimil Babka [this message]
2015-03-09  3:17   ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54F95C40.6040302@suse.cz \
    --to=vbabka@suse.cz \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andres@anarazel.de \
    --cc=athorlton@sgi.com \
    --cc=dave@stgolabs.net \
    --cc=ebru.akagunduz@gmail.com \
    --cc=hughd@google.com \
    --cc=josh@agliodbs.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=robertmhaas@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox