From: Vlastimil Babka <vbabka@suse.cz>
To: Andres Freund <andres@anarazel.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Davidlohr Bueso <dave@stgolabs.net>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickins <hughd@google.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
Michal Hocko <mhocko@suse.cz>,
Ebru Akagunduz <ebru.akagunduz@gmail.com>,
Alex Thorlton <athorlton@sgi.com>,
David Rientjes <rientjes@google.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@kernel.org>,
Robert Haas <robertmhaas@gmail.com>,
Josh Berkus <josh@agliodbs.com>
Subject: Re: [RFC 0/6] the big khugepaged redesign
Date: Fri, 06 Mar 2015 08:50:24 +0100 [thread overview]
Message-ID: <54F95C40.6040302@suse.cz> (raw)
In-Reply-To: <20150306002102.GU30405@awork2.anarazel.de>
On 03/06/2015 01:21 AM, Andres Freund wrote:
> Long mail ahead, sorry for that.
No problem, thanks a lot!
> TL;DR: THP is still noticeable, but not nearly as bad.
>
> On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
>> That however means the workload is based on hugetlbfs and shouldn't trigger THP
>> page fault activity, which is the aim of this patchset. Some more googling made
>> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
>> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
>> patchset should help, but I obviously won't be able to measure this before LSF/MM...
>
> Just as a reference, this is how some the more extreme profiles looked
> like in the past:
>
>> 96.50% postmaster [kernel.kallsyms] [k] _spin_lock_irq
>> |
>> --- _spin_lock_irq
>> |
>> |--99.87%-- compact_zone
>> | compact_zone_order
>> | try_to_compact_pages
>> | __alloc_pages_nodemask
>> | alloc_pages_vma
>> | do_huge_pmd_anonymous_page
>> | handle_mm_fault
>> | __do_page_fault
>> | do_page_fault
>> | page_fault
>> | 0x631d98
>> --0.13%-- [...]
>
> That specific profile is from a rather old kernel as you probably
> recognize.
Yeah, sounds like synchronous compaction before it was forbidden for THP page
faults...
>> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
>> psql performance with THPs enabled/disabled on recent kernels, where e.g.
>> compaction is no longer synchronous for THP page faults?
>
> So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
> RAM.
>
> First of: It's noticeably harder to trigger problems than it used to
> be. But, I can still trigger various problems that are much worse with
> THP enabled than without.
>
> There seem to be various different bottlenecks; I can get somewhat
> different profiles.
>
> In a somewhat artificial workload, that tries to simulate what I've seen
> trigger the problem at a customer, I can quite easily trigger large
> differences between THP=enable and THP=never. There's two types of
> tasks running, one purely OLTP, another doing somewhat more complex
> statements that require a fair amount of process local memory.
>
> (ignore the absolute numbers for progress, I just waited for somewhat
> stable results while doing other stuff)
>
> THP off:
> Task 1 solo:
> progress: 200.0 s, 391442.0 tps, 0.654 ms lat
> progress: 201.0 s, 394816.1 tps, 0.683 ms lat
> progress: 202.0 s, 409722.5 tps, 0.625 ms lat
> progress: 203.0 s, 384794.9 tps, 0.665 ms lat
>
> combined:
> Task 1:
> progress: 144.0 s, 25430.4 tps, 10.067 ms lat
> progress: 145.0 s, 22260.3 tps, 11.500 ms lat
> progress: 146.0 s, 24089.9 tps, 10.627 ms lat
> progress: 147.0 s, 25888.8 tps, 9.888 ms lat
>
> Task 2:
> progress: 24.4 s, 30.0 tps, 2134.043 ms lat
> progress: 26.5 s, 29.8 tps, 2150.487 ms lat
> progress: 28.4 s, 29.7 tps, 2151.557 ms lat
> progress: 30.4 s, 28.5 tps, 2245.304 ms lat
>
> flat profile:
> 6.07% postgres postgres [.] heap_form_minimal_tuple
> 4.36% postgres postgres [.] heap_fill_tuple
> 4.22% postgres postgres [.] ExecStoreMinimalTuple
> 4.11% postgres postgres [.] AllocSetAlloc
> 3.97% postgres postgres [.] advance_aggregates
> 3.94% postgres postgres [.] advance_transition_function
> 3.94% postgres postgres [.] ExecMakeTableFunctionResult
> 3.33% postgres postgres [.] heap_compute_data_size
> 3.30% postgres postgres [.] MemoryContextReset
> 3.28% postgres postgres [.] ExecScan
> 3.04% postgres postgres [.] ExecProject
> 2.96% postgres postgres [.] generate_series_step_int4
> 2.94% postgres [kernel.kallsyms] [k] clear_page_c
>
> (i.e. most of it postgres, cache miss bound)
>
> THP on:
> Task 1 solo:
> progress: 140.0 s, 390458.1 tps, 0.656 ms lat
> progress: 141.0 s, 391174.2 tps, 0.654 ms lat
> progress: 142.0 s, 394828.8 tps, 0.648 ms lat
> progress: 143.0 s, 398156.2 tps, 0.643 ms lat
>
> Task 1:
> progress: 179.0 s, 23963.1 tps, 10.683 ms lat
> progress: 180.0 s, 22712.9 tps, 11.271 ms lat
> progress: 181.0 s, 21211.4 tps, 12.069 ms lat
> progress: 182.0 s, 23207.8 tps, 11.031 ms lat
>
> Task 2:
> progress: 28.2 s, 19.1 tps, 3349.747 ms lat
> progress: 31.0 s, 19.8 tps, 3230.589 ms lat
> progress: 34.3 s, 21.5 tps, 2979.113 ms lat
> progress: 37.4 s, 20.9 tps, 3055.143 ms lat
So that's 1/3 worse tps for task 2? Not very nice...
> flat profile:
> 21.36% postgres [kernel.kallsyms] [k] pageblock_pfn_to_page
Interesting. This function shouldn't be heavyweight, although cache misses are
certainly possible. It's only called once per pageblock, so for this to be so
prominent, the pageblocks are probably marked as unsuitable and it just skips
over them uselessly. The compaction doesn't become deferred, since that only
happens for synchronous compaction and this is probably doing just a lots of
asynchronous ones.
I wonder what are the /proc/vmstat here for compaction and thp fault succcesses...
> 4.93% postgres postgres [.] ExecStoreMinimalTuple
> 4.02% postgres postgres [.] heap_form_minimal_tuple
> 3.55% postgres [kernel.kallsyms] [k] clear_page_c
> 2.85% postgres postgres [.] heap_fill_tuple
> 2.60% postgres postgres [.] ExecMakeTableFunctionResult
> 2.57% postgres postgres [.] AllocSetAlloc
> 2.44% postgres postgres [.] advance_transition_function
> 2.43% postgres postgres [.] generate_series_step_int4
>
> callgraph:
> 18.23% postgres [kernel.kallsyms] [k] pageblock_pfn_to_page
> |
> --- pageblock_pfn_to_page
> |
> |--99.05%-- isolate_migratepages
> | compact_zone
> | compact_zone_order
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | __handle_mm_fault
> | handle_mm_fault
> | __do_page_fault
> | do_page_fault
> | page_fault
> ....
> |
> --0.95%-- compact_zone
> compact_zone_order
> try_to_compact_pages
> __alloc_pages_direct_compact
> __alloc_pages_nodemask
> alloc_pages_vma
> do_huge_pmd_anonymous_page
> __handle_mm_fault
> handle_mm_fault
> __do_page_fault
> 4.98% postgres postgres [.] ExecStoreMinimalTuple
> |
> 4.20% postgres postgres [.] heap_form_minimal_tuple
> |
> 3.69% postgres [kernel.kallsyms] [k] clear_page_c
> |
> --- clear_page_c
> |
> |--58.89%-- __do_huge_pmd_anonymous_page
> | do_huge_pmd_anonymous_page
> | __handle_mm_fault
> | handle_mm_fault
> | __do_page_fault
> | do_page_fault
> | page_fault
>
> As you can see THP on/off makes a noticeable difference, especially for
> Task 2. Compaction suddenly takes a significant amount of time. But:
> It's a relatively gradual slowdown, at pretty extreme concurrency. So
> I'm pretty happy already.
>
>
> In the workload tested here most non-shared allocations are short
> lived. So it's not surprising that it's not worth compacting pages. I do
> wonder whether it'd be possible to keep some running statistics about
> THP being worthwhile or not.
My goal was to be more conservative and collapse mostly in khugepaged instead
of page faults. But maybe some running per-thread statistics of hugepage lifetime
could work too...
> This is just one workload, and I saw some different profiles while
> playing around. But I've already invested more time in this today than I
> should have... :)
Again, thanks a lot! If you find some more time, could you please also quickly
try how this workload looks like when THP's are enabled but page fault
compaction disabled completely by:
echo never > /sys/kernel/mm/transparent_hugepage/defrag
After LSF/MM I might be interested in how to reproduce this locally to use as a
testcase...
> BTW, parallel process exits with large shared mappings isn't
> particularly fun:
>
> 80.09% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> |
> --- _raw_spin_lock_irqsave
> |
> |--99.97%-- pagevec_lru_move_fn
> | |
> | |--65.51%-- activate_page
Hm at first sight it seems odd that page activation would be useful to do when
pages are being unmapped. But I'm not that familiar with this area...
> | | mark_page_accessed.part.23
> | | mark_page_accessed
> | | zap_pte_range
> | | unmap_page_range
> | | unmap_single_vma
> | | unmap_vmas
> | | exit_mmap
> | | mmput.part.27
> | | mmput
> | | exit_mm
> | | do_exit
> | | do_group_exit
> | | sys_exit_group
> | | system_call_fastpath
> | |
> | --34.49%-- lru_add_drain_cpu
> | lru_add_drain
> | free_pages_and_swap_cache
> | tlb_flush_mmu_free
> | zap_pte_range
> | unmap_page_range
> | unmap_single_vma
> | unmap_vmas
> | exit_mmap
> | mmput.part.27
> | mmput
> | exit_mm
> | do_exit
> | do_group_exit
> | sys_exit_group
> | system_call_fastpath
> --0.03%-- [...]
>
> 9.75% postgres [kernel.kallsyms] [k] zap_pte_range
> |
> --- zap_pte_range
> unmap_page_range
> unmap_single_vma
> unmap_vmas
> exit_mmap
> mmput.part.27
> mmput
> exit_mm
> do_exit
> do_group_exit
> sys_exit_group
> system_call_fastpath
>
> 1.93% postgres [kernel.kallsyms] [k] release_pages
> |
> --- release_pages
> |
> |--77.09%-- free_pages_and_swap_cache
> | tlb_flush_mmu_free
> | zap_pte_range
> | unmap_page_range
> | unmap_single_vma
> | unmap_vmas
> | exit_mmap
> | mmput.part.27
> | mmput
> | exit_mm
> | do_exit
> | do_group_exit
> | sys_exit_group
> | system_call_fastpath
> |
> |--22.64%-- pagevec_lru_move_fn
> | |
> | |--63.88%-- activate_page
> | | mark_page_accessed.part.23
> | | mark_page_accessed
> | | zap_pte_range
> | | unmap_page_range
> | | unmap_single_vma
> | | unmap_vmas
> | | exit_mmap
> | | mmput.part.27
> | | mmput
> | | exit_mm
> | | do_exit
> | | do_group_exit
> | | sys_exit_group
> | | system_call_fastpath
> | |
> | --36.12%-- lru_add_drain_cpu
> | lru_add_drain
> | free_pages_and_swap_cache
> | tlb_flush_mmu_free
> | zap_pte_range
> | unmap_page_range
> | unmap_single_vma
> | unmap_vmas
> | exit_mmap
> | mmput.part.27
> | mmput
> | exit_mm
> | do_exit
> | do_group_exit
> | sys_exit_group
> | system_call_fastpath
> --0.27%-- [...]
>
> 1.91% postgres [kernel.kallsyms] [k] page_remove_file_rmap
> |
> --- page_remove_file_rmap
> |
> |--98.18%-- page_remove_rmap
> | zap_pte_range
> | unmap_page_range
> | unmap_single_vma
> | unmap_vmas
> | exit_mmap
> | mmput.part.27
> | mmput
> | exit_mm
> | do_exit
> | do_group_exit
> | sys_exit_group
> | system_call_fastpath
> |
> --1.82%-- zap_pte_range
> unmap_page_range
> unmap_single_vma
> unmap_vmas
> exit_mmap
> mmput.part.27
> mmput
> exit_mm
> do_exit
> do_group_exit
> sys_exit_group
> system_call_fastpath
>
>
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-03-06 7:50 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-23 12:58 Vlastimil Babka
2015-02-23 12:58 ` [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
2015-02-23 12:58 ` [RFC 2/6] mm, thp: make khugepaged check for THP allocability before scanning Vlastimil Babka
2015-02-23 12:58 ` [RFC 3/6] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
2015-02-23 12:58 ` [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context Vlastimil Babka
2015-02-23 14:25 ` Peter Zijlstra
2015-02-23 12:58 ` [RFC 5/6] mm, thp: wakeup khugepaged when THP allocation fails Vlastimil Babka
2015-02-23 12:58 ` [RFC 6/6] mm, thp: remove no longer needed khugepaged code Vlastimil Babka
2015-02-23 21:03 ` [RFC 0/6] the big khugepaged redesign Andi Kleen
2015-02-23 22:46 ` Davidlohr Bueso
2015-02-23 22:56 ` Andrew Morton
2015-02-23 22:58 ` Sasha Levin
2015-02-24 10:32 ` Vlastimil Babka
2015-02-24 11:24 ` Andrea Arcangeli
2015-02-24 11:45 ` Andrea Arcangeli
2015-02-25 12:42 ` Vlastimil Babka
2015-03-05 16:30 ` Vlastimil Babka
2015-03-05 16:52 ` Andres Freund
2015-03-05 17:01 ` Vlastimil Babka
2015-03-05 17:07 ` Andres Freund
2015-03-06 0:21 ` Andres Freund
2015-03-06 7:50 ` Vlastimil Babka [this message]
2015-03-09 3:17 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54F95C40.6040302@suse.cz \
--to=vbabka@suse.cz \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andres@anarazel.de \
--cc=athorlton@sgi.com \
--cc=dave@stgolabs.net \
--cc=ebru.akagunduz@gmail.com \
--cc=hughd@google.com \
--cc=josh@agliodbs.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rientjes@google.com \
--cc=robertmhaas@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox