* [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
@ 2015-03-02 1:04 Dave Chinner
2015-03-02 19:47 ` Linus Torvalds
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2015-03-02 1:04 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm, xfs
Hi folks,
Running one of my usual benchmarks (fsmark to create 50 million zero
length files in a 500TB filesystem, then running xfs_repair on it)
has indicated a significant regression in xfs_repair performance.
config 3.19 4.0-rc1
defaults 8m08s 9m34s
-o ag_stride=-1 4m04s 4m38s
-o bhash=101073 6m04s 17m43s
-o ag_stride=-1,bhash=101073 4m54s 9m58s
The default is for create a number of concurrent threads to progress
AGs in parallel (https://lkml.org/lkml/2014/7/3/15), and this is
running on a 500AG filesystem so lots of parallelism. "-o
ag_stride=-1" turns this off, and just leaves a single prefetch
group working on AGs sequentially. As you can see, turning off the
concurrency halves the runtime.
The concurrency is really there for large spinning disk arrays,
where IO wait time dominates performance. I'm running on SSDs, so
ther eis almost no IO wait time.
The "-o bhash=X" controls the size of the buffer cache. The default
value is 4096, which means xfs_repair is oeprating with a memory
footprint of about 1GB and is small enough to suffer from readahead
thrashing on large filesystems. Setting it to 101073 gives increases that
to around 7-10GB and prevents readahead thrashing, so should run
much faster than the default concurrent config. It does run faster
for 3.19, but for 4.0-rc1 it runs almost twice as slow, and burns a
huge amount of system CPU time doing so.
Across the board the 4.0-rc1 numbers are much slower, and the
degradation is far worse when using the large memory footprint
configs. Perf points straight at the cause - this is from 4.0-rc1
on the "-o bhash=101073" config:
- 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys
- default_send_IPI_mask_sequence_phys
- 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
smp_call_function_many
- native_flush_tlb_others
- 99.85% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 99.73% __do_page_fault
trace_do_page_fault
do_async_page_fault
+ async_page_fault
0.63% native_send_call_func_single_ipi
generic_exec_single
smp_call_function_single
And the same profile output from 3.19 shows:
- 9.61% 9.61% [kernel] [k] default_send_IPI_mask_sequence_phys
- default_send_IPI_mask_sequence_phys
- 99.98% physflat_send_IPI_mask
- 96.26% native_send_call_func_ipi
smp_call_function_many
- native_flush_tlb_others
- 98.44% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
handle_mm_fault
+ 1.56% flush_tlb_mm_range
+ 3.74% native_send_call_func_single_ipi
So either there's been a massive increase in the number of IPIs
being sent, or the cost per IPI have greatly increased. Either way,
the result is a pretty significant performance degradatation.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-02 1:04 [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation Dave Chinner @ 2015-03-02 19:47 ` Linus Torvalds 2015-03-03 1:47 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2015-03-02 19:47 UTC (permalink / raw) To: Dave Chinner, Andrew Morton, Ingo Molnar, Matt B Cc: Linux Kernel Mailing List, linux-mm, xfs On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner <david@fromorbit.com> wrote: > > Across the board the 4.0-rc1 numbers are much slower, and the > degradation is far worse when using the large memory footprint > configs. Perf points straight at the cause - this is from 4.0-rc1 > on the "-o bhash=101073" config: > > - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys > - 99.99% physflat_send_IPI_mask > - 99.37% native_send_call_func_ipi .. > > And the same profile output from 3.19 shows: > > - 9.61% 9.61% [kernel] [k] default_send_IPI_mask_sequence_phys > - 99.98% physflat_send_IPI_mask > - 96.26% native_send_call_func_ipi ... > > So either there's been a massive increase in the number of IPIs > being sent, or the cost per IPI have greatly increased. Either way, > the result is a pretty significant performance degradatation. And on Mon, Mar 2, 2015 at 11:17 AM, Matt <jackdachef@gmail.com> wrote: > > Linus already posted a fix to the problem, however I can't seem to > find the matching commit in his tree (searching for "TLC regression" > or "TLB cache"). That was commit f045bbb9fa1b, which was then refined by commit 721c21c17ab9, because it turned out that ARM64 had a very subtle relationship with tlb->end and fullmm. But both of those hit 3.19, so none of this should affect 4.0-rc1. There's something else going on. I assume it's the mm queue from Andrew, so adding him to the cc. There are changes to the page migration etc, which could explain it. There are also a fair amount of APIC changes in 4.0-rc1, so I guess it really could be just that the IPI sending itself has gotten much slower. Adding Ingo for that, although I don't think default_send_IPI_mask_sequence_phys() itself hasn't actually changed, only other things around the apic. So I'd be inclined to blame the mm changes. Obviously bisection would find it.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-02 19:47 ` Linus Torvalds @ 2015-03-03 1:47 ` Dave Chinner 2015-03-03 2:22 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-03-03 1:47 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Mon, Mar 02, 2015 at 11:47:52AM -0800, Linus Torvalds wrote: > On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > Across the board the 4.0-rc1 numbers are much slower, and the > > degradation is far worse when using the large memory footprint > > configs. Perf points straight at the cause - this is from 4.0-rc1 > > on the "-o bhash=101073" config: > > > > - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys > > - 99.99% physflat_send_IPI_mask > > - 99.37% native_send_call_func_ipi > .. > > > > And the same profile output from 3.19 shows: > > > > - 9.61% 9.61% [kernel] [k] default_send_IPI_mask_sequence_phys > > - 99.98% physflat_send_IPI_mask > > - 96.26% native_send_call_func_ipi > ... > > > > So either there's been a massive increase in the number of IPIs > > being sent, or the cost per IPI have greatly increased. Either way, > > the result is a pretty significant performance degradatation. .... > I assume it's the mm queue from Andrew, so adding him to the cc. There > are changes to the page migration etc, which could explain it. > > There are also a fair amount of APIC changes in 4.0-rc1, so I guess it > really could be just that the IPI sending itself has gotten much > slower. Adding Ingo for that, although I don't think > default_send_IPI_mask_sequence_phys() itself hasn't actually changed, > only other things around the apic. So I'd be inclined to blame the mm > changes. > > Obviously bisection would find it.. Yes, though the time it takes to do a 13 step bisection means it's something I don't do just for an initial bug report. ;) Anyway, the difference between good and bad is pretty clear, so I'm pretty confident the bisect is solid: 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit commit 4d9424669946532be754a6e116618dcb58430cb4 Author: Mel Gorman <mgorman@suse.de> Date: Thu Feb 12 14:58:28 2015 -0800 mm: convert p[te|md]_mknonnuma and remaining page table manipulations With PROT_NONE, the traditional page table manipulation functions are sufficient. [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()] [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> :040000 040000 50985a3f84e80bb2bdd049d4f34739d99436f988 1bc79bfac2c138844373b603f9bc5914f0d010f3 M arch :040000 040000 ea69bcd1c59f832a4b012a57b4eb1d0c7516947d 0822692fa6c356952e723b56038585716fa51723 M include :040000 040000 c11960b9f1ee72edb08dc3fdc46f590fb1d545f7 f5d17ff5b639adcb7363a196a9efe70f2a7312b5 M mm Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 1:47 ` Dave Chinner @ 2015-03-03 2:22 ` Linus Torvalds 2015-03-03 2:37 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2015-03-03 2:22 UTC (permalink / raw) To: Dave Chinner Cc: Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Mon, Mar 2, 2015 at 5:47 PM, Dave Chinner <david@fromorbit.com> wrote: > > Anyway, the difference between good and bad is pretty clear, so > I'm pretty confident the bisect is solid: > > 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit Well, it's the mm queue from Andrew, so I'm not surprised. That said, I don't see why that particular one should matter. Hmm. In your profiles, can you tell which caller of "flush_tlb_page()" changed the most? The change from "mknnuma" to "prot_none" *should* be 100% equivalent (both just change the page to be not-present, just set different bits elsewhere in the pte), but clearly something wasn't. Oh. Except for that special "huge-zero-page" special case that got dropped, but that got re-introduced in commit e944fd67b625. There might be some other case where the new "just change the protection" doesn't do the "oh, but it the protection didn't change, don't bother flushing". I don't see it. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 2:22 ` Linus Torvalds @ 2015-03-03 2:37 ` Linus Torvalds 2015-03-03 5:20 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2015-03-03 2:37 UTC (permalink / raw) To: Dave Chinner, Mel Gorman Cc: Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > There might be some other case where the new "just change the > protection" doesn't do the "oh, but it the protection didn't change, > don't bother flushing". I don't see it. Hmm. I wonder.. In change_pte_range(), we just unconditionally change the protection bits. But the old numa code used to do if (!pte_numa(oldpte)) { ptep_set_numa(mm, addr, pte); so it would actually avoid the pte update if a numa-prot page was marked numa-prot again. But are those migrate-page calls really common enough to make these things happen often enough on the same pages for this all to matter? Odd. So it would be good if your profiles just show "there's suddenly a *lot* more calls to flush_tlb_page() from XYZ" and the culprit is obvious that way.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 2:37 ` Linus Torvalds @ 2015-03-03 5:20 ` Dave Chinner 2015-03-03 6:56 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-03-03 5:20 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote: > On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > There might be some other case where the new "just change the > > protection" doesn't do the "oh, but it the protection didn't change, > > don't bother flushing". I don't see it. > > Hmm. I wonder.. In change_pte_range(), we just unconditionally change > the protection bits. > > But the old numa code used to do > > if (!pte_numa(oldpte)) { > ptep_set_numa(mm, addr, pte); > > so it would actually avoid the pte update if a numa-prot page was > marked numa-prot again. > > But are those migrate-page calls really common enough to make these > things happen often enough on the same pages for this all to matter? It's looking like that's a possibility. I am running a fake-numa=4 config on this test VM so it's got 4 nodes of 4p/4GB RAM each. both kernels are running through the same page fault path and that is straight through migrate_pages(). 3.19: 13.70% 0.01% [kernel] [k] native_flush_tlb_others - native_flush_tlb_others - 98.58% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 96.88% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault + 3.12% __get_user_pages + 1.40% flush_tlb_mm_range 4.0-rc1: - 67.12% 0.04% [kernel] [k] native_flush_tlb_others - native_flush_tlb_others - 99.80% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.50% __do_page_fault trace_do_page_fault do_async_page_fault - async_page_fault Same call chain, just a lot more CPU used further down the stack. > Odd. > > So it would be good if your profiles just show "there's suddenly a > *lot* more calls to flush_tlb_page() from XYZ" and the culprit is > obvious that way.. Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to count all the tlb flush events from the kernel. I then pulled the full events for a 30s period to get a sampling of the reason associated with each flush event. 4.0-rc1: Performance counter stats for 'system wide' (6 runs): 2,190,503 tlb:tlb_flush ( +- 8.30% ) 10.001970663 seconds time elapsed ( +- 0.00% ) The reason breakdown: 81% TLB_REMOTE_SHOOTDOWN 19% TLB_FLUSH_ON_TASK_SWITCH 3.19: Performance counter stats for 'system wide' (6 runs): 467,151 tlb:tlb_flush ( +- 25.50% ) 10.002021491 seconds time elapsed ( +- 0.00% ) The reason breakdown: 6% TLB_REMOTE_SHOOTDOWN 94% TLB_FLUSH_ON_TASK_SWITCH The difference would appear to be the number of remote TLB shootdowns that are occurring from otherwise identical page fault paths. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 5:20 ` Dave Chinner @ 2015-03-03 6:56 ` Linus Torvalds 2015-03-03 11:34 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2015-03-03 6:56 UTC (permalink / raw) To: Dave Chinner Cc: Mel Gorman, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> But are those migrate-page calls really common enough to make these >> things happen often enough on the same pages for this all to matter? > > It's looking like that's a possibility. Hmm. Looking closer, commit 10c1045f28e8 already should have re-introduced the "pte was already NUMA" case. So that's not it either, afaik. Plus your numbers seem to say that it's really "migrate_pages()" that is done more. So it feels like the numa balancing isn't working right. But I'm not seeing what would cause that in that commit. It really all looks the same to me. The few special-cases it drops get re-introduced later (although in a different form). Mel, do you see what I'm missing? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 6:56 ` Linus Torvalds @ 2015-03-03 11:34 ` Dave Chinner 2015-03-03 13:43 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-03-03 11:34 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > >> But are those migrate-page calls really common enough to make these > >> things happen often enough on the same pages for this all to matter? > > > > It's looking like that's a possibility. > > Hmm. Looking closer, commit 10c1045f28e8 already should have > re-introduced the "pte was already NUMA" case. > > So that's not it either, afaik. Plus your numbers seem to say that > it's really "migrate_pages()" that is done more. So it feels like the > numa balancing isn't working right. So that should show up in the vmstats, right? Oh, and there's a tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: 3.19: 55,898 migrate:mm_migrate_pages And a sample of the events shows 99.99% of these are: mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason= 4.0-rc1: 364,442 migrate:mm_migrate_pages They are also single page MIGRATE_ASYNC events like for 3.19. And 'grep "numa\|migrate" /proc/vmstat' output for the entire xfs_repair run: 3.19: numa_hit 5163221 numa_miss 121274 numa_foreign 121274 numa_interleave 12116 numa_local 5153127 numa_other 131368 numa_pte_updates 36482466 numa_huge_pte_updates 0 numa_hint_faults 34816515 numa_hint_faults_local 9197961 numa_pages_migrated 1228114 pgmigrate_success 1228114 pgmigrate_fail 0 4.0-rc1: numa_hit 36952043 numa_miss 92471 numa_foreign 92471 numa_interleave 10964 numa_local 36927384 numa_other 117130 numa_pte_updates 84010995 numa_huge_pte_updates 0 numa_hint_faults 81697505 numa_hint_faults_local 21765799 numa_pages_migrated 32916316 pgmigrate_success 32916316 pgmigrate_fail 0 Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 11:34 ` Dave Chinner @ 2015-03-03 13:43 ` Mel Gorman 2015-03-03 21:33 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2015-03-03 13:43 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote: > > >> > > >> But are those migrate-page calls really common enough to make these > > >> things happen often enough on the same pages for this all to matter? > > > > > > It's looking like that's a possibility. > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > re-introduced the "pte was already NUMA" case. > > > > So that's not it either, afaik. Plus your numbers seem to say that > > it's really "migrate_pages()" that is done more. So it feels like the > > numa balancing isn't working right. > > So that should show up in the vmstats, right? Oh, and there's a > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > The stats indicate both more updates and more faults. Can you try this please? It's against 4.0-rc1. ---8<--- mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This was bisected to commit 4d94246699 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") but I expect the full issue is related series up to and including that patch. There are two important changes that might be relevant here. The first is marking huge PMDs to trap a hinting fault potentially sends an IPI to flush TLBs. This did not show up in Dave's report and it almost certainly is not a factor but it would affect IPI counts for other users. The second is that the PTE protection update now clears the PTE leaving a window where parallel faults can be trapped resulting in more overhead from faults. Higher faults, even if correct can result in higher scan rates indirectly and may explain what Dave is saying. This is not signed off or tested. --- mm/huge_memory.c | 11 +++++++++-- mm/mprotect.c | 17 +++++++++++++++-- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fc00c8cb5a82..7fc4732c77d7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1494,8 +1494,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, } if (!prot_numa || !pmd_protnone(*pmd)) { - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); + /* + * NUMA hinting update can avoid a clear and flush as + * it is not a functional correctness issue if access + * occurs after the update + */ + if (prot_numa) + entry = *pmd; + else + entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); diff --git a/mm/mprotect.c b/mm/mprotect.c index 44727811bf4c..1efd03ffa0d8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -77,19 +77,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, pte_t ptent; /* - * Avoid trapping faults against the zero or KSM - * pages. See similar comment in change_huge_pmd. + * prot_numa does not clear the pte during protection + * update as asynchronous hardware updates are not + * a concern but unnecessary faults while the PTE is + * cleared is overhead. */ if (prot_numa) { struct page *page; page = vm_normal_page(vma, addr, oldpte); + + /* + * Avoid trapping faults against the zero or KSM + * pages. See similar comment in change_huge_pmd. + */ if (!page || PageKsm(page)) continue; /* Avoid TLB flush if possible */ if (pte_protnone(oldpte)) continue; + + ptent = *pte; + ptent = pte_modify(ptent, newprot); + set_pte_at(mm, addr, pte, ptent); + pages++; + continue; } ptent = ptep_modify_prot_start(mm, addr, pte); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 13:43 ` Mel Gorman @ 2015-03-03 21:33 ` Dave Chinner 2015-03-04 20:00 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-03-03 21:33 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote: > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote: > > > >> > > > >> But are those migrate-page calls really common enough to make these > > > >> things happen often enough on the same pages for this all to matter? > > > > > > > > It's looking like that's a possibility. > > > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > > re-introduced the "pte was already NUMA" case. > > > > > > So that's not it either, afaik. Plus your numbers seem to say that > > > it's really "migrate_pages()" that is done more. So it feels like the > > > numa balancing isn't working right. > > > > So that should show up in the vmstats, right? Oh, and there's a > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > > > > The stats indicate both more updates and more faults. Can you try this > please? It's against 4.0-rc1. > > ---8<--- > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing Makes no noticable difference to behaviour or performance. Stats: 359,857 migrate:mm_migrate_pages ( +- 5.54% ) numa_hit 36026802 numa_miss 14287 numa_foreign 14287 numa_interleave 18408 numa_local 36006052 numa_other 35037 numa_pte_updates 81803359 numa_huge_pte_updates 0 numa_hint_faults 79810798 numa_hint_faults_local 21227730 numa_pages_migrated 32037516 pgmigrate_success 32037516 pgmigrate_fail 0 -Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-03 21:33 ` Dave Chinner @ 2015-03-04 20:00 ` Mel Gorman 2015-03-04 23:00 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2015-03-04 20:00 UTC (permalink / raw) To: Dave Chinner Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote: > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > >> > > > > >> But are those migrate-page calls really common enough to make these > > > > >> things happen often enough on the same pages for this all to matter? > > > > > > > > > > It's looking like that's a possibility. > > > > > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > > > re-introduced the "pte was already NUMA" case. > > > > > > > > So that's not it either, afaik. Plus your numbers seem to say that > > > > it's really "migrate_pages()" that is done more. So it feels like the > > > > numa balancing isn't working right. > > > > > > So that should show up in the vmstats, right? Oh, and there's a > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > > > > > > > The stats indicate both more updates and more faults. Can you try this > > please? It's against 4.0-rc1. > > > > ---8<--- > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing > > Makes no noticable difference to behaviour or performance. Stats: > After going through the series again, I did not spot why there is a difference. It's functionally similar and I would hate the theory that this is somehow hardware related due to the use of bits it takes action on. There is nothing in the manual that indicates that it would. Try this as I don't want to leave this hanging before LSF/MM because it'll mask other reports. It alters the maximum rate automatic NUMA balancing scans ptes. --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ce18f3c097a..40ae5d84d4ba 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) * calculated based on the tasks virtual memory size and * numa_balancing_scan_size. */ -unsigned int sysctl_numa_balancing_scan_period_min = 1000; +unsigned int sysctl_numa_balancing_scan_period_min = 2000; unsigned int sysctl_numa_balancing_scan_period_max = 60000; /* Portion of address space to scan in MB */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-04 20:00 ` Mel Gorman @ 2015-03-04 23:00 ` Dave Chinner 2015-03-04 23:35 ` Ingo Molnar 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-03-04 23:00 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Wed, Mar 04, 2015 at 08:00:46PM +0000, Mel Gorman wrote: > On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote: > > On Tue, Mar 03, 2015 at 01:43:46PM +0000, Mel Gorman wrote: > > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > >> > > > > > >> But are those migrate-page calls really common enough to make these > > > > > >> things happen often enough on the same pages for this all to matter? > > > > > > > > > > > > It's looking like that's a possibility. > > > > > > > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > > > > re-introduced the "pte was already NUMA" case. > > > > > > > > > > So that's not it either, afaik. Plus your numbers seem to say that > > > > > it's really "migrate_pages()" that is done more. So it feels like the > > > > > numa balancing isn't working right. > > > > > > > > So that should show up in the vmstats, right? Oh, and there's a > > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > > > > > > > > > > The stats indicate both more updates and more faults. Can you try this > > > please? It's against 4.0-rc1. > > > > > > ---8<--- > > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing > > > > Makes no noticable difference to behaviour or performance. Stats: > > > > After going through the series again, I did not spot why there is a > difference. It's functionally similar and I would hate the theory that > this is somehow hardware related due to the use of bits it takes action > on. I doubt it's hardware related - I'm testing inside a VM, and the host is a year old Dell r820 server, so it's a pretty common hardware I'd think. Guest: processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.0.0 stepping : 3 microcode : 0x1 cpu MHz : 2199.998 cache size : 4096 KB physical id : 15 siblings : 1 core id : 0 cpu cores : 1 apicid : 15 initial apicid : 15 fpu : yes fpu_exception : yes cpuid level : 4 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16 x2apic popcnt hypervisor lahf_lm bugs : bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: Host: processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz stepping : 7 microcode : 0x70d cpu MHz : 1190.750 cache size : 16384 KB physical id : 1 siblings : 16 core id : 7 cpu cores : 8 apicid : 47 initial apicid : 47 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4400.75 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: > There is nothing in the manual that indicates that it would. Try this > as I don't want to leave this hanging before LSF/MM because it'll mask other > reports. It alters the maximum rate automatic NUMA balancing scans ptes. > > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7ce18f3c097a..40ae5d84d4ba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) > * calculated based on the tasks virtual memory size and > * numa_balancing_scan_size. > */ > -unsigned int sysctl_numa_balancing_scan_period_min = 1000; > +unsigned int sysctl_numa_balancing_scan_period_min = 2000; > unsigned int sysctl_numa_balancing_scan_period_max = 60000; Made absolutely no difference: 357,635 migrate:mm_migrate_pages ( +- 4.11% ) numa_hit 36724642 numa_miss 92477 numa_foreign 92477 numa_interleave 11835 numa_local 36709671 numa_other 107448 numa_pte_updates 83924860 numa_huge_pte_updates 0 numa_hint_faults 81856035 numa_hint_faults_local 22104529 numa_pages_migrated 32766735 pgmigrate_success 32766735 pgmigrate_fail 0 Runtime was actually a minute worse (18m35s vs 17m39s) than without this patch. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-04 23:00 ` Dave Chinner @ 2015-03-04 23:35 ` Ingo Molnar 2015-03-04 23:51 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Ingo Molnar @ 2015-03-04 23:35 UTC (permalink / raw) To: Dave Chinner Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Matt B, Linux Kernel Mailing List, linux-mm, xfs * Dave Chinner <david@fromorbit.com> wrote: > > After going through the series again, I did not spot why there is > > a difference. It's functionally similar and I would hate the > > theory that this is somehow hardware related due to the use of > > bits it takes action on. > > I doubt it's hardware related - I'm testing inside a VM, [...] That might be significant, I doubt Mel considered KVM's interpretation of pte details? Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation. 2015-03-04 23:35 ` Ingo Molnar @ 2015-03-04 23:51 ` Dave Chinner 0 siblings, 0 replies; 14+ messages in thread From: Dave Chinner @ 2015-03-04 23:51 UTC (permalink / raw) To: Ingo Molnar Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Matt B, Linux Kernel Mailing List, linux-mm, xfs On Thu, Mar 05, 2015 at 12:35:45AM +0100, Ingo Molnar wrote: > > * Dave Chinner <david@fromorbit.com> wrote: > > > > After going through the series again, I did not spot why there is > > > a difference. It's functionally similar and I would hate the > > > theory that this is somehow hardware related due to the use of > > > bits it takes action on. > > > > I doubt it's hardware related - I'm testing inside a VM, [...] > > That might be significant, I doubt Mel considered KVM's interpretation > of pte details? I did actaully mention that before: | I am running a fake-numa=4 config on this test VM so it's got 4 | nodes of 4p/4GB RAM each. but I think it got snipped before Mel was cc'd. Perhaps size of the nodes is relevant, too, because the steady state phase 3 memory usage is 5-6GB when this problem first shows up, and then continues into phase 4 where memory usage grows again and peaks at ~10GB.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-03-04 23:51 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-02 1:04 [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation Dave Chinner 2015-03-02 19:47 ` Linus Torvalds 2015-03-03 1:47 ` Dave Chinner 2015-03-03 2:22 ` Linus Torvalds 2015-03-03 2:37 ` Linus Torvalds 2015-03-03 5:20 ` Dave Chinner 2015-03-03 6:56 ` Linus Torvalds 2015-03-03 11:34 ` Dave Chinner 2015-03-03 13:43 ` Mel Gorman 2015-03-03 21:33 ` Dave Chinner 2015-03-04 20:00 ` Mel Gorman 2015-03-04 23:00 ` Dave Chinner 2015-03-04 23:35 ` Ingo Molnar 2015-03-04 23:51 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox