* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
@ 2024-07-31 8:57 Shivank Garg
2024-07-31 9:15 ` Thomas Gleixner
0 siblings, 1 reply; 7+ messages in thread
From: Shivank Garg @ 2024-07-31 8:57 UTC (permalink / raw)
To: kirill.shutemov
Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
luto, michael.roth, mingo, peterz, rick.p.edgecombe,
sandipan.das, tglx, thomas.lendacky, x86
I did some experiments to understand the impact of making 5 level page tables
default.
Machine Info: AMD Zen 4 EPYC server (2-socket system, 128 cores and 1 NUMA
node per socket, SMT Enabled)
Size of each NUMA node is approx 377 GB.
For experiments, I'm binding the benchmark to CPUs and memory nodes of single
socket for consistent results. Measured by enabling/disabling 5level Page
table using CONFIG_X86_5LEVEL.
% Change: (5L-4L)/4L*100
CoV (%): Coefficient of Variation (%)
Results:
lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
4-Level PT 5-Level PT % Change
THP-never Mean:0.4068 Mean:0.4294 5.56
95% CI:0.4057-0.4078 95% CI:0.4287-0.4302
THP-Always Mean: 0.4061 Mean: 0.4288 % Change
95% CI: 0.4051-0.4071 95% CI: 0.4281-0.4295 5.59
Btree (Thread:32): Metric- Time Taken (in seconds) - Lower is better
4-Level 5-Level
Time Taken(s) CoV (%) Time Taken(s) CoV(%) % Change
THP Never 382.2 0.219 388.8 1.019 1.73
THP Madvise 383.0 0.261 384.8 0.809 0.47
THP Always 392.8 1.376 386.4 2.147 -1.63
Btree (Thread:256): Metric- Time Taken (in seconds) - Lower is better
4-Level 5-Level
Time Taken(s) CoV (%) Time Taken(s) CoV(%) % Change
THP Never 56.6 2.014 55.2 0.810 -2.47
THP Madvise 56.6 2.014 56.4 2.022 -0.35
THP Always 56.6 0.968 56.2 1.489 -0.71
Ebizzy: Metric- records/s - Higher is better
4-Level 5-Level
Threads record/s CoV (%) record/s CoV(%) % Change
1 844 0.302 837 0.196 -0.85
256 10160 0.315 10288 1.081 1.26
XSBench (Thread:256, THP:Never) - Higher is better
Metric 4-Level 5-Level % Change
Lookups/s 13720556 13396288 -2.36
CoV (%) 1.726 1.317
Hashjoin (Thread:256, THP:Never) - Lower is better
Metric 4-Level 5-Level % Change
Time taken(s) 424.4 427.4 0.707
CoV (%) 0.394 0.209
Graph500(Thread:256, THP:Madvise) - Lower is better
Metric 4-Level 5-Level % Change
Time Taken(s) 0.1879 0.1873 -0.32
CoV (%) 0.165 0.213
GUPS(Thread:128, THP:Madvise) - Higher is better
Metric 4-Level 5-Level % Change
GUPS 1.3265 1.3252 -0.10
CoV (%) 0.037 0.027
pagerank(Thread:256, THP:Madvise) - Lower is better
Metric 4-Level 5-Level % Change
Time taken(s) 143.67 143.67 0.00
CoV (%) 0.402 0.402
Redis(Thread:256, THP:Madvise) - Higher is better
Metric 4-Level 5-Level % Change
Throughput(Ops/s) 141030744 139586376 -1.02
CoV (%) 0.372 0.561
memcached(Thread:256, THP:Madvise) - Higher is better
Metric 4-Level 5-Level % Change
Throughput(Ops/s) 19916313 19743637 -0.87
CoV (%) 0.051 0.095
Inference:
5-level page table shows increase in page-fault latency but it does
not significantly impact other benchmarks.
Thanks,
Shivank
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
2024-07-31 8:57 [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64 Shivank Garg
@ 2024-07-31 9:15 ` Thomas Gleixner
2024-07-31 11:11 ` Peter Zijlstra
2024-07-31 11:36 ` Kirill A. Shutemov
0 siblings, 2 replies; 7+ messages in thread
From: Thomas Gleixner @ 2024-07-31 9:15 UTC (permalink / raw)
To: 20240621164406.256314-1-kirill.shutemov, kirill.shutemov
Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
luto, michael.roth, mingo, peterz, rick.p.edgecombe,
sandipan.das, thomas.lendacky, x86
On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
> lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
> 4-Level PT 5-Level PT % Change
> THP-never Mean:0.4068 Mean:0.4294 5.56
> 95% CI:0.4057-0.4078 95% CI:0.4287-0.4302
>
> THP-Always Mean: 0.4061 Mean: 0.4288 % Change
> 95% CI: 0.4051-0.4071 95% CI: 0.4281-0.4295 5.59
>
> Inference:
> 5-level page table shows increase in page-fault latency but it does
> not significantly impact other benchmarks.
5% regression on lmbench is a NONO.
5-level page tables add a cost in every hardware page table walk. That's
a matter of fact and there is absolutely no reason to inflict this cost
on everyone.
The solution to this to make the 5-level mechanics smarter by evaluating
whether the machine has enough memory to require 5-level tables and
select the depth at boot time.
Thanks,
tglx
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
2024-07-31 9:15 ` Thomas Gleixner
@ 2024-07-31 11:11 ` Peter Zijlstra
2024-07-31 11:36 ` Kirill A. Shutemov
1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2024-07-31 11:11 UTC (permalink / raw)
To: Thomas Gleixner
Cc: 20240621164406.256314-1-kirill.shutemov, kirill.shutemov, ardb,
bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka, jgross,
kbingham, linux-doc, linux-efi, linux-kernel, linux-mm, luto,
michael.roth, mingo, rick.p.edgecombe, sandipan.das,
thomas.lendacky, x86
On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
> > lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
> > 4-Level PT 5-Level PT % Change
> > THP-never Mean:0.4068 Mean:0.4294 5.56
> > 95% CI:0.4057-0.4078 95% CI:0.4287-0.4302
> >
> > THP-Always Mean: 0.4061 Mean: 0.4288 % Change
> > 95% CI: 0.4051-0.4071 95% CI: 0.4281-0.4295 5.59
> >
> > Inference:
> > 5-level page table shows increase in page-fault latency but it does
> > not significantly impact other benchmarks.
>
> 5% regression on lmbench is a NONO.
>
> 5-level page tables add a cost in every hardware page table walk. That's
> a matter of fact and there is absolutely no reason to inflict this cost
> on everyone.
>
> The solution to this to make the 5-level mechanics smarter by evaluating
> whether the machine has enough memory to require 5-level tables and
> select the depth at boot time.
I gotta mention (again) that its a pain we can't mix and match like
s390. They default run their userspace on 4 level, even if the kernel
runs 5. Only silly daft userspace that needs more than insane amounts of
memory get 5 level.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
2024-07-31 9:15 ` Thomas Gleixner
2024-07-31 11:11 ` Peter Zijlstra
@ 2024-07-31 11:36 ` Kirill A. Shutemov
2024-07-31 11:40 ` Peter Zijlstra
2024-07-31 17:45 ` Shivank Garg
1 sibling, 2 replies; 7+ messages in thread
From: Kirill A. Shutemov @ 2024-07-31 11:36 UTC (permalink / raw)
To: Thomas Gleixner, Shivank Garg
Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
luto, michael.roth, mingo, peterz, rick.p.edgecombe,
sandipan.das, thomas.lendacky, x86
On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
> > lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
> > 4-Level PT 5-Level PT % Change
> > THP-never Mean:0.4068 Mean:0.4294 5.56
> > 95% CI:0.4057-0.4078 95% CI:0.4287-0.4302
> >
> > THP-Always Mean: 0.4061 Mean: 0.4288 % Change
> > 95% CI: 0.4051-0.4071 95% CI: 0.4281-0.4295 5.59
> >
> > Inference:
> > 5-level page table shows increase in page-fault latency but it does
> > not significantly impact other benchmarks.
>
> 5% regression on lmbench is a NONO.
Yeah, that's a biggy.
In our testing (on Intel HW) we didn't see any significant difference
between 4- and 5-level paging. But we were focused on TLB fill latency.
In both bare metal and in VMs. Maybe something wrong in the fault path?
It requires a closer look.
Shivank, could you share how you run lat_pagefault? What file size? How
parallel you run it?...
It would also be nice to get perf traces. Maybe it is purely SW issue.
> 5-level page tables add a cost in every hardware page table walk. That's
> a matter of fact and there is absolutely no reason to inflict this cost
> on everyone.
>
> The solution to this to make the 5-level mechanics smarter by evaluating
> whether the machine has enough memory to require 5-level tables and
> select the depth at boot time.
Let's understand the reason first.
The risk with your proposal is that 5-level paging will not get any
testing and rot over time.
I would like to keep it on, if possible.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
2024-07-31 11:36 ` Kirill A. Shutemov
@ 2024-07-31 11:40 ` Peter Zijlstra
2024-07-31 17:45 ` Shivank Garg
1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2024-07-31 11:40 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Thomas Gleixner, Shivank Garg, ardb, bp, brijesh.singh, corbet,
dave.hansen, hpa, jan.kiszka, jgross, kbingham, linux-doc,
linux-efi, linux-kernel, linux-mm, luto, michael.roth, mingo,
rick.p.edgecombe, sandipan.das, thomas.lendacky, x86
On Wed, Jul 31, 2024 at 02:36:47PM +0300, Kirill A. Shutemov wrote:
> The risk with your proposal is that 5-level paging will not get any
> testing and rot over time.
>
> I would like to keep it on, if possible.
Well, if it is boot time, you just tell your CI to force enable 5level
and you're done, right? Then the rest of us use 4 and we all good.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
2024-07-31 11:36 ` Kirill A. Shutemov
2024-07-31 11:40 ` Peter Zijlstra
@ 2024-07-31 17:45 ` Shivank Garg
2024-10-31 15:36 ` Dave Hansen
1 sibling, 1 reply; 7+ messages in thread
From: Shivank Garg @ 2024-07-31 17:45 UTC (permalink / raw)
To: Kirill A. Shutemov, Thomas Gleixner
Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
luto, michael.roth, mingo, peterz, rick.p.edgecombe,
sandipan.das, thomas.lendacky, x86
On 7/31/2024 5:06 PM, Kirill A. Shutemov wrote:
> On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
>> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
>>> lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
>>> 4-Level PT 5-Level PT % Change
>>> THP-never Mean:0.4068 Mean:0.4294 5.56
>>> 95% CI:0.4057-0.4078 95% CI:0.4287-0.4302
>>>
>>> THP-Always Mean: 0.4061 Mean: 0.4288 % Change
>>> 95% CI: 0.4051-0.4071 95% CI: 0.4281-0.4295 5.59
>>>
>>> Inference:
>>> 5-level page table shows increase in page-fault latency but it does
>>> not significantly impact other benchmarks.
>>
>> 5% regression on lmbench is a NONO.
>
> Yeah, that's a biggy.
>
> In our testing (on Intel HW) we didn't see any significant difference
> between 4- and 5-level paging. But we were focused on TLB fill latency.
> In both bare metal and in VMs. Maybe something wrong in the fault path?
>
> It requires a closer look.
>
> Shivank, could you share how you run lat_pagefault? What file size? How
> parallel you run it?...
Hi Kirill,
I got lmbench from here:
https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_pagefault.c
and using this command:
numactl --membind=1 --cpunodebind=1 bin/x86_64-linux-gnu/lat_pagefault -N 100 1GB_dev_urandom_file
>
> It would also be nice to get perf traces. Maybe it is purely SW issue.
>
4-level-page-table:
- 52.31% benchmark
- 49.52% asm_exc_page_fault
- 49.35% exc_page_fault
- 48.36% do_user_addr_fault
- 46.15% handle_mm_fault
- 44.59% __handle_mm_fault
- 42.95% do_fault
- 40.89% filemap_map_pages
- 28.30% set_pte_range
- 23.70% folio_add_file_rmap_ptes
- 14.30% __lruvec_stat_mod_folio
- 10.12% __mod_lruvec_state
- 5.70% __mod_memcg_lruvec_state
0.60% cgroup_rstat_updated
1.06% __mod_node_page_state
2.84% __rcu_read_unlock
0.76% srso_alias_safe_ret
0.84% set_ptes.isra.0
- 5.48% next_uptodate_folio
- 1.19% xas_find
0.96% xas_load
1.00% set_ptes.isra.0
1.22% lock_vma_under_rcu
5-level-page-table:
- 52.75% benchmark
- 50.04% asm_exc_page_fault
- 49.90% exc_page_fault
- 48.91% do_user_addr_fault
- 46.74% handle_mm_fault
- 45.27% __handle_mm_fault
- 43.30% do_fault
- 41.58% filemap_map_pages
- 28.04% set_pte_range
- 22.77% folio_add_file_rmap_ptes
- 17.74% __lruvec_stat_mod_folio
- 10.89% __mod_lruvec_state
- 5.97% __mod_memcg_lruvec_state
1.94% cgroup_rstat_updated
1.09% __mod_node_page_state
0.56% __mod_node_page_state
2.28% __rcu_read_unlock
1.08% set_ptes.isra.0
- 5.94% next_uptodate_folio
- 1.13% xas_find
0.99% xas_load
1.13% srso_alias_safe_ret
0.52% set_ptes.isra.0
1.16% lock_vma_under_rcu
>> 5-level page tables add a cost in every hardware page table walk. That's
>> a matter of fact and there is absolutely no reason to inflict this cost
>> on everyone.
>>
>> The solution to this to make the 5-level mechanics smarter by evaluating
>> whether the machine has enough memory to require 5-level tables and
>> select the depth at boot time.
>
> Let's understand the reason first.
Sure, please let me know how can I help in this debug.
Thanks,
Shivank
>
> The risk with your proposal is that 5-level paging will not get any
> testing and rot over time.
>
> I would like to keep it on, if possible.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
2024-07-31 17:45 ` Shivank Garg
@ 2024-10-31 15:36 ` Dave Hansen
0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2024-10-31 15:36 UTC (permalink / raw)
To: Shivank Garg, Kirill A. Shutemov, Thomas Gleixner
Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
luto, michael.roth, mingo, peterz, rick.p.edgecombe,
sandipan.das, thomas.lendacky, x86
On 7/31/24 10:45, Shivank Garg wrote:
> It would also be nice to get perf traces. Maybe it is purely SW issue.
Cycle counts aren't going to help much here. For instance, if 5-level
paging makes *ALL* TLB misses slower, you would just see a regression in
any code that misses the TLB, which could show up all over.
On Intel we have some PMU events like this:
dtlb_store_misses.walk_active
[Cycles when at least one PMH is busy
with a page walk for a store]
(there's a load side one as well). If a page walk gets more expensive,
you can see it there. Note that this doesn't actually tell you how much
time the core spent _waiting_ for a page walk to complete. If all the
speculation magic works perfectly in your favor, you could have the PMH
busy 100% of cycles but never had the core waiting on it.
So could we drill down a level in the "perf traces" please, and gather
some of the relevant performance counters and not just cycles?
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-31 15:36 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-31 8:57 [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64 Shivank Garg
2024-07-31 9:15 ` Thomas Gleixner
2024-07-31 11:11 ` Peter Zijlstra
2024-07-31 11:36 ` Kirill A. Shutemov
2024-07-31 11:40 ` Peter Zijlstra
2024-07-31 17:45 ` Shivank Garg
2024-10-31 15:36 ` Dave Hansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox