Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
@ 2024-07-31  8:57 Shivank Garg
  2024-07-31  9:15 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Shivank Garg @ 2024-07-31  8:57 UTC (permalink / raw)
  To: kirill.shutemov
  Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
	jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
	luto, michael.roth, mingo, peterz, rick.p.edgecombe,
	sandipan.das, tglx, thomas.lendacky, x86

I did some experiments to understand the impact of making 5 level page tables
default.
Machine Info:  AMD Zen 4 EPYC server (2-socket system, 128 cores and 1 NUMA
node per socket, SMT Enabled)
Size of each NUMA node is approx 377 GB.

For experiments, I'm binding the benchmark to CPUs and memory nodes of single
socket for consistent results. Measured by enabling/disabling 5level Page
table using CONFIG_X86_5LEVEL.

% Change: (5L-4L)/4L*100
CoV (%):  Coefficient of Variation (%)

Results:

lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
                4-Level PT              5-Level PT		% Change
THP-never       Mean:0.4068             Mean:0.4294		5.56
                95% CI:0.4057-0.4078    95% CI:0.4287-0.4302

THP-Always      Mean: 0.4061            Mean: 0.4288		% Change
                95% CI: 0.4051-0.4071   95% CI: 0.4281-0.4295	5.59


Btree (Thread:32): Metric- Time Taken (in seconds) - Lower is better 
                4-Level                 5-Level               
                Time Taken(s) CoV (%)   Time Taken(s) CoV(%)    % Change
THP Never       382.2         0.219     388.8         1.019     1.73
THP Madvise     383.0         0.261     384.8         0.809     0.47
THP Always      392.8         1.376     386.4         2.147     -1.63

Btree (Thread:256): Metric- Time Taken (in seconds) - Lower is better
                4-Level                 5-Level               
                Time Taken(s) CoV (%)   Time Taken(s) CoV(%)     % Change
THP Never       56.6          2.014     55.2          0.810     -2.47
THP Madvise     56.6          2.014     56.4          2.022     -0.35
THP Always      56.6          0.968     56.2          1.489     -0.71


Ebizzy: Metric- records/s - Higher is better
                4-Level                 5-Level
Threads         record/s    CoV (%)     record/s    CoV(%)      % Change
1               844         0.302       837         0.196       -0.85
256             10160       0.315       10288       1.081       1.26


XSBench (Thread:256, THP:Never) - Higher is better
Metric          4-Level         5-Level         % Change
Lookups/s       13720556        13396288        -2.36
CoV (%)         1.726           1.317


Hashjoin (Thread:256, THP:Never) - Lower is better
Metric          4-Level         5-Level         % Change
Time taken(s)   424.4           427.4           0.707
CoV (%)         0.394           0.209


Graph500(Thread:256, THP:Madvise) - Lower is better
Metric          4-Level         5-Level       % Change
Time Taken(s)   0.1879          0.1873        -0.32
CoV (%)         0.165           0.213


GUPS(Thread:128, THP:Madvise) - Higher is better
Metric          4-Level         5-Level       % Change
GUPS            1.3265          1.3252        -0.10
CoV (%)         0.037           0.027


pagerank(Thread:256, THP:Madvise) - Lower is better
Metric          4-Level         5-Level       % Change
Time taken(s)   143.67          143.67        0.00
CoV (%)         0.402           0.402


Redis(Thread:256, THP:Madvise) - Higher is better
Metric              4-Level     5-Level       % Change
Throughput(Ops/s)   141030744   139586376     -1.02
CoV (%)             0.372       0.561


memcached(Thread:256, THP:Madvise) - Higher is better
Metric              4-Level     5-Level       % Change
Throughput(Ops/s)   19916313    19743637      -0.87
CoV (%)             0.051       0.095


Inference:
5-level page table shows increase in page-fault latency but it does
not significantly impact other benchmarks.


Thanks,
Shivank


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
  2024-07-31  8:57 [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64 Shivank Garg
@ 2024-07-31  9:15 ` Thomas Gleixner
  2024-07-31 11:11   ` Peter Zijlstra
  2024-07-31 11:36   ` Kirill A. Shutemov
  0 siblings, 2 replies; 7+ messages in thread
From: Thomas Gleixner @ 2024-07-31  9:15 UTC (permalink / raw)
  To: 20240621164406.256314-1-kirill.shutemov, kirill.shutemov
  Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
	jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
	luto, michael.roth, mingo, peterz, rick.p.edgecombe,
	sandipan.das, thomas.lendacky, x86

On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
> lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
>                 4-Level PT              5-Level PT		% Change
> THP-never       Mean:0.4068             Mean:0.4294		5.56
>                 95% CI:0.4057-0.4078    95% CI:0.4287-0.4302
>
> THP-Always      Mean: 0.4061            Mean: 0.4288		% Change
>                 95% CI: 0.4051-0.4071   95% CI: 0.4281-0.4295	5.59
>
> Inference:
> 5-level page table shows increase in page-fault latency but it does
> not significantly impact other benchmarks.

5% regression on lmbench is a NONO.

5-level page tables add a cost in every hardware page table walk. That's
a matter of fact and there is absolutely no reason to inflict this cost
on everyone.

The solution to this to make the 5-level mechanics smarter by evaluating
whether the machine has enough memory to require 5-level tables and
select the depth at boot time.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
  2024-07-31  9:15 ` Thomas Gleixner
@ 2024-07-31 11:11   ` Peter Zijlstra
  2024-07-31 11:36   ` Kirill A. Shutemov
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2024-07-31 11:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: 20240621164406.256314-1-kirill.shutemov, kirill.shutemov, ardb,
	bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka, jgross,
	kbingham, linux-doc, linux-efi, linux-kernel, linux-mm, luto,
	michael.roth, mingo, rick.p.edgecombe, sandipan.das,
	thomas.lendacky, x86

On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
> > lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
> >                 4-Level PT              5-Level PT		% Change
> > THP-never       Mean:0.4068             Mean:0.4294		5.56
> >                 95% CI:0.4057-0.4078    95% CI:0.4287-0.4302
> >
> > THP-Always      Mean: 0.4061            Mean: 0.4288		% Change
> >                 95% CI: 0.4051-0.4071   95% CI: 0.4281-0.4295	5.59
> >
> > Inference:
> > 5-level page table shows increase in page-fault latency but it does
> > not significantly impact other benchmarks.
> 
> 5% regression on lmbench is a NONO.
> 
> 5-level page tables add a cost in every hardware page table walk. That's
> a matter of fact and there is absolutely no reason to inflict this cost
> on everyone.
> 
> The solution to this to make the 5-level mechanics smarter by evaluating
> whether the machine has enough memory to require 5-level tables and
> select the depth at boot time.

I gotta mention (again) that its a pain we can't mix and match like
s390. They default run their userspace on 4 level, even if the kernel
runs 5. Only silly daft userspace that needs more than insane amounts of
memory get 5 level.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
  2024-07-31  9:15 ` Thomas Gleixner
  2024-07-31 11:11   ` Peter Zijlstra
@ 2024-07-31 11:36   ` Kirill A. Shutemov
  2024-07-31 11:40     ` Peter Zijlstra
  2024-07-31 17:45     ` Shivank Garg
  1 sibling, 2 replies; 7+ messages in thread
From: Kirill A. Shutemov @ 2024-07-31 11:36 UTC (permalink / raw)
  To: Thomas Gleixner, Shivank Garg
  Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
	jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
	luto, michael.roth, mingo, peterz, rick.p.edgecombe,
	sandipan.das, thomas.lendacky, x86

On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
> > lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
> >                 4-Level PT              5-Level PT		% Change
> > THP-never       Mean:0.4068             Mean:0.4294		5.56
> >                 95% CI:0.4057-0.4078    95% CI:0.4287-0.4302
> >
> > THP-Always      Mean: 0.4061            Mean: 0.4288		% Change
> >                 95% CI: 0.4051-0.4071   95% CI: 0.4281-0.4295	5.59
> >
> > Inference:
> > 5-level page table shows increase in page-fault latency but it does
> > not significantly impact other benchmarks.
> 
> 5% regression on lmbench is a NONO.

Yeah, that's a biggy.

In our testing (on Intel HW) we didn't see any significant difference
between 4- and 5-level paging. But we were focused on TLB fill latency.
In both bare metal and in VMs. Maybe something wrong in the fault path?

It requires a closer look.

Shivank, could you share how you run lat_pagefault? What file size? How
parallel you run it?...

It would also be nice to get perf traces. Maybe it is purely SW issue.

> 5-level page tables add a cost in every hardware page table walk. That's
> a matter of fact and there is absolutely no reason to inflict this cost
> on everyone.
>
> The solution to this to make the 5-level mechanics smarter by evaluating
> whether the machine has enough memory to require 5-level tables and
> select the depth at boot time.

Let's understand the reason first.

The risk with your proposal is that 5-level paging will not get any
testing and rot over time.

I would like to keep it on, if possible.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
  2024-07-31 11:36   ` Kirill A. Shutemov
@ 2024-07-31 11:40     ` Peter Zijlstra
  2024-07-31 17:45     ` Shivank Garg
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2024-07-31 11:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Shivank Garg, ardb, bp, brijesh.singh, corbet,
	dave.hansen, hpa, jan.kiszka, jgross, kbingham, linux-doc,
	linux-efi, linux-kernel, linux-mm, luto, michael.roth, mingo,
	rick.p.edgecombe, sandipan.das, thomas.lendacky, x86

On Wed, Jul 31, 2024 at 02:36:47PM +0300, Kirill A. Shutemov wrote:
> The risk with your proposal is that 5-level paging will not get any
> testing and rot over time.
> 
> I would like to keep it on, if possible.

Well, if it is boot time, you just tell your CI to force enable 5level
and you're done, right? Then the rest of us use 4 and we all good.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
  2024-07-31 11:36   ` Kirill A. Shutemov
  2024-07-31 11:40     ` Peter Zijlstra
@ 2024-07-31 17:45     ` Shivank Garg
  2024-10-31 15:36       ` Dave Hansen
  1 sibling, 1 reply; 7+ messages in thread
From: Shivank Garg @ 2024-07-31 17:45 UTC (permalink / raw)
  To: Kirill A. Shutemov, Thomas Gleixner
  Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
	jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
	luto, michael.roth, mingo, peterz, rick.p.edgecombe,
	sandipan.das, thomas.lendacky, x86

On 7/31/2024 5:06 PM, Kirill A. Shutemov wrote:
> On Wed, Jul 31, 2024 at 11:15:05AM +0200, Thomas Gleixner wrote:
>> On Wed, Jul 31 2024 at 14:27, Shivank Garg wrote:
>>> lmbench:lat_pagefault: Metric- page-fault time (us) - Lower is better
>>>                 4-Level PT              5-Level PT		% Change
>>> THP-never       Mean:0.4068             Mean:0.4294		5.56
>>>                 95% CI:0.4057-0.4078    95% CI:0.4287-0.4302
>>>
>>> THP-Always      Mean: 0.4061            Mean: 0.4288		% Change
>>>                 95% CI: 0.4051-0.4071   95% CI: 0.4281-0.4295	5.59
>>>
>>> Inference:
>>> 5-level page table shows increase in page-fault latency but it does
>>> not significantly impact other benchmarks.
>>
>> 5% regression on lmbench is a NONO.
> 
> Yeah, that's a biggy.
> 
> In our testing (on Intel HW) we didn't see any significant difference
> between 4- and 5-level paging. But we were focused on TLB fill latency.
> In both bare metal and in VMs. Maybe something wrong in the fault path?
> 
> It requires a closer look.
> 
> Shivank, could you share how you run lat_pagefault? What file size? How
> parallel you run it?...

Hi Kirill,

I got lmbench from here:
https://github.com/foss-for-synopsys-dwc-arc-processors/lmbench/blob/master/src/lat_pagefault.c

and using this command:
numactl --membind=1 --cpunodebind=1 bin/x86_64-linux-gnu/lat_pagefault -N 100 1GB_dev_urandom_file

> 
> It would also be nice to get perf traces. Maybe it is purely SW issue.
> 

4-level-page-table:
      - 52.31% benchmark
         - 49.52% asm_exc_page_fault
            - 49.35% exc_page_fault
               - 48.36% do_user_addr_fault
                  - 46.15% handle_mm_fault
                     - 44.59% __handle_mm_fault
                        - 42.95% do_fault
                           - 40.89% filemap_map_pages
                              - 28.30% set_pte_range
                                 - 23.70% folio_add_file_rmap_ptes
                                    - 14.30% __lruvec_stat_mod_folio
                                       - 10.12% __mod_lruvec_state
                                          - 5.70% __mod_memcg_lruvec_state
                                               0.60% cgroup_rstat_updated
                                            1.06% __mod_node_page_state
                                      2.84% __rcu_read_unlock
                                      0.76% srso_alias_safe_ret
                                   0.84% set_ptes.isra.0
                              - 5.48% next_uptodate_folio
                                 - 1.19% xas_find
                                      0.96% xas_load
                                1.00% set_ptes.isra.0
                    1.22% lock_vma_under_rcu


5-level-page-table:
      - 52.75% benchmark
         - 50.04% asm_exc_page_fault
            - 49.90% exc_page_fault
               - 48.91% do_user_addr_fault
                  - 46.74% handle_mm_fault
                     - 45.27% __handle_mm_fault
                        - 43.30% do_fault
                           - 41.58% filemap_map_pages
                              - 28.04% set_pte_range
                                 - 22.77% folio_add_file_rmap_ptes
                                    - 17.74% __lruvec_stat_mod_folio
                                       - 10.89% __mod_lruvec_state
                                          - 5.97% __mod_memcg_lruvec_state
                                               1.94% cgroup_rstat_updated
                                            1.09% __mod_node_page_state
                                         0.56% __mod_node_page_state
                                      2.28% __rcu_read_unlock
                                   1.08% set_ptes.isra.0
                              - 5.94% next_uptodate_folio
                                 - 1.13% xas_find
                                      0.99% xas_load
                                1.13% srso_alias_safe_ret
                                0.52% set_ptes.isra.0
                    1.16% lock_vma_under_rcu

>> 5-level page tables add a cost in every hardware page table walk. That's
>> a matter of fact and there is absolutely no reason to inflict this cost
>> on everyone.
>>
>> The solution to this to make the 5-level mechanics smarter by evaluating
>> whether the machine has enough memory to require 5-level tables and
>> select the depth at boot time.
> 
> Let's understand the reason first.

Sure, please let me know how can I help in this debug.

Thanks,
Shivank

> 
> The risk with your proposal is that 5-level paging will not get any
> testing and rot over time.
> 
> I would like to keep it on, if possible.
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64
  2024-07-31 17:45     ` Shivank Garg
@ 2024-10-31 15:36       ` Dave Hansen
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2024-10-31 15:36 UTC (permalink / raw)
  To: Shivank Garg, Kirill A. Shutemov, Thomas Gleixner
  Cc: ardb, bp, brijesh.singh, corbet, dave.hansen, hpa, jan.kiszka,
	jgross, kbingham, linux-doc, linux-efi, linux-kernel, linux-mm,
	luto, michael.roth, mingo, peterz, rick.p.edgecombe,
	sandipan.das, thomas.lendacky, x86

On 7/31/24 10:45, Shivank Garg wrote:
> It would also be nice to get perf traces. Maybe it is purely SW issue.

Cycle counts aren't going to help much here.  For instance, if 5-level
paging makes *ALL* TLB misses slower, you would just see a regression in
any code that misses the TLB, which could show up all over.

On Intel we have some PMU events like this:

dtlb_store_misses.walk_active
       [Cycles when at least one PMH is busy
	with a page walk for a store]

(there's a load side one as well).  If a page walk gets more expensive,
you can see it there.  Note that this doesn't actually tell you how much
time the core spent _waiting_ for a page walk to complete.  If all the
speculation magic works perfectly in your favor, you could have the PMH
busy 100% of cycles but never had the core waiting on it.

So could we drill down a level in the "perf traces" please, and gather
some of the relevant performance counters and not just cycles?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-31 15:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-31  8:57 [PATCH 0/3] x86: Make 5-level paging support unconditional for x86-64 Shivank Garg
2024-07-31  9:15 ` Thomas Gleixner
2024-07-31 11:11   ` Peter Zijlstra
2024-07-31 11:36   ` Kirill A. Shutemov
2024-07-31 11:40     ` Peter Zijlstra
2024-07-31 17:45     ` Shivank Garg
2024-10-31 15:36       ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox