Re: [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Nellans <david@nellans.org>
To: Dave Hansen <dave@sr71.net>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"x86@kernel.org" <x86@kernel.org>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"riel@redhat.com" <riel@redhat.com>,
	"mgorman@suse.de" <mgorman@suse.de>
Subject: Re: [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33)
Date: Wed, 02 Jul 2014 13:16:58 -0500	[thread overview]
Message-ID: <53B44C9A.9070808@nellans.org> (raw)
In-Reply-To: <20140701164856.3020D644@viggo.jf.intel.com>


On 07/01/2014 11:48 AM, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This has been run through Intel's LKP tests across a wide range
> of modern sytems and workloads and it wasn't shown to make a
> measurable performance difference positive or negative.
>
> Now that we have some shiny new tracepoints, we can actually
> figure out what the heck is going on.
>
> During a kernel compile, 60% of the flush_tlb_mm_range() calls
> are for a single page.  It breaks down like this:
>
>   size   percent  percent<=
>    V        V        V
> GLOBAL:   2.20%   2.20% avg cycles:  2283
>       1:  56.92%  59.12% avg cycles:  1276
>       2:  13.78%  72.90% avg cycles:  1505
>       3:   8.26%  81.16% avg cycles:  1880
>       4:   7.41%  88.58% avg cycles:  2447
>       5:   1.73%  90.31% avg cycles:  2358
>       6:   1.32%  91.63% avg cycles:  2563
>       7:   1.14%  92.77% avg cycles:  2862
>       8:   0.62%  93.39% avg cycles:  3542
>       9:   0.08%  93.47% avg cycles:  3289
>      10:   0.43%  93.90% avg cycles:  3570
>      11:   0.20%  94.10% avg cycles:  3767
>      12:   0.08%  94.18% avg cycles:  3996
>      13:   0.03%  94.20% avg cycles:  4077
>      14:   0.02%  94.23% avg cycles:  4836
>      15:   0.04%  94.26% avg cycles:  5699
>      16:   0.06%  94.32% avg cycles:  5041
>      17:   0.57%  94.89% avg cycles:  5473
>      18:   0.02%  94.91% avg cycles:  5396
>      19:   0.03%  94.95% avg cycles:  5296
>      20:   0.02%  94.96% avg cycles:  6749
>      21:   0.18%  95.14% avg cycles:  6225
>      22:   0.01%  95.15% avg cycles:  6393
>      23:   0.01%  95.16% avg cycles:  6861
>      24:   0.12%  95.28% avg cycles:  6912
>      25:   0.05%  95.32% avg cycles:  7190
>      26:   0.01%  95.33% avg cycles:  7793
>      27:   0.01%  95.34% avg cycles:  7833
>      28:   0.01%  95.35% avg cycles:  8253
>      29:   0.08%  95.42% avg cycles:  8024
>      30:   0.03%  95.45% avg cycles:  9670
>      31:   0.01%  95.46% avg cycles:  8949
>      32:   0.01%  95.46% avg cycles:  9350
>      33:   3.11%  98.57% avg cycles:  8534
>      34:   0.02%  98.60% avg cycles: 10977
>      35:   0.02%  98.62% avg cycles: 11400
>
> We get in to dimishing returns pretty quickly.  On pre-IvyBridge
> CPUs, we used to set the limit at 8 pages, and it was set at 128
> on IvyBrige.  That 128 number looks pretty silly considering that
> less than 0.5% of the flushes are that large.
>
> The previous code tried to size this number based on the size of
> the TLB.  Good idea, but it's error-prone, needs maintenance
> (which it didn't get up to now), and probably would not matter in
> practice much.
>
> Settting it to 33 means that we cover the mallopt
> M_TRIM_THRESHOLD, which is the most universally common size to do
> flushes.
>
> That's the short version.  Here's the long one for why I chose 33:
>
> 1. These numbers have a constant bias in the timestamps from the
>     tracing.  Probably counts for a couple hundred cycles in each of
>     these tests, but it should be fairly _even_ across all of them.
>     The smallest delta between the tracepoints I have ever seen is
>     335 cycles.  This is one reason the cycles/page cost goes down in
>     general as the flushes get larger.  The true cost is nearer to
>     100 cycles.
> 2. A full flush is more expensive than a single invlpg, but not
>     by much (single percentages).
> 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
>     (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
>     22,000 cycles.
> 4. 22,000 cycles is approximately the equivalent of doing 85
>     invlpg operations.  But, the odds are that the TLB can
>     actually be filled up faster than that because TLB misses that
>     are close in time also tend to leverage the same caches.
> 6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
>     33 pages, probably because libc's M_TRIM_THRESHOLD is set to
>     128k (32 pages)
> 7. I've found no consistent data to support changing the IvyBridge
>     vs. SandyBridge tunable by a factor of 16
>
> I used the performance counters on this hardware (IvyBridge i5-3320M)
> to figure out the tlb miss costs:
>
> ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush
>
>       7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
>         169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
>         708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
>          19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
>       2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
>          82,241,148      itlb_misses_walk_completed                                    [57.13%]
>             770,717      itlb_itlb_flush                                              [57.11%]
>
> Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
> (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
> 22,000 cycles.  On a SandyBridge system with more cores and larger
> caches, those are dtlb=13.4ns and itlb=9.5ns.

Intuition here is that invalidate caused refills will almost always be serviced from the L2
or better since we've recently walked to modify the page needing flush and thus pre-warmed the caches
for any refill? Or is this an artifact of the flush/refill test setup? Main mem latency even on Ivybridge is ~100
clocks, worse in previous generations, so to get down to average ~30 cycle refill you basically can never be
missing in the L1 or maybe L2 which seems optimistic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-07-02 18:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-01 16:48 [PATCH 0/7] [RESEND][v4] x86: rework tlb range flushing code Dave Hansen
2014-07-01 16:48 ` [PATCH 1/7] x86: mm: clean up tlb " Dave Hansen
2014-07-01 16:48 ` [PATCH 2/7] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Dave Hansen
2014-07-01 16:48 ` [PATCH 3/7] x86: mm: fix missed global TLB flush stat Dave Hansen
2014-07-01 16:48 ` [PATCH 4/7] x86: mm: unify remote invlpg code Dave Hansen
2014-07-01 16:48 ` [PATCH 5/7] x86: mm: add tracepoints for TLB flushes Dave Hansen
2014-07-01 16:48 ` [PATCH 6/7] x86: mm: new tunable for single vs full TLB flush Dave Hansen
2014-07-01 16:48 ` [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33) Dave Hansen
2014-07-02 18:16   ` David Nellans [this message]
2014-07-02 18:24     ` Dave Hansen
2014-07-02 19:04 ` [PATCH 0/7] [RESEND][v4] x86: rework tlb range flushing code Davidlohr Bueso
2014-07-31 15:40 Dave Hansen
2014-07-31 15:41 ` [PATCH 7/7] x86: mm: set TLB flush tunable to sane value (33) Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53B44C9A.9070808@nellans.org \
    --to=david@nellans.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@sr71.net \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox