* [RFC][PATCH 1/6] x86: mm: clean up tlb flushing code
2014-02-18 19:30 [RFC][PATCH 0/6] x86: rework tlb range flushing code Dave Hansen
@ 2014-02-18 19:30 ` Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Dave Hansen
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2014-02-18 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, ak, alex.shi, kirill.shutemov, mgorman, tim.c.chen,
x86, peterz, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
The
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels. This eliminates
one of the the copy-n-paste versions. It also gives us a unified
exit point for each path through this function. We need this in
a minute for our tracepoint.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/tlb.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff -puN arch/x86/mm/tlb.c~simplify-tlb-code arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~simplify-tlb-code 2014-02-18 10:59:35.521325070 -0800
+++ b/arch/x86/mm/tlb.c 2014-02-18 10:59:35.529325436 -0800
@@ -161,23 +161,24 @@ void flush_tlb_current_task(void)
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
+ int need_flush_others_all = 1;
unsigned long addr;
unsigned act_entries, tlb_entries = 0;
unsigned long nr_base_pages;
preempt_disable();
if (current->active_mm != mm)
- goto flush_all;
+ goto out;
if (!current->mm) {
leave_mm(smp_processor_id());
- goto flush_all;
+ goto out;
}
if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
|| vmflag & VM_HUGETLB) {
local_flush_tlb();
- goto flush_all;
+ goto out;
}
/* In modern CPU, last level tlb used for both data/ins */
@@ -196,22 +197,20 @@ void flush_tlb_mm_range(struct mm_struct
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
+ need_flush_others_all = 0;
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
__flush_tlb_single(addr);
}
-
- if (cpumask_any_but(mm_cpumask(mm),
- smp_processor_id()) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), mm, start, end);
- preempt_enable();
- return;
}
-
-flush_all:
+out:
+ if (need_flush_others_all) {
+ start = 0UL;
+ end = TLB_FLUSH_ALL;
+ }
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
+ flush_tlb_others(mm_cpumask(mm), mm, start, end);
preempt_enable();
}
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread* [RFC][PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing
2014-02-18 19:30 [RFC][PATCH 0/6] x86: rework tlb range flushing code Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 1/6] x86: mm: clean up tlb " Dave Hansen
@ 2014-02-18 19:30 ` Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 3/6] x86: mm: fix missed global TLB flush stat Dave Hansen
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2014-02-18 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, ak, alex.shi, kirill.shutemov, mgorman, tim.c.chen,
x86, peterz, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:
1. It is obviously buggy. It uses mm->total_vm to judge the
task's footprint in the TLB. It should certainly be using
some measure of RSS, *NOT* ->total_vm since only resident
memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
intel_tlb_flushall_shift_set() function.
3. It is plain wrong in my vm:
[ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] tlb_flushall_shift: 6
Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
(more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
I believe the sample times were too short. Running the
benchmark in a loop yields times that vary quite a bit.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/include/asm/processor.h | 1
b/arch/x86/kernel/cpu/amd.c | 7 ---
b/arch/x86/kernel/cpu/common.c | 13 -----
b/arch/x86/kernel/cpu/intel.c | 26 -----------
b/arch/x86/mm/tlb.c | 83 ++++---------------------------------
5 files changed, 11 insertions(+), 119 deletions(-)
diff -puN arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x8x-mm-rip-out-complicated-tlb-flushing 2014-02-18 10:59:36.430366513 -0800
+++ b/arch/x86/include/asm/processor.h 2014-02-18 10:59:36.452367514 -0800
@@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_I
extern u16 __read_mostly tlb_lld_2m[NR_INFO];
extern u16 __read_mostly tlb_lld_4m[NR_INFO];
extern u16 __read_mostly tlb_lld_1g[NR_INFO];
-extern s8 __read_mostly tlb_flushall_shift;
/*
* CPU type and hardware bug flags. Kept separately for each CPU.
diff -puN arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/amd.c
--- a/arch/x86/kernel/cpu/amd.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-02-18 10:59:36.434366693 -0800
+++ b/arch/x86/kernel/cpu/amd.c 2014-02-18 10:59:36.489369201 -0800
@@ -765,11 +765,6 @@ static unsigned int amd_size_cache(struc
}
#endif
-static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c)
-{
- tlb_flushall_shift = 6;
-}
-
static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
{
u32 ebx, eax, ecx, edx;
@@ -817,8 +812,6 @@ static void cpu_detect_tlb_amd(struct cp
tlb_lli_2m[ENTRIES] = eax & mask;
tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;
-
- cpu_set_tlb_flushall_shift(c);
}
static const struct cpu_dev amd_cpu_dev = {
diff -puN arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-02-18 10:59:36.438366877 -0800
+++ b/arch/x86/kernel/cpu/common.c 2014-02-18 10:59:36.453367560 -0800
@@ -474,26 +474,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO];
u16 __read_mostly tlb_lld_4m[NR_INFO];
u16 __read_mostly tlb_lld_1g[NR_INFO];
-/*
- * tlb_flushall_shift shows the balance point in replacing cr3 write
- * with multiple 'invlpg'. It will do this replacement when
- * flush_tlb_lines <= active_lines/2^tlb_flushall_shift.
- * If tlb_flushall_shift is -1, means the replacement will be disabled.
- */
-s8 __read_mostly tlb_flushall_shift = -1;
-
void cpu_detect_tlb(struct cpuinfo_x86 *c)
{
if (this_cpu->c_detect_tlb)
this_cpu->c_detect_tlb(c);
printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n"
- "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n"
- "tlb_flushall_shift: %d\n",
+ "Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n",
tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES],
- tlb_lld_1g[ENTRIES], tlb_flushall_shift);
+ tlb_lld_1g[ENTRIES]);
}
void detect_ht(struct cpuinfo_x86 *c)
diff -puN arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-02-18 10:59:36.442367060 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2014-02-18 10:59:36.488369155 -0800
@@ -631,31 +631,6 @@ static void intel_tlb_lookup(const unsig
}
}
-static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
-{
- switch ((c->x86 << 8) + c->x86_model) {
- case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */
- case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */
- case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */
- case 0x61d: /* six-core 45 nm xeon "Dunnington" */
- tlb_flushall_shift = -1;
- break;
- case 0x63a: /* Ivybridge */
- tlb_flushall_shift = 2;
- break;
- case 0x61a: /* 45 nm nehalem, "Bloomfield" */
- case 0x61e: /* 45 nm nehalem, "Lynnfield" */
- case 0x625: /* 32 nm nehalem, "Clarkdale" */
- case 0x62c: /* 32 nm nehalem, "Gulftown" */
- case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
- case 0x62f: /* 32 nm Xeon E7 */
- case 0x62a: /* SandyBridge */
- case 0x62d: /* SandyBridge, "Romely-EP" */
- default:
- tlb_flushall_shift = 6;
- }
-}
-
static void intel_detect_tlb(struct cpuinfo_x86 *c)
{
int i, j, n;
@@ -680,7 +655,6 @@ static void intel_detect_tlb(struct cpui
for (j = 1 ; j < 16 ; j++)
intel_tlb_lookup(desc[j]);
}
- intel_tlb_flushall_shift_set(c);
}
static const struct cpu_dev intel_cpu_dev = {
diff -puN arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~x8x-mm-rip-out-complicated-tlb-flushing 2014-02-18 10:59:36.445367196 -0800
+++ b/arch/x86/mm/tlb.c 2014-02-18 10:59:36.489369201 -0800
@@ -158,13 +158,14 @@ void flush_tlb_current_task(void)
preempt_enable();
}
+/* in units of pages */
+unsigned long tlb_single_page_flush_ceiling = 5;
+
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
int need_flush_others_all = 1;
unsigned long addr;
- unsigned act_entries, tlb_entries = 0;
- unsigned long nr_base_pages;
preempt_disable();
if (current->active_mm != mm)
@@ -175,25 +176,12 @@ void flush_tlb_mm_range(struct mm_struct
goto out;
}
- if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
- || vmflag & VM_HUGETLB) {
+ if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
local_flush_tlb();
goto out;
}
- /* In modern CPU, last level tlb used for both data/ins */
- if (vmflag & VM_EXEC)
- tlb_entries = tlb_lli_4k[ENTRIES];
- else
- tlb_entries = tlb_lld_4k[ENTRIES];
-
- /* Assume all of TLB entries was occupied by this task */
- act_entries = tlb_entries >> tlb_flushall_shift;
- act_entries = mm->total_vm > act_entries ? act_entries : mm->total_vm;
- nr_base_pages = (end - start) >> PAGE_SHIFT;
-
- /* tlb_flushall_shift is on balance point, details in commit log */
- if (nr_base_pages > act_entries) {
+ if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
@@ -259,68 +247,15 @@ static void do_kernel_range_flush(void *
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
- unsigned act_entries;
- struct flush_tlb_info info;
-
- /* In modern CPU, last level tlb used for both data/ins */
- act_entries = tlb_lld_4k[ENTRIES];
/* Balance as user space task's flush, a bit conservative */
- if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 ||
- (end - start) >> PAGE_SHIFT > act_entries >> tlb_flushall_shift)
-
+ if (end == TLB_FLUSH_ALL ||
+ (end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
on_each_cpu(do_flush_tlb_all, NULL, 1);
- else {
+ } else {
+ struct flush_tlb_info info;
info.flush_start = start;
info.flush_end = end;
on_each_cpu(do_kernel_range_flush, &info, 1);
}
}
-
-#ifdef CONFIG_DEBUG_TLBFLUSH
-static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
- size_t count, loff_t *ppos)
-{
- char buf[32];
- unsigned int len;
-
- len = sprintf(buf, "%hd\n", tlb_flushall_shift);
- return simple_read_from_buffer(user_buf, count, ppos, buf, len);
-}
-
-static ssize_t tlbflush_write_file(struct file *file,
- const char __user *user_buf, size_t count, loff_t *ppos)
-{
- char buf[32];
- ssize_t len;
- s8 shift;
-
- len = min(count, sizeof(buf) - 1);
- if (copy_from_user(buf, user_buf, len))
- return -EFAULT;
-
- buf[len] = '\0';
- if (kstrtos8(buf, 0, &shift))
- return -EINVAL;
-
- if (shift < -1 || shift >= BITS_PER_LONG)
- return -EINVAL;
-
- tlb_flushall_shift = shift;
- return count;
-}
-
-static const struct file_operations fops_tlbflush = {
- .read = tlbflush_read_file,
- .write = tlbflush_write_file,
- .llseek = default_llseek,
-};
-
-static int __init create_tlb_flushall_shift(void)
-{
- debugfs_create_file("tlb_flushall_shift", S_IRUSR | S_IWUSR,
- arch_debugfs_dir, NULL, &fops_tlbflush);
- return 0;
-}
-late_initcall(create_tlb_flushall_shift);
-#endif
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread* [RFC][PATCH 3/6] x86: mm: fix missed global TLB flush stat
2014-02-18 19:30 [RFC][PATCH 0/6] x86: rework tlb range flushing code Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 1/6] x86: mm: clean up tlb " Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 2/6] x86: mm: rip out complicated, out-of-date, buggy TLB flushing Dave Hansen
@ 2014-02-18 19:30 ` Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 4/6] x86: mm: trace tlb flushes Dave Hansen
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2014-02-18 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, ak, alex.shi, kirill.shutemov, mgorman, tim.c.chen,
x86, peterz, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
If we take the
if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
local_flush_tlb();
goto out;
}
path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/tlb.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff -puN arch/x86/mm/tlb.c~fix-missed-global-flush-stat arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~fix-missed-global-flush-stat 2014-02-18 10:59:37.611420354 -0800
+++ b/arch/x86/mm/tlb.c 2014-02-18 10:59:37.619420720 -0800
@@ -164,8 +164,9 @@ unsigned long tlb_single_page_flush_ceil
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
- int need_flush_others_all = 1;
unsigned long addr;
+ /* do a global flush by default */
+ unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
preempt_disable();
if (current->active_mm != mm)
@@ -176,16 +177,14 @@ void flush_tlb_mm_range(struct mm_struct
goto out;
}
- if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
- local_flush_tlb();
- goto out;
- }
+ if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
+ base_pages_to_flush = (end - start) >> PAGE_SHIFT;
- if ((end - start) > tlb_single_page_flush_ceiling * PAGE_SIZE) {
+ if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
+ base_pages_to_flush = TLB_FLUSH_ALL;
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
local_flush_tlb();
} else {
- need_flush_others_all = 0;
/* flush range by one by one 'invlpg' */
for (addr = start; addr < end; addr += PAGE_SIZE) {
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
@@ -193,7 +192,7 @@ void flush_tlb_mm_range(struct mm_struct
}
}
out:
- if (need_flush_others_all) {
+ if (base_pages_to_flush == TLB_FLUSH_ALL) {
start = 0UL;
end = TLB_FLUSH_ALL;
}
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread* [RFC][PATCH 4/6] x86: mm: trace tlb flushes
2014-02-18 19:30 [RFC][PATCH 0/6] x86: rework tlb range flushing code Dave Hansen
` (2 preceding siblings ...)
2014-02-18 19:30 ` [RFC][PATCH 3/6] x86: mm: fix missed global TLB flush stat Dave Hansen
@ 2014-02-18 19:30 ` Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 5/6] x86: mm: new tunable for single vs full TLB flush Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 6/6] x86: mm: set TLB flush tunable to sane value Dave Hansen
5 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2014-02-18 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, ak, alex.shi, kirill.shutemov, mgorman, tim.c.chen,
x86, peterz, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
We don't have any good way to figure out what kinds of flushes
are being attempted. Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.
This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).
Also, since we have a pair of tracepoint calls in
flush_tlb_mm_range(), we can time the deltas between them to make
sure that we got the "invlpg vs. global flush" balance correct in
practice.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/include/asm/mmu_context.h | 6 +++++
b/arch/x86/mm/tlb.c | 13 ++++++++++--
b/include/linux/mm_types.h | 10 +++++++++
b/include/trace/events/tlb.h | 37 +++++++++++++++++++++++++++++++++++
b/mm/Makefile | 2 -
b/mm/trace_tlb.c | 12 +++++++++++
6 files changed, 77 insertions(+), 3 deletions(-)
diff -puN arch/x86/include/asm/mmu_context.h~tlb-trace-flushes arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~tlb-trace-flushes 2014-02-18 10:59:38.570464074 -0800
+++ b/arch/x86/include/asm/mmu_context.h 2014-02-18 10:59:38.593465123 -0800
@@ -3,6 +3,10 @@
#include <asm/desc.h>
#include <linux/atomic.h>
+#include <linux/mm_types.h>
+
+#include <trace/events/tlb.h>
+
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
@@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_s
/* Re-load page tables */
load_cr3(next->pgd);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
/* Stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev));
@@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_s
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
load_LDT_nolock(&next->context);
}
}
diff -puN arch/x86/mm/tlb.c~tlb-trace-flushes arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~tlb-trace-flushes 2014-02-18 10:59:38.574464256 -0800
+++ b/arch/x86/mm/tlb.c 2014-02-18 11:03:52.426034364 -0800
@@ -14,6 +14,9 @@
#include <asm/uv/uv.h>
#include <linux/debugfs.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/tlb.h>
+
DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)
= { &init_mm, 0, };
@@ -49,6 +52,7 @@ void leave_mm(int cpu)
if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
load_cr3(swapper_pg_dir);
+ trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}
}
EXPORT_SYMBOL_GPL(leave_mm);
@@ -105,9 +109,10 @@ static void flush_tlb_func(void *info)
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) {
- if (f->flush_end == TLB_FLUSH_ALL)
+ if (f->flush_end == TLB_FLUSH_ALL) {
local_flush_tlb();
- else if (!f->flush_end)
+ trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, TLB_FLUSH_ALL);
+ } else if (!f->flush_end)
__flush_tlb_single(f->flush_start);
else {
unsigned long addr;
@@ -152,7 +157,9 @@ void flush_tlb_current_task(void)
preempt_disable();
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
+ trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
local_flush_tlb();
+ trace_tlb_flush(TLB_LOCAL_SHOOTDOWN_DONE, TLB_FLUSH_ALL);
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
preempt_enable();
@@ -180,6 +187,7 @@ void flush_tlb_mm_range(struct mm_struct
if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
base_pages_to_flush = (end - start) >> PAGE_SHIFT;
+ trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN, base_pages_to_flush);
if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
base_pages_to_flush = TLB_FLUSH_ALL;
count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
@@ -191,6 +199,7 @@ void flush_tlb_mm_range(struct mm_struct
__flush_tlb_single(addr);
}
}
+ trace_tlb_flush(TLB_LOCAL_MM_SHOOTDOWN_DONE, base_pages_to_flush);
out:
if (base_pages_to_flush == TLB_FLUSH_ALL) {
start = 0UL;
diff -puN include/linux/mm_types.h~tlb-trace-flushes include/linux/mm_types.h
--- a/include/linux/mm_types.h~tlb-trace-flushes 2014-02-18 10:59:38.578464440 -0800
+++ b/include/linux/mm_types.h 2014-02-18 10:59:38.595465214 -0800
@@ -509,4 +509,14 @@ static inline void clear_tlb_flush_pendi
}
#endif
+enum tlb_flush_reason {
+ TLB_FLUSH_ON_TASK_SWITCH,
+ TLB_REMOTE_SHOOTDOWN,
+ TLB_LOCAL_SHOOTDOWN,
+ TLB_LOCAL_SHOOTDOWN_DONE,
+ TLB_LOCAL_MM_SHOOTDOWN,
+ TLB_LOCAL_MM_SHOOTDOWN_DONE,
+ NR_TLB_FLUSH_REASONS,
+};
+
#endif /* _LINUX_MM_TYPES_H */
diff -puN /dev/null include/trace/events/tlb.h
--- /dev/null 2014-01-15 16:08:30.019511980 -0800
+++ b/include/trace/events/tlb.h 2014-02-18 11:05:13.176713847 -0800
@@ -0,0 +1,37 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tlb
+
+#if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TLB_H
+
+#include <linux/mm_types.h>
+#include <linux/tracepoint.h>
+
+extern const char * const tlb_flush_reason_desc[];
+
+TRACE_EVENT(tlb_flush,
+
+ TP_PROTO(int reason, unsigned long pages),
+ TP_ARGS(reason, pages),
+
+ TP_STRUCT__entry(
+ __field( int, reason)
+ __field(unsigned long, pages)
+ ),
+
+ TP_fast_assign(
+ __entry->reason = reason;
+ __entry->pages = pages;
+ ),
+
+ TP_printk("pages: %ld reason: %d (%s)",
+ __entry->pages,
+ __entry->reason,
+ tlb_flush_reason_desc[__entry->reason])
+);
+
+#endif /* _TRACE_TLB_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
+
diff -puN mm/Makefile~tlb-trace-flushes mm/Makefile
--- a/mm/Makefile~tlb-trace-flushes 2014-02-18 10:59:38.583464667 -0800
+++ b/mm/Makefile 2014-02-18 10:59:38.596465261 -0800
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o pagewalk.o pgtable-generic.o
+ vmalloc.o pagewalk.o pgtable-generic.o trace_tlb.o
ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff -puN /dev/null mm/trace_tlb.c
--- /dev/null 2014-01-15 16:08:30.019511980 -0800
+++ b/mm/trace_tlb.c 2014-02-18 10:59:38.596465261 -0800
@@ -0,0 +1,12 @@
+#define DEFINE_TRACE_POINTS
+#include <trace/events/tlb.h>
+
+const char * const tlb_flush_reason_desc[] = {
+ __stringify(TLB_FLUSH_ON_TASK_SWITCH),
+ __stringify(TLB_REMOTE_SHOOTDOWN),
+ __stringify(TLB_LOCAL_SHOOTDOWN),
+ __stringify(TLB_LOCAL_SHOOTDOWN_DONE),
+ __stringify(TLB_LOCAL_MM_SHOOTDOWN),
+ __stringify(TLB_LOCAL_MM_SHOOTDOWN_DONE),
+};
+
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread* [RFC][PATCH 5/6] x86: mm: new tunable for single vs full TLB flush
2014-02-18 19:30 [RFC][PATCH 0/6] x86: rework tlb range flushing code Dave Hansen
` (3 preceding siblings ...)
2014-02-18 19:30 ` [RFC][PATCH 4/6] x86: mm: trace tlb flushes Dave Hansen
@ 2014-02-18 19:30 ` Dave Hansen
2014-02-18 19:30 ` [RFC][PATCH 6/6] x86: mm: set TLB flush tunable to sane value Dave Hansen
5 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2014-02-18 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, ak, alex.shi, kirill.shutemov, mgorman, tim.c.chen,
x86, peterz, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
Most of the logic here is in the documentation file. Please take
a look at it.
I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler. I challenge anyone to describe in one
sentence how the old one worked. Here's the way the new one
works:
If we are flushing more pages than the ceiling, we use
the full flush, otherwise we use invlpg.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/Documentation/x86/tlb.txt | 64 ++++++++++++++++++++++++++++++++++++++++++++
b/arch/x86/mm/tlb.c | 47 +++++++++++++++++++++++++++++++-
2 files changed, 110 insertions(+), 1 deletion(-)
diff -puN arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~new-tunable-for-single-vs-full-tlb-flush 2014-02-18 10:59:39.420502826 -0800
+++ b/arch/x86/mm/tlb.c 2014-02-18 10:59:39.427503145 -0800
@@ -167,7 +167,6 @@ void flush_tlb_current_task(void)
/* in units of pages */
unsigned long tlb_single_page_flush_ceiling = 5;
-
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
@@ -267,3 +266,49 @@ void flush_tlb_kernel_range(unsigned lon
on_each_cpu(do_kernel_range_flush, &info, 1);
}
}
+
+static ssize_t tlbflush_read_file(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ char buf[32];
+ unsigned int len;
+
+ len = sprintf(buf, "%ld\n", tlb_single_page_flush_ceiling);
+ return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t tlbflush_write_file(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ char buf[32];
+ ssize_t len;
+ int ceiling;
+
+ len = min(count, sizeof(buf) - 1);
+ if (copy_from_user(buf, user_buf, len))
+ return -EFAULT;
+
+ buf[len] = '\0';
+ if (kstrtoint(buf, 0, &ceiling))
+ return -EINVAL;
+
+ if (ceiling < 0)
+ return -EINVAL;
+
+ tlb_single_page_flush_ceiling = ceiling;
+ return count;
+}
+
+static const struct file_operations fops_tlbflush = {
+ .read = tlbflush_read_file,
+ .write = tlbflush_write_file,
+ .llseek = default_llseek,
+};
+
+static int __init create_tlb_single_page_flush_ceiling(void)
+{
+ debugfs_create_file("tlb_single_page_flush_ceiling", S_IRUSR | S_IWUSR,
+ arch_debugfs_dir, NULL, &fops_tlbflush);
+ return 0;
+}
+late_initcall(create_tlb_single_page_flush_ceiling);
diff -puN /dev/null Documentation/x86/tlb.txt
--- /dev/null 2014-01-15 16:08:30.019511980 -0800
+++ b/Documentation/x86/tlb.txt 2014-02-18 10:59:39.427503145 -0800
@@ -0,0 +1,64 @@
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+ 1. Flush the entire TLB with a two-instruction sequence. This is
+ a quick operation, but it causes collateral damage: TLB entries
+ from areas other than the one we are trying to flush will be
+ destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+ time. This could potentialy cost many more instructions, but
+ it is a much more precise operation, causing no collateral
+ damage to other TLB entries.
+
+Which method to do depends on a few things:
+ 1. The size of the flush being performed. A flush of the entire
+ address space is obviously better performed by flushing the
+ entire TLB than doing 2^48/PAGE_SIZE invlpg calls.
+ 2. The contents of the TLB. If the TLB is empty, then there will
+ be no collateral damage caused by doing the global flush, and
+ all of the invlpg calls will have ended up being wasted work.
+ Whether or not the range being flushed was in the TLB matters
+ as well.
+ 3. The size of the TLB. The larger the TLB, the more collateral
+ damage we do with a full flush. So, the larger the TLB, the
+ more attrative invlpg looks.
+ 4. The microarchitecture. The TLB has become a multi-level
+ cache on modern CPUs, and the global flushes have become more
+ expensive relative to single-page flushes.
+
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush. The
+sizes of the flush will vary greatly depending on the workload as
+well. There is essentially no "right" point to choose.
+
+If you believe that invlpg is being called too often, you can
+lower the tunable:
+
+ /sys/debug/kernel/x86/tlb_single_page_flush_ceiling
+
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of invlpg.
+
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this:
+
+perf stat -e
+ cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+ cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+ cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+ cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+ cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+ cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+
+That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
+may have differently-named counters, but they should at least
+be there in some form. You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread* [RFC][PATCH 6/6] x86: mm: set TLB flush tunable to sane value
2014-02-18 19:30 [RFC][PATCH 0/6] x86: rework tlb range flushing code Dave Hansen
` (4 preceding siblings ...)
2014-02-18 19:30 ` [RFC][PATCH 5/6] x86: mm: new tunable for single vs full TLB flush Dave Hansen
@ 2014-02-18 19:30 ` Dave Hansen
5 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2014-02-18 19:30 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, ak, alex.shi, kirill.shutemov, mgorman, tim.c.chen,
x86, peterz, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.
During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page. It breaks down like this:
size percent percent<=
V V V
GLOBAL: 2.20% 2.20% avg cycles: 2283
1: 56.92% 59.12% avg cycles: 1276
2: 13.78% 72.90% avg cycles: 1505
3: 8.26% 81.16% avg cycles: 1880
4: 7.41% 88.58% avg cycles: 2447
5: 1.73% 90.31% avg cycles: 2358
6: 1.32% 91.63% avg cycles: 2563
7: 1.14% 92.77% avg cycles: 2862
8: 0.62% 93.39% avg cycles: 3542
9: 0.08% 93.47% avg cycles: 3289
10: 0.43% 93.90% avg cycles: 3570
11: 0.20% 94.10% avg cycles: 3767
12: 0.08% 94.18% avg cycles: 3996
13: 0.03% 94.20% avg cycles: 4077
14: 0.02% 94.23% avg cycles: 4836
15: 0.04% 94.26% avg cycles: 5699
16: 0.06% 94.32% avg cycles: 5041
17: 0.57% 94.89% avg cycles: 5473
18: 0.02% 94.91% avg cycles: 5396
19: 0.03% 94.95% avg cycles: 5296
20: 0.02% 94.96% avg cycles: 6749
21: 0.18% 95.14% avg cycles: 6225
22: 0.01% 95.15% avg cycles: 6393
23: 0.01% 95.16% avg cycles: 6861
24: 0.12% 95.28% avg cycles: 6912
25: 0.05% 95.32% avg cycles: 7190
26: 0.01% 95.33% avg cycles: 7793
27: 0.01% 95.34% avg cycles: 7833
28: 0.01% 95.35% avg cycles: 8253
29: 0.08% 95.42% avg cycles: 8024
30: 0.03% 95.45% avg cycles: 9670
31: 0.01% 95.46% avg cycles: 8949
32: 0.01% 95.46% avg cycles: 9350
33: 3.11% 98.57% avg cycles: 8534
34: 0.02% 98.60% avg cycles: 10977
35: 0.02% 98.62% avg cycles: 11400
We get in to dimishing returns pretty quickly. On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige. That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.
The previous code tried to size this number based on the size of
the TLB. Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/tlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN arch/x86/mm/tlb.c~set-tunable-to-sane-value arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~set-tunable-to-sane-value 2014-02-18 11:05:37.304813166 -0800
+++ b/arch/x86/mm/tlb.c 2014-02-18 11:05:37.306813257 -0800
@@ -166,7 +166,7 @@ void flush_tlb_current_task(void)
}
/* in units of pages */
-unsigned long tlb_single_page_flush_ceiling = 5;
+unsigned long tlb_single_page_flush_ceiling = 33;
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned long vmflag)
{
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread