linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Brian Gerst <brgerst@gmail.com>
To: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	moritz.lipp@iaik.tugraz.at, daniel.gruss@iaik.tugraz.at,
	michael.schwarz@iaik.tugraz.at, Andy Lutomirski <luto@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Kees Cook <keescook@google.com>,
	hughd@google.com, the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
Date: Tue, 31 Oct 2017 20:43:49 -0400	[thread overview]
Message-ID: <CAMzpN2gP5SQWrbwNn9A+c6y5yLpqrV8Hpxou+TSypQ2WL+JXkQ@mail.gmail.com> (raw)
In-Reply-To: <20171031223148.5334003A@viggo.jf.intel.com>

On Tue, Oct 31, 2017 at 6:31 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> This is largely code from Andy Lutomirski.  I fixed a few bugs
> in it, and added a few SWITCH_TO_* spots.
>
> KAISER needs to switch to a different CR3 value when it enters
> the kernel and switch back when it exits.  This essentially
> needs to be done before we leave assembly code.
>
> This is extra challenging because the context in which we have to
> make this switch is tricky: the registers we are allowed to
> clobber can vary.  It's also hard to store things on the stack
> because there are already things on it with an established ABI
> (ptregs) or the stack is unsafe to use at all.
>
> This patch establishes a set of macros that allow changing to
> the user and kernel CR3 values, but do not actually switch
> CR3.  The code will, however, clobber the registers that it
> says it will and also does perform *writes* to CR3.  So, this
> patch by itself tests that the registers we are clobbering
> and restoring from are OK, and that things like our stack
> manipulation are in safe places.
>
> In other words, if you bisect to here, this *does* introduce
> changes that can break things.
>
> Interactions with SWAPGS: previous versions of the KAISER code
> relied on having per-cpu scratch space so we have a register
> to clobber for our CR3 MOV.  The %GS register is what we use
> to index into our per-cpu sapce, so SWAPGS *had* to be done
> before the CR3 switch.  That scratch space is gone now, but we
> still keep the semantic that SWAPGS must be done before the
> CR3 MOV.  This is good to keep because it is not that hard to
> do and it allows us to do things like add per-cpu debugging
> information to help us figure out what goes wrong sometimes.
>
> What this does in the NMI code is worth pointing out.  NMIs
> can interrupt *any* context and they can also be nested with
> NMIs interrupting other NMIs.  The comments below
> ".Lnmi_from_kernel" explain the format of the stack that we
> have to deal with this situation.  Changing the format of
> this stack is not a fun exercise: I tried.  Instead of
> storing the old CR3 value on the stack, we depend on the
> *regular* register save/restore mechanism and then use %r14
> to keep CR3 during the NMI.  It will not be clobbered by the
> C NMI handlers that get called.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> ---
>
>  b/arch/x86/entry/calling.h         |   40 +++++++++++++++++++++++++++++++++++++
>  b/arch/x86/entry/entry_64.S        |   33 +++++++++++++++++++++++++-----
>  b/arch/x86/entry/entry_64_compat.S |   13 ++++++++++++
>  3 files changed, 81 insertions(+), 5 deletions(-)
>
> diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h
> --- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work        2017-10-31 15:03:48.105007253 -0700
> +++ b/arch/x86/entry/calling.h  2017-10-31 15:03:48.113007631 -0700
> @@ -1,5 +1,6 @@
>  #include <linux/jump_label.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/cpufeatures.h>
>
>  /*
>
> @@ -217,6 +218,45 @@ For 32-bit we have the following convent
>  #endif
>  .endm
>
> +.macro ADJUST_KERNEL_CR3 reg:req
> +.endm
> +
> +.macro ADJUST_USER_CR3 reg:req
> +.endm
> +
> +.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> +       mov     %cr3, \scratch_reg
> +       ADJUST_KERNEL_CR3 \scratch_reg
> +       mov     \scratch_reg, %cr3
> +.endm
> +
> +.macro SWITCH_TO_USER_CR3 scratch_reg:req
> +       mov     %cr3, \scratch_reg
> +       ADJUST_USER_CR3 \scratch_reg
> +       mov     \scratch_reg, %cr3
> +.endm
> +
> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +       movq    %cr3, %r\scratch_reg
> +       movq    %r\scratch_reg, \save_reg
> +       /*
> +        * Just stick a random bit in here that never gets set.  Fixed
> +        * up in real KAISER patches in a moment.
> +        */
> +       bt      $63, %r\scratch_reg
> +       jz      .Ldone_\@
> +
> +       ADJUST_KERNEL_CR3 %r\scratch_reg
> +       movq    %r\scratch_reg, %cr3
> +
> +.Ldone_\@:
> +.endm
> +
> +.macro RESTORE_CR3 save_reg:req
> +       /* optimize this */
> +       movq    \save_reg, %cr3
> +.endm
> +
>  #endif /* CONFIG_X86_64 */
>
>  /*
> diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
> --- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work        2017-10-31 15:03:48.107007348 -0700
> +++ b/arch/x86/entry/entry_64_compat.S  2017-10-31 15:03:48.113007631 -0700
> @@ -48,8 +48,13 @@
>  ENTRY(entry_SYSENTER_compat)
>         /* Interrupts are off on entry. */
>         SWAPGS_UNSAFE_STACK
> +
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> +       pushq   %rdi
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +       popq    %rdi
> +
>         /*
>          * User tracing code (ptrace or signal handlers) might assume that
>          * the saved RAX contains a 32-bit number when we're invoking a 32-bit
> @@ -91,6 +96,9 @@ ENTRY(entry_SYSENTER_compat)
>         pushq   $0                      /* pt_regs->r15 = 0 */
>         cld
>
> +       pushq   %rdi
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +       popq    %rdi
>         /*
>          * SYSENTER doesn't filter flags, so we need to clear NT and AC
>          * ourselves.  To save a few cycles, we can check whether
> @@ -214,6 +222,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
>         pushq   $0                      /* pt_regs->r14 = 0 */
>         pushq   $0                      /* pt_regs->r15 = 0 */
>
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +
>         /*
>          * User mode is traced as though IRQs are on, and SYSENTER
>          * turned them off.
> @@ -240,6 +250,7 @@ sysret32_from_system_call:
>         popq    %rsi                    /* pt_regs->si */
>         popq    %rdi                    /* pt_regs->di */
>
> +       SWITCH_TO_USER_CR3 scratch_reg=%r8
>          /*
>           * USERGS_SYSRET32 does:
>           *  GSBASE = user's GS base
> @@ -324,6 +335,7 @@ ENTRY(entry_INT80_compat)
>         pushq   %r15                    /* pt_regs->r15 */
>         cld
>
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
>         /*
>          * User mode is traced as though IRQs are on, and the interrupt
>          * gate turned them off.
> @@ -337,6 +349,7 @@ ENTRY(entry_INT80_compat)
>         /* Go back to user mode. */
>         TRACE_IRQS_ON
>         SWAPGS
> +       SWITCH_TO_USER_CR3 scratch_reg=%r11
>         jmp     restore_regs_and_iret
>  END(entry_INT80_compat)
>
> diff -puN arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64.S
> --- a/arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work       2017-10-31 15:03:48.109007442 -0700
> +++ b/arch/x86/entry/entry_64.S 2017-10-31 15:03:48.115007726 -0700
> @@ -147,8 +147,6 @@ ENTRY(entry_SYSCALL_64)
>         movq    %rsp, PER_CPU_VAR(rsp_scratch)
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> -       TRACE_IRQS_OFF
> -
>         /* Construct struct pt_regs on stack */
>         pushq   $__USER_DS                      /* pt_regs->ss */
>         pushq   PER_CPU_VAR(rsp_scratch)        /* pt_regs->sp */
> @@ -169,6 +167,13 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>         sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
>         UNWIND_HINT_REGS extra=0
>
> +       /* NB: right here, all regs except r11 are live. */
> +
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
> +
> +       /* Must wait until we have the kernel CR3 to call C functions: */
> +       TRACE_IRQS_OFF
> +
>         /*
>          * If we need to do entry work or if we guess we'll need to do
>          * exit work, go straight to the slow path.
> @@ -220,6 +225,7 @@ entry_SYSCALL_64_fastpath:
>         TRACE_IRQS_ON           /* user mode is traced as IRQs on */
>         movq    RIP(%rsp), %rcx
>         movq    EFLAGS(%rsp), %r11
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
>         UNWIND_HINT_EMPTY
> @@ -313,6 +319,7 @@ return_from_SYSCALL_64:
>          * perf profiles. Nothing jumps here.
>          */
>  syscall_return_via_sysret:
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         /* rcx and r11 are already restored (see code above) */
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
> @@ -320,6 +327,7 @@ syscall_return_via_sysret:
>         USERGS_SYSRET64
>
>  opportunistic_sysret_failed:
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         SWAPGS
>         jmp     restore_c_regs_and_iret
>  END(entry_SYSCALL_64)
> @@ -422,6 +430,7 @@ ENTRY(ret_from_fork)
>         movq    %rsp, %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         TRACE_IRQS_ON                   /* user mode is traced as IRQS on */
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         SWAPGS
>         jmp     restore_regs_and_iret
>
> @@ -611,6 +620,7 @@ GLOBAL(retint_user)
>         mov     %rsp,%rdi
>         call    prepare_exit_to_usermode
>         TRACE_IRQS_IRETQ
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         SWAPGS
>         jmp     restore_regs_and_iret
>
> @@ -1091,7 +1101,11 @@ ENTRY(paranoid_entry)
>         js      1f                              /* negative -> in kernel */
>         SWAPGS
>         xorl    %ebx, %ebx
> -1:     ret
> +
> +1:
> +       SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
> +
> +       ret
>  END(paranoid_entry)
>
>  /*
> @@ -1118,6 +1132,7 @@ ENTRY(paranoid_exit)
>  paranoid_exit_no_swapgs:
>         TRACE_IRQS_IRETQ_DEBUG
>  paranoid_exit_restore:
> +       RESTORE_CR3     %r14
>         RESTORE_EXTRA_REGS
>         RESTORE_C_REGS
>         REMOVE_PT_GPREGS_FROM_STACK 8
> @@ -1144,6 +1159,9 @@ ENTRY(error_entry)
>          */
>         SWAPGS
>
> +       /* We have user CR3.  Change to kernel CR3. */
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> +
>  .Lerror_entry_from_usermode_after_swapgs:
>         /*
>          * We need to tell lockdep that IRQs are off.  We can't do this until
> @@ -1190,9 +1208,10 @@ ENTRY(error_entry)
>
>  .Lerror_bad_iret:
>         /*
> -        * We came from an IRET to user mode, so we have user gsbase.
> -        * Switch to kernel gsbase:
> +        * We came from an IRET to user mode, so we have user
> +        * gsbase and CR3.  Switch to kernel gsbase and CR3:
>          */
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>         SWAPGS
>
>         /*
> @@ -1313,6 +1332,7 @@ ENTRY(nmi)
>         UNWIND_HINT_REGS
>         ENCODE_FRAME_POINTER
>
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
>         /*
>          * At this point we no longer need to worry about stack damage
>          * due to nesting -- we're on the normal thread stack and we're
> @@ -1328,6 +1348,7 @@ ENTRY(nmi)
>          * work, because we don't want to enable interrupts.
>          */
>         SWAPGS
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         jmp     restore_regs_and_iret
>
>  .Lnmi_from_kernel:
> @@ -1538,6 +1559,8 @@ end_repeat_nmi:
>         movq    $-1, %rsi
>         call    do_nmi
>
> +       RESTORE_CR3 save_reg=%r14
> +
>         testl   %ebx, %ebx                      /* swapgs needed? */
>         jnz     nmi_restore
>  nmi_swapgs:
> _

This all needs to be conditional on a config option.  Something with
this amount of performance impact needs to be 100% optional.

--
Brian Gerst

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-11-01  0:43 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
2017-11-01  0:43   ` Brian Gerst [this message]
2017-11-01  1:08     ` Dave Hansen
2017-11-01 18:18   ` Borislav Petkov
2017-11-01 18:27     ` Dave Hansen
2017-11-01 20:42       ` Borislav Petkov
2017-11-01 21:01   ` Thomas Gleixner
2017-11-01 22:58     ` Dave Hansen
2017-10-31 22:31 ` [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables Dave Hansen
2017-11-01 21:11   ` Thomas Gleixner
2017-11-01 21:24     ` Andy Lutomirski
2017-11-01 21:28       ` Thomas Gleixner
2017-11-01 21:52         ` Dave Hansen
2017-11-01 22:11           ` Thomas Gleixner
2017-11-01 22:12           ` Linus Torvalds
2017-11-01 22:20             ` Thomas Gleixner
2017-11-01 22:45               ` Kees Cook
2017-11-02  7:10               ` Andy Lutomirski
2017-11-02 11:33                 ` Thomas Gleixner
2017-11-02 11:59                   ` Andy Lutomirski
2017-11-02 12:56                     ` Thomas Gleixner
2017-11-02 16:38                   ` Dave Hansen
2017-11-02 18:19                     ` Andy Lutomirski
2017-11-02 18:24                       ` Thomas Gleixner
2017-11-02 18:24                       ` Linus Torvalds
2017-11-02 18:40                         ` Thomas Gleixner
2017-11-02 18:57                           ` Linus Torvalds
2017-11-02 21:41                             ` Thomas Gleixner
2017-11-02  7:07         ` Andy Lutomirski
2017-11-02 11:21           ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 03/23] x86, kaiser: disable global pages Dave Hansen
2017-11-01 21:18   ` Thomas Gleixner
2017-11-01 22:12     ` Dave Hansen
2017-11-01 22:28       ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust Dave Hansen
2017-11-01  8:01   ` Andy Lutomirski
2017-11-01 10:11     ` Kirill A. Shutemov
2017-11-01 10:38       ` Andy Lutomirski
2017-11-01 10:56         ` Kirill A. Shutemov
2017-11-01 11:18           ` Andy Lutomirski
2017-11-01 22:21             ` Dave Hansen
2017-11-01 21:25   ` Thomas Gleixner
2017-11-01 22:24     ` Dave Hansen
2017-11-01 22:30       ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior Dave Hansen
2017-10-31 23:31   ` Kees Cook
2017-10-31 22:31 ` [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas Dave Hansen
2017-11-01 21:47   ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 07/23] x86, kaiser: unmap kernel from userspace page tables (core patch) Dave Hansen
2017-10-31 22:32 ` [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace Dave Hansen
2017-10-31 23:35   ` Kees Cook
2017-10-31 22:32 ` [PATCH 09/23] x86, kaiser: allow NX to be set in p4d/pgd Dave Hansen
2017-10-31 22:32 ` [PATCH 10/23] x86, kaiser: make sure static PGDs are 8k in size Dave Hansen
2017-10-31 22:32 ` [PATCH 11/23] x86, kaiser: map GDT into user page tables Dave Hansen
2017-10-31 22:32 ` [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs Dave Hansen
2017-11-01  8:00   ` Andy Lutomirski
2017-11-01  8:06     ` Ingo Molnar
2017-10-31 22:32 ` [PATCH 13/23] x86, kaiser: map espfix structures Dave Hansen
2017-10-31 22:32 ` [PATCH 14/23] x86, kaiser: map entry stack variables Dave Hansen
2017-10-31 22:32 ` [PATCH 15/23] x86, kaiser: map trace interrupt entry Dave Hansen
2017-10-31 22:32 ` [PATCH 16/23] x86, kaiser: map debug IDT tables Dave Hansen
2017-10-31 22:32 ` [PATCH 17/23] x86, kaiser: map virtually-addressed performance monitoring buffers Dave Hansen
2017-10-31 22:32 ` [PATCH 18/23] x86, mm: Move CR3 construction functions Dave Hansen
2017-10-31 22:32 ` [PATCH 19/23] x86, mm: remove hard-coded ASID limit checks Dave Hansen
2017-10-31 22:32 ` [PATCH 20/23] x86, mm: put mmu-to-h/w ASID translation in one place Dave Hansen
2017-10-31 22:32 ` [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches Dave Hansen
2017-11-01  8:03   ` Andy Lutomirski
2017-11-01 14:17     ` Dave Hansen
2017-11-01 20:31       ` Andy Lutomirski
2017-11-01 20:59         ` Dave Hansen
2017-11-01 21:04           ` Andy Lutomirski
2017-11-01 21:06             ` Dave Hansen
2017-10-31 22:32 ` [PATCH 22/23] x86, kaiser: use PCID feature to make user and kernel switches faster Dave Hansen
2017-10-31 22:32 ` [PATCH 23/23] x86, kaiser: add Kconfig Dave Hansen
2017-10-31 23:59   ` Kees Cook
2017-11-01  9:07     ` Borislav Petkov
2017-10-31 23:27 ` [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Linus Torvalds
2017-10-31 23:44   ` Dave Hansen
2017-11-01  0:21     ` Dave Hansen
2017-11-01  7:59     ` Andy Lutomirski
2017-11-01 16:08     ` Linus Torvalds
2017-11-01 17:31       ` Dave Hansen
2017-11-01 17:58         ` Randy Dunlap
2017-11-01 18:27         ` Linus Torvalds
2017-11-01 18:46           ` Dave Hansen
2017-11-01 19:05             ` Linus Torvalds
2017-11-01 20:33               ` Andy Lutomirski
2017-11-02  7:32                 ` Andy Lutomirski
2017-11-02  7:54                   ` Andy Lutomirski
2017-11-01 15:53   ` Dave Hansen
2017-11-01  8:54 ` Ingo Molnar
2017-11-01 14:09   ` Thomas Gleixner
2017-11-01 22:14   ` Dave Hansen
2017-11-01 22:28     ` Linus Torvalds
2017-11-02  8:03     ` Peter Zijlstra
2017-11-03 11:07     ` Kirill A. Shutemov
2017-11-02 19:01 ` Will Deacon
2017-11-02 19:38   ` Dave Hansen
2017-11-03 13:41     ` Will Deacon
2017-11-22 16:19 ` Pavel Machek
2017-11-23 10:47   ` Pavel Machek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMzpN2gP5SQWrbwNn9A+c6y5yLpqrV8Hpxou+TSypQ2WL+JXkQ@mail.gmail.com \
    --to=brgerst@gmail.com \
    --cc=daniel.gruss@iaik.tugraz.at \
    --cc=dave.hansen@linux.intel.com \
    --cc=hughd@google.com \
    --cc=keescook@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=michael.schwarz@iaik.tugraz.at \
    --cc=moritz.lipp@iaik.tugraz.at \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox