linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: Tycho Andersen <tycho@docker.com>, linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, kernel-hardening@lists.openwall.com,
	Marco Benatto <marco.antonio.780@gmail.com>,
	Juerg Haefliger <juerg.haefliger@canonical.com>,
	x86@kernel.org
Subject: Re: [PATCH v6 03/11] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
Date: Wed, 20 Sep 2017 08:48:36 -0700	[thread overview]
Message-ID: <34454a32-72c2-c62e-546c-1837e05327e1@intel.com> (raw)
In-Reply-To: <20170907173609.22696-4-tycho@docker.com>

On 09/07/2017 10:36 AM, Tycho Andersen wrote:
...
> Whenever a page destined for userspace is allocated, it is
> unmapped from physmap (the kernel's page table). When such a page is
> reclaimed from userspace, it is mapped back to physmap.

I'm looking for the code where it's unmapped at allocation to userspace.
 I see TLB flushing and 'struct xpfo' manipulation, but I don't see the
unmapping.  Where is that occurring?

How badly does this hurt performance?  Since we (generally) have
different migrate types for user and kernel allocation, I can imagine
that a given page *generally* doesn't oscillate between user and
kernel-allocated, but I'm curious how it works in practice.  Doesn't the
IPI load from the TLB flushes eat you alive?

It's a bit scary to have such a deep code path under the main allocator.

This all seems insanely expensive.  It will *barely* work on an
allocation-heavy workload on a desktop.  I'm pretty sure the locking
will just fall over entirely on any reasonably-sized server.

I really have to wonder whether there are better ret2dir defenses than
this.  The allocator just seems like the *wrong* place to be doing this
because it's such a hot path.

> +		cpa.vaddr = kaddr;
> +		cpa.pages = &page;
> +		cpa.mask_set = prot;
> +		cpa.mask_clr = msk_clr;
> +		cpa.numpages = 1;
> +		cpa.flags = 0;
> +		cpa.curpage = 0;
> +		cpa.force_split = 0;
> +
> +
> +		do_split = try_preserve_large_page(pte, (unsigned 

Is this safe to do without a TLB flush?  I thought we had plenty of bugs
in CPUs around having multiple entries for the same page in the TLB at
once.  We're *REALLY* careful when we split large pages for THP, and I'm
surprised we don't do the same here.

Why do you even bother keeping large pages around?  Won't the entire
kernel just degrade to using 4k everywhere, eventually?

> +		if (do_split) {
> +			struct page *base;
> +
> +			base = alloc_pages(GFP_ATOMIC | __GFP_NOTRACK, 

Ugh, GFP_ATOMIC.  That's nasty.  Do you really want this allocation to
fail all the time?  GFP_ATOMIC could really be called
GFP_YOU_BETTER_BE_OK_WITH_THIS_FAILING. :)

You probably want to do what the THP code does here and keep a spare
page around, then allocate it before you take the locks.

> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> +	int level;
> +	unsigned long size, kaddr;
> +
> +	kaddr = (unsigned long)page_address(page);
> +
> +	if (unlikely(!lookup_address(kaddr, &level))) {
> +		WARN(1, "xpfo: invalid address to flush %lx %d\n", kaddr, level);
> +		return;
> +	}
> +
> +	switch (level) {
> +	case PG_LEVEL_4K:
> +		size = PAGE_SIZE;
> +		break;
> +	case PG_LEVEL_2M:
> +		size = PMD_SIZE;
> +		break;
> +	case PG_LEVEL_1G:
> +		size = PUD_SIZE;
> +		break;
> +	default:
> +		WARN(1, "xpfo: unsupported page level %x\n", level);
> +		return;
> +	}
> +
> +	flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> +}

I'm not sure flush_tlb_kernel_range() is the best primitive to be
calling here.

Let's say you walk the page tables and find level=PG_LEVEL_1G.  You call
flush_tlb_kernel_range(), you will be above
tlb_single_page_flush_ceiling, and you will do a full TLB flush.  But,
with a 1GB page, you could have just used a single INVLPG and skipped
the global flush.

I guess the cost of the IPI is way more than the flush itself, but it's
still a shame to toss the entire TLB when you don't have to.

I also think the TLB flush should be done closer to the page table
manipulation that it is connected to.  It's hard to figure out whether
the flush is the right one otherwise.

Also, the "(1 << order) * size" thing looks goofy to me.  Let's say you
are flushing a order=1 (8k) page and its mapped in a 1GB mapping.  You
flush 2GB.  Is that intentional?

> +
> +void xpfo_free_pages(struct page *page, int order)
> +{
...
> +		/*
> +		 * Map the page back into the kernel if it was previously
> +		 * allocated to user space.
> +		 */
> +		if (test_and_clear_bit(XPFO_PAGE_USER, &xpfo->flags)) {
> +			clear_bit(XPFO_PAGE_UNMAPPED, &xpfo->flags);
> +			set_kpte(page_address(page + i), page + i,
> +				 PAGE_KERNEL);
> +		}
> +	}
> +}

This seems like a bad idea, performance-wise.  Kernel and userspace
pages tend to be separated by migrate types.  So, a given physical page
will tend to be used as kernel *or* for userspace.  With this nugget,
every time a userspace page is freed, we will go to the trouble of
making it *back* into a kernel page.  Then, when it is allocated again
(probably as userspace), we will re-make it into a userspace page.  That
seems horribly inefficient.

Also, this weakens the security guarantees.  Let's say you're mounting a
ret2dir attack.  You populate a page with your evil data and you know
the kernel address for the page.  All you have to do is coordinate your
attack with freeing the page.  You can control when it gets freed.  Now,
the xpfo_free_pages() helpfully just mapped your attack code back into
the kernel.

Why not *just* do these moves at allocation time?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2017-09-20 15:48 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-07 17:35 [PATCH v6 00/11] Add support for eXclusive Page Frame Ownership Tycho Andersen
2017-09-07 17:35 ` [PATCH v6 01/11] mm: add MAP_HUGETLB support to vm_mmap Tycho Andersen
2017-09-08  7:42   ` Christoph Hellwig
2017-09-07 17:36 ` [PATCH v6 02/11] x86: always set IF before oopsing from page fault Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 03/11] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO) Tycho Andersen
2017-09-07 18:33   ` Ralph Campbell
2017-09-07 18:50     ` Tycho Andersen
2017-09-08  7:51   ` Christoph Hellwig
2017-09-08 14:58     ` Tycho Andersen
2017-09-09 15:35   ` Laura Abbott
2017-09-11 15:03     ` Tycho Andersen
2017-09-11  7:24   ` Yisheng Xie
2017-09-11 14:50     ` Tycho Andersen
2017-09-11 16:03       ` Juerg Haefliger
2017-09-11 16:59         ` Tycho Andersen
2017-09-12  8:05         ` Yisheng Xie
2017-09-12 14:36           ` Tycho Andersen
2017-09-12 18:13             ` Tycho Andersen
2017-09-14  6:15               ` Yisheng Xie
2017-09-20 23:46               ` Dave Hansen
2017-09-21  0:02                 ` Tycho Andersen
2017-09-21  0:04                   ` Dave Hansen
2017-09-11 18:32   ` Tycho Andersen
2017-09-11 21:54     ` Marco Benatto
2017-09-20 15:48   ` Dave Hansen [this message]
2017-09-20 22:34     ` Tycho Andersen
2017-09-20 23:21       ` Dave Hansen
2017-09-21  0:09         ` Tycho Andersen
2017-09-21  0:27           ` Dave Hansen
2017-09-21  1:37             ` Tycho Andersen
2017-11-10  1:09             ` Tycho Andersen
2017-11-13 22:20               ` Dave Hansen
2017-11-13 22:46                 ` Dave Hansen
2017-11-15  0:33                   ` [kernel-hardening] " Tycho Andersen
2017-11-15  0:37                     ` Dave Hansen
2017-11-15  0:42                       ` Tycho Andersen
2017-11-15  3:44                   ` Matthew Wilcox
2017-11-15  7:00                     ` Dave Hansen
2017-11-15 14:58                       ` Matthew Wilcox
2017-11-15 16:20                         ` [kernel-hardening] " Tycho Andersen
2017-11-15 21:34                           ` Matthew Wilcox
2017-09-21  0:03   ` Dave Hansen
2017-09-21  0:28   ` Dave Hansen
2017-09-21  1:04     ` Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 04/11] swiotlb: Map the buffer if it was unmapped by XPFO Tycho Andersen
2017-09-07 18:10   ` Christoph Hellwig
2017-09-07 18:44     ` Tycho Andersen
2017-09-08  7:13       ` Christoph Hellwig
2017-09-07 17:36 ` [PATCH v6 05/11] arm64/mm: Add support for XPFO Tycho Andersen
2017-09-08  7:53   ` Christoph Hellwig
2017-09-08 17:24     ` Tycho Andersen
2017-09-14 10:41       ` Julien Grall
2017-09-14 11:29         ` Juergen Gross
2017-09-14 18:22   ` [kernel-hardening] " Mark Rutland
2017-09-18 21:27     ` Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 06/11] xpfo: add primitives for mapping underlying memory Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 07/11] arm64/mm, xpfo: temporarily map dcache regions Tycho Andersen
2017-09-14 18:25   ` Mark Rutland
2017-09-18 21:29     ` Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 08/11] arm64/mm: Add support for XPFO to swiotlb Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 09/11] arm64/mm: disable section/contiguous mappings if XPFO is enabled Tycho Andersen
2017-09-09 15:38   ` Laura Abbott
2017-09-07 17:36 ` [PATCH v6 10/11] mm: add a user_virt_to_phys symbol Tycho Andersen
2017-09-08  7:55   ` Christoph Hellwig
2017-09-08 15:44     ` Kees Cook
2017-09-11  7:36       ` Christoph Hellwig
2017-09-14 18:34   ` [kernel-hardening] " Mark Rutland
2017-09-18 20:56     ` Tycho Andersen
2017-09-07 17:36 ` [PATCH v6 11/11] lkdtm: Add test for XPFO Tycho Andersen
2017-09-07 19:08   ` Kees Cook
2017-09-10  0:57   ` kbuild test robot
2017-09-11 10:34 ` [PATCH v6 00/11] Add support for eXclusive Page Frame Ownership Yisheng Xie
2017-09-11 15:02   ` Tycho Andersen
2017-09-12  7:07     ` Yisheng Xie
2017-09-12  7:40       ` Juerg Haefliger
2017-09-12  8:11         ` Yisheng Xie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=34454a32-72c2-c62e-546c-1837e05327e1@intel.com \
    --to=dave.hansen@intel.com \
    --cc=juerg.haefliger@canonical.com \
    --cc=kernel-hardening@lists.openwall.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=marco.antonio.780@gmail.com \
    --cc=tycho@docker.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox