Re: Potential race in TLB flush batching?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 03:40:02 -0700	[thread overview]
Message-ID: <E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com> (raw)
In-Reply-To: <20170711092935.bogdb4oja6v7kilq@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>>>> Something bothers me about the TLB flushes batching mechanism that Linux
>>>> uses on x86 and I would appreciate your opinion regarding it.
>>>> 
>>>> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
>>>> however, the page-table lock(s) are not held, and I see no indication of the
>>>> pending flush saved (and regarded) in the relevant mm-structs.
>>>> 
>>>> So, my question: what prevents, at least in theory, the following scenario:
>>>> 
>>>> 	CPU0 				CPU1
>>>> 	----				----
>>>> 					user accesses memory using RW PTE 
>>>> 					[PTE now cached in TLB]
>>>> 	try_to_unmap_one()
>>>> 	==> ptep_get_and_clear()
>>>> 	==> set_tlb_ubc_flush_pending()
>>>> 					mprotect(addr, PROT_READ)
>>>> 					==> change_pte_range()
>>>> 					==> [ PTE non-present - no flush ]
>>>> 
>>>> 					user writes using cached RW PTE
>>>> 	...
>>>> 
>>>> 	try_to_unmap_flush()
>>>> 
>>>> 
>>>> As you see CPU1 write should have failed, but may succeed. 
>>>> 
>>>> Now I don???t have a PoC since in practice it seems hard to create such a
>>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
>>>> would not be reclaimed.
>>> 
>>> That is the same to a race whereby there is no batching mechanism and the
>>> racing operation happens between a pte clear and a flush as ptep_clear_flush
>>> is not atomic. All that differs is that the race window is a different size.
>>> The application on CPU1 is buggy in that it may or may not succeed the write
>>> but it is buggy regardless of whether a batching mechanism is used or not.
>> 
>> Thanks for your quick and detailed response, but I fail to see how it can
>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
>> but without batching they are both performed under the page table lock
>> (which is acquired in page_vma_mapped_walk and released in
>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
>> and change_pte_range, acquire the lock before accessing the PTEs.
> 
> I was primarily thinking in terms of memory corruption or data loss.
> However, we are still protected although it's not particularly obvious why.
> 
> On the reclaim side, we are either reclaiming clean pages (which ignore
> the accessed bit) or normal reclaim. If it's clean pages then any parallel
> write must update the dirty bit at minimum. If it's normal reclaim then
> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> the page in either as part of page_referenced or try_to_unmap_one but
> clearing the accessed bit flushes the TLB.

Wait. Are you looking at the x86 arch function? The TLB is not flushed when
the access bit is cleared:

int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *                 
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */
        return ptep_test_and_clear_young(vma, address, ptep);
}

> 
> On the mprotect side then, as the page was first accessed, clearing the
> accessed bit incurs a TLB flush on the reclaim side before the second write.
> That means any TLB entry that exists cannot have the accessed bit set so
> a second write needs to update it.
> 
> While it's not clearly documented, I checked with hardware engineers
> at the time that an update of the accessed or dirty bit even with a TLB
> entry will check the underlying page tables and trap if it's not present
> and the subsequent fault will then fail on sigsegv if the VMA protections
> no longer allow the write.
> 
> So, on one side if ignoring the accessed bit during reclaim, the pages
> are clean so any access will set the dirty bit and trap if unmapped in
> parallel. On the other side, the accessed bit if set cleared the TLB and
> if not set, then the hardware needs to update and again will trap if
> unmapped in parallel.


Yet, even regardless to the TLB flush it seems there is still a possible
race:

CPU0				CPU1
----				----
ptep_clear_flush_young_notify
==> PTE.A==0
				access PTE
				==> PTE.A=1
prep_get_and_clear
				change mapping (and PTE)
				Use stale TLB entry


> If this guarantee from hardware was every shown to be wrong or another
> architecture wanted to add batching without the same guarantee then mprotect
> would need to do a local_flush_tlb if no pages were updated by the mprotect
> but right now, this should not be necessary.
> 
>> Can you please explain why you consider the application to be buggy?
> 
> I considered it a bit dumb to mprotect for READ/NONE and then try writing
> the same mapping. However, it will behave as expected.

I don’t think that this is the only scenario. For example, the application
may create a new memory mapping of a different file using mmap at the same
memory address that was used before, just as that memory is reclaimed. The
application can (inadvertently) cause such a scenario by using MAP_FIXED.
But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
virtual address.

>> AFAIU
>> an application can wish to trap certain memory accesses using userfaultfd or
>> SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
>> do so, it can use mprotect with PROT_NONE and expect to be able to trap
>> future accesses to that memory. This use-case is described in usefaultfd
>> documentation.
> 
> Such applications are safe due to how the accessed bit is handled by the
> software (flushes TLB if clearing young) and hardware (traps if updating
> the accessed or dirty bit and the underlying PTE was unmapped even if
> there is a TLB entry).

I don’t think it is so. And I also think there are many additional
potentially problematic scenarios.

Thanks for your patience,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-07-11 10:40 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit [this message]
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox