From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Tue, 11 Jul 2017 03:40:02 -0700 [thread overview]
Message-ID: <E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com> (raw)
In-Reply-To: <20170711092935.bogdb4oja6v7kilq@suse.de>
Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Jul 11, 2017 at 12:30:28AM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>>
>>> On Mon, Jul 10, 2017 at 05:52:25PM -0700, Nadav Amit wrote:
>>>> Something bothers me about the TLB flushes batching mechanism that Linux
>>>> uses on x86 and I would appreciate your opinion regarding it.
>>>>
>>>> As you know, try_to_unmap_one() can batch TLB invalidations. While doing so,
>>>> however, the page-table lock(s) are not held, and I see no indication of the
>>>> pending flush saved (and regarded) in the relevant mm-structs.
>>>>
>>>> So, my question: what prevents, at least in theory, the following scenario:
>>>>
>>>> CPU0 CPU1
>>>> ---- ----
>>>> user accesses memory using RW PTE
>>>> [PTE now cached in TLB]
>>>> try_to_unmap_one()
>>>> ==> ptep_get_and_clear()
>>>> ==> set_tlb_ubc_flush_pending()
>>>> mprotect(addr, PROT_READ)
>>>> ==> change_pte_range()
>>>> ==> [ PTE non-present - no flush ]
>>>>
>>>> user writes using cached RW PTE
>>>> ...
>>>>
>>>> try_to_unmap_flush()
>>>>
>>>>
>>>> As you see CPU1 write should have failed, but may succeed.
>>>>
>>>> Now I don???t have a PoC since in practice it seems hard to create such a
>>>> scenario: try_to_unmap_one() is likely to find the PTE accessed and the PTE
>>>> would not be reclaimed.
>>>
>>> That is the same to a race whereby there is no batching mechanism and the
>>> racing operation happens between a pte clear and a flush as ptep_clear_flush
>>> is not atomic. All that differs is that the race window is a different size.
>>> The application on CPU1 is buggy in that it may or may not succeed the write
>>> but it is buggy regardless of whether a batching mechanism is used or not.
>>
>> Thanks for your quick and detailed response, but I fail to see how it can
>> happen without batching. Indeed, the PTE clear and flush are not ???atomic???,
>> but without batching they are both performed under the page table lock
>> (which is acquired in page_vma_mapped_walk and released in
>> page_vma_mapped_walk_done). Since the lock is taken, other cores should not
>> be able to inspect/modify the PTE. Relevant functions, e.g., zap_pte_range
>> and change_pte_range, acquire the lock before accessing the PTEs.
>
> I was primarily thinking in terms of memory corruption or data loss.
> However, we are still protected although it's not particularly obvious why.
>
> On the reclaim side, we are either reclaiming clean pages (which ignore
> the accessed bit) or normal reclaim. If it's clean pages then any parallel
> write must update the dirty bit at minimum. If it's normal reclaim then
> the accessed bit is checked and if cleared in try_to_unmap_one, it uses a
> ptep_clear_flush_young_notify so the TLB gets flushed. We don't reclaim
> the page in either as part of page_referenced or try_to_unmap_one but
> clearing the accessed bit flushes the TLB.
Wait. Are you looking at the x86 arch function? The TLB is not flushed when
the access bit is cleared:
int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
/*
* On x86 CPUs, clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
return ptep_test_and_clear_young(vma, address, ptep);
}
>
> On the mprotect side then, as the page was first accessed, clearing the
> accessed bit incurs a TLB flush on the reclaim side before the second write.
> That means any TLB entry that exists cannot have the accessed bit set so
> a second write needs to update it.
>
> While it's not clearly documented, I checked with hardware engineers
> at the time that an update of the accessed or dirty bit even with a TLB
> entry will check the underlying page tables and trap if it's not present
> and the subsequent fault will then fail on sigsegv if the VMA protections
> no longer allow the write.
>
> So, on one side if ignoring the accessed bit during reclaim, the pages
> are clean so any access will set the dirty bit and trap if unmapped in
> parallel. On the other side, the accessed bit if set cleared the TLB and
> if not set, then the hardware needs to update and again will trap if
> unmapped in parallel.
Yet, even regardless to the TLB flush it seems there is still a possible
race:
CPU0 CPU1
---- ----
ptep_clear_flush_young_notify
==> PTE.A==0
access PTE
==> PTE.A=1
prep_get_and_clear
change mapping (and PTE)
Use stale TLB entry
> If this guarantee from hardware was every shown to be wrong or another
> architecture wanted to add batching without the same guarantee then mprotect
> would need to do a local_flush_tlb if no pages were updated by the mprotect
> but right now, this should not be necessary.
>
>> Can you please explain why you consider the application to be buggy?
>
> I considered it a bit dumb to mprotect for READ/NONE and then try writing
> the same mapping. However, it will behave as expected.
I don’t think that this is the only scenario. For example, the application
may create a new memory mapping of a different file using mmap at the same
memory address that was used before, just as that memory is reclaimed. The
application can (inadvertently) cause such a scenario by using MAP_FIXED.
But even without MAP_FIXED, running mmap->munmap->mmap can reuse the same
virtual address.
>> AFAIU
>> an application can wish to trap certain memory accesses using userfaultfd or
>> SIGSEGV. For example, it may do it for garbage collection or sandboxing. To
>> do so, it can use mprotect with PROT_NONE and expect to be able to trap
>> future accesses to that memory. This use-case is described in usefaultfd
>> documentation.
>
> Such applications are safe due to how the accessed bit is handled by the
> software (flushes TLB if clearing young) and hardware (traps if updating
> the accessed or dirty bit and the underlying PTE was unmapped even if
> there is a TLB entry).
I don’t think it is so. And I also think there are many additional
potentially problematic scenarios.
Thanks for your patience,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-07-11 10:40 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-11 0:52 Nadav Amit
2017-07-11 6:41 ` Mel Gorman
2017-07-11 7:30 ` Nadav Amit
2017-07-11 9:29 ` Mel Gorman
2017-07-11 10:40 ` Nadav Amit [this message]
2017-07-11 13:20 ` Mel Gorman
2017-07-11 14:58 ` Andy Lutomirski
2017-07-11 15:53 ` Mel Gorman
2017-07-11 17:23 ` Andy Lutomirski
2017-07-11 19:18 ` Mel Gorman
2017-07-11 20:06 ` Nadav Amit
2017-07-11 21:09 ` Mel Gorman
2017-07-11 20:09 ` Mel Gorman
2017-07-11 21:52 ` Mel Gorman
2017-07-11 22:27 ` Nadav Amit
2017-07-11 22:34 ` Nadav Amit
2017-07-12 8:27 ` Mel Gorman
2017-07-12 23:27 ` Nadav Amit
2017-07-12 23:36 ` Andy Lutomirski
2017-07-12 23:42 ` Nadav Amit
2017-07-13 5:38 ` Andy Lutomirski
2017-07-13 16:05 ` Nadav Amit
2017-07-13 16:06 ` Andy Lutomirski
2017-07-13 6:07 ` Mel Gorman
2017-07-13 16:08 ` Andy Lutomirski
2017-07-13 17:07 ` Mel Gorman
2017-07-13 17:15 ` Andy Lutomirski
2017-07-13 18:23 ` Mel Gorman
2017-07-14 23:16 ` Nadav Amit
2017-07-15 15:55 ` Mel Gorman
2017-07-15 16:41 ` Andy Lutomirski
2017-07-17 7:49 ` Mel Gorman
2017-07-18 21:28 ` Nadav Amit
2017-07-19 7:41 ` Mel Gorman
2017-07-19 19:41 ` Nadav Amit
2017-07-19 19:58 ` Mel Gorman
2017-07-19 20:20 ` Nadav Amit
2017-07-19 21:47 ` Mel Gorman
2017-07-19 22:19 ` Nadav Amit
2017-07-19 22:59 ` Mel Gorman
2017-07-19 23:39 ` Nadav Amit
2017-07-20 7:43 ` Mel Gorman
2017-07-22 1:19 ` Nadav Amit
2017-07-24 9:58 ` Mel Gorman
2017-07-24 19:46 ` Nadav Amit
2017-07-25 7:37 ` Minchan Kim
2017-07-25 8:51 ` Mel Gorman
2017-07-25 9:11 ` Minchan Kim
2017-07-25 10:10 ` Mel Gorman
2017-07-26 5:43 ` Minchan Kim
2017-07-26 9:22 ` Mel Gorman
2017-07-26 19:18 ` Nadav Amit
2017-07-26 23:40 ` Minchan Kim
2017-07-27 0:09 ` Nadav Amit
2017-07-27 0:34 ` Minchan Kim
2017-07-27 0:48 ` Nadav Amit
2017-07-27 1:13 ` Nadav Amit
2017-07-27 7:04 ` Minchan Kim
2017-07-27 7:21 ` Mel Gorman
2017-07-27 16:04 ` Nadav Amit
2017-07-27 17:36 ` Mel Gorman
2017-07-26 23:44 ` Minchan Kim
2017-07-11 22:07 ` Andy Lutomirski
2017-07-11 22:33 ` Mel Gorman
2017-07-14 7:00 ` Benjamin Herrenschmidt
2017-07-14 8:31 ` Mel Gorman
2017-07-14 9:02 ` Benjamin Herrenschmidt
2017-07-14 9:27 ` Mel Gorman
2017-07-14 22:21 ` Andy Lutomirski
2017-07-11 16:22 ` Nadav Amit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E37E0D40-821A-4C82-B924-F1CE6DF97719@gmail.com \
--to=nadav.amit@gmail.com \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox