Re: Potential race in TLB flush batching?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nadav Amit <nadav.amit@gmail.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: Potential race in TLB flush batching?
Date: Wed, 12 Jul 2017 16:27:23 -0700	[thread overview]
Message-ID: <591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com> (raw)
In-Reply-To: <20170712082733.ouf7yx2bnvwwcfms@suse.de>

Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Jul 11, 2017 at 03:27:55PM -0700, Nadav Amit wrote:
>> Mel Gorman <mgorman@suse.de> wrote:
>> 
>>> On Tue, Jul 11, 2017 at 09:09:23PM +0100, Mel Gorman wrote:
>>>> On Tue, Jul 11, 2017 at 08:18:23PM +0100, Mel Gorman wrote:
>>>>> I don't think we should be particularly clever about this and instead just
>>>>> flush the full mm if there is a risk of a parallel batching of flushing is
>>>>> in progress resulting in a stale TLB entry being used. I think tracking mms
>>>>> that are currently batching would end up being costly in terms of memory,
>>>>> fairly complex, or both. Something like this?
>>>> 
>>>> mremap and madvise(DONTNEED) would also need to flush. Memory policies are
>>>> fine as a move_pages call that hits the race will simply fail to migrate
>>>> a page that is being freed and once migration starts, it'll be flushed so
>>>> a stale access has no further risk. copy_page_range should also be ok as
>>>> the old mm is flushed and the new mm cannot have entries yet.
>>> 
>>> Adding those results in
>> 
>> You are way too fast for me.
>> 
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -637,12 +637,34 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
>>> 		return false;
>>> 
>>> 	/* If remote CPUs need to be flushed then defer batch the flush */
>>> -	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
>>> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) {
>>> 		should_defer = true;
>>> +		mm->tlb_flush_batched = true;
>>> +	}
>> 
>> Since mm->tlb_flush_batched is set before the PTE is actually cleared, it
>> still seems to leave a short window for a race.
>> 
>> CPU0				CPU1
>> ---- 				----
>> should_defer_flush
>> => mm->tlb_flush_batched=true		
>> 				flush_tlb_batched_pending (another PT)
>> 				=> flush TLB
>> 				=> mm->tlb_flush_batched=false
>> ptep_get_and_clear
>> ...
>> 
>> 				flush_tlb_batched_pending (batched PT)
>> 				use the stale PTE
>> ...
>> try_to_unmap_flush
>> 
>> IOW it seems that mm->flush_flush_batched should be set after the PTE is
>> cleared (and have some compiler barrier to be on the safe side).
> 
> I'm relying on setting and clearing of tlb_flush_batched is under a PTL
> that is contended if the race is active.
> 
> If reclaim is first, it'll take the PTL, set batched while a racing
> mprotect/munmap/etc spins. On release, the racing mprotect/munmmap
> immediately calls flush_tlb_batched_pending() before proceeding as normal,
> finding pte_none with the TLB flushed.

This is the scenario I regarded in my example. Notice that when the first
flush_tlb_batched_pending is called, CPU0 and CPU1 hold different page-table
locks - allowing them to run concurrently. As a result
flush_tlb_batched_pending is executed before the PTE was cleared and
mm->tlb_flush_batched is cleared. Later, after CPU0 runs ptep_get_and_clear
mm->tlb_flush_batched remains clear, and CPU1 can use the stale PTE.

> If the mprotect/munmap/etc is first, it'll take the PTL, observe that
> pte_present and handle the flushing itself while reclaim potentially
> spins. When reclaim acquires the lock, it'll still set set tlb_flush_batched.
> 
> As it's PTL that is taken for that field, it is possible for the accesses
> to be re-ordered but only in the case where a race is not occurring.
> I'll think some more about whether barriers are necessary but concluded
> they weren't needed in this instance. Doing the setting/clear+flush under
> the PTL, the protection is similar to normal page table operations that
> do not batch the flush.
> 
>> One more question, please: how does elevated page count or even locking the
>> page help (as you mention in regard to uprobes and ksm)? Yes, the page will
>> not be reclaimed, but IIUC try_to_unmap is called before the reference count
>> is frozen, and the page lock is dropped on each iteration of the loop in
>> shrink_page_list. In this case, it seems to me that uprobes or ksm may still
>> not flush the TLB.
> 
> If page lock is held then reclaim skips the page entirely and uprobe,
> ksm and cow holds the page lock for pages that potentially be observed
> by reclaim.  That is the primary protection for those paths.

It is really hard, at least for me, to track this synchronization scheme, as
each path is protected in different means. I still don’t understand why it
is true, since the loop in shrink_page_list calls __ClearPageLocked(page) on
each iteration, before the actual flush takes place.

Actually, I think that based on Andy’s patches there is a relatively
reasonable solution. For each mm we will hold both a “pending_tlb_gen”
(increased under the PT-lock) and an “executed_tlb_gen”. Once
flush_tlb_mm_range finishes flushing it will use cmpxchg to update the
executed_tlb_gen to the pending_tlb_gen that was prior the flush (the
cmpxchg will ensure the TLB gen only goes forward). Then, whenever
pending_tlb_gen is different than executed_tlb_gen - a flush is needed.

Nadav 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-07-12 23:27 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11  0:52 Nadav Amit
2017-07-11  6:41 ` Mel Gorman
2017-07-11  7:30   ` Nadav Amit
2017-07-11  9:29     ` Mel Gorman
2017-07-11 10:40       ` Nadav Amit
2017-07-11 13:20         ` Mel Gorman
2017-07-11 14:58           ` Andy Lutomirski
2017-07-11 15:53             ` Mel Gorman
2017-07-11 17:23               ` Andy Lutomirski
2017-07-11 19:18                 ` Mel Gorman
2017-07-11 20:06                   ` Nadav Amit
2017-07-11 21:09                     ` Mel Gorman
2017-07-11 20:09                   ` Mel Gorman
2017-07-11 21:52                     ` Mel Gorman
2017-07-11 22:27                       ` Nadav Amit
2017-07-11 22:34                         ` Nadav Amit
2017-07-12  8:27                         ` Mel Gorman
2017-07-12 23:27                           ` Nadav Amit [this message]
2017-07-12 23:36                             ` Andy Lutomirski
2017-07-12 23:42                               ` Nadav Amit
2017-07-13  5:38                                 ` Andy Lutomirski
2017-07-13 16:05                                   ` Nadav Amit
2017-07-13 16:06                                     ` Andy Lutomirski
2017-07-13  6:07                             ` Mel Gorman
2017-07-13 16:08                               ` Andy Lutomirski
2017-07-13 17:07                                 ` Mel Gorman
2017-07-13 17:15                                   ` Andy Lutomirski
2017-07-13 18:23                                     ` Mel Gorman
2017-07-14 23:16                               ` Nadav Amit
2017-07-15 15:55                                 ` Mel Gorman
2017-07-15 16:41                                   ` Andy Lutomirski
2017-07-17  7:49                                     ` Mel Gorman
2017-07-18 21:28                                   ` Nadav Amit
2017-07-19  7:41                                     ` Mel Gorman
2017-07-19 19:41                                       ` Nadav Amit
2017-07-19 19:58                                         ` Mel Gorman
2017-07-19 20:20                                           ` Nadav Amit
2017-07-19 21:47                                             ` Mel Gorman
2017-07-19 22:19                                               ` Nadav Amit
2017-07-19 22:59                                                 ` Mel Gorman
2017-07-19 23:39                                                   ` Nadav Amit
2017-07-20  7:43                                                     ` Mel Gorman
2017-07-22  1:19                                                       ` Nadav Amit
2017-07-24  9:58                                                         ` Mel Gorman
2017-07-24 19:46                                                           ` Nadav Amit
2017-07-25  7:37                                                           ` Minchan Kim
2017-07-25  8:51                                                             ` Mel Gorman
2017-07-25  9:11                                                               ` Minchan Kim
2017-07-25 10:10                                                                 ` Mel Gorman
2017-07-26  5:43                                                                   ` Minchan Kim
2017-07-26  9:22                                                                     ` Mel Gorman
2017-07-26 19:18                                                                       ` Nadav Amit
2017-07-26 23:40                                                                         ` Minchan Kim
2017-07-27  0:09                                                                           ` Nadav Amit
2017-07-27  0:34                                                                             ` Minchan Kim
2017-07-27  0:48                                                                               ` Nadav Amit
2017-07-27  1:13                                                                                 ` Nadav Amit
2017-07-27  7:04                                                                                   ` Minchan Kim
2017-07-27  7:21                                                                                     ` Mel Gorman
2017-07-27 16:04                                                                                       ` Nadav Amit
2017-07-27 17:36                                                                                         ` Mel Gorman
2017-07-26 23:44                                                                       ` Minchan Kim
2017-07-11 22:07                   ` Andy Lutomirski
2017-07-11 22:33                     ` Mel Gorman
2017-07-14  7:00                     ` Benjamin Herrenschmidt
2017-07-14  8:31                       ` Mel Gorman
2017-07-14  9:02                         ` Benjamin Herrenschmidt
2017-07-14  9:27                           ` Mel Gorman
2017-07-14 22:21                             ` Andy Lutomirski
2017-07-11 16:22           ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=591A2865-13B8-4B3A-B094-8B83A7F9814B@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox