From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Nadav Amit <nadav.amit@gmail.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Nadav Amit <namit@vmware.com>, Yu Zhao <yuzhao@google.com>,
Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
Pavel Emelyanov <xemul@openvz.org>,
Mike Kravetz <mike.kravetz@oracle.com>,
Mike Rapoport <rppt@linux.vnet.ibm.com>,
Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Tue, 5 Jan 2021 12:58:55 -0500 [thread overview]
Message-ID: <X/So39qVhYYzDZ+8@redhat.com> (raw)
In-Reply-To: <20210105085857.GE3040@hirez.programming.kicks-ass.net>
On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > >
> > > > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > > >
> > > > cpu0 cpu1 cpu2
> > > > ---- ---- ----
> > > > [ Writable PTE
> > > > cached in TLB ]
> > > > userfaultfd_writeprotect()
> > > > [ write-*unprotect* ]
> > > > mwriteprotect_range()
> > > > mmap_read_lock()
> > > > change_protection()
> > > >
> > > > change_protection_range()
> > > > ...
> > > > change_pte_range()
> > > > [ *clear* “write”-bit ]
> > > > [ defer TLB flushes ]
> > > > [ page-fault ]
> > > > ...
> > > > wp_page_copy()
> > > > cow_user_page()
> > > > [ copy page ]
> > > > [ write to old
> > > > page ]
> > > > ...
> > > > set_pte_at_notify()
> > >
> > > Yuck!
> > >
> >
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> >
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
>
> Lets assume CPU0 does a read-lock, W -> RO with deferred flush.
I was mistaken saying the deferred tlb flush was not shown in the v2
trace, just this appears a new different case we didn't happen to
consider before.
In the previous case we discussed earlier, when un-wrprotect above is
called it never should have been a W->RO since a wrprotect run first.
Doesn't it ring a bell that if an un-wrprotect does a W->RO
transition, something is a bit going backwards?
I don't recall from previous discussion that un-wrprotect was
considered as called on read-write memory. I think we need the below
change to fix this new case.
if (uffd_wp) {
+ if (unlikely(pte_uffd_wp(oldpte)))
+ continue;
ptent = pte_wrprotect(ptent);
ptent = pte_mkuffd_wp(ptent);
} else if (uffd_wp_resolve) {
+ if (unlikely(!pte_uffd_wp(oldpte)))
+ continue;
/*
* Leave the write bit to be handled
* by PF interrupt handler, then
* things like COW could be properly
* handled.
*/
ptent = pte_clear_uffd_wp(ptent);
}
I now get why the v2 patch touches preserved_write, but this is not
about preserve_write, it's not about leaving the write bit alone. This
is about leaving the whole pte alone if the uffd-wp bit doesn't
actually change.
We shouldn't just defer the tlb flush if un-wprotect is called on
read-write memory: we should not have flushed the tlb at all in such
case.
Same for hugepmd in huge_memory.c which will be somewhere else.
Once the above is optimized, then un-wrprotect as in
MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in
MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition
that just clears the uffd_wp flag and nothing else and whose tlb flush
is in turn irrelevant.
The fix discussed still works for this new case too: I'm not
suggesting we should rely on the above optimization for the tlb
safety. The above is just a missing optimization.
> > > Isn't this all rather similar to the problem that resulted in the
> > > tlb_flush_pending mess?
> > >
> > > I still think that's all fundamentally buggered, the much saner solution
> > > (IMO) would've been to make things wait for the pending flush, instead
> >
> > How do intend you wait in PT lock while the writer also has to take PT
> > lock repeatedly before it can do wake_up_var?
> >
> > If you release the PT lock before calling wait_tlb_flush_pending it
> > all falls apart again.
>
> I suppose you can check for pending, if found, release lock, wait for 0,
> and re-take the fault?
Aborting the page fault unconditionally while MADV_DONTNEED is running
on some other unrelated vma, sounds not desirable.
Doing it only for !VM_SOFTDIRTY or soft dirty not compiled in sounds
less bad but it would still mean that while clear_refs is running, no
thread can write to any anon memory of the process.
> > This I guess explains why a local pte/hugepmd smp local invlpg is the
> > only working solution for this issue, similarly to how it's done in rmap.
>
> In that case a local invalidate on CPU1 simply doesn't help anything.
>
> CPU1 needs to do a global invalidate or wait for the in-progress one to
> complete, such that CPU2 is sure to not have a W entry left before CPU1
> goes and copies the page.
Yes, it was a global invlpg, definitely not local sorry for the
confusion, as in the PoC posted here which needs cleaning up:
https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com
+ flush_tlb_page(vma, vmf->address);
I think instead of the flush_tlb_page above, we just need an
ad-hoc abstraction there.
The added complexity to the page fault common code consist in having
to call such abstract call in the right place of the page fault.
The vm_flags to check will be the same for both the flush_tlb_page and
the wait_tlb_pending approaches.
Once the filter on vm_flags pass, the only difference is between
"flush_tlb_page; return void" or "PT unlock; wait_; return
VM_FAULT_RETRY" so it looks more an implementation detail with a
different tradeoff at runtime.
Thanks,
Andrea
next prev parent reply other threads:[~2021-01-05 17:59 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-25 9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22 ` Peter Zijlstra
2021-01-04 19:24 ` Andrea Arcangeli
2021-01-04 19:35 ` Nadav Amit
2021-01-04 20:19 ` Andrea Arcangeli
2021-01-04 20:39 ` Nadav Amit
2021-01-04 21:01 ` Andrea Arcangeli
2021-01-04 21:26 ` Nadav Amit
2021-01-05 18:45 ` Andrea Arcangeli
2021-01-05 19:05 ` Nadav Amit
2021-01-05 19:45 ` Andrea Arcangeli
2021-01-05 20:06 ` Nadav Amit
2021-01-05 21:06 ` Andrea Arcangeli
2021-01-05 21:43 ` Peter Xu
2021-01-05 8:13 ` Peter Zijlstra
2021-01-05 8:52 ` Nadav Amit
2021-01-05 14:26 ` Peter Zijlstra
2021-01-05 8:58 ` Peter Zijlstra
2021-01-05 9:22 ` Nadav Amit
2021-01-05 17:58 ` Andrea Arcangeli [this message]
2021-01-05 15:08 ` Peter Xu
2021-01-05 18:08 ` Andrea Arcangeli
2021-01-05 18:41 ` Peter Xu
2021-01-05 18:55 ` Andrea Arcangeli
2021-01-05 19:07 ` Nadav Amit
2021-01-05 19:43 ` Peter Xu
2020-12-25 9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08 ` Will Deacon
2021-01-05 18:20 ` Andrea Arcangeli
2021-01-05 19:26 ` Nadav Amit
2021-01-05 20:39 ` Andrea Arcangeli
2021-01-05 21:20 ` Yu Zhao
2021-01-05 21:22 ` Nadav Amit
2021-01-05 22:16 ` Will Deacon
2021-01-06 0:29 ` Andrea Arcangeli
2021-01-06 0:02 ` Andrea Arcangeli
2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04 ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04 ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17 ` Linus Torvalds
2021-01-07 20:25 ` Linus Torvalds
2021-01-07 20:58 ` Andrea Arcangeli
2021-01-07 21:29 ` Linus Torvalds
2021-01-07 21:53 ` John Hubbard
2021-01-07 22:00 ` Linus Torvalds
2021-01-07 22:14 ` John Hubbard
2021-01-07 22:20 ` Linus Torvalds
2021-01-07 22:24 ` Linus Torvalds
2021-01-07 22:37 ` John Hubbard
2021-01-15 11:27 ` Jan Kara
2021-01-07 22:31 ` Andrea Arcangeli
2021-01-07 22:42 ` Linus Torvalds
2021-01-07 22:51 ` Linus Torvalds
2021-01-07 23:48 ` Andrea Arcangeli
2021-01-08 0:25 ` Linus Torvalds
2021-01-08 12:48 ` Will Deacon
2021-01-08 16:14 ` Andrea Arcangeli
2021-01-08 17:39 ` Linus Torvalds
2021-01-08 17:53 ` Andrea Arcangeli
2021-01-08 19:25 ` Linus Torvalds
2021-01-09 0:12 ` Andrea Arcangeli
2021-01-08 17:30 ` Linus Torvalds
2021-01-07 23:28 ` Andrea Arcangeli
2021-01-07 21:36 ` kernel test robot
2021-01-07 20:25 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32 ` Linus Torvalds
2021-01-07 21:05 ` Linus Torvalds
2021-01-07 22:02 ` Andrea Arcangeli
2021-01-07 22:17 ` Linus Torvalds
2021-01-07 22:56 ` Andrea Arcangeli
2021-01-09 19:32 ` Matthew Wilcox
2021-01-09 19:46 ` Linus Torvalds
2021-01-15 14:30 ` Jan Kara
2021-01-07 21:54 ` Andrea Arcangeli
2021-01-07 21:45 ` Andrea Arcangeli
2021-01-08 13:36 ` Jason Gunthorpe
2021-01-08 17:00 ` Andrea Arcangeli
2021-01-08 18:19 ` Jason Gunthorpe
2021-01-08 18:31 ` Andy Lutomirski
2021-01-08 18:38 ` Linus Torvalds
2021-01-08 23:34 ` Andrea Arcangeli
2021-01-09 19:03 ` Andy Lutomirski
2021-01-09 19:15 ` Linus Torvalds
2021-01-08 18:59 ` Linus Torvalds
2021-01-08 22:43 ` Andrea Arcangeli
2021-01-09 0:42 ` Jason Gunthorpe
2021-01-09 2:50 ` Andrea Arcangeli
2021-01-11 14:30 ` Jason Gunthorpe
2021-01-13 21:56 ` Jerome Glisse
2021-01-13 23:39 ` Jason Gunthorpe
2021-01-14 2:35 ` Jerome Glisse
2021-01-09 3:49 ` Hillf Danton
2021-01-11 14:39 ` Jason Gunthorpe
2021-01-05 21:55 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=X/So39qVhYYzDZ+8@redhat.com \
--to=aarcange@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mike.kravetz@oracle.com \
--cc=minchan@kernel.org \
--cc=nadav.amit@gmail.com \
--cc=namit@vmware.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=rppt@linux.vnet.ibm.com \
--cc=will@kernel.org \
--cc=xemul@openvz.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox