From: Mike Rapoport <rppt@linux.ibm.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickins <hughd@google.com>, Maya Gokhale <gokhale2@llnl.gov>,
Jerome Glisse <jglisse@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Martin Cracauer <cracauer@cons.org>,
Denis Plotnikov <dplotnikov@virtuozzo.com>,
Shaohua Li <shli@fb.com>, Andrea Arcangeli <aarcange@redhat.com>,
Pavel Emelyanov <xemul@parallels.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Marty McFadden <mcfadden8@llnl.gov>,
Mike Rapoport <rppt@linux.vnet.ibm.com>,
Mel Gorman <mgorman@suse.de>,
"Kirill A . Shutemov" <kirill@shutemov.name>,
"Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
Date: Fri, 25 Jan 2019 09:54:53 +0200 [thread overview]
Message-ID: <20190125075453.GF31519@rapoport-lnx> (raw)
In-Reply-To: <20190124092848.GL18231@xz-x1>
On Thu, Jan 24, 2019 at 05:28:48PM +0800, Peter Xu wrote:
> On Thu, Jan 24, 2019 at 09:27:07AM +0200, Mike Rapoport wrote:
> > On Thu, Jan 24, 2019 at 12:56:15PM +0800, Peter Xu wrote:
> > > On Mon, Jan 21, 2019 at 12:42:33PM +0200, Mike Rapoport wrote:
> > >
> > > [...]
> > >
> > > > > @@ -1343,7 +1344,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >
> > > > > /* check not compatible vmas */
> > > > > ret = -EINVAL;
> > > > > - if (!vma_can_userfault(cur))
> > > > > + if (!vma_can_userfault(cur, vm_flags))
> > > > > goto out_unlock;
> > > > >
> > > > > /*
> > > > > @@ -1371,6 +1372,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > if (end & (vma_hpagesize - 1))
> > > > > goto out_unlock;
> > > > > }
> > > > > + if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_WRITE))
> > > > > + goto out_unlock;
> > > >
> > > > This is problematic for the non-cooperative use-case. Way may still want to
> > > > monitor a read-only area because it may eventually become writable, e.g. if
> > > > the monitored process runs mprotect().
> > >
> > > Firstly I think I should be able to change it to VM_MAYWRITE which
> > > seems to suite more.
> > >
> > > Meanwhile, frankly speaking I didn't think a lot about how to nest the
> > > usages of uffd-wp and mprotect(), so far I was only considering it as
> > > a replacement of mprotect(). But indeed it can happen that the
> > > monitored process calls mprotect(). Is there an existing scenario of
> > > such usage?
> > >
> > > The problem is I'm uncertain about whether this scenario can work
> > > after all. Say, the monitor process A write protected process B's
> > > page P, so logically A will definitely receive a message before B
> > > writes to page P. However here if we allow process B to do
> > > mprotect(PROT_WRITE) upon page P and grant write permission to it on
> > > its own, then A will not be able to capture the write operation at
> > > all? Then I don't know how it can work here... or whether we should
> > > fail the mprotect() at least upon uffd-wp ranges?
> >
> > The use-case we've discussed a while ago was to use uffd-wp instead of
> > soft-dirty for tracking memory changes in CRIU for pre-copy migration.
> > Currently, we enable soft-dirty for the migrated process and monitor
> > /proc/pid/pagemap between memory dump iterations to see what memory pages
> > have been changed.
> > With uffd-wp we thought to register all the process memory with uffd-wp and
> > then track changes with uffd-wp notifications. Back then it was considered
> > only at the very general level without paying much attention to details.
> >
> > So my initial thought was that we do register the entire memory with
> > uffd-wp. If an area changes from RO to RW at some point, uffd-wp will
> > generate notifications to the monitor, it would be able to notice the
> > change and the write will continue normally.
> >
> > If we are to limit uffd-wp register only to VMAs with VM_WRITE and even
> > VM_MAYWRITE, we'd need a way to handle the possible changes of VMA
> > protection and an ability to add monitoring for areas that changed from RO
> > to RW.
> >
> > Can't say I have a clear picture in mind at the moment, will continue to
> > think about it.
>
> Thanks for these details. Though I have a question about how it's
> used.
>
> Since we're talking about replacing soft dirty with uffd-wp here, I
> noticed that there's a major interface difference between soft-dirty
> and uffd-wp: the soft-dirty was all about /proc operations so a
> monitor process can easily monitor mostly any process on the system as
> long as knowing its PID. However I'm unsure about uffd-wp since
> userfaultfd was always bound to a mm_struct. For example, the syscall
> userfaultfd() will always attach the current process mm_struct to the
> newly created userfaultfd but it cannot be attached to another random
> mm_struct of other processes. Or is there any way that the CRIU
> monitor process can gain an userfaultfd of any process of the system
> somehow?
Yes, there is. For CRIU to read the process state during snapshot (or one
the source in case of the migration) we inject a parasite code into the
victim process. The parasite code communicates with the "main" CRIU monitor
via UNIX socket to pass information that cannot be obtained from outside.
For uffd-wp usage we thought about creating the uffd context in the
parasite code, registering the memory and passing the userfault file
descriptor to the CRIU core via that UNIX socket.
> >
> > > > Particularity, for using uffd-wp as a replacement for soft-dirty would
> > > > require it.
> > > >
> > > > >
> > > > > /*
> > > > > * Check that this vma isn't already owned by a
> > > > > @@ -1400,7 +1403,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > > do {
> > > > > cond_resched();
> > > > >
> > > > > - BUG_ON(!vma_can_userfault(vma));
> > > > > + BUG_ON(!vma_can_userfault(vma, vm_flags));
> > > > > BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
> > > > > vma->vm_userfaultfd_ctx.ctx != ctx);
> > > > > WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
> > > > > @@ -1760,6 +1763,46 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
> > > > > return ret;
> > > > > }
> > > > >
> > > > > +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> > > > > + unsigned long arg)
> > > > > +{
> > > > > + int ret;
> > > > > + struct uffdio_writeprotect uffdio_wp;
> > > > > + struct uffdio_writeprotect __user *user_uffdio_wp;
> > > > > + struct userfaultfd_wake_range range;
> > > > > +
> > > >
> > > > In the non-cooperative mode the userfaultfd_writeprotect() may race with VM
> > > > layout changes, pretty much as uffdio_copy() [1]. My solution for uffdio_copy()
> > > > was to return -EAGAIN if such race is encountered. I think the same would
> > > > apply here.
> > >
> > > I tried to understand the problem at [1] but failed... could you help
> > > to clarify it a bit more?
> > >
> > > I'm quoting some of the discussions from [1] here directly between you
> > > and Pavel:
> > >
> > > > Since the monitor cannot assume that the process will access all its memory
> > > > it has to copy some pages "in the background". A simple monitor may look
> > > > like:
> > > >
> > > > for (;;) {
> > > > wait_for_uffd_events(timeout);
> > > > handle_uffd_events();
> > > > uffd_copy(some not faulted pages);
> > > > }
> > > >
> > > > Then, if the "background" uffd_copy() races with fork, the pages we've
> > > > copied may be already present in parent's mappings before the call to
> > > > copy_page_range() and may be not.
> > > >
> > > > If the pages were not present, uffd_copy'ing them again to the child's
> > > > memory would be ok.
> > > >
> > > > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them
> > > > again, child process will get memory corruption.
> > >
> > > Here I don't understand why the child process will get memory
> > > corruption if uffd_copy() caught the mmap_sem first.
> > >
> > > If it did it, then IMHO when uffd_copy() copies the page again it'll
> > > simply get a -EEXIST showing that the page has already been copied.
> > > Could you explain on why there will be a data corruption?
> >
> > Let's say we do post-copy migration of a process A with CRIU and its page at
> > address 0x1000 is already copied. Now it modifies the contents of this
> > page. At this point the contents of the page at 0x1000 is different on the
> > source and the destination.
> > Next, process A forks process B. The CRIU's uffd monitor gets
> > UFFD_EVENT_FORK, and starts filling process B memory with UFFDIO_COPY.
> > It may happen, that UFFDIO_COPY to 0x1000 of the process B will occur
>
> I think this is the place I started to get confused...
>
> The mmap copy phase and the FORK event path is in dup_mmap() as
> mentioned in the patch too:
>
> dup_mmap()
> down_write(old_mm)
> down_write(new_mm)
> foreach(vma)
> copy_page_range() (a)
> up_write(new_mm)
> up_write(old_mm)
> dup_userfaultfd_complete() (b)
>
> Here if we already received UFFD_EVENT_FORK and started to copy pages
> to process B in the background, then we should have at least passed
> (b) above since otherwise we won't even know the existance of process
> B. However if so, we should have already passed the point to copy
> data at (a) too, then how could copy_page_range() race? It seems that
> I might have missed something important out there but it's not easy
> for me to figure out myself...
Apparently, I confused myself as well...
I clearly remember that there was a problem with fork() but the sequence
the causes it keeps evading me :(
Anyway, some mean of synchronization between uffd_copy and the
non-cooperative events is required. Take, for example, MADV_DONTNEED. When
it races with uffdio_copy() a process may end reading non zero values right
after MADV_DONTNEED call.
uffd monitor | process
-----------------------+-------------------------------------------
uffdio_copy(0x1000) | madvise(MADV_DONTNEED, 0x1000)
| down_read(mmap_sem)
| zap_pte_range(0x1000)
| up_read(mmap_sem)
down_read(mmap_sem) |
copy() |
up_read(mmap_sem) |
| read(0x1000) != 0
Similar issues happen with mpremap() and munmap().
> Thanks,
>
> > *before* fork() completes and it may race with copy_page_range().
> > If UFFDIO_COPY wins the race, it will fill the page with the contents from
> > the source, although the correct data is what process A set in that page.
> >
> > Hope it helps.
>
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df2cc96e77011cf7989208b206da9817e0321028
> > > >
>
> --
> Peter Xu
>
--
Sincerely yours,
Mike.
next prev parent reply other threads:[~2019-01-25 7:55 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-21 7:56 [PATCH RFC 00/24] userfaultfd: write protection support Peter Xu
2019-01-21 7:56 ` [PATCH RFC 01/24] mm: gup: rename "nonblocking" to "locked" where proper Peter Xu
2019-01-21 10:20 ` Mike Rapoport
2019-01-21 7:57 ` [PATCH RFC 02/24] mm: userfault: return VM_FAULT_RETRY on signals Peter Xu
2019-01-21 15:40 ` Jerome Glisse
2019-01-22 6:10 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 03/24] mm: allow VM_FAULT_RETRY for multiple times Peter Xu
2019-01-21 15:55 ` Jerome Glisse
2019-01-22 8:22 ` Peter Xu
2019-01-22 16:53 ` Jerome Glisse
2019-01-23 2:12 ` Peter Xu
2019-01-23 2:39 ` Jerome Glisse
2019-01-24 5:45 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 04/24] mm: gup: " Peter Xu
2019-01-21 16:24 ` Jerome Glisse
2019-01-24 7:05 ` Peter Xu
2019-01-24 15:34 ` Jerome Glisse
2019-01-25 2:49 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 05/24] userfaultfd: wp: add helper for writeprotect check Peter Xu
2019-01-21 10:23 ` Mike Rapoport
2019-01-22 8:31 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 06/24] userfaultfd: wp: support write protection for userfault vma range Peter Xu
2019-01-21 10:20 ` Mike Rapoport
2019-01-22 8:55 ` Peter Xu
2019-01-21 14:05 ` Jerome Glisse
2019-01-22 9:39 ` Peter Xu
2019-01-22 17:02 ` Jerome Glisse
2019-01-23 2:17 ` Peter Xu
2019-01-23 2:43 ` Jerome Glisse
2019-01-24 5:47 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 07/24] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Peter Xu
2019-01-21 10:42 ` Mike Rapoport
2019-01-24 4:56 ` Peter Xu
2019-01-24 7:27 ` Mike Rapoport
2019-01-24 9:28 ` Peter Xu
2019-01-25 7:54 ` Mike Rapoport [this message]
2019-01-25 10:12 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 08/24] userfaultfd: wp: hook userfault handler to write protection fault Peter Xu
2019-01-21 7:57 ` [PATCH RFC 09/24] userfaultfd: wp: enabled write protection in userfaultfd API Peter Xu
2019-01-21 7:57 ` [PATCH RFC 10/24] userfaultfd: wp: add WP pagetable tracking to x86 Peter Xu
2019-01-21 15:09 ` Jerome Glisse
2019-01-24 5:16 ` Peter Xu
2019-01-24 15:40 ` Jerome Glisse
2019-01-25 3:30 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 11/24] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Peter Xu
2019-01-21 7:57 ` [PATCH RFC 12/24] userfaultfd: wp: add UFFDIO_COPY_MODE_WP Peter Xu
2019-01-21 7:57 ` [PATCH RFC 13/24] mm: merge parameters for change_protection() Peter Xu
2019-01-21 13:54 ` Jerome Glisse
2019-01-24 5:22 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 14/24] userfaultfd: wp: apply _PAGE_UFFD_WP bit Peter Xu
2019-01-21 7:57 ` [PATCH RFC 15/24] mm: export wp_page_copy() Peter Xu
2019-01-21 7:57 ` [PATCH RFC 16/24] userfaultfd: wp: handle COW properly for uffd-wp Peter Xu
2019-01-21 7:57 ` [PATCH RFC 17/24] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork Peter Xu
2019-01-21 7:57 ` [PATCH RFC 18/24] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Peter Xu
2019-01-21 7:57 ` [PATCH RFC 19/24] userfaultfd: wp: support swap and page migration Peter Xu
2019-01-21 7:57 ` [PATCH RFC 20/24] userfaultfd: wp: don't wake up when doing write protect Peter Xu
2019-01-21 11:10 ` Mike Rapoport
2019-01-24 5:36 ` Peter Xu
2019-01-21 7:57 ` [PATCH RFC 21/24] khugepaged: skip collapse if uffd-wp detected Peter Xu
2019-01-21 7:57 ` [PATCH RFC 22/24] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Peter Xu
2019-01-21 7:57 ` [PATCH RFC 23/24] userfaultfd: selftests: refactor statistics Peter Xu
2019-01-21 7:57 ` [PATCH RFC 24/24] userfaultfd: selftests: add write-protect test Peter Xu
2019-01-21 14:33 ` [PATCH RFC 00/24] userfaultfd: write protection support David Hildenbrand
2019-01-22 3:18 ` Peter Xu
2019-01-22 8:59 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190125075453.GF31519@rapoport-lnx \
--to=rppt@linux.ibm.com \
--cc=aarcange@redhat.com \
--cc=cracauer@cons.org \
--cc=dgilbert@redhat.com \
--cc=dplotnikov@virtuozzo.com \
--cc=gokhale2@llnl.gov \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=jglisse@redhat.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mcfadden8@llnl.gov \
--cc=mgorman@suse.de \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=shli@fb.com \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox