From: Kirill Smelkov <kirr@nexedi.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
qemu-devel@nongnu.org, kvm@vger.kernel.org,
linux-api@vger.kernel.org, Pavel Emelyanov <xemul@parallels.com>,
Sanidhya Kashyap <sanidhya.gatech@gmail.com>,
zhang.zhanghailiang@huawei.com,
Linus Torvalds <torvalds@linux-foundation.org>,
"Kirill A. Shutemov" <kirill@shutemov.name>,
Andres Lagar-Cavilla <andreslc@google.com>,
Dave Hansen <dave.hansen@intel.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>,
Andy Lutomirski <luto@amacapital.net>,
Hugh Dickins <hughd@google.com>,
Peter Feiner <pfeiner@google.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
"Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Subject: Re: [PATCH 00/23] userfaultfd v4
Date: Thu, 21 May 2015 16:11:11 +0300 [thread overview]
Message-ID: <20150521131111.GA8932@teco.navytux.spb.ru> (raw)
In-Reply-To: <1431624680-20153-1-git-send-email-aarcange@redhat.com>
Hello up there,
On Thu, May 14, 2015 at 07:30:57PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
>
> This is the latest userfaultfd patchset against mm-v4.1-rc3
> 2015-05-14-10:04.
>
> The postcopy live migration feature on the qemu side is mostly ready
> to be merged and it entirely depends on the userfaultfd syscall to be
> merged as well. So it'd be great if this patchset could be reviewed
> for merging in -mm.
>
> Userfaults allow to implement on demand paging from userland and more
> generally they allow userland to more efficiently take control of the
> behavior of page faults than what was available before
> (PROT_NONE + SIGSEGV trap).
>
> The use cases are:
[...]
> Even though there wasn't a real use case requesting it yet, it also
> allows to implement distributed shared memory in a way that readonly
> shared mappings can exist simultaneously in different hosts and they
> can be become exclusive at the first wrprotect fault.
Sorry for maybe speaking up too late, but here is additional real
potential use-case which in my view is overlapping with the above:
Recently we needed to implement persistency for NumPy arrays - that is
to track made changes to array memory and transactionally either abandon
the changes on transaction abort, or store them back to storage on
transaction commit.
Since arrays can be large, it would be slow and thus not practical to
have original data copy and compare memory to original to find what
array parts have been changed.
So I've implemented a scheme where array data is initially PROT_READ
protected, then we catch SIGSEGV, if it is write and area belongs to array
data - we mark that page as PROT_WRITE and continue. On commit time we
know which parts were modified.
Also, since arrays could be large - bigger than RAM, and only sparse
parts of it could be needed to get needed information, for reading it
also makes sense to lazily load data in SIGSEGV handler with initial
PROT_NONE protection.
This is very similar to how memory mapped files work, but adds
transactionality which, as far as I know, is not provided by any
currently in-kernel filesystem on Linux.
The system is done as files, and arrays are then build on top of
this-way memory-mapped files. So from now on we can forget about NumPy
arrays and only talk about files, their mapping, lazy loading and
transactionally storing in-memory changes back to file storage.
To get this working, a custom user-space virtual memory manager is
unrolled, which manages RAM memory "pages", file mappings into virtual
address-space, tracks pages protection and does SIGSEGV handling
appropriately.
The gist of virtual memory-manager is this:
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/include/wendelin/bigfile/virtmem.h
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c (vma_on_pagefault)
For operations it currently needs
- establishing virtual memory areas and connecting to tracking it
- changing pages protection
PROT_NONE or absent - initially
PROT_NONE -> PROT_READ - after read
PROT_READ -> PROT_READWRITE - after write
PROT_READWRITE -> PROT_READ - after commit
PROT_READWRITE -> PROT_NONE or absent (again) - after abort
PROT_READ -> PROT_NONE or absent (again) - on reclaim
- working with aliasable memory (thus taken from tmpfs)
there could be two overlapping-in-file mapping for file (array)
requested at different time, and changes from one mapping should
propagate to another one -> for common parts only 1 page should
be memory-mapped into 2 places in address-space.
so what is currently lacking on userfaultfd side is:
- ability to remove / make PROT_NONE already mapped pages
(UFFDIO_REMAP was recently dropped)
- ability to arbitrarily change pages protection (e.g. RW -> R)
- inject aliasable memory from tmpfs (or better hugetlbfs) and into
several places (UFFDIO_REMAP + some mapping copy semantic).
The code is ugly because it is only a prototype. You can clone/read it
all from here:
https://lab.nexedi.cn/kirr/wendelin.core
Virtual memory-manager even has tests, and from them it could be seen
how the system is supposed to work (after each access - what pages and
where are mapped and how):
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/tests/test_virtmem.c
The performance currently is not great, partly because of page clearing
when getting ram from tmpfs, and partly because of mprotect/SIGSEGV/vmas
overhead and other dumb things on my side.
I still wanted to show the case, as userfaultd here has potential to
remove overhead related to kernel.
Thanks beforehand for feedback,
Kirill
P.S. some context
http://www.wendelin.io/NXD-Wendelin.Core.Non.Secret/asEntireHTML
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-05-21 13:09 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-14 17:30 Andrea Arcangeli
2015-05-14 17:30 ` [PATCH 01/23] userfaultfd: linux/Documentation/vm/userfaultfd.txt Andrea Arcangeli
2015-09-11 8:47 ` Michael Kerrisk (man-pages)
2015-12-04 15:50 ` Michael Kerrisk (man-pages)
2015-12-04 17:55 ` Andrea Arcangeli
2015-05-14 17:30 ` [PATCH 02/23] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 03/23] userfaultfd: uAPI Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 04/23] userfaultfd: linux/userfaultfd_k.h Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 05/23] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 06/23] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 07/23] userfaultfd: call handle_userfault() for userfaultfd_missing() faults Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 08/23] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 09/23] userfaultfd: prevent khugepaged to merge if userfaultfd is armed Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 10/23] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2015-05-14 17:49 ` Linus Torvalds
2015-05-15 16:04 ` Andrea Arcangeli
2015-05-15 18:22 ` Linus Torvalds
2015-06-23 19:00 ` Dave Hansen
2015-06-23 21:41 ` Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 11/23] userfaultfd: Rename uffd_api.bits into .features Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 12/23] userfaultfd: Rename uffd_api.bits into .features fixup Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 13/23] userfaultfd: change the read API to return a uffd_msg Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 14/23] userfaultfd: wake pending userfaults Andrea Arcangeli
2015-10-22 12:10 ` Peter Zijlstra
2015-10-22 13:20 ` Andrea Arcangeli
2015-10-22 13:38 ` Peter Zijlstra
2015-10-22 14:18 ` Andrea Arcangeli
2015-10-22 15:15 ` Peter Zijlstra
2015-10-22 15:30 ` Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 15/23] userfaultfd: optimize read() and poll() to be O(1) Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 16/23] userfaultfd: allocate the userfaultfd_ctx cacheline aligned Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 17/23] userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 18/23] userfaultfd: buildsystem activation Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 19/23] userfaultfd: activate syscall Andrea Arcangeli
2015-08-11 10:07 ` [Qemu-devel] " Bharata B Rao
2015-08-11 13:48 ` Andrea Arcangeli
2015-08-12 5:23 ` Bharata B Rao
2015-09-08 6:08 ` Michael Ellerman
2015-09-08 6:39 ` Bharata B Rao
2015-09-08 7:14 ` Michael Ellerman
2015-09-08 10:40 ` Michael Ellerman
2015-09-08 12:28 ` Dr. David Alan Gilbert
2015-09-08 8:59 ` Dr. David Alan Gilbert
2015-09-08 10:00 ` Bharata B Rao
2015-09-08 12:46 ` Dr. David Alan Gilbert
2015-09-08 13:37 ` Bharata B Rao
2015-09-08 14:13 ` Dr. David Alan Gilbert
2015-05-14 17:31 ` [PATCH 20/23] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 21/23] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic Andrea Arcangeli
2015-05-22 20:18 ` Andrew Morton
2015-05-22 20:48 ` Andrea Arcangeli
2015-05-22 21:18 ` Andrew Morton
2015-05-23 1:04 ` Andrea Arcangeli
2015-05-14 17:31 ` [PATCH 23/23] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE Andrea Arcangeli
2015-05-18 14:24 ` [PATCH 00/23] userfaultfd v4 Pavel Emelyanov
2015-05-19 21:38 ` Andrew Morton
2015-05-19 21:59 ` Richard Weinberger
2015-05-20 14:17 ` Andrea Arcangeli
2015-05-20 13:23 ` Andrea Arcangeli
2015-05-21 13:11 ` Kirill Smelkov [this message]
2015-05-21 15:52 ` Andrea Arcangeli
2015-05-22 16:35 ` Kirill Smelkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150521131111.GA8932@teco.navytux.spb.ru \
--to=kirr@nexedi.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=andreslc@google.com \
--cc=dave.hansen@intel.com \
--cc=dgilbert@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kirill@shutemov.name \
--cc=kvm@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@amacapital.net \
--cc=mgorman@suse.de \
--cc=pbonzini@redhat.com \
--cc=peter.huangpeng@huawei.com \
--cc=pfeiner@google.com \
--cc=qemu-devel@nongnu.org \
--cc=riel@redhat.com \
--cc=sanidhya.gatech@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=xemul@parallels.com \
--cc=zhang.zhanghailiang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox