linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: "Andy Lutomirski" <luto@kernel.org>,
	"Adalbert Lazăr" <alazar@bitdefender.com>
Cc: "Christian Brauner" <christian.brauner@ubuntu.com>,
	Linux-MM <linux-mm@kvack.org>,
	"Linux API" <linux-api@vger.kernel.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Alexander Graf" <graf@amazon.com>,
	"Jerome Glisse" <jglisse@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Mihai Donțu" <mdontu@bitdefender.com>,
	"Mircea Cirjaliu" <mcirjaliu@bitdefender.com>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Sargun Dhillon" <sargun@sargun.me>,
	"Aleksa Sarai" <cyphar@cyphar.com>,
	"Oleg Nesterov" <oleg@redhat.com>, "Jann Horn" <jannh@google.com>,
	"Kees Cook" <keescook@chromium.org>,
	"Matthew Wilcox" <willy@infradead.org>
Subject: Re: [RESEND RFC PATCH 0/5] Remote mapping
Date: Wed, 9 Sep 2020 12:38:51 +0100	[thread overview]
Message-ID: <20200909113851.GB15584@stefanha-x1.localdomain> (raw)
In-Reply-To: <CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6320 bytes --]

On Mon, Sep 07, 2020 at 01:43:48PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > This patchset adds support for the remote mapping feature.
> > > Remote mapping, as its name suggests, is a means for transparent and
> > > zero-copy access of a remote process' address space.
> > > access of a remote process' address space.
> > >
> > > The feature was designed according to a specification suggested by
> > > Paolo Bonzini:
> > > >> The proposed API is a new pidfd system call, through which the parent
> > > >> can map portions of its virtual address space into a file descriptor
> > > >> and then pass that file descriptor to a child.
> > > >>
> > > >> This should be:
> > > >>
> > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> > > >> better way to do PTRACE_{PEEK,POKE}DATA
> >
> > In all honesty, that sentence made me a bit uneasy as it reads like this
> > is implemented on top of pidfds because it makes it more likely to go
> > upstream not because it is the right design. To be clear, I'm not
> > implying any sort of malicious intent on your part but I would suggest
> > to phrase this a little better. :)
> 
> 
> I thought about this whole thing some more, and here are some thoughts.
> 
> First, I was nervous about two things.  One was faulting in pages from
> the wrong context.  (When a normal page fault or KVM faults in a page,
> the mm is loaded.  (In the KVM case, the mm is sort of not loaded when
> the actual fault happens, but the mm is loaded when the fault is
> handled, I think.  Maybe there are workqueues involved and I'm wrong.)
>  When a remote mapping faults in a page, the mm is *not* loaded.)
> This ought not to be a problem, though -- get_user_pages_remote() also
> faults in pages from a non-current mm, and that's at least supposed to
> work correctly, so maybe this is okay.
> 
> Second is recursion.  I think this is a genuine problem.
> 
> And I think that tying this to pidfds is the wrong approach.  In fact,
> tying it to processes at all seems wrong.  There is a lot of demand
> for various forms of memory isolation in which memory is mapped only
> by its intended user.  Using something tied to a process mm gets in
> the way of this in the same way that KVM's current mapping model gets
> in the way.
> 
> All that being said, I think the whole idea of making fancy address
> spaces composed from other mappable objects is neat and possibly quite
> useful.  And, if you squint a bit, this is a lot like what KVM does
> today.
> 
> So I suggest something that may be more generally useful as an
> alternative.  This is a sketch and very subject to bikeshedding:
> 
> Create an empty address space:
> 
> int address_space_create(int flags, etc);
> 
> Map an fd into an address space:
> 
> int address_space_mmap(int asfd, int fd_to_map, offset, size, prot,
> ...);  /* might run out of args here */
> 
> Unmap from an address space:
> 
> int address_space_munmap(int asfd, unsigned long addr, unsigned long len);
> 
> Stick an address space into KVM:
> 
> ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd);  /* or similar */
> 
> Maybe some day allow mapping an address space into a process.
> 
> mmap(..., asfd, ...);
> 
> 
> And at least for now, there's a rule that an address space that is
> address_space_mmapped into an address space is disallowed.
> 
> 
> Maybe some day we also allow mremap(), madvise(), etc.  And maybe some
> day we allow creating a special address_space that represents a real
> process's address space.
> 
> 
> Under the hood, an address_space could own an mm_struct that is not
> used by any tasks.  And we could have special memfds that are bound to
> a VM such that all you can do with them is stick them into an
> address_space and map that address_space into the VM in question.  For
> this to work, we would want a special vm_operation for mapping into a
> VM.
> 
> 
> What do you all think?  Is this useful?  Does it solve your problems?
> Is it a good approach going forward?

Hi Adalbert and Andy,
As everyone continues to discuss how the mechanism should look, I want
to share two use cases for something like this. Let me know if you would
like more detail on these use cases.

They requirement in both cases is that process A can map a virtual
memory range from process B so that mmap/munmap operations within the
memory range in process B also affect process A.

An enforcing vIOMMU for vhost-user and vfio-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user, vfio-user, and other out-of-process device emulation
interfaces need a way for the virtual machine manager (VMM) to enforce
the vIOMMU mappings on the device emulation process. The VMM emulates
the vIOMMU and only wants to expose a subset of memory to the device
emulation process. This subset can change as the guest programs the
vIOMMU.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later.

The new mechanism would allow the VMM to add/remove mappings so that the
device emulation process can only access ranges of memory programmed by
the guest vIOMMU. Accesses to unmapped addresses would raise a signal.

Accelerating the virtio-fs DAX window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process handles guest file map/unmap messages.
The map/unmap messages allow the guest to map ranges of files into its
memory space. The guest kernel then uses DAX to access the file pages
without copying their contents into the guest page cache and mmap
MAP_SHARED is coherent when guests access the same file.

Today virtiofsd sends a message to the VMM over a UNIX domain socket
asking for an mmap/munmap. The VMM must perform the mapping on behalf of
virtiofsd. This communication and file descriptor passing is clumsy and
slow.

The new mechanism would allow virtiofsd to map/unmap without extra
coordination with the VMM. The VMM only needs to perform an initial mmap
of the DAX window so that kvm.ko can resolve page faults to that region.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

      reply	other threads:[~2020-09-09 11:39 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-04 11:31 Adalbert Lazăr
2020-09-04 11:31 ` [RESEND RFC PATCH 1/5] mm: add atomic capability to zap_details Adalbert Lazăr
2020-09-04 11:31 ` [RESEND RFC PATCH 2/5] mm: let the VMA decide how zap_pte_range() acts on mapped pages Adalbert Lazăr
2020-09-04 11:31 ` [RESEND RFC PATCH 3/5] mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in nested scenarios Adalbert Lazăr
2020-09-04 12:03   ` Jason Gunthorpe
2020-09-04 11:31 ` [RESEND RFC PATCH 4/5] mm/remote_mapping: use a pidfd to access memory belonging to unrelated process Adalbert Lazăr
2020-09-04 17:55   ` Oleg Nesterov
2020-09-07 14:30   ` Oleg Nesterov
2020-09-07 15:16     ` Adalbert Lazăr
2020-09-09  8:32     ` Mircea CIRJALIU - MELIU
2020-09-10 16:43       ` Oleg Nesterov
2020-09-07 15:02   ` Christian Brauner
2020-09-07 16:04     ` Mircea CIRJALIU - MELIU
2020-09-04 11:31 ` [RESEND RFC PATCH 5/5] pidfd_mem: implemented remote memory mapping system call Adalbert Lazăr
2020-09-04 19:18   ` Florian Weimer
2020-09-07 14:55   ` Christian Brauner
2020-09-04 12:11 ` [RESEND RFC PATCH 0/5] Remote mapping Jason Gunthorpe
2020-09-04 13:24   ` Mircea CIRJALIU - MELIU
2020-09-04 13:39     ` Jason Gunthorpe
2020-09-04 14:18       ` Mircea CIRJALIU - MELIU
2020-09-04 14:39         ` Jason Gunthorpe
2020-09-04 15:40           ` Mircea CIRJALIU - MELIU
2020-09-04 16:11             ` Jason Gunthorpe
2020-09-04 19:41   ` Matthew Wilcox
2020-09-04 19:49     ` Jason Gunthorpe
2020-09-04 20:08     ` Paolo Bonzini
2020-12-01 18:01     ` Jason Gunthorpe
2020-09-04 19:19 ` Florian Weimer
2020-09-04 20:18   ` Paolo Bonzini
2020-09-07  8:33     ` Christian Brauner
2020-09-04 19:39 ` Andy Lutomirski
2020-09-04 20:09   ` Paolo Bonzini
2020-09-04 20:34     ` Andy Lutomirski
2020-09-04 21:58       ` Paolo Bonzini
2020-09-04 23:17         ` Andy Lutomirski
2020-09-05 18:27           ` Paolo Bonzini
2020-09-07  8:38             ` Christian Brauner
2020-09-07 12:41           ` Mircea CIRJALIU - MELIU
2020-09-07  7:05         ` Christoph Hellwig
2020-09-07  8:44           ` Paolo Bonzini
2020-09-07 10:25   ` Mircea CIRJALIU - MELIU
2020-09-07 15:05 ` Christian Brauner
2020-09-07 20:43   ` Andy Lutomirski
2020-09-09 11:38     ` Stefan Hajnoczi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200909113851.GB15584@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alazar@bitdefender.com \
    --cc=arnd@arndb.de \
    --cc=christian.brauner@ubuntu.com \
    --cc=cyphar@cyphar.com \
    --cc=graf@amazon.com \
    --cc=jannh@google.com \
    --cc=jglisse@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mcirjaliu@bitdefender.com \
    --cc=mdontu@bitdefender.com \
    --cc=oleg@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=sargun@sargun.me \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox