Here is a proposed design document for supporting mapping GPU VRAM
and/or file-backed memory into other domains.  It's not in the form of
a patch because the leading + characters would just make it harder to
read for no particular gain, and because this is still RFC right now.
Once it is ready to merge, I'll send a proper patch.  Nevertheless,
you can consider this to be

Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>

This approach is very different from the "frontend-allocates"
approach used elsewhere in Xen.  It is very much Linux-centric,
rather than Xen-centric.  In fact, MMU notifiers were invented for
KVM, and this approach is exactly the same as the one KVM implements.
However, to the best of my understanding, the design described here is
the only viable one.  Linux MM and GPU drivers require it, and changes
to either to relax this requirement will not be accepted upstream.
---
# Memory lending: Mapping pageable memory, such as GPU VRAM, from one Xen domain into another

## Background

Some Linux kernel subsystems require full control over certain memory
regions.  This includes the ability to handle page faults from any
entity accessing this memory.  Such entities include not only that
kernel's userspace, but also kernels belonging to other guests.

For instance, GPU drivers reserve the right to migrate data between
VRAM and system RAM at any time.  Furthermore, there is a set of
page tables between the "aperture" (mapped as a PCI BAR) and the
actual VRAM.  This means that the GPU driver can make the memory
temporarily inaccessible to the CPU.  This is in fact _required_
when resizable BAR is not supported, as otherwise there is too much
VRAM to expose it all via a single BAR.

Since the backing storage of this memory must be movable, pinning
it is not supported.  However, the existing grant table interface
requires pinned memory.  Therefore, such memory currently cannot be
shared with another guest.  As a result, implementing virtio-GPU blob
objects is not possible.  Since blob objects are a prerequisite for
both Venus and native contexts, supporting Vulkan via virtio-GPU on
Xen is also impossible.

Direct Access to Differentiated Memory (DAX) also relies on non-pinned
memory.  In the (now rare) case of persistent memory, it is because
the filesystem may need to move data blocks around on disk.  In the
case of virtio-pmem and virtio-fs, it is because page faults on write
operations are used to inform filesystems that they need to write the
data back at some point.  Without these page faults, filesystems will
not write back the data and silent data loss will result.

There are other use-cases for this too.  For instance, virtio-GPU
cross-domain Wayland exposes host shared memory buffers to the guest.
These buffers are mmap()'d file descriptors provided by the Wayland
compositor, and as such are not guaranteed to be anonymous memory.
Using grant tables for such mappings would conflict with the design
of existing virtio-GPU implementations, which assume that GPU VRAM
and shared memory can be handled uniformly.

Additionally, this is needed to support paging guest memory out to the
host's disks.  While this is significantly less efficient than using
an in-guest balloon driver, it has the advantage of not requiring
guest cooperation.  Therefore, it can be useful for situations in
which the performance of a guest is irrelevant, but where saving the
guest isn't appropriate.

## Informing drivers that they must stop using memory: MMU notifiers

Kernel drivers, such as xen_privcmd, in the same domain that has
the GPU (the "host") may map GPU memory buffers.  However, they must
register an *MMU notifier*.  This is a callback that Linux core memory
management code ("MM") uses to tell the driver that it must stop
all accesses to the memory.  Once the memory is no longer accessed,
Linux assumes it can do whatever it wants with this memory:

- The GPU driver can move it from VRAM to system RAM or visa versa,
  move it within VRAM or system RAM, or it temporarily inaccessible
  so that other VRAM can be accessed.
- MM can swap the page out to disk/zram/etc.
- MM can move the page in system RAM to create huge pages.
- MM can write the pages out to their backing files and then free them.
- Anything else in Linux can do whatever it wants with the memory.

Suspending access to memory is not allowed to block indefinitely.
It can sleep, but it must finish in finite time regardless of what
userspace (or other VMs) do.  Otherwise, bad things (which I believe
includes deadlocks) may result.  I believe it can fail temporarily,
but permanent failure is also not allowed.  Once the MMU notifier
has succeeded, userspace or other domains **must not be allowed to
access the memory**.  This would be an exploitable use-after-free
vulnerability.

Due to these requirements, MMU notifier callbacks must not require
cooperation from other guests.  This means that they are not allowed to
wait for memory that has been granted to another guest to no longer
be mapped by that guest.  Therefore, MMU notifiers and the use of
grant tables are inherently incompatible.

## Memory lending: A different approach

Instead, xen_privcmd must use a different hypercall to _lend_ memory to
another domain (the "guest").  When MM triggers the guest MMU notifier,
xen_privcmd _tells_ Xen (via hypercall) to revoke the guest's access
to the memory.  This hypercall _must succeed in bounded time_ even
if the guest is malicious.

Since the other guests are not aware this has happened, they will
continue to access the memory.  This will cause p2m faults, which
trap to Xen.  Xen normally kills the guest in this situation which is
obviously not desired behavior.  Instead, Xen must pause the guest
and inform the host's kernel.  xen_privcmd will have registered a
handler for such events, so it will be informed when this happens.

When xen_privcmd is told that a guest wants to access the revoked
page, it will ask core MM to make the page available.  Once the page
_is_ available, core MM will inform xen_privcmd, which will in turn
provide a page to Xen that will be mapped into the guest's stage 2
translation tables.  This page will generally be different than the
one that was originally lent.

Requesting a new page can fail.  This is usually due to rare errors,
such as a GPU being hot-unplugged or an I/O error faulting pages
from disk.  In these cases, the old content of the page is lost.

When this happens, xen_privcmd can do one of two things:

1. It can provide a page that is filled with zeros.
2. It can tell Xen that it is unable to fulfill the request.

Which choice it makes is under userspace control.  If userspace
chooses the second option, Xen injects a fault into the guest.
It is up to the guest to handle the fault correctly.

## Restrictions on lent memory

Lent memory is still considered to belong to the lending domain.
The borrowing domain can only access it via its p2m.  Hypercalls made
by the borrowing domain act as if the borrowed memory was not present.
This includes, but is not limited to:

- Using pointers to borrowed memory in hypercall arguments.
- Granting borrowed memory to other VMs.
- Any other operation that depends on whether a page is accessible
  by a domain.

Furthermore:

- Borrowed memory isn't mapped into the IOMMU of any PCIe devices
  the guest has attached, because IOTLB faults generally are not
  replayable.

- Foreign mapping hypercalls that reference lent memory will fail.
  Otherwise, the domain making the foreign mapping hypercall could
  continue to access the borrowed memory after the lease had been
  revoked.  This is true even if the domain performing the foreign
  mapping is an all-powerful dom0.  Otherwise, an emulated device
  could access memory whose lease had been revoked.

This also means that live migration of a domain that has borrowed
memory requires cooperation from the lending domain.  For now, it
will be considered out of scope.  Live migration is typically used
with server workloads, and accelerators for server hardware often
support SR-IOV.

## Where will lent memory appear in a guest's address space?

Typically, lent memory will be an emulated PCI BAR.  It may be emulated
by dom0 or an alternate ioreq server.  However, it is not *required*
to be a PCI BAR.

## Privileges required for memory lending

For obvious reasons, the domain lending the memory must be privileged
over the domain borrowing it.  The lending domain does not inherently
need to be privileged over the whole system.  However, supporting
situations where the providing domain is not dom0 will require
extensions to Xen's permission model, except for the case where the
providing domain only serves a single VM.

Memory lending hypercalls are not subject to the restrictions of
XSA-77.  They may safely be delegated to VMs other than dom0.

## Userspace API

To the extent possible, the memory lending API should be similar
to KVM's uAPI.  Ideally, userspace should be able to abstract over
the differences.  Using the API should not require root privileges
or be equivalent to root on the host.  It should only require a file
descriptor that only allows controlling a single domain.

## Future directions: Creating & running Xen VMs without special privileges

With the exception of a single page used for hypercalls, it is
possible for a Xen domain to *only* have borrowed memory.  Such a
domain can be managed by an entirely unprivileged userspace process,
just like it would manage a KVM VM.  Since the "host" in this scenario
only needs privilege over a domain it itself created, it is possible
(once a subset of XSA-77 restrictions are lifted) for this domain
to not actually be dom0.

Even with XSA-77, the domain could still request dom0 to create and
destroy the domain on its behalf.  Qubes OS already allows unprivileged
guests to cause domain creation and destruction, so this does not
introduce any new Xen attack surface.

This could allow unprivileged processes in a domU to create and manage
sub-domUs, just as if the domU had nested virtualization support and
KVM was used.  However, this should provide significantly better
performance than nested virtualization.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)