From: Shachar Raindel <raindel@mellanox.com>
To: Michel Lespinasse <walken@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
Andrea Arcangeli <aarcange@redhat.com>,
Roland Dreier <roland@purestorage.com>,
Haggai Eran <haggaie@mellanox.com>,
Or Gerlitz <ogerlitz@mellanox.com>,
Sagi Grimberg <sagig@mellanox.com>,
Liran Liss <liranl@mellanox.com>
Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes
Date: Sun, 10 Feb 2013 09:54:57 +0200 [thread overview]
Message-ID: <51175251.3040209@mellanox.com> (raw)
In-Reply-To: <CANN689Ff6vSu4ZvHek4J4EMzFG7EjF-Ej48hJKV_4SrLoj+mCA@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 4809 bytes --]
On 2/9/2013 8:05 AM, Michel Lespinasse wrote:
> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel<raindel@mellanox.com> wrote:
>> Hi,
>>
>> We would like to present a reference implementation for safely sharing
>> memory pages from user space with the hardware, without pinning.
>>
>> We will be happy to hear the community feedback on our prototype
>> implementation, and suggestions for future improvements.
>>
>> We would also like to discuss adding features to the core MM subsystem to
>> assist hardware access to user memory without pinning.
> This sounds kinda scary TBH; however I do understand the need for such
> technology.
The technological challenges here are actually rather similar to the
ones experienced
by hypervisors that want to allow swapping of virtual machines. As a
result, we benefit
greatly from the mmu notifiers implemented for KVM. Reading the page
table directly
will be another level of challenge.
> I think one issue is that many MM developers are insufficiently aware
> of such developments; having a technology presentation would probably
> help there; but traditionally LSF/MM sessions are more interactive
> between developers who are already quite familiar with the technology.
> I think it would help if you could send in advance a detailed
> presentation of the problem and the proposed solutions (and then what
> they require of the MM layer) so people can be better prepared.
We hope to send out an RFC patch-set of the feature implementation for
our hardware
soon, which might help to demonstrate a use case for the technology.
The current programming model for InfiniBand (and related network
protocols - RoCE,
iWarp) relies on the user space program registering memory regions for
use with the
hardware. Upon registration, the driver performs pinning
(get_user_pages) of the
memory area, updates a mapping table in the hardware and provides the user
application with a handle for the mapping. The user space application
then use this
handle to request the hardware to access this area for network IO.
While achieving unbeatable IO performance (round-trip latency, for user
space programs,
of less than 2 microseconds, bandwidth of 56 Gbit/second), this model
is relatively
hard to use:
- The need for explicit memory registration for each area makes the API
rather
complex to use. Ideal API would have a handle per process, that
allows it to
communicate with the hardware using the process virtual addresses.
- After a part of the address space has been registered, the application
must be
careful not to move the pages around. For example, doing a fork
results in all of
the memory registrations pointing to the wrong pages (which is very
hard to debug).
This was partially addressed at [1], but the cure is nearly as bad as
the disease - when
MADVISE_DONTFORK is used on the heap, a simple call to malloc in the
child process
might crash the process.
- Memory which was registered is not swappable. As a result, one cannot
write
applications that overcommit for physical memory while using this
API. Similarly to
what Jerome described about GPU applications, for network access the
application
might want to use ~10% of its allocated memory space, but it is
required to either
pin all of the memory, use heuristics to predict what memory will be
used or
perform expensive copying/pinning for every network transaction. All
of these are
non-optimal.
> And first I'd like to ask, aren't IOMMUs supposed to already largely
> solve this problem ? (probably a dumb question, but that just tells
> you how much you need to explain :)
>
IOMMU v1 doesn't solve this problem, as it gives you only one mapping
table per
PCI function. If you want ~64 processes on your machine to be able to
access the
network, this is not nearly enough. It is helping in implementing PCI
pass-thru for
virtualized guests (with the hardware devices exposing several virtual
PCI functions
for the guests), but that is still not enough for user space applications.
To some extant, IOMMU v1 might even be an obstacle to implementing such
feature, as it prevents PCI devices from accessing parts of the memory,
requiring
driver intervention for every page fault, even if the page is in memory.
IOMMU v2 [2] is a step at the same direction that we are moving towards,
offering
PASID - a unique identifier for each transaction that the device
performs, allowing
to associate the transaction with a specific process. However, the
challenges there
are similar to these we encounter when using an address translation
table on the
PCI device itself (NIC/GPU).
References:
1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/
2. AMD IOMMU v2 -
http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf
[-- Attachment #2: Type: text/html, Size: 9341 bytes --]
next prev parent reply other threads:[~2013-02-10 7:55 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-08 11:18 Shachar Raindel
2013-02-08 15:21 ` Jerome Glisse
2013-04-16 7:03 ` Simon Jeons
2013-04-16 16:27 ` Jerome Glisse
2013-04-16 23:50 ` Simon Jeons
2013-04-17 14:01 ` Jerome Glisse
2013-04-17 23:48 ` Simon Jeons
2013-04-18 1:02 ` Jerome Glisse
2013-02-09 6:05 ` Michel Lespinasse
2013-02-09 16:29 ` Jerome Glisse
2013-04-09 8:28 ` Simon Jeons
2013-04-09 14:21 ` Jerome Glisse
2013-04-10 1:41 ` Simon Jeons
2013-04-10 20:45 ` Jerome Glisse
2013-04-11 3:42 ` Simon Jeons
2013-04-11 18:38 ` Jerome Glisse
2013-04-12 1:54 ` Simon Jeons
2013-04-12 2:11 ` [Lsf-pc] " Rik van Riel
2013-04-12 2:57 ` Jerome Glisse
2013-04-12 5:44 ` Simon Jeons
2013-04-12 13:32 ` Jerome Glisse
2013-04-10 1:57 ` Simon Jeons
2013-04-10 20:55 ` Jerome Glisse
2013-04-11 3:37 ` Simon Jeons
2013-04-11 18:48 ` Jerome Glisse
2013-04-12 3:13 ` Simon Jeons
2013-04-12 3:21 ` Jerome Glisse
2013-04-15 8:39 ` Simon Jeons
2013-04-15 15:38 ` Jerome Glisse
2013-04-16 4:20 ` Simon Jeons
2013-04-16 16:19 ` Jerome Glisse
2013-02-10 7:54 ` Shachar Raindel [this message]
2013-04-09 8:17 ` Simon Jeons
2013-04-10 1:48 ` Simon Jeons
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51175251.3040209@mellanox.com \
--to=raindel@mellanox.com \
--cc=aarcange@redhat.com \
--cc=haggaie@mellanox.com \
--cc=linux-mm@kvack.org \
--cc=liranl@mellanox.com \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ogerlitz@mellanox.com \
--cc=roland@purestorage.com \
--cc=sagig@mellanox.com \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox