linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shachar Raindel <raindel@mellanox.com>
To: Michel Lespinasse <walken@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	Roland Dreier <roland@purestorage.com>,
	Haggai Eran <haggaie@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Sagi Grimberg <sagig@mellanox.com>,
	Liran Liss <liranl@mellanox.com>
Subject: Re: [LSF/MM TOPIC] Hardware initiated paging of user process pages, hardware access to the CPU page tables of user processes
Date: Sun, 10 Feb 2013 09:54:57 +0200	[thread overview]
Message-ID: <51175251.3040209@mellanox.com> (raw)
In-Reply-To: <CANN689Ff6vSu4ZvHek4J4EMzFG7EjF-Ej48hJKV_4SrLoj+mCA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4809 bytes --]

On 2/9/2013 8:05 AM, Michel Lespinasse wrote:
> On Fri, Feb 8, 2013 at 3:18 AM, Shachar Raindel<raindel@mellanox.com>  wrote:
>> Hi,
>>
>> We would like to present a reference implementation for safely sharing
>> memory pages from user space with the hardware, without pinning.
>>
>> We will be happy to hear the community feedback on our prototype
>> implementation, and suggestions for future improvements.
>>
>> We would also like to discuss adding features to the core MM subsystem to
>> assist hardware access to user memory without pinning.
> This sounds kinda scary TBH; however I do understand the need for such
> technology.
The technological challenges here are actually rather similar to the 
ones experienced
by hypervisors that want to allow swapping of virtual machines. As a 
result, we benefit
greatly from the mmu notifiers implemented for KVM. Reading the page 
table directly
will be another level of challenge.
> I think one issue is that many MM developers are insufficiently aware
> of such developments; having a technology presentation would probably
> help there; but traditionally LSF/MM sessions are more interactive
> between developers who are already quite familiar with the technology.
> I think it would help if you could send in advance a detailed
> presentation of the problem and the proposed solutions (and then what
> they require of the MM layer) so people can be better prepared.
We hope to send out an RFC patch-set of the feature implementation for 
our hardware
soon, which might help to demonstrate a use case for the technology.

The current programming model for InfiniBand (and related network 
protocols - RoCE,
iWarp) relies on the user space program registering memory regions for 
use with the
hardware. Upon registration, the driver performs pinning 
(get_user_pages) of the
memory area, updates a mapping table in the hardware and provides the user
application with a handle for the mapping. The user space application 
then use this
handle to request the hardware to access this area for network IO.

While achieving unbeatable IO performance (round-trip latency, for user 
space programs,
of less than 2  microseconds, bandwidth of 56 Gbit/second), this model 
is relatively
hard to use:

- The need for explicit memory registration for each area makes the API 
rather
   complex to use. Ideal API would have a handle per process, that 
allows it to
   communicate with the hardware using the process virtual addresses.

- After a part of the address space has been registered, the application 
must be
   careful not to move the pages around. For example, doing a fork 
results in all of
   the memory registrations pointing to the wrong pages (which is very 
hard to debug).
   This was partially addressed at [1], but the cure is nearly as bad as 
the disease - when
   MADVISE_DONTFORK is used on the heap, a simple call to malloc in the 
child process
   might crash the process.

- Memory which was registered is not swappable. As a result, one cannot 
write
   applications that overcommit for physical memory while using this 
API. Similarly to
   what Jerome described about GPU applications, for network access the 
application
   might want to use ~10% of its allocated memory space, but it is 
required to either
   pin all of the memory, use heuristics to predict what memory will be 
used or
   perform expensive copying/pinning for every network transaction. All 
of these are
   non-optimal.

> And first I'd like to ask, aren't IOMMUs supposed to already largely
> solve this problem ? (probably a dumb question, but that just tells
> you how much you need to explain :)
>

IOMMU v1 doesn't solve this problem, as it gives you only one mapping 
table per
PCI function. If you want ~64 processes on your machine to be able to 
access the
network, this is not nearly enough. It is helping in implementing PCI 
pass-thru for
virtualized guests (with the hardware devices exposing several virtual 
PCI functions
for the guests), but that is still not enough for user space applications.

To some extant, IOMMU v1 might even be an obstacle to implementing such
feature, as it prevents PCI devices from accessing parts of the memory, 
requiring
driver intervention for every page fault, even if the page is in memory.

IOMMU v2 [2] is a step at the same direction that we are moving towards, 
offering
PASID - a unique identifier for each transaction that the device 
performs, allowing
to associate the transaction with a specific process. However, the 
challenges there
are similar to these we encounter when using an address translation 
table on the
PCI device itself (NIC/GPU).

References:

1. MADVISE_DONTFORK - http://lwn.net/Articles/171956/
2. AMD IOMMU v2 - 
http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf


[-- Attachment #2: Type: text/html, Size: 9341 bytes --]

  parent reply	other threads:[~2013-02-10  7:55 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-08 11:18 Shachar Raindel
2013-02-08 15:21 ` Jerome Glisse
2013-04-16  7:03   ` Simon Jeons
2013-04-16 16:27     ` Jerome Glisse
2013-04-16 23:50       ` Simon Jeons
2013-04-17 14:01         ` Jerome Glisse
2013-04-17 23:48           ` Simon Jeons
2013-04-18  1:02             ` Jerome Glisse
2013-02-09  6:05 ` Michel Lespinasse
2013-02-09 16:29   ` Jerome Glisse
2013-04-09  8:28     ` Simon Jeons
2013-04-09 14:21       ` Jerome Glisse
2013-04-10  1:41         ` Simon Jeons
2013-04-10 20:45           ` Jerome Glisse
2013-04-11  3:42             ` Simon Jeons
2013-04-11 18:38               ` Jerome Glisse
2013-04-12  1:54                 ` Simon Jeons
2013-04-12  2:11                   ` [Lsf-pc] " Rik van Riel
2013-04-12  2:57                   ` Jerome Glisse
2013-04-12  5:44                     ` Simon Jeons
2013-04-12 13:32                       ` Jerome Glisse
2013-04-10  1:57     ` Simon Jeons
2013-04-10 20:55       ` Jerome Glisse
2013-04-11  3:37         ` Simon Jeons
2013-04-11 18:48           ` Jerome Glisse
2013-04-12  3:13             ` Simon Jeons
2013-04-12  3:21               ` Jerome Glisse
2013-04-15  8:39     ` Simon Jeons
2013-04-15 15:38       ` Jerome Glisse
2013-04-16  4:20         ` Simon Jeons
2013-04-16 16:19           ` Jerome Glisse
2013-02-10  7:54   ` Shachar Raindel [this message]
2013-04-09  8:17 ` Simon Jeons
2013-04-10  1:48   ` Simon Jeons

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51175251.3040209@mellanox.com \
    --to=raindel@mellanox.com \
    --cc=aarcange@redhat.com \
    --cc=haggaie@mellanox.com \
    --cc=linux-mm@kvack.org \
    --cc=liranl@mellanox.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ogerlitz@mellanox.com \
    --cc=roland@purestorage.com \
    --cc=sagig@mellanox.com \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox