From: Christoph Lameter <clameter@sgi.com>
To: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Roland Dreier <rdreier@cisco.com>, Rik van Riel <riel@redhat.com>,
steiner@sgi.com, Andrea Arcangeli <andrea@qumranet.com>,
a.p.zijlstra@chello.nl, izike@qumranet.com,
linux-kernel@vger.kernel.org, avi@qumranet.com,
linux-mm@kvack.org, daniel.blueman@quadrics.com,
Robin Holt <holt@sgi.com>,
general@lists.openfabrics.org,
Andrew Morton <akpm@linux-foundation.org>,
kvm-devel@lists.sourceforge.net
Subject: Re: [ofa-general] Re: Demand paging for memory regions
Date: Tue, 12 Feb 2008 18:35:09 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0802121819530.12328@schroedinger.engr.sgi.com> (raw)
In-Reply-To: <20080213012638.GD31435@obsidianresearch.com>
On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
> The problem is that the existing wire protocols do not have a
> provision for doing an 'are you ready' or 'I am not ready' exchange
> and they are not designed to store page tables on both sides as you
> propose. The remote side can send RDMA WRITE traffic at any time after
> the RDMA region is established. The local side must be able to handle
> it. There is no way to signal that a page is not ready and the remote
> should not send.
>
> This means the only possible implementation is to stall/discard at the
> local adaptor when a RDMA WRITE is recieved for a page that has been
> reclaimed. This is what leads to deadlock/poor performance..
You would only use the wire protocols *after* having established the RDMA
region. The notifier chains allows a RDMA region (or parts thereof) to be
down on demand by the VM. The region can be reestablished if one of
the side accesses it. I hope I got that right. Not much exposure to
Infiniband so far.
Lets say you have a two systems A and B. Each has their memory region MemA
and MemB. Each side also has page tables for this region PtA and PtB.
Now you establish a RDMA connection between both side. The pages in both
MemB and MemA are present and so are entries in PtA and PtB. RDMA
traffic can proceed.
The VM on system A now gets into a situation in which memory becomes
heavily used by another (maybe non RDMA process) and after checking that
there was no recent reference to MemA and MemB (via a notifier aging
callback) decides to reclaim the memory from MemA.
In that case it will notify the RDMA subsystem on A that it is trying to
reclaim a certain page.
The RDMA subsystem on A will then send a message to B notifying it that
the memory will be going away. B now has to remove its corresponding page
from memory (and drop the entry in PtB) and confirm to A that this has
happened. RDMA traffic is then stopped for this page. Then A can also
remove its page, the corresponding entry in PtA and the page is reclaimed
or pushed out to swap completing the page reclaim.
If either side then accesses the page again then the reverse process
happens. If B accesses the page then it wil first of all incur a page
fault because the entry in PtB is missing. The fault will then cause a
message to be send to A to establish the page again. A will create an
entry in PtA and will then confirm to B that the page was established. At
that point RDMA operations can occur again.
So the whole scheme does not really need a hardware page table in the RDMA
hardware. The page tables of the two systems A and B are sufficient.
The scheme can also be applied to a larger range than only a single page.
The RDMA subsystem could tear down a large section when reclaim is
pushing on it and then reestablish it as needed.
Swapping and page reclaim is certainly not something that improves the
speed of the application affected by swapping and page reclaim but it
allows the VM to manage memory effectively if multiple loads are runing on
a system.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-02-13 2:35 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-02-08 22:06 [patch 0/6] MMU Notifiers V6 Christoph Lameter
2008-02-08 22:06 ` [patch 1/6] mmu_notifier: Core code Christoph Lameter
2008-02-08 22:06 ` [patch 2/6] mmu_notifier: Callbacks to invalidate address ranges Christoph Lameter
2008-02-08 22:06 ` [patch 3/6] mmu_notifier: invalidate_page callbacks Christoph Lameter
2008-02-08 22:06 ` [patch 4/6] mmu_notifier: Skeleton driver for a simple mmu_notifier Christoph Lameter
2008-02-08 22:06 ` [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem) Christoph Lameter
2008-02-08 22:06 ` [patch 6/6] mmu_rmap_notifier: Skeleton for complex driver that uses its own rmaps Christoph Lameter
2008-02-08 22:23 ` [patch 0/6] MMU Notifiers V6 Andrew Morton
2008-02-08 23:32 ` Christoph Lameter
2008-02-08 23:36 ` Robin Holt
2008-02-08 23:41 ` Christoph Lameter
2008-02-08 23:43 ` Robin Holt
2008-02-08 23:56 ` Andrew Morton
2008-02-09 0:05 ` Christoph Lameter
2008-02-09 0:12 ` [ofa-general] " Roland Dreier
2008-02-09 0:16 ` Christoph Lameter
2008-02-09 0:21 ` [ofa-general] trying to get of all lists R S
2008-02-09 0:22 ` [ofa-general] Re: [patch 0/6] MMU Notifiers V6 Roland Dreier
2008-02-09 0:36 ` Christoph Lameter
2008-02-09 1:24 ` Andrea Arcangeli
2008-02-09 1:27 ` Christoph Lameter
2008-02-09 1:56 ` Andrea Arcangeli
2008-02-09 2:16 ` Christoph Lameter
2008-02-09 12:55 ` Rik van Riel
2008-02-09 21:46 ` Christoph Lameter
2008-02-11 22:40 ` Demand paging for memory regions (was Re: MMU Notifiers V6) Roland Dreier
2008-02-12 22:01 ` Steve Wise
2008-02-12 22:10 ` Christoph Lameter
2008-02-12 22:41 ` [ofa-general] Re: Demand paging for memory regions Roland Dreier
2008-02-12 23:14 ` Felix Marti
2008-02-13 0:57 ` Christoph Lameter
2008-02-14 15:09 ` Steve Wise
2008-02-14 15:53 ` Robin Holt
2008-02-14 16:23 ` Steve Wise
2008-02-14 17:48 ` Caitlin Bestler
2008-02-14 20:47 ` David Singleton
2008-02-15 9:55 ` Robin Holt
2008-02-14 19:39 ` Christoph Lameter
2008-02-14 20:17 ` Caitlin Bestler
2008-02-14 20:20 ` Christoph Lameter
2008-02-14 22:43 ` Caitlin Bestler
2008-02-14 22:48 ` Christoph Lameter
2008-02-15 1:26 ` Caitlin Bestler
2008-02-15 2:37 ` Christoph Lameter
2008-02-15 18:09 ` Caitlin Bestler
2008-02-15 18:45 ` Christoph Lameter
2008-02-15 18:53 ` Caitlin Bestler
2008-02-15 20:02 ` Christoph Lameter
2008-02-15 20:14 ` Caitlin Bestler
2008-02-15 22:50 ` Christoph Lameter
2008-02-15 23:50 ` Caitlin Bestler
2008-02-12 23:23 ` Jason Gunthorpe
2008-02-13 1:01 ` Christoph Lameter
2008-02-13 1:26 ` Jason Gunthorpe
2008-02-13 1:45 ` Steve Wise
2008-02-13 2:35 ` Christoph Lameter [this message]
2008-02-13 3:25 ` Jason Gunthorpe
2008-02-13 3:56 ` Patrick Geoffray
2008-02-13 4:26 ` Jason Gunthorpe
2008-02-13 4:47 ` Patrick Geoffray
2008-02-13 18:51 ` Christoph Lameter
2008-02-13 19:51 ` Jason Gunthorpe
2008-02-13 20:36 ` Christoph Lameter
2008-02-13 4:09 ` Christian Bell
2008-02-13 19:00 ` Christoph Lameter
2008-02-13 19:46 ` Christian Bell
2008-02-13 20:32 ` Christoph Lameter
2008-02-13 22:44 ` Kanoj Sarcar
2008-02-13 23:02 ` Christoph Lameter
2008-02-13 23:43 ` Kanoj Sarcar
2008-02-13 23:48 ` Jesse Barnes
2008-02-14 0:56 ` [ofa-general] " Andrea Arcangeli
2008-02-14 19:35 ` Christoph Lameter
2008-02-13 23:23 ` Pete Wyckoff
2008-02-14 0:01 ` Jason Gunthorpe
2008-02-27 22:11 ` Christoph Lameter
2008-02-13 1:55 ` Christian Bell
2008-02-13 2:19 ` Christoph Lameter
2008-02-13 0:56 ` Christoph Lameter
2008-02-13 12:11 ` Christoph Raisch
2008-02-13 19:02 ` Christoph Lameter
2008-02-09 0:12 ` [patch 0/6] MMU Notifiers V6 Andrew Morton
2008-02-09 0:18 ` Christoph Lameter
2008-02-13 14:31 ` Jack Steiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0802121819530.12328@schroedinger.engr.sgi.com \
--to=clameter@sgi.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=andrea@qumranet.com \
--cc=avi@qumranet.com \
--cc=daniel.blueman@quadrics.com \
--cc=general@lists.openfabrics.org \
--cc=holt@sgi.com \
--cc=izike@qumranet.com \
--cc=jgunthorpe@obsidianresearch.com \
--cc=kvm-devel@lists.sourceforge.net \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rdreier@cisco.com \
--cc=riel@redhat.com \
--cc=steiner@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox