linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: jglisse@redhat.com
To: linux-mm@kvack.org
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Dave Hansen" <dave.hansen@intel.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Balbir Singh" <balbirs@au1.ibm.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Felix Kuehling" <felix.kuehling@amd.com>,
	"Philip Yang" <Philip.Yang@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Logan Gunthorpe" <logang@deltatee.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Jonathan Cameron" <jonathan.cameron@huawei.com>,
	"Mark Hairgrove" <mhairgrove@nvidia.com>,
	"Vivek Kini" <vkini@nvidia.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Dave Airlie" <airlied@redhat.com>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>
Subject: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
Date: Mon,  3 Dec 2018 18:34:57 -0500	[thread overview]
Message-ID: <20181203233509.20671-3-jglisse@redhat.com> (raw)
In-Reply-To: <20181203233509.20671-1-jglisse@redhat.com>

From: Jérôme Glisse <jglisse@redhat.com>

Add documentation to what is HMS and what it is for (see patch content).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 246 insertions(+), 29 deletions(-)

diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst
index dbf0f71918a9..bd7c9e8e7077 100644
--- a/Documentation/vm/hms.rst
+++ b/Documentation/vm/hms.rst
@@ -4,32 +4,249 @@
 Heterogeneous Memory System (HMS)
 =================================
 
-System with complex memory topology needs a more versatile memory topology
-description than just node where a node is a collection of memory and CPU.
-In heterogeneous memory system we consider four types of object::
-   - target: which is any kind of memory
-   - initiator: any kind of device or CPU
-   - inter-connect: any kind of links that connects target and initiator
-   - bridge: a link between two inter-connects
-
-Properties (like bandwidth, latency, bus width, ...) are define per bridge
-and per inter-connect. Property of an inter-connect apply to all initiators
-which are link to that inter-connect. Not all initiators are link to all
-inter-connect and thus not all initiators can access all memory (this apply
-to CPU too ie some CPU might not be able to access all memory).
-
-Bridges allow initiators (that can use the bridge) to access target for
-which they do not have a direct link with (ie they do not share a common
-inter-connect with the target).
-
-Through this four types of object we can describe any kind of system memory
-topology. To expose this to userspace we expose a new sysfs hierarchy (that
-co-exist with the existing one)::
-   - /sys/bus/hms/target* all targets in the system
-   - /sys/bus/hms/initiator* all initiators in the system
-   - /sys/bus/hms/interconnect* all inter-connects in the system
-   - /sys/bus/hms/bridge* all bridges in the system
-
-Inside each bridge or inter-connect directory they are symlinks to targets
-and initiators that are linked to that bridge or inter-connect. Properties
-are defined inside bridge and inter-connect directory.
+Heterogeneous memory system are becoming more and more the norm, in
+those system there is not only the main system memory for each node,
+but also device memory and|or memory hierarchy to consider. Device
+memory can comes from a device like GPU, FPGA, ... or from a memory
+only device (persistent memory, or high density memory device).
+
+Memory hierarchy is when you not only have the main memory but also
+other type of memory like HBM (High Bandwidth Memory often stack up
+on CPU die or GPU die), peristent memory or high density memory (ie
+something slower then regular DDR DIMM but much bigger).
+
+On top of this diversity of memories you also have to account for the
+system bus topology ie how all CPUs and devices are connected to each
+others. Userspace do not care about the exact physical topology but
+care about topology from behavior point of view ie what are all the
+paths between an initiator (anything that can initiate memory access
+like CPU, GPU, FGPA, network controller ...) and a target memory and
+what are all the properties of each of those path (bandwidth, latency,
+granularity, ...).
+
+This means that it is no longer sufficient to consider a flat view
+for each node in a system but for maximum performance we need to
+account for all of this new memory but also for system topology.
+This is why this proposal is unlike the HMAT proposal [1] which
+tries to extend the existing NUMA for new type of memory. Here we
+are tackling a much more profound change that depart from NUMA.
+
+
+One of the reasons for radical change is the advance of accelerator
+like GPU or FPGA means that CPU is no longer the only piece where
+computation happens. It is becoming more and more common for an
+application to use a mix and match of different accelerator to
+perform its computation. So we can no longer satisfy our self with
+a CPU centric and flat view of a system like NUMA and NUMA distance.
+
+
+HMS tackle this problems through three aspects:
+    1 - Expose complex system topology and various kind of memory
+        to user space so that application have a standard way and
+        single place to get all the information it cares about.
+    2 - A new API for user space to bind/provide hint to kernel on
+        which memory to use for range of virtual address (a new
+        mbind() syscall).
+    3 - Kernel side changes for vm policy to handle this changes
+
+
+The rest of this documents is splits in 3 sections, the first section
+talks about complex system topology: what it is, how it is use today
+and how to describe it tomorrow. The second sections talks about
+new API to bind/provide hint to kernel for range of virtual address.
+The third section talks about new mechanism to track bind/hint
+provided by user space or device driver inside the kernel.
+
+
+1) Complex system topology and representing them
+================================================
+
+Inside a node you can have a complex topology of memory, for instance
+you can have multiple HBM memory in a node, each HBM memory tie to a
+set of CPUs (all of which are in the same node). This means that you
+have a hierarchy of memory for CPUs. The local fast HBM but which is
+expected to be relatively small compare to main memory and then the
+main memory. New memory technology might also deepen this hierarchy
+with another level of yet slower memory but gigantic in size (some
+persistent memory technology might fall into that category). Another
+example is device memory, and device themself can have a hierarchy
+like HBM on top of device core and main device memory.
+
+On top of that you can have multiple path to access each memory and
+each path can have different properties (latency, bandwidth, ...).
+Also there is not always symmetry ie some memory might only be
+accessible by some device or CPU ie not accessible by everyone.
+
+So a flat hierarchy for each node is not capable of representing this
+kind of complexity. To simplify discussion and because we do not want
+to single out CPU from device, from here on out we will use initiator
+to refer to either CPU or device. An initiator is any kind of CPU or
+device that can access memory (ie initiate memory access).
+
+At this point a example of such system might help:
+    - 2 nodes and for each node:
+        - 1 CPU per node with 2 complex of CPUs cores per CPU
+        - one HBM memory for each complex of CPUs cores (200GB/s)
+        - CPUs cores complex are linked to each other (100GB/s)
+        - main memory is (90GB/s)
+        - 4 GPUs each with:
+            - HBM memory for each GPU (1000GB/s) (not CPU accessible)
+            - GDDR memory for each GPU (500GB/s) (CPU accessible)
+            - connected to CPU root controller (60GB/s)
+            - connected to other GPUs (even GPUs from the second
+              node) with GPU link (400GB/s)
+
+In this example we restrict our self to bandwidth and ignore bus width
+or latency, this is just to simplify discussions but obviously they
+also factor in.
+
+
+Userspace very much would like to know about this information, for
+instance HPC folks have develop complex library to manage this and
+there is wide research on the topics [2] [3] [4] [5]. Today most of
+the work is done by hardcoding thing for specific platform. Which is
+somewhat acceptable for HPC folks where the platform stays the same
+for a long period of time.
+
+Roughly speaking i see two broads use case for topology information.
+First is for virtualization and vm where you want to segment your
+hardware properly for each vm (binding memory, CPU and GPU that are
+all close to each others). Second is for application, many of which
+can partition their workload to minimize exchange between partition
+allowing each partition to be bind to a subset of device and CPUs
+that are close to each others (for maximum locality). Here it is much
+more than just NUMA distance, you can leverage the memory hierarchy
+and  the system topology all-together (see [2] [3] [4] [5] for more
+references and details).
+
+So this is not exposing topology just for the sake of cool graph in
+userspace. They are active user today of such information and if we
+want to growth and broaden the usage we should provide a unified API
+to standardize how that information is accessible to every one.
+
+
+One proposal so far to handle new type of memory is to user CPU less
+node for those [6]. While same idea can apply for device memory, it is
+still hard to describe multiple path with different property in such
+scheme. While it is backward compatible and have minimum changes, it
+simplify can not convey complex topology (think any kind of random
+graph, not just a tree like graph).
+
+So HMS use a new way to expose to userspace the system topology. It
+relies on 4 types of objects:
+    - target: any kind of memory (main memory, HBM, device, ...)
+    - initiator: CPU or device (anything that can access memory)
+    - link: anything that link initiator and target
+    - bridges: anything that allow group of initiator to access
+      remote target (ie target they are not connected with directly
+      through an link)
+
+Properties like bandwidth, latency, ... are all sets per bridges and
+links. All initiators connected to an link can access any target memory
+also connected to the same link and all with the same link properties.
+
+Link do not need to match physical hardware ie you can have a single
+physical link match a single or multiples software expose link. This
+allows to model device connected to same physical link (like PCIE
+for instance) but not with same characteristics (like number of lane
+or lane speed in PCIE). The reverse is also true ie having a single
+software expose link match multiples physical link.
+
+Bridges allows initiator to access remote link. A bridges connect two
+links to each others and is also specific to list of initiators (ie
+not all initiators connected to each of the link can use the bridge).
+Bridges have their own properties (bandwidth, latency, ...) so that
+the actual property value for each property is the lowest common
+denominator between bridge and each of the links.
+
+
+This model allows to describe any kind of directed graph and thus
+allows to describe any kind of topology we might see in the future.
+It is also easier to add new properties to each object type.
+
+Moreover it can be use to expose devices capable to do peer to peer
+between them. For that simply have all devices capable to peer to
+peer to have a common link or use the bridge object if the peer to
+peer capabilities is only one way for instance.
+
+
+HMS use the above scheme to expose system topology through sysfs under
+/sys/bus/hms/ with:
+    - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
+      each has a UID and you can usual value in that folder (node id,
+      size, ...)
+
+    - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
+      (CPU or device), each has a HMS UID but also a CPU id for CPU
+      (which match CPU id in (/sys/bus/cpu/). For device you have a
+      path that can be PCIE BUS ID for instance)
+
+    - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
+      UID and a file per property (bandwidth, latency, ...) you also
+      find a symlink to every target and initiator connected to that
+      link.
+
+    - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
+      a UID and a file per property (bandwidth, latency, ...) you
+      also find a symlink to all initiators that can use that bridge.
+
+To help with forward compatibility each object as a version value and
+it is mandatory for user space to only use target or initiator with
+version supported by the user space. For instance if user space only
+knows about what version 1 means and sees a target with version 2 then
+the user space must ignore that target as if it does not exist.
+
+Mandating that allows the additions of new properties that break back-
+ward compatibility ie user space must know how this new property affect
+the object to be able to use it safely.
+
+Main memory of each node is expose under a common target. For now
+device driver are responsible to register memory they want to expose
+through that scheme but in the future that information might come from
+the system firmware (this is a different discussion).
+
+
+
+2) hbind() bind range of virtual address to heterogeneous memory
+================================================================
+
+So instead of using a bitmap, hbind() take an array of uid and each uid
+is a unique memory target inside the new memory topology description.
+User space also provide an array of modifiers. Modifier can be seen as
+the flags parameter of mbind() but here we use an array so that user
+space can not only supply a modifier but also value with it. This should
+allow the API to grow more features in the future. Kernel should return
+-EINVAL if it is provided with an unkown modifier and just ignore the
+call all together, forcing the user space to restrict itself to modifier
+supported by the kernel it is running on (i know i am dreaming about well
+behave user space).
+
+
+Note that none of this is exclusive of automatic memory placement like
+autonuma. I also believe that we will see something similar to autonuma
+for device memory.
+
+
+3) Tracking and applying heterogeneous memory policies
+======================================================
+
+Current memory policy infrastructure is node oriented, instead of
+changing that and risking breakage and regression HMS adds a new
+heterogeneous policy tracking infra-structure. The expectation is
+that existing application can keep using mbind() and all existing
+infrastructure under-disturb and unaffected, while new application
+will use the new API and should avoid mix and matching both (as they
+can achieve the same thing with the new API).
+
+Also the policy is not directly tie to the vma structure for a few
+reasons:
+    - avoid having to split vma for policy that do not cover full vma
+    - avoid changing too much vma code
+    - avoid growing the vma structure with an extra pointer
+
+The overall design is simple, on hbind() call a hms policy structure
+is created for the supplied range and hms use the callback associated
+with the target memory. This callback is provided by device driver
+for device memory or by core HMS for regular main memory. The callback
+can decide to migrate the range to the target memories or do nothing
+(this can be influenced by flags provided to hbind() too).
-- 
2.17.2

  parent reply	other threads:[~2018-12-03 23:35 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
2018-12-03 23:34 ` jglisse [this message]
2018-12-04 17:06   ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation Andi Kleen
2018-12-04 18:24     ` Jerome Glisse
2018-12-04 18:31       ` Dan Williams
2018-12-04 18:57         ` Jerome Glisse
2018-12-04 19:11           ` Logan Gunthorpe
2018-12-04 19:22             ` Jerome Glisse
2018-12-04 19:41               ` Logan Gunthorpe
2018-12-04 20:13                 ` Jerome Glisse
2018-12-04 20:30                   ` Logan Gunthorpe
2018-12-04 20:59                     ` Jerome Glisse
2018-12-04 21:19                       ` Logan Gunthorpe
2018-12-04 21:51                         ` Jerome Glisse
2018-12-04 22:16                           ` Logan Gunthorpe
2018-12-04 23:56                             ` Jerome Glisse
2018-12-05  1:15                               ` Logan Gunthorpe
2018-12-05  2:31                                 ` Jerome Glisse
2018-12-05 17:41                                   ` Logan Gunthorpe
2018-12-05 18:07                                     ` Jerome Glisse
2018-12-05 18:20                                       ` Logan Gunthorpe
2018-12-05 18:33                                         ` Jerome Glisse
2018-12-05 18:48                                           ` Logan Gunthorpe
2018-12-05 18:55                                             ` Jerome Glisse
2018-12-05 19:10                                               ` Logan Gunthorpe
2018-12-05 22:58                                                 ` Jerome Glisse
2018-12-05 23:09                                                   ` Logan Gunthorpe
2018-12-05 23:20                                                     ` Jerome Glisse
2018-12-05 23:23                                                       ` Logan Gunthorpe
2018-12-05 23:27                                                         ` Jerome Glisse
2018-12-06  0:08                                                           ` Dan Williams
2018-12-05  2:34                                 ` Dan Williams
2018-12-05  2:37                                   ` Jerome Glisse
2018-12-05 17:25                                     ` Logan Gunthorpe
2018-12-05 18:01                                       ` Jerome Glisse
2018-12-04 20:14             ` Andi Kleen
2018-12-04 20:47               ` Logan Gunthorpe
2018-12-04 21:15                 ` Jerome Glisse
2018-12-05  0:54             ` Kuehling, Felix
2018-12-04 19:19           ` Dan Williams
2018-12-04 19:32             ` Jerome Glisse
2018-12-04 20:12       ` Andi Kleen
2018-12-04 20:41         ` Jerome Glisse
2018-12-05  4:36       ` Aneesh Kumar K.V
2018-12-05  4:41         ` Jerome Glisse
2018-12-05 10:52   ` Mike Rapoport
2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
2018-12-04 14:44   ` Jerome Glisse
2018-12-04 18:02 ` Dave Hansen
2018-12-04 18:49   ` Jerome Glisse
2018-12-04 18:54     ` Dave Hansen
2018-12-04 19:11       ` Jerome Glisse
2018-12-04 21:37     ` Dave Hansen
2018-12-04 21:57       ` Jerome Glisse
2018-12-04 23:58         ` Dave Hansen
2018-12-05  0:29           ` Jerome Glisse
2018-12-05  1:22         ` Kuehling, Felix
2018-12-05 11:27     ` Aneesh Kumar K.V
2018-12-05 16:09       ` Jerome Glisse
2018-12-04 23:54 ` Dave Hansen
2018-12-05  0:15   ` Jerome Glisse
2018-12-05  1:06     ` Dave Hansen
2018-12-05  2:13       ` Jerome Glisse
2018-12-05 17:27         ` Dave Hansen
2018-12-05 17:53           ` Jerome Glisse
2018-12-06 18:25             ` Dave Hansen
2018-12-06 19:20               ` Jerome Glisse
2018-12-06 19:31                 ` Dave Hansen
2018-12-06 20:11                   ` Logan Gunthorpe
2018-12-06 22:04                     ` Dave Hansen
2018-12-06 22:39                       ` Jerome Glisse
2018-12-06 23:09                         ` Dave Hansen
2018-12-06 23:28                           ` Logan Gunthorpe
2018-12-06 23:34                             ` Dave Hansen
2018-12-06 23:38                             ` Dave Hansen
2018-12-06 23:48                               ` Logan Gunthorpe
2018-12-07  0:20                                 ` Jerome Glisse
2018-12-07 15:06                                   ` Jonathan Cameron
2018-12-07 19:37                                     ` Jerome Glisse
2018-12-07  0:15                           ` Jerome Glisse
2018-12-06 20:27                   ` Jerome Glisse
2018-12-06 21:46                     ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181203233509.20671-3-jglisse@redhat.com \
    --to=jglisse@redhat.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=balbirs@au1.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=felix.kuehling@amd.com \
    --cc=haggaie@mellanox.com \
    --cc=jhubbard@nvidia.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=logang@deltatee.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhairgrove@nvidia.com \
    --cc=mhocko@kernel.org \
    --cc=rafael@kernel.org \
    --cc=rcampbell@nvidia.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=vkini@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox