Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>,
	Andi Kleen <ak@linux.intel.com>, Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Haggai Eran <haggaie@mellanox.com>,
	balbirs@au1.ibm.com,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	"Kuehling, Felix" <felix.kuehling@amd.com>,
	Philip.Yang@amd.com, "Koenig,
	Christian" <christian.koenig@amd.com>,
	"Blinzer, Paul" <Paul.Blinzer@amd.com>,
	John Hubbard <jhubbard@nvidia.com>,
	rcampbell@nvidia.com
Subject: Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
Date: Tue, 4 Dec 2018 21:37:24 -0500	[thread overview]
Message-ID: <20181205023724.GF3045@redhat.com> (raw)
In-Reply-To: <CAPcyv4ihEesx1G1on6JA8qZ6RooOsgO2CL_=1gXVMXpMJW_N9w@mail.gmail.com>

On Tue, Dec 04, 2018 at 06:34:37PM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> >
> >
> >
> > On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > > two 8 GPUs node connected through each other with fast mesh (ie each
> > > GPU can peer to peer to each other at the same bandwidth). Then this
> > > 2 blocks are connected to the other block through a share link.
> > >
> > > So it looks like:
> > >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> > >     |          |           |          |
> > >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> > >     ||     \\//            ||     \\//
> > >     ||     //\\            ||     //\\
> > >     ...    ====...    -----...    ====...
> > >     ||     \\//            ||     \\//
> > >     ||     //\\            ||     //\\
> > >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
> >
> > Well the existing NUMA node stuff tells userspace which GPU belongs to
> > which socket (every device in sysfs already has a numa_node attribute).
> > And if that's not good enough we should work to improve how that works
> > for all devices. This problem isn't specific to GPUS or devices with
> > memory and seems rather orthogonal to an API to bind to device memory.
> >
> > > How the above example would looks like ? I fail to see how to do it
> > > inside current sysfs. Maybe by creating multiple virtual device for
> > > each of the inter-connect ? So something like
> > >
> > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
> >
> > I think the "links" between GPUs themselves would be a bus. In the same
> > way a NUMA node is a bus. Each device in sysfs would then need a
> > directory or something to describe what "link bus(es)" they are a part
> > of. Though there are other ways to do this: a GPU driver could simply
> > create symlinks to other GPUs inside a "neighbours" directory under the
> > device path or something like that.
> >
> > The point is that this seems like it is specific to GPUs and could
> > easily be solved in the GPU community without any new universal concepts
> > or big APIs.
> >
> > And for applications that need topology information, a lot of it is
> > already there, we just need to fill in the gaps with small changes that
> > would be much less controversial. Then if you want to create a libhms
> > (or whatever) to help applications parse this information out of
> > existing sysfs that would make sense.
> >
> > > My proposal is to do HMS behind staging for a while and also avoid
> > > any disruption to existing code path. See with people living on the
> > > bleeding edge if they get interested in that informations. If not then
> > > i can strip down my thing to the bare minimum which is about device
> > > memory.
> >
> > This isn't my area or decision to make, but it seemed to me like this is
> > not what staging is for. Staging is for introducing *drivers* that
> > aren't up to the Kernel's quality level and they all reside under the
> > drivers/staging path. It's not meant to introduce experimental APIs
> > around the kernel that might be revoked at anytime.
> >
> > DAX introduced itself by marking the config option as EXPERIMENTAL and
> > printing warnings to dmesg when someone tries to use it. But, to my
> > knowledge, DAX also wasn't creating APIs with the intention of changing
> > or revoking them -- it was introducing features using largely existing
> > APIs that had many broken corner cases.
> >
> > Do you know of any precedents where big APIs were introduced and then
> > later revoked or radically changed like you are proposing to do?
> 
> This came up before for apis even better defined than HMS as well as
> more limited scope, i.e. experimental ABI availability only for -rc
> kernels. Linus said this:
> 
> "There are no loopholes. No "but it's been only one release". No, no,
> no. The whole point is that users are supposed to be able to *trust*
> the kernel. If we do something, we keep on doing it.
> 
> And if it makes it harder to add new user-visible interfaces, then
> that's a *good* thing." [1]
> 
> The takeaway being don't land work-in-progress ABIs in the kernel.
> Once an application depends on it, there are no more incompatible
> changes possible regardless of the warnings, experimental notices, or
> "staging" designation. DAX is experimental because there are cases
> where it currently does not work with respect to another kernel
> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
> to hide behind an experimental designation, and fix them in a way that
> preserves the user visible behavior that has already been exposed,
> i.e. no regressions.
> 
> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html

So i guess i am heading down the vXX road ... such is my life :)

Cheers,
Jï¿½rï¿½me

next prev parent reply	other threads:[~2018-12-05  2:37 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
2018-12-04 17:06   ` Andi Kleen
2018-12-04 18:24     ` Jerome Glisse
2018-12-04 18:31       ` Dan Williams
2018-12-04 18:57         ` Jerome Glisse
2018-12-04 19:11           ` Logan Gunthorpe
2018-12-04 19:22             ` Jerome Glisse
2018-12-04 19:41               ` Logan Gunthorpe
2018-12-04 20:13                 ` Jerome Glisse
2018-12-04 20:30                   ` Logan Gunthorpe
2018-12-04 20:59                     ` Jerome Glisse
2018-12-04 21:19                       ` Logan Gunthorpe
2018-12-04 21:51                         ` Jerome Glisse
2018-12-04 22:16                           ` Logan Gunthorpe
2018-12-04 23:56                             ` Jerome Glisse
2018-12-05  1:15                               ` Logan Gunthorpe
2018-12-05  2:31                                 ` Jerome Glisse
2018-12-05 17:41                                   ` Logan Gunthorpe
2018-12-05 18:07                                     ` Jerome Glisse
2018-12-05 18:20                                       ` Logan Gunthorpe
2018-12-05 18:33                                         ` Jerome Glisse
2018-12-05 18:48                                           ` Logan Gunthorpe
2018-12-05 18:55                                             ` Jerome Glisse
2018-12-05 19:10                                               ` Logan Gunthorpe
2018-12-05 22:58                                                 ` Jerome Glisse
2018-12-05 23:09                                                   ` Logan Gunthorpe
2018-12-05 23:20                                                     ` Jerome Glisse
2018-12-05 23:23                                                       ` Logan Gunthorpe
2018-12-05 23:27                                                         ` Jerome Glisse
2018-12-06  0:08                                                           ` Dan Williams
2018-12-05  2:34                                 ` Dan Williams
2018-12-05  2:37                                   ` Jerome Glisse [this message]
2018-12-05 17:25                                     ` Logan Gunthorpe
2018-12-05 18:01                                       ` Jerome Glisse
2018-12-04 20:14             ` Andi Kleen
2018-12-04 20:47               ` Logan Gunthorpe
2018-12-04 21:15                 ` Jerome Glisse
2018-12-05  0:54             ` Kuehling, Felix
2018-12-04 19:19           ` Dan Williams
2018-12-04 19:32             ` Jerome Glisse
2018-12-04 20:12       ` Andi Kleen
2018-12-04 20:41         ` Jerome Glisse
2018-12-05  4:36       ` Aneesh Kumar K.V
2018-12-05  4:41         ` Jerome Glisse
2018-12-05 10:52   ` Mike Rapoport
2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
2018-12-04 14:44   ` Jerome Glisse
2018-12-04 18:02 ` Dave Hansen
2018-12-04 18:49   ` Jerome Glisse
2018-12-04 18:54     ` Dave Hansen
2018-12-04 19:11       ` Jerome Glisse
2018-12-04 21:37     ` Dave Hansen
2018-12-04 21:57       ` Jerome Glisse
2018-12-04 23:58         ` Dave Hansen
2018-12-05  0:29           ` Jerome Glisse
2018-12-05  1:22         ` Kuehling, Felix
2018-12-05 11:27     ` Aneesh Kumar K.V
2018-12-05 16:09       ` Jerome Glisse
2018-12-04 23:54 ` Dave Hansen
2018-12-05  0:15   ` Jerome Glisse
2018-12-05  1:06     ` Dave Hansen
2018-12-05  2:13       ` Jerome Glisse
2018-12-05 17:27         ` Dave Hansen
2018-12-05 17:53           ` Jerome Glisse
2018-12-06 18:25             ` Dave Hansen
2018-12-06 19:20               ` Jerome Glisse
2018-12-06 19:31                 ` Dave Hansen
2018-12-06 20:11                   ` Logan Gunthorpe
2018-12-06 22:04                     ` Dave Hansen
2018-12-06 22:39                       ` Jerome Glisse
2018-12-06 23:09                         ` Dave Hansen
2018-12-06 23:28                           ` Logan Gunthorpe
2018-12-06 23:34                             ` Dave Hansen
2018-12-06 23:38                             ` Dave Hansen
2018-12-06 23:48                               ` Logan Gunthorpe
2018-12-07  0:20                                 ` Jerome Glisse
2018-12-07 15:06                                   ` Jonathan Cameron
2018-12-07 19:37                                     ` Jerome Glisse
2018-12-07  0:15                           ` Jerome Glisse
2018-12-06 20:27                   ` Jerome Glisse
2018-12-06 21:46                     ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181205023724.GF3045@redhat.com \
    --to=jglisse@redhat.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=balbirs@au1.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=felix.kuehling@amd.com \
    --cc=haggaie@mellanox.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=logang@deltatee.com \
    --cc=rafael@kernel.org \
    --cc=rcampbell@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox