Re: RFC: Memory Tiering Kernel Interfaces

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,  Linux MM <linux-mm@kvack.org>,
	Greg Thelen <gthelen@google.com>,
	 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alistair Popple <apopple@nvidia.com>,
	 Davidlohr Bueso <dave@stgolabs.net>,
	Michal Hocko <mhocko@kernel.org>,
	 Baolin Wang <baolin.wang@linux.alibaba.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	 Feng Tang <feng.tang@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>
Subject: Re: RFC: Memory Tiering Kernel Interfaces
Date: Sun, 1 May 2022 11:35:01 -0700	[thread overview]
Message-ID: <CAPcyv4g7kyPsSKGT1rR4yy680VD6UJ8V7wzj0OUqN2y2-PjOpQ@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkq1YXXLMiREpGnzhJjPssu4WpSsnkTmrLJ=hAEhZVUr9w@mail.gmail.com>

On Fri, Apr 29, 2022 at 8:59 PM Yang Shi <shy828301@gmail.com> wrote:
>
> Hi Wei,
>
> Thanks for the nice writing. Please see the below inline comments.
>
> On Fri, Apr 29, 2022 at 7:10 PM Wei Xu <weixugc@google.com> wrote:
> >
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a lower
> > tier NUMA node to make room for new allocations on the higher tier
> > NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> > migrated (promoted) to a higher tier NUMA node to improve the
> > performance.
> >
> > A tiering relationship between NUMA nodes in the form of demotion path
> > is created during the kernel initialization and updated when a NUMA
> > node is hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and then builds the tiering hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> >
> > The current memory tiering interface needs to be improved to address
> > several important use cases:
> >
> > * The current tiering initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into the top tier.
> >
> > * The current tiering hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
> >   with CPUs are better to be placed into the next lower tier.
> >
> > * Also because the current tiering hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tiering hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tiering
> >   hierarchy much less stable.
>
> I'd prefer the firmware builds up tiers topology then passes it to
> kernel so that kernel knows what nodes are in what tiers. No matter
> what nodes are hot-removed/hot-added they always stay in their tiers
> defined by the firmware. I think this is important information like
> numa distances. NUMA distance alone can't satisfy all the usecases
> IMHO.

Just want to note here that the platform firmware can only describe
the tiers of static memory present at boot. CXL hotplug breaks this
model and the kernel is left to dynamically determine the device's
performance characteristics and the performance of the topology to
reach that device. Now, the platform firmware does set expectations
for the perfomance class of different memory ranges, but there is no
way to know in advance the performance of devices that will be asked
to be physically or logically added to the memory configuration. That
said, it's probably still too early to define ABI for those
exceptional cases where the kernel needs to make a policy decision
about a device that does not fit into the firmware's performance
expectations, but just note that there are limits to the description
that platform firmware can provide.

I agree that NUMA distance alone is inadequate and the kernel needs to
make better use of data like ACPI HMAT to determine the default
tiering order.

next prev parent reply	other threads:[~2022-05-01 18:35 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-30  2:10 Wei Xu
2022-04-30  3:59 ` Yang Shi
2022-04-30  6:37   ` Wei Xu
2022-05-06  0:01     ` Alistair Popple
2022-05-10  4:32       ` Wei Xu
2022-05-10  5:37         ` Alistair Popple
2022-05-10 11:38           ` Aneesh Kumar K.V
2022-05-11  5:30             ` Wei Xu
2022-05-11  7:34               ` Alistair Popple
2022-05-11  7:49               ` ying.huang
2022-05-11 17:07                 ` Wei Xu
2022-05-12  1:42                   ` ying.huang
2022-05-12  2:39                     ` Wei Xu
2022-05-12  3:13                       ` ying.huang
2022-05-12  3:37                         ` Wei Xu
2022-05-12  6:24                         ` Wei Xu
2022-05-06 18:56     ` Yang Shi
2022-05-09 14:32       ` Hesham Almatary
2022-05-10  3:24         ` Yang Shi
2022-05-10  9:59           ` Hesham Almatary
2022-05-10 12:10             ` Aneesh Kumar K V
2022-05-11  5:42               ` Wei Xu
2022-05-11  7:12                 ` Alistair Popple
2022-05-11  9:05                   ` Hesham Almatary
2022-05-12  3:02                     ` ying.huang
2022-05-12  4:40                   ` Aneesh Kumar K V
2022-05-12  4:49                     ` Wei Xu
2022-05-10  4:22         ` Wei Xu
2022-05-10 10:01           ` Hesham Almatary
2022-05-10 11:44           ` Aneesh Kumar K.V
2022-05-01 18:35   ` Dan Williams [this message]
2022-05-03  6:36     ` Wei Xu
2022-05-06 19:05     ` Yang Shi
2022-05-07  7:56     ` ying.huang
2022-05-01 17:58 ` Davidlohr Bueso
2022-05-02  1:04   ` David Rientjes
2022-05-02  7:23   ` Aneesh Kumar K.V
2022-05-03  2:07   ` Baolin Wang
2022-05-03  6:06   ` Wei Xu
2022-05-03 17:14   ` Alistair Popple
2022-05-03 17:47     ` Dave Hansen
2022-05-03 22:35       ` Alistair Popple
2022-05-03 23:54         ` Dave Hansen
2022-05-04  1:31           ` Wei Xu
2022-05-04 17:02             ` Dave Hansen
2022-05-05  6:35               ` Wei Xu
2022-05-05 14:24                 ` Dave Hansen
2022-05-10  4:43                   ` Wei Xu
2022-05-02  6:25 ` Aneesh Kumar K.V
2022-05-03  7:02   ` Wei Xu
2022-05-02 15:20 ` Dave Hansen
2022-05-03  7:19   ` Wei Xu
2022-05-03 19:12 ` Tim Chen
2022-05-05  7:02   ` Wei Xu
2022-05-05  8:57 ` ying.huang
2022-05-05 23:57 ` Alistair Popple
2022-05-06  0:25   ` Alistair Popple

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4g7kyPsSKGT1rR4yy680VD6UJ8V7wzj0OUqN2y2-PjOpQ@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox