Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Hao Xiang <hao.xiang@bytedance.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: "aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>,
	 Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	Gregory Price <gregory.price@memverge.com>,
	 Srinivasulu Thanneeru <sthanneeru@micron.com>,
	Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>,
	 "linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	 "dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"mhocko@suse.com" <mhocko@suse.com>,
	 "tj@kernel.org" <tj@kernel.org>,
	"john@jagalactic.com" <john@jagalactic.com>,
	 Eishan Mirakhur <emirakhur@micron.com>,
	Vinicius Tavares Petrucci <vtavarespetr@micron.com>,
	 Ravis OpenSrc <Ravis.OpenSrc@micron.com>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Wei Xu <weixugc@google.com>,
	"Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com>
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
Date: Fri, 12 Jan 2024 00:14:04 -0800	[thread overview]
Message-ID: <CAAYibXh5DWcAJrqXi-V1v61DY_Xeb8BiMGoOxn1fJ_YBc2L8KQ@mail.gmail.com> (raw)
In-Reply-To: <87il3z2g03.fsf@yhuang6-desk2.ccr.corp.intel.com>

On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> >>
> >> On Tue, 9 Jan 2024 16:28:15 -0800
> >> Hao Xiang <hao.xiang@bytedance.com> wrote:
> >>
> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
> >> > >
> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800
> >> > > > "Huang, Ying" <ying.huang@intel.com> wrote:
> >> > > > > Gregory Price <gregory.price@memverge.com> writes:
> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> > > > > It's possible to change the performance of a NUMA node changed, if we
> >> > > > > hot-remove a memory device, then hot-add another different memory
> >> > > > > device.  It's hoped that the CDAT changes too.
> >> > > >
> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
> >> > > > HMAT values based on firmware notifications...  So we 'could' make
> >> > > > it work for HMAT based description.
> >> > > >
> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3
> >> > > > devices (hiding topology complexity) and you can update CDAT but
> >> > > > IIRC that is only meant to be for degraded situations - so if you
> >> > > > want multiple performance regions, CDAT should describe them form the start.
> >> > > >
> >> > >
> >> > > That was my thought.  I don't think it's particularly *realistic* for
> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
> >> > > it could be valuable.
> >> > >
> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> >> > > > > >
> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and
> >> > > > > > allow the guest to apply memory tiering.
> >> > > > > >
> >> > > > > > There are multiple issues with this, presently:
> >> > > > > >
> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
> >> > > > > >    performant enough to be a commodity class virtualization.
> >> > > >
> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but
> >> > > > it isn't the emulation that is there today because it's not possible to
> >> > > > emulate some of the topology in a peformant manner (interleaving with sub
> >> > > > page granularity / interleaving at all (to a lesser degree)). There are
> >> > > > ways to do better than we are today, but they start to look like
> >> > > > software dissagregated memory setups (think lots of page faults in the host).
> >> > > >
> >> > >
> >> > > Agreed, the emulated device as-is can't be the virtualization device,
> >> > > but it doesn't mean it can't be the basis for it.
> >> > >
> >> > > My thought is, if you want to pass host CXL *memory* through to the
> >> > > guest, you don't actually care to pass CXL *control* through to the
> >> > > guest.  That control lies pretty squarely with the host/hypervisor.
> >> > >
> >> > > So, at least in theory, you can just cut the type3 device out of the
> >> > > QEMU configuration entirely and just pass it through as a distinct numa
> >> > > node with specific hmat qualities.
> >> > >
> >> > > Barring that, if we must go through the type3 device, the question is
> >> > > how difficult would it be to just make a stripped down type3 device
> >> > > to provide the informational components, but hack off anything
> >> > > topology/interleave related? Then you just do direct passthrough as you
> >> > > described below.
> >> > >
> >> > > qemu/kvm would report errors if you tried to touch the naughty bits.
> >> > >
> >> > > The second question is... is that device "compliant" or does it need
> >> > > super special handling from the kernel driver :D?  If what i described
> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> >> > > just hide the CXL device entirely from the guest (for this use case)
> >> > > and just pass the memory through as a numa node.
> >> > >
> >> > > Which gets us back to: The memory-tiering component needs a way to
> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> >> > > of those seem like totally valid ways to go about it.
> >> > >
> >> > > > > >
> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
> >> > > > > >    part of a CXL memory device, the nodes are lumped together in the
> >> > > > > >    DRAM tier.
> >> > > > > >
> >> > > > > > None of this has to do with firmware.
> >> > > > > >
> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
> >> > > > > > have HMAT information that can be passed through via QEMU:
> >> > > > > >
> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> >> > > > > >
> >> > > > > > Not only would it be nice if we could change tier membership based on
> >> > > > > > this data, it's realistically the only way to allow guests to accomplish
> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> >> > > >
> >> > > > This I fully agree with.  There will be systems with a bunch of normal DDR with different
> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> >> > > > before we get anything more complex in place for CXL.
> >> > > >
> >> > >
> >> > > Had not even considered this, but that's completely accurate as well.
> >> > >
> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
> >> > > isn't necessarily a violation of any standard.  There probably could be
> >> > > a release valve for us to still make those devices useful.
> >> > >
> >> > > The concern I have with not implementing a movement mechanism *at all*
> >> > > is that a one-size-fits-all initial-placement heuristic feels gross
> >> > > when we're, at least ideologically, moving toward "software defined memory".
> >> > >
> >> > > Personally I think the movement mechanism is a good idea that gets folks
> >> > > where they're going sooner, and it doesn't hurt anything by existing. We
> >> > > can change the initial placement mechanism too.
> >> >
> >> > I think providing users a way to "FIX" the memory tiering is a backup
> >> > option. Given that DDRs with different access characteristics provide
> >> > the relevant CDAT/HMAT information, the kernel should be able to
> >> > correctly establish memory tiering on boot.
> >>
> >> Include hotplug and I'll be happier!  I know that's messy though.
> >>
> >> > Current memory tiering code has
> >> > 1) memory_tier_init() to iterate through all boot onlined memory
> >> > nodes. All nodes are assumed to be fast tier (adistance
> >> > MEMTIER_ADISTANCE_DRAM is used).
> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> >> > nodes. This is the place the kernel reads the memory attributes from
> >> > HMAT and recognizes the memory nodes into the correct tier (devdax
> >> > controlled CXL, pmem, etc).
> >> > If we want DDRs with different memory characteristics to be put into
> >> > the correct tier (as in the guest VM memory tiering case), we probably
> >> > need a third path to iterate the boot onlined memory nodes and also be
> >> > able to read their memory attributes. I don't think we can do that in
> >> > 1) because the ACPI subsystem is not yet initialized.
> >>
> >> Can we move it later in general?  Or drag HMAT parsing earlier?
> >> ACPI table availability is pretty early, it's just that we don't bother
> >> with HMAT because nothing early uses it.
> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
> >
> > I tested the call sequence under a debugger earlier. hmat_init() is
> > called after memory_tier_init(). Let me poke around and see what our
> > options are.
>
> This sounds reasonable.
>
> Please keep in mind that we need a way to identify the base line memory
> type(default_dram_type).  A simple method is to use NUMA nodes with CPU
> attached.  But I remember that Aneesh said that some NUMA nodes without
> CPU will need to be put in default_dram_type too on their systems.  We
> need a way to identify that.

Yes, I am doing some prototyping the way you described. In
memory_tier_init(), we will just set the memory tier for the NUMA
nodes with CPU. In hmat_init(), I am trying to call back to mm to
finish the memory tier initialization for the CPUless NUMA nodes. If a
CPUless numa node can't get the effective adistance from
mt_calc_adistance(), we will fallback to add that node to
default_dram_type.
The other thing I want to experiment is to call mt_calc_adistance() on
a memory node with CPU and see what kind of adistance will be
returned.

>
> --
> Best Regards,
> Huang, Ying

next prev parent reply	other threads:[~2024-01-12  8:14 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-13 17:53 sthanneeru.opensrc
2023-12-13 17:53 ` [PATCH 1/2] base/node: Add sysfs for memtier_override sthanneeru.opensrc
2023-12-13 17:53 ` [PATCH 2/2] memory tier: Support node migration between tiers sthanneeru.opensrc
2023-12-15  5:02 ` [RFC PATCH v2 0/2] Node migration between memory tiers Huang, Ying
2023-12-15 17:42   ` Gregory Price
2023-12-18  5:55     ` Huang, Ying
2024-01-03  5:26       ` [EXT] " Srinivasulu Thanneeru
2024-01-03  6:07         ` Huang, Ying
2024-01-03  7:56           ` Srinivasulu Thanneeru
2024-01-03  8:29             ` Huang, Ying
2024-01-03  8:47               ` Srinivasulu Thanneeru
2024-01-04  6:05                 ` Huang, Ying
2024-01-08 17:04                   ` Gregory Price
2024-01-09  3:41                     ` Huang, Ying
2024-01-09 15:50                       ` Jonathan Cameron
2024-01-09 17:59                         ` Gregory Price
2024-01-10  0:28                           ` [External] " Hao Xiang
2024-01-10 14:18                             ` Jonathan Cameron
2024-01-10 19:29                               ` Hao Xiang
2024-01-12  7:00                                 ` Huang, Ying
2024-01-12  8:14                                   ` Hao Xiang [this message]
2024-01-15  1:24                                     ` Huang, Ying
2024-01-10  5:47                           ` Huang, Ying
2024-01-10 14:11                           ` Jonathan Cameron
2024-01-10  6:06                         ` Huang, Ying
2024-01-09 17:34                       ` Gregory Price
2023-12-18  8:56   ` Srinivasulu Thanneeru
2023-12-19  3:57     ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAYibXh5DWcAJrqXi-V1v61DY_Xeb8BiMGoOxn1fJ_YBc2L8KQ@mail.gmail.com \
    --to=hao.xiang@bytedance.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Ravis.OpenSrc@micron.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=dan.j.williams@intel.com \
    --cc=emirakhur@micron.com \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=horenchuang@bytedance.com \
    --cc=john@jagalactic.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=sthanneeru.opensrc@micron.com \
    --cc=sthanneeru@micron.com \
    --cc=tj@kernel.org \
    --cc=vtavarespetr@micron.com \
    --cc=weixugc@google.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox