From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4924FC4706C
	for <linux-mm@archiver.kernel.org>; Fri, 12 Jan 2024 08:14:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C1A096B0078; Fri, 12 Jan 2024 03:14:20 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BC9156B008A; Fri, 12 Jan 2024 03:14:20 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A427A6B00AA; Fri, 12 Jan 2024 03:14:20 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 913AE6B0078
	for <linux-mm@kvack.org>; Fri, 12 Jan 2024 03:14:20 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 5243EA21DE
	for <linux-mm@kvack.org>; Fri, 12 Jan 2024 08:14:20 +0000 (UTC)
X-FDA: 81669946680.09.EED9043
Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44])
	by imf13.hostedemail.com (Postfix) with ESMTP id 9FF7620002
	for <linux-mm@kvack.org>; Fri, 12 Jan 2024 08:14:17 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=S6xMwZNa;
	dmarc=pass (policy=quarantine) header.from=bytedance.com;
	spf=pass (imf13.hostedemail.com: domain of hao.xiang@bytedance.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=hao.xiang@bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1705047258;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=VKb3pXA3O7QJEj9qo6R8Br8Dqg0hH16HzN0jOTIU5CY=;
	b=GrH6JndU+7Ub9eX6bghTdy2bLz4dzf5Gq2J3JhAp2Z/tipENv6TJI1q1pLP/mPvIhGYkSW
	zqoZMBxOYPW6DH81SS0zKTvfq0xjMVG5+Ty6BZOEMSFeiaAk9uPjZ5GFXTtKxyK7DEUWbG
	ZM6mCzI29b6j+KmXOirYIZIyFtVvFEs=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=S6xMwZNa;
	dmarc=pass (policy=quarantine) header.from=bytedance.com;
	spf=pass (imf13.hostedemail.com: domain of hao.xiang@bytedance.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=hao.xiang@bytedance.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705047258; a=rsa-sha256;
	cv=none;
	b=aXZXpgxmqHu3N2AKBOvm5cju+wBcqHngxB8YfUYZvxHHM0PzWLzsX8MYvHW/EMfwGq7vAw
	zJZYlgGafaU71BgvNXbSbghAETHqlthWh4tadE6Twa/HbV9JsBzZktQVtRaZvuPgjDqNkq
	j/ISJfUuFfwyUOnBuwT/MytbQ+obuAs=
Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-5571e662b93so5222376a12.2
        for <linux-mm@kvack.org>; Fri, 12 Jan 2024 00:14:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance.com; s=google; t=1705047256; x=1705652056; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VKb3pXA3O7QJEj9qo6R8Br8Dqg0hH16HzN0jOTIU5CY=;
        b=S6xMwZNa1LBkbcNMzx3N3qL3dKk0eKVqAZL8vVLV84w7Qy+p8r+wxI9CpSENMBDl72
         U0fXoCCu0+OrBiUwQc+1wW6iaXWm0Krol2+1yofjR/TW5rDCYm/1N44+wSDM9uIKuX/O
         8428LMgfN08ER1rVN53037RJYW1tlN2VHhYOVP56p1oyRLylUfqivWOgC8gbfbQQ6qqd
         r885OUVCm6ZmuItT3ZRiYZWKH3sgvoMp/oPm3HkXcK/GoW11zOlXAaV188LCFKlqqefw
         M3R8IVUbRj+7mIRuGLaqPW82vq2gtryf8wgn+IVK7m/hiquJ/4O+VJ8nxjorfmiGSmQC
         GYow==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1705047256; x=1705652056;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=VKb3pXA3O7QJEj9qo6R8Br8Dqg0hH16HzN0jOTIU5CY=;
        b=Kc3e1u7VYDYlKpJ8tJ7Gh2ph9kedPx+noNRoK3ZFQ/g27KejnVryq/QfCEkX7IsUVM
         KeTYmJ8yNskWhTq00tr3DdGqlne64rPS18z62xfinVQrWywZ8KIxTjfIxvni7Pc8dJed
         OXkAgxh3ePpu0gwDElKSQdPN+fgL7TWzu0lYlDCL/rtsZMGYu2LXkHHyd+klnUP7Hvoa
         M0xRKZ+6ovkFRYLYjuwXcvnKqpe+7t0zciJKfUNKGVkJRA4mGCAbIvHryttHrheyRX/4
         BGywUWqw4MQtAwKS7MGTz0Lx0eEzTNZGJnzoUzEkye591kdCja8T1ACoGDyji5Bw1IQ3
         9yzg==
X-Gm-Message-State: AOJu0Yx8cVZtUgu/ME7uBQrvzHw4B5ICTilRwOHRxOIn9nGIlrhS5yfV
	m9grwDiW4R4Qa6z6HtAnf/eC/D3UiPmdlTakKCg7ViYfUKnhUA==
X-Google-Smtp-Source: AGHT+IFiFE1mg3Ou42N8EKtfdnhiQN1fItrBbzyubruEBUXHkcUgv7iMSLfYfu4GiAjCUieiEfS7XXBwbT3DdXNntPk=
X-Received: by 2002:aa7:d859:0:b0:554:4dde:4ca6 with SMTP id
 f25-20020aa7d859000000b005544dde4ca6mr511278eds.4.1705047255771; Fri, 12 Jan
 2024 00:14:15 -0800 (PST)
MIME-Version: 1.0
References: <87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com>
 <87edezc5l1.fsf@yhuang6-desk2.ccr.corp.intel.com> <PH0PR08MB79550922630FEC47E4B4D3A3A860A@PH0PR08MB7955.namprd08.prod.outlook.com>
 <87a5pmddl5.fsf@yhuang6-desk2.ccr.corp.intel.com> <PH0PR08MB79552F35351FA57EF4BD64B4A860A@PH0PR08MB7955.namprd08.prod.outlook.com>
 <87wmspbpma.fsf@yhuang6-desk2.ccr.corp.intel.com> <ZZwrIoP9+ey7rp3C@memverge.com>
 <87o7dv897s.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240109155049.00003f13@Huawei.com>
 <ZZ2Jd7/7rFD0o5S3@memverge.com> <CAAYibXhe81ez06tP5K7zGkX9P=Ot+DcSysVyDvh13aSEDD63aA@mail.gmail.com>
 <20240110141821.0000370d@Huawei.com> <CAAYibXgwqY6Og_4NqGGEni=2Xgx=DPxaMc3GdBUE6FREKVCq8w@mail.gmail.com>
 <87il3z2g03.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87il3z2g03.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Hao Xiang <hao.xiang@bytedance.com>
Date: Fri, 12 Jan 2024 00:14:04 -0800
Message-ID: <CAAYibXh5DWcAJrqXi-V1v61DY_Xeb8BiMGoOxn1fJ_YBc2L8KQ@mail.gmail.com>
Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration
 between memory tiers
To: "Huang, Ying" <ying.huang@intel.com>
Cc: "aneesh.kumar@linux.ibm.com" <aneesh.kumar@linux.ibm.com>, 
	Jonathan Cameron <Jonathan.Cameron@huawei.com>, Gregory Price <gregory.price@memverge.com>, 
	Srinivasulu Thanneeru <sthanneeru@micron.com>, Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>, 
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, 
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>, "mhocko@suse.com" <mhocko@suse.com>, 
	"tj@kernel.org" <tj@kernel.org>, "john@jagalactic.com" <john@jagalactic.com>, 
	Eishan Mirakhur <emirakhur@micron.com>, Vinicius Tavares Petrucci <vtavarespetr@micron.com>, 
	Ravis OpenSrc <Ravis.OpenSrc@micron.com>, 
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Wei Xu <weixugc@google.com>, "Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 9FF7620002
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: j9zs3cg9c5t1g6zzzac1kjmuuieyqg1z
X-HE-Tag: 1705047257-387479
X-HE-Meta: U2FsdGVkX19Z0FS9GzFTXUDHOA9slpRUyj+o4soQpt7TGWi0dAYNL8vTc7mvg5wW/J9EeScXEZKYUAUzUZp3I6x386N9s8c5Q/F5LYi8AxUcVrmiPbdW9v8X1Zbw+snNc0F4knn9XEzfOmrEcx2iLlsO21fiMgoPerWKFhplTBqQuKVaX6TKkcAkkA28aCZ5hjoJjgSP5gXxD/tsmaIe5dmRVWrDB+1epEBpkHYtYH8c+b4eJe1kRqlwitOEDizkrozW1cZfSPqmOLSW3WXCFLL7JgSdtEC5T7xUUqnT7dj3LpQvhkVgU2tLC94qFFvpSD4Q4obKqwAt0Qe8jkHCngd6VU25oiG/6rRapY2pueUDEm1dXDyB1pUXLszheufEbeNNLGMIE5TQHr692/EFq0WxoVM8ssTBTENOyaVfJFJc5EesAMk3Pp2YpstCTsMA/klASwU7UCoECdA3+qBAz6D/Xvpk/WyQ+Qr6Sthzjfc95pwkNsfAJAHWzKe+4ijC/forwox7LbG5F4Xitd8xu/p6Gi8NjXtSUol3Yf1sV6tqgQ9dH0V/84PHZInCbN4XGKjGNT8N4eCs6TeF6c3OtvEXeAjTQhYeXJBMrLuQtGgI0IZio1bt0ebRXwLh5cTMEz0I3muILQ/jmO9X27ze7gyLNr+yxLIZaMFKgZSOQc4ISkqotnqs8FVNAMxu2Jt8+D8G21ADwDMRb8nPxgU4ILMOZjvc5SQvdoxW/CowXqKKzFgw4H1ZLTN9p+hIeVdtFUAl7JA70c2xwCNZC78oQRGMvXhrk4gapGf5rGO8olo/lMdVQon/sCtJMsABvKXrUwGPq+Lx52sufZI2oNWcEtWvrVyMILBKls7UDSNDI7jRLHihXZjWkG1Nm8PQ35MzS9+uatrfFdBw1BgiCnd2EouVm6trcOoGVnGz/QZO9hW4rkxT2T8qRPo1F9FStr3cT25tLZvcKf1IO+1BDaG
 BCqWxGvJ
 TzsbFEz7ASbHVZuwGQLogc/Xopx8sdp6cFJaT/ebERcBqAiMCkl0+/3Q23s52O4bm7gEGpEd7W17sM8uFleHMPd9uzOZFc1IZNeJMjNstmwQEaxMVosxH3pWsOU6ZqAAhhShhxxgmnBMJZBaGoEANXqLtasuRAjpEVDALf9n4xiofloO5nDze2T6xZ6/lzx7+Fl8JJFD9sPBBHBf6Qv7MltxHhRIZFsCH2AIBYnxmMrEI9lqzLolbiHm26Geu455A5zsoc7Ore3Xv0pyExinvm1qj381PdDLSYucd1vnBgu04qq19hhI8PENmk0YplnM2JA32f+iIutsovliOb37OAxIek+EVfYeMF8W6d4T3Oe5FAA/+L7+ahjuJvwG3Xbk0LYY4bfBymON4q1BMTvVnHFR26XgBsC/D9aPhcoFi/eQfUO1ni1ycyjSjT5YY/BNjbMZq
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jan 11, 2024 at 11:02=E2=80=AFPM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > On Wed, Jan 10, 2024 at 6:18=E2=80=AFAM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> >>
> >> On Tue, 9 Jan 2024 16:28:15 -0800
> >> Hao Xiang <hao.xiang@bytedance.com> wrote:
> >>
> >> > On Tue, Jan 9, 2024 at 9:59=E2=80=AFAM Gregory Price <gregory.price@=
memverge.com> wrote:
> >> > >
> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800
> >> > > > "Huang, Ying" <ying.huang@intel.com> wrote:
> >> > > > > Gregory Price <gregory.price@memverge.com> writes:
> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> > > > > It's possible to change the performance of a NUMA node changed=
, if we
> >> > > > > hot-remove a memory device, then hot-add another different mem=
ory
> >> > > > > device.  It's hoped that the CDAT changes too.
> >> > > >
> >> > > > Not supported, but ACPI has _HMA methods to in theory allow chan=
ging
> >> > > > HMAT values based on firmware notifications...  So we 'could' ma=
ke
> >> > > > it work for HMAT based description.
> >> > > >
> >> > > > Ultimately my current thinking is we'll end up emulating CXL typ=
e3
> >> > > > devices (hiding topology complexity) and you can update CDAT but
> >> > > > IIRC that is only meant to be for degraded situations - so if yo=
u
> >> > > > want multiple performance regions, CDAT should describe them for=
m the start.
> >> > > >
> >> > >
> >> > > That was my thought.  I don't think it's particularly *realistic* =
for
> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case wh=
ere
> >> > > it could be valuable.
> >> > >
> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLn=
cS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> >> > > > > >
> >> > > > > > This group wants to enable passing CXL memory through to KVM=
/QEMU
> >> > > > > > (i.e. host CXL expander memory passed through to the guest),=
 and
> >> > > > > > allow the guest to apply memory tiering.
> >> > > > > >
> >> > > > > > There are multiple issues with this, presently:
> >> > > > > >
> >> > > > > > 1. The QEMU CXL virtual device is not and probably never wil=
l be
> >> > > > > >    performant enough to be a commodity class virtualization.
> >> > > >
> >> > > > I'd flex that a bit - we will end up with a solution for virtual=
ization but
> >> > > > it isn't the emulation that is there today because it's not poss=
ible to
> >> > > > emulate some of the topology in a peformant manner (interleaving=
 with sub
> >> > > > page granularity / interleaving at all (to a lesser degree)). Th=
ere are
> >> > > > ways to do better than we are today, but they start to look like
> >> > > > software dissagregated memory setups (think lots of page faults =
in the host).
> >> > > >
> >> > >
> >> > > Agreed, the emulated device as-is can't be the virtualization devi=
ce,
> >> > > but it doesn't mean it can't be the basis for it.
> >> > >
> >> > > My thought is, if you want to pass host CXL *memory* through to th=
e
> >> > > guest, you don't actually care to pass CXL *control* through to th=
e
> >> > > guest.  That control lies pretty squarely with the host/hypervisor=
.
> >> > >
> >> > > So, at least in theory, you can just cut the type3 device out of t=
he
> >> > > QEMU configuration entirely and just pass it through as a distinct=
 numa
> >> > > node with specific hmat qualities.
> >> > >
> >> > > Barring that, if we must go through the type3 device, the question=
 is
> >> > > how difficult would it be to just make a stripped down type3 devic=
e
> >> > > to provide the informational components, but hack off anything
> >> > > topology/interleave related? Then you just do direct passthrough a=
s you
> >> > > described below.
> >> > >
> >> > > qemu/kvm would report errors if you tried to touch the naughty bit=
s.
> >> > >
> >> > > The second question is... is that device "compliant" or does it ne=
ed
> >> > > super special handling from the kernel driver :D?  If what i descr=
ibed
> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU sh=
ould
> >> > > just hide the CXL device entirely from the guest (for this use cas=
e)
> >> > > and just pass the memory through as a numa node.
> >> > >
> >> > > Which gets us back to: The memory-tiering component needs a way to
> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All t=
hree
> >> > > of those seem like totally valid ways to go about it.
> >> > >
> >> > > > > >
> >> > > > > > 2. When passing memory through as an explicit NUMA node, but=
 not as
> >> > > > > >    part of a CXL memory device, the nodes are lumped togethe=
r in the
> >> > > > > >    DRAM tier.
> >> > > > > >
> >> > > > > > None of this has to do with firmware.
> >> > > > > >
> >> > > > > > Memory-type is an awful way of denoting membership of a tier=
, but we
> >> > > > > > have HMAT information that can be passed through via QEMU:
> >> > > > > >
> >> > > > > > -object memory-backend-ram,size=3D4G,id=3Dram-node0 \
> >> > > > > > -object memory-backend-ram,size=3D4G,id=3Dram-node1 \
> >> > > > > > -numa node,nodeid=3D0,cpus=3D0-4,memdev=3Dram-node0 \
> >> > > > > > -numa node,initiator=3D0,nodeid=3D1,memdev=3Dram-node1 \
> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D0,hierarchy=3Dmemory,da=
ta-type=3Daccess-latency,latency=3D10 \
> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D0,hierarchy=3Dmemory,da=
ta-type=3Daccess-bandwidth,bandwidth=3D10485760 \
> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D1,hierarchy=3Dmemory,da=
ta-type=3Daccess-latency,latency=3D20 \
> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D1,hierarchy=3Dmemory,da=
ta-type=3Daccess-bandwidth,bandwidth=3D5242880
> >> > > > > >
> >> > > > > > Not only would it be nice if we could change tier membership=
 based on
> >> > > > > > this data, it's realistically the only way to allow guests t=
o accomplish
> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to =
the guest.
> >> > > >
> >> > > > This I fully agree with.  There will be systems with a bunch of =
normal DDR with different
> >> > > > access characteristics irrespective of CXL. + likely HMAT soluti=
ons will be used
> >> > > > before we get anything more complex in place for CXL.
> >> > > >
> >> > >
> >> > > Had not even considered this, but that's completely accurate as we=
ll.
> >> > >
> >> > > And more discretely: What of devices that don't provide HMAT/CDAT?=
 That
> >> > > isn't necessarily a violation of any standard.  There probably cou=
ld be
> >> > > a release valve for us to still make those devices useful.
> >> > >
> >> > > The concern I have with not implementing a movement mechanism *at =
all*
> >> > > is that a one-size-fits-all initial-placement heuristic feels gros=
s
> >> > > when we're, at least ideologically, moving toward "software define=
d memory".
> >> > >
> >> > > Personally I think the movement mechanism is a good idea that gets=
 folks
> >> > > where they're going sooner, and it doesn't hurt anything by existi=
ng. We
> >> > > can change the initial placement mechanism too.
> >> >
> >> > I think providing users a way to "FIX" the memory tiering is a backu=
p
> >> > option. Given that DDRs with different access characteristics provid=
e
> >> > the relevant CDAT/HMAT information, the kernel should be able to
> >> > correctly establish memory tiering on boot.
> >>
> >> Include hotplug and I'll be happier!  I know that's messy though.
> >>
> >> > Current memory tiering code has
> >> > 1) memory_tier_init() to iterate through all boot onlined memory
> >> > nodes. All nodes are assumed to be fast tier (adistance
> >> > MEMTIER_ADISTANCE_DRAM is used).
> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memor=
y
> >> > nodes. This is the place the kernel reads the memory attributes from
> >> > HMAT and recognizes the memory nodes into the correct tier (devdax
> >> > controlled CXL, pmem, etc).
> >> > If we want DDRs with different memory characteristics to be put into
> >> > the correct tier (as in the guest VM memory tiering case), we probab=
ly
> >> > need a third path to iterate the boot onlined memory nodes and also =
be
> >> > able to read their memory attributes. I don't think we can do that i=
n
> >> > 1) because the ACPI subsystem is not yet initialized.
> >>
> >> Can we move it later in general?  Or drag HMAT parsing earlier?
> >> ACPI table availability is pretty early, it's just that we don't bothe=
r
> >> with HMAT because nothing early uses it.
> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
> >
> > I tested the call sequence under a debugger earlier. hmat_init() is
> > called after memory_tier_init(). Let me poke around and see what our
> > options are.
>
> This sounds reasonable.
>
> Please keep in mind that we need a way to identify the base line memory
> type(default_dram_type).  A simple method is to use NUMA nodes with CPU
> attached.  But I remember that Aneesh said that some NUMA nodes without
> CPU will need to be put in default_dram_type too on their systems.  We
> need a way to identify that.

Yes, I am doing some prototyping the way you described. In
memory_tier_init(), we will just set the memory tier for the NUMA
nodes with CPU. In hmat_init(), I am trying to call back to mm to
finish the memory tier initialization for the CPUless NUMA nodes. If a
CPUless numa node can't get the effective adistance from
mt_calc_adistance(), we will fallback to add that node to
default_dram_type.
The other thing I want to experiment is to call mt_calc_adistance() on
a memory node with CPU and see what kind of adistance will be
returned.

>
> --
> Best Regards,
> Huang, Ying