From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5786CC6FD1D
	for <linux-mm@archiver.kernel.org>; Fri,  7 Apr 2023 14:36:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 955FE900003; Fri,  7 Apr 2023 10:36:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8E053900002; Fri,  7 Apr 2023 10:36:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 732EF900003; Fri,  7 Apr 2023 10:36:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 5C359900002
	for <linux-mm@kvack.org>; Fri,  7 Apr 2023 10:36:08 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 0D3EB40F01
	for <linux-mm@kvack.org>; Fri,  7 Apr 2023 14:36:07 +0000 (UTC)
X-FDA: 80654844816.24.2B5EE9D
Received: from hedgehog.birch.relay.mailchannels.net (hedgehog.birch.relay.mailchannels.net [23.83.209.81])
	by imf03.hostedemail.com (Postfix) with ESMTP id 1237420003
	for <linux-mm@kvack.org>; Fri,  7 Apr 2023 14:36:03 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=stancevic.com header.s=dreamhost header.b=DITUEkBO;
	dmarc=none;
	spf=pass (imf03.hostedemail.com: domain of dragan@stancevic.com designates 23.83.209.81 as permitted sender) smtp.mailfrom=dragan@stancevic.com;
	arc=pass ("mailchannels.net:s=arc-2022:i=1")
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1680878164;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jLZaAYrhuQztgnFDhGZXuf14myjUOS4CvOq0JTxeDHc=;
	b=1iYRs+FaPV8U5uAOUluljhfPbfdAmrJ2iljjZ7fpHFyueb+E0CKZLNp3Ff4QSMme9Ys7UD
	S9cRrCiI8EsB+3m7CuF6pxSI0ni4JGkZasCeLm5AfzIhwgGsmRFYyk3xeiH0+mNFDwOk2j
	qvB55w4Vl3Gut6N6RwqspXR8a8RfJkw=
ARC-Authentication-Results: i=2;
	imf03.hostedemail.com;
	dkim=pass header.d=stancevic.com header.s=dreamhost header.b=DITUEkBO;
	dmarc=none;
	spf=pass (imf03.hostedemail.com: domain of dragan@stancevic.com designates 23.83.209.81 as permitted sender) smtp.mailfrom=dragan@stancevic.com;
	arc=pass ("mailchannels.net:s=arc-2022:i=1")
ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1680878164; a=rsa-sha256;
	cv=pass;
	b=bFFmkaxumDFz54Apy4Dh+6LP3xFeoCpZ3ipRpm+HrWfNs3VFUuFoCODz0CpgFXJUvXrkXr
	nzk0s2Ct1T//uuUgfRrJPOCw5cjMR0j+Sh9tN6iW48DFCw0sETYUnkCm33jLyuMjD9BY0s
	sA7XnaVSJfbUkTxcIYdyxOX/s9inFbY=
X-Sender-Id: dreamhost|x-authsender|dragan@stancevic.com
Received: from relay.mailchannels.net (localhost [127.0.0.1])
	by relay.mailchannels.net (Postfix) with ESMTP id 467463E1E69;
	Fri,  7 Apr 2023 14:36:02 +0000 (UTC)
Received: from pdx1-sub0-mail-a294.dreamhost.com (unknown [127.0.0.6])
	(Authenticated sender: dreamhost)
	by relay.mailchannels.net (Postfix) with ESMTPA id B615B3E15BB;
	Fri,  7 Apr 2023 14:36:01 +0000 (UTC)
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1680878161; a=rsa-sha256;
	cv=none;
	b=Y9c4H89a9P/4aOWX4uVc1VM2KoEuc86OJIrnjyUHq/sJ5fvoZgppN5c6PTB/TGusr2ZF6b
	ogmKz8OTr4xdj+pBI+7Stf+sXNzuyXa756x4FseWs/5gKchnf23DByxbkU6W7L3MLbWt4c
	3w65hHXktbfIqO6KFj2DK6do2y4uZ2m7scp97vRvWutcWkayTrqudlRXXuuTwLf0iV95Ke
	oNU+Af4rJ1I4knDOCA2usqLY2faPgIoQfCcuUiJm4SL6QU4pVWSYP0ru5WwBwGolhM1eSO
	XUXFHfaY5GPoQxG5AjL97SrvLSPo58eibNDP49c2nqX1HO0DqDgz+Fyf6dpbOw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=mailchannels.net;
	s=arc-2022; t=1680878161;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jLZaAYrhuQztgnFDhGZXuf14myjUOS4CvOq0JTxeDHc=;
	b=+GxBlJ6irEdvQSTKhszkJmQcu3JOH4i0YxgEkJp8Jh7UiTY/mEDP/H5U1aio6tIhoPooU/
	JjZyOEUEnJCRBmcfXZ+2CJf/vRLhQFeBTu/fdmWl3a100A16pZMRqsfDielBXfCbL+Dr/J
	LPI7wsuu/va2YinlDFlLP+HpMEKdov4xLluNKpQ7PmxD1w4lUkRjVRsBahIf7x7v12Nz8/
	pz5inLZhYEm9HcL8dkXZgeDsZ5pzw5bxgA0EVegY1+Fmy6Rql8vYgbp6rI+vrASwmIU47Q
	Id7ZQOgMNnN1HObQaoilS3JU4kBP84ESmTtXdNsCvuHAvZuXoByd5lU+5nQg2A==
ARC-Authentication-Results: i=1;
	rspamd-5468d68f6d-qs6w9;
	auth=pass smtp.auth=dreamhost smtp.mailfrom=dragan@stancevic.com
X-Sender-Id: dreamhost|x-authsender|dragan@stancevic.com
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|dragan@stancevic.com
X-MailChannels-Auth-Id: dreamhost
X-White-Fearful: 498be03f240c216f_1680878162153_265360854
X-MC-Loop-Signature: 1680878162153:1003734348
X-MC-Ingress-Time: 1680878162153
Received: from pdx1-sub0-mail-a294.dreamhost.com (pop.dreamhost.com
 [64.90.62.162])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384)
	by 100.125.42.134 (trex/6.7.2);
	Fri, 07 Apr 2023 14:36:02 +0000
Received: from [192.168.1.31] (99-160-136-52.lightspeed.nsvltn.sbcglobal.net [99.160.136.52])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	(Authenticated sender: dragan@stancevic.com)
	by pdx1-sub0-mail-a294.dreamhost.com (Postfix) with ESMTPSA id 4PtLX03qwczQb;
	Fri,  7 Apr 2023 07:36:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=stancevic.com;
	s=dreamhost; t=1680878161;
	bh=jLZaAYrhuQztgnFDhGZXuf14myjUOS4CvOq0JTxeDHc=;
	h=Date:Subject:To:Cc:From:Content-Type:Content-Transfer-Encoding;
	b=DITUEkBOYDTFey/9IAmhtt/ly5u4afO5WKlFWMyjp6E0VqnQYI8avN6Z9IYe2SkOT
	 0vAUKh26YhMmgvL6kDjP+Ll2PNP/2uhTXpCzD0wL6VY6MkyBfLs2wN2bPwNszQL7EF
	 euZ4UxH5oaULycDHZOk/Nr6MKT+dvXv0BixG60Bg8+7Wozxr/yW6FfjO5ih5Y54BHV
	 SrxjrU+GpoQAR6fIu+Ua98YBT9M1E/ElO4cKtNa7nOjvFLG4WLzpZPARMt0S6q4H4y
	 8AN544p1Gs/3I6j01egBx1BIuYpIjmwFiFD57530uX+ZjV2PIKT60fF+pHz0me5nTf
	 NlDO/0wyhFr2A==
Message-ID: <da9b18b8-f618-ad55-d603-9664762d37f2@stancevic.com>
Date: Fri, 7 Apr 2023 09:35:59 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.9.0
Subject: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
Content-Language: en-US
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>, Kyungsan Kim <ks0204.kim@samsung.com>,
 dan.j.williams@intel.com, lsf-pc@lists.linux-foundation.org,
 linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
 linux-cxl@vger.kernel.org, a.manzanares@samsung.com,
 viacheslav.dubeyko@bytedance.com, nil-migration@lists.linux.dev
References: <641b7b2117d02_1b98bb294cb@dwillia2-xfh.jf.intel.com.notmuch>
 <CGME20230323105106epcas2p39ea8de619622376a4698db425c6a6fb3@epcas2p3.samsung.com>
 <20230323105105.145783-1-ks0204.kim@samsung.com>
 <ZB/yb9n6e/eNtNsf@kernel.org>
 <362a9e19-fea5-e45a-3c22-3aa47e851aea@stancevic.com>
 <ZCqR55Ryrewmy6Bo@kernel.org>
 <81baa7f2-6c95-5225-a675-71d1290032f0@stancevic.com>
 <87sfdgywha.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <a81875d6-10d4-6e94-4c21-18dad9f1640e@stancevic.com>
 <87a5zky0c8.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Dragan Stancevic <dragan@stancevic.com>
In-Reply-To: <87a5zky0c8.fsf@yhuang6-desk2.ccr.corp.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rspamd-Queue-Id: 1237420003
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-Stat-Signature: t54xzsi4z37t4yxxpiuzmdqd7mf97fnc
X-HE-Tag: 1680878163-821553
X-HE-Meta: U2FsdGVkX19sDHKiT5AlERmMjrM/C7xN3Xt5QIEUyaDCeUJUuKgltc551clyyeg063TO8jqSw5921asskklFSFrSuBkvSeJpWS9//wGm1WWQfjoDV3NlgCMgvt7knlRyXy61lxGxMNWcp5jTcRL4BMC0T/5t9/M05FBkpHj2QxUV9R3PbefU3q6cQCkfAIbsuwnQhVeKIVK2Vdme83WcfgkdWq0FgdsuwzhZS0qydzZuuaQ6JY+LeYYfSqqDazsHeKPhqDyHyTxZH+3aiLwO56npLt3H1+jOVO8GTaC5+9i+RZUL0xEy+ByBpwumJqMM2L5yfe2sjm6jXtcbfaKpE/lnkHym5HsPED+9hi3JFuKhcCc5JIDy+mJMuh+yGupqDqEF40WAsQH16Wa6DTvDPyoIh0J+OQLhFJn2FYBm+2i3Qk/v0+f4rCnTtBenA7BxvlOcDNVQx0WfPXkPNS5LP48QNM8oQ1I4AGE1hiWwx3GDoO0POGcoNPDz8n5B1I8AoEPqNVyewOK3RDHImPNp4LmOHFM0pks54onpMCQbZyXa7z00XTNNqXukdarADqV7VOtaOg5ShEIRzmiJegTVOx6sRhT1aLQ/sZ7R/vpaIlmd15J62bKy7zY/WLKeAJUBKkCOb4VAknOuJrHuMlpo1Fimm9DAJ54YOENCs3f2R+jybqfIvIZg6jR07txJ8aeNkY42TBwfAK+xDkpGxfJbxngXZD3z61OeyqenauYwKZcBfIT47/Q5BuenAQRzjuN1Pf9iWBjS4DmDt3CP3RNoNUfzO2jSh2+YfUlPrLQJHPxpGxXjjE1QYGhWGVmXg5w15wJgPOgMz9lNrW5gT73GDgDdgjO+KIRs4NAQr1hVjCu9mqF4aUvMluJPOZkc2OMk9q7o1HlioezNXYEDGJqaEZEF246kzwtjfALli8N1S9ZmtmPmWaw+3Tof52zlxf7RXzJhg6ALjFzNel6wGoz
 toJC5XbM
 o/CnN7dvsGeaj9AIwAqBauVHhH276PTQ3HjzV2XT83VBfB1vVeE3XefxjZJyeu3MIZsRELY69l0ppX/OOII6GByuNQ0NPm58cF+ZEqdXyjJPxik2QbYVET8O8S4cN9OZsaZ+WDqBRNLOIf+QVlr77jmIh8UN5AvRC0NigG/rdFGU7hR/ooWL3gk4w5AjpXuRVNHeTEeuwPZsDFZETpi/yJieCIxHW6rU54n9OGWije99spIOygVSkjyFdWKYtRdB6wzDotJRR9m3EntBOPpZIuGAn6bq5Tn1ECPAl6Pxc5QquaVhJY/D1xFrsKl+ABptNhukhbRg0IS8iGpfSMuOXc4E6lEpR3Ya2sA18pTbYfqnep5boBLrJeMiaVrv9ZXw08dmfZ8M3TLXNJb7vIOmJfwxHBdougfJi9MztCg1Gnkb6NxjfrLbFbMGQ6wENwpAmz8lzT5Smz141EBlsrHPPSH7gwwv4Z0vBNkFsvesJ0eod+k0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi Ying-


On 4/6/23 19:58, Huang, Ying wrote:
> Dragan Stancevic <dragan@stancevic.com> writes:
> 
>> Hi Ying-
>>
>> On 4/4/23 01:47, Huang, Ying wrote:
>>> Dragan Stancevic <dragan@stancevic.com> writes:
>>>
>>>> Hi Mike,
>>>>
>>>> On 4/3/23 03:44, Mike Rapoport wrote:
>>>>> Hi Dragan,
>>>>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>>>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> [..] >> One problem we experienced was occured in the combination of
>>>>>> hot-remove and kerelspace allocation usecases.
>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>>>>
>>>>>>> This still does not describe what are the use cases that require having
>>>>>>> kernel allocations on CXL.mem.
>>>>>>>
>>>>>>> I believe it's important to start with explanation *why* it is important to
>>>>>>> have kernel allocations on removable devices.
>>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>>>>> clustering and VM migration over cxl.mem [1].
>>>>>>
>>>>>> And in my mind, at least one reason that I can think of having kernel
>>>>>> allocations from cxl.mem devices is where you have multiple VH connections
>>>>>> sharing the memory [2]. Where for example you have a user space application
>>>>>> stored in cxl.mem, and then you want the metadata about this
>>>>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>>>>> to another hypervisor. So basically the same way processors in a single
>>>>>> hypervisors cooperate on memory, you extend that across processors that span
>>>>>> over physical hypervisors. If that makes sense...
>>>>> Let me reiterate to make sure I understand your example.
>>>>> If we focus on VM usecase, your suggestion is to store VM's memory and
>>>>> associated KVM structures on a CXL.mem device shared by several nodes.
>>>>
>>>> Yes correct. That is what I am exploring, two different approaches:
>>>>
>>>> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>>>> approach the VM and the metadata executes/resides on a traditional
>>>> NUMA node (cpu+dram) and only uses CXL.mem to transition between
>>>> hypervisors. It's not kept permanently there. So basically on
>>>> hypervisor A you would do something along the lines of migrate_pages
>>>> into cxl.mem and then on hypervisor B you would migrate_pages from
>>>> cxl.mem and onto the regular NUMA node (cpu+dram).
>>>>
>>>> Approach 2: Use CXL.mem to cluster hypervisors to improve high
>>>> availability of VMs. In this approach the VM and metadata would be
>>>> kept in CXL.mem permanently and each hypervisor accessing this shared
>>>> memory could have the potential to schedule/run the VM if the other
>>>> hypervisor experienced a failure.
>>>>
>>>>> Even putting aside the aspect of keeping KVM structures on presumably
>>>>> slower memory,
>>>>
>>>> Totally agree, presumption of memory speed dully noted. As far as I am
>>>> aware, CXL.mem at this point has higher latency than DRAM, and
>>>> switched CXL.mem has an additional latency. That may or may not change
>>>> in the future, but even with actual CXL induced latency I think there
>>>> are benefits to the approaches.
>>>>
>>>> In the example #1 above, I think even if you had a very noisy VM that
>>>> is dirtying pages at a high rate, once migrate_pages has occurred, it
>>>> wouldn't have to be quiesced for the migration to happen. A migration
>>>> could basically occur in-between the CPU slices, once VCPU is done
>>>> with it's slice on hypervisor A, the next slice could be on hypervisor
>>>> B.
>>>>
>>>> And the example #2 above, you are trading memory speed for
>>>> high-availability. Where either hypervisor A or B could run the CPU
>>>> load of the VM. You could even have a VM where some of the VCPUs are
>>>> executing on hypervisor A and others on hypervisor B to be able to
>>>> shift CPU load across hypervisors in quasi real-time.
>>>>
>>>>
>>>>> what ZONE_EXMEM will provide that cannot be accomplished
>>>>> with having the cxl memory in a memoryless node and using that node to
>>>>> allocate VM metadata?
>>>>
>>>> It has crossed my mind to perhaps use NUMA node distance for the two
>>>> approaches above. But I think that is not sufficient because we can
>>>> have varying distance, and distance in itself doesn't indicate
>>>> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>>>> speaking just for myself here, with the two approaches above, the
>>>> crucial differentiator in order for #1 and #2 to work would be that
>>>> switched/shared CXL.mem would have to be indicated as such in a way.
>>>> Because switched memory would have to be treated and formatted in some
>>>> kind of ABI way that would allow hypervisors to cooperate and follow
>>>> certain protocols when using this memory.
>>>>
>>>>
>>>> I can't answer what ZONE_EXMEM will provide since we haven's seen
>>>> Kyungsan's talk yet, that's why I myself was very curious to find out
>>>> more about ZONE_EXMEM proposal and if it includes some provisions for
>>>> CXL switched/shared memory.
>>>>
>>>> To me, I don't think it makes a difference if pages are coming from
>>>> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
>>>> if I could allocate from or migrate_pages to (ZONE_EXMEM | type
>>>> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
>>>> the typing. That's what I meant with my initial response but I guess
>>>> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
>>>> my case, this is where you'd have kernel allocations on CXL.mem"
>>>>
>>> We have 2 choices here.
>>> a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
>>> (normal or movable).  Then you can migrate pages there with
>>> move_pages(2) or migrate_pages(2).  Or you can run your workload on the
>>> CXL.mem with numactl.
>>> b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
>>> control your workloads in user space, you need a set of new ABIs.
>>> Anything you cannot do in a)?
>>
>> I like the CXL.mem as a NUMA node approach, and also think it's best
>> to do this with move/migrate_pages and numactl and those a & b are
>> good choices.
>>
>> I think there is an option c too though, which is an amalgamation of a
>> & b. Here is my thinking, and please do let me know what you think
>> about this approach.
>>
>> If you think about CXL 3.0 shared/switched memory as a portal for a VM
>> to move from one hypervisor to another, I think each switched memory
>> should be represented by it's own node and have a distinct type so the
>> migration path becomes more deterministic. I was thinking along the
>> lines that there would be some kind of user space clustering/migration
>> app/script that runs on all the hypervisors. Which would read, let's
>> say /proc/pagetypeinfo to find these "portals":
>> Node 4, zone Normal, type Switched ....
>> Node 6, zone Normal, type Switched ....
>>
>> Then it would build a traversal Graph, find per hypervisor reach and
>> critical connections, where critical connections are cross-rack or
>> cross-pod, perhaps something along the lines of this pseudo/python code:
>> class Graph:
>> 	def __init__(self, mydict):
>> 		self.dict = mydict
>> 		self.visited = set()
>> 		self.critical = list()
>> 		self.reach = dict()
>> 		self.id = 0
>> 	def depth_first_search(self, vertex, parent):
>> 		self.visited.add(vertex)
>> 		if vertex not in self.reach:
>> 			self.reach[vertex] = {'id':self.id, 'reach':self.id}
>> 			self.id += 1
>> 		for next_vertex in self.dict[vertex] - {parent}:
>> 			if next_vertex not in self.visited:
>> 				self.depth_first_search(next_vertex, vertex)
>> 			if self.reach[next_vertex]['reach'] < self.reach[vertex]['reach']:
>> 				self.reach[vertex]['reach'] = self.reach[next_vertex]['reach']
>> 		if parent != None and self.reach[vertex]['id'] ==
>> 		self.reach[vertex]['reach']:
>> 			self.critical.append([parent, vertex])
>> 		return self.critical
>>
>> critical = mygraph.depth_first_search("hostname-foo4", None)
>>
>> that way you could have a VM migrate between only two hypervisors
>> sharing switched memory, or pass through a subset of hypervisors (that
>> don't necessarily share switched memory) to reach it's
>> destination. This may be rack confined, or across a rack or even a pod
>> using critical connections.
>>
>> Long way of saying that if you do a) then the clustering/migration
>> script only sees a bunch of nodes and a bunch of normal zones it
>> wouldn't know how to build the "flight-path" and where to send a
>> VM. You'd probably have to add an additional interface in the kernel
>> for the script to query the paths somehow, where on the other hand
>> pulling things from proc/sys is easy.
>>
>>
>> And then if you do b) and put it in an existing NUMA and with a
>> "Switched" type, you could potentially end up with several "Switched"
>> types under the same node. So when you numactl/move/migrate pages they
>> could go in either direction and you could send some pages through one
>> "portal" and others through another "portal", which is not what you
>> want to do.
>>
>> That's why I think the c option might be the most optimal, where each
>> switched memory has it's own node number. And then displaying type as
>> "Switched" just makes it easier to detect and Graph the topology.
>>
>>
>> And with regards to an ABI, I was referring to an ABI needed between
>> the kernels running on separate hypervisors. When hypervisor B boots,
>> it needs to detect through an ABI if this switched/shared memory is
>> already initialized and if there are VMs in there which are used by
>> another hypervisor, say A. Also during the migration, hypervisors A
>> and B would have to use this ABI to synchronize the hand-off between
>> the two physical hosts. Not an all-inclusive list, but I was referring
>> to those types of scenarios.
>>
>> What do you think?
> 
> It seems unnecessary to add a new zone type to mark a node with some
> attribute.  For example, in the following patch, a per-node attribute
> can be added and shown in sysfs.
> 
> https://lore.kernel.org/linux-mm/20220704135833.1496303-10-martin.fernandez@eclypsium.com/

That's a very good suggestion Ying, thank you I appreciate it.

So perhaps having switched memory on it's own node(option a), and 
exporting a sysfs attribute like "switched". Might be a good place to 
also export hypervisor partners in there which share the same switched 
memory, for the script to build up a connection topology graph.


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla