Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@intel.com>
To: Dragan Stancevic <dragan@stancevic.com>
Cc: Mike Rapoport <rppt@kernel.org>,
	 Kyungsan Kim <ks0204.kim@samsung.com>,
	dan.j.williams@intel.com,  lsf-pc@lists.linux-foundation.org,
	linux-mm@kvack.org,  linux-fsdevel@vger.kernel.org,
	linux-cxl@vger.kernel.org,  a.manzanares@samsung.com,
	viacheslav.dubeyko@bytedance.com,  nil-migration@lists.linux.dev
Subject: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
Date: Fri, 07 Apr 2023 08:58:47 +0800	[thread overview]
Message-ID: <87a5zky0c8.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <a81875d6-10d4-6e94-4c21-18dad9f1640e@stancevic.com> (Dragan Stancevic's message of "Thu, 6 Apr 2023 17:27:22 -0500")

Dragan Stancevic <dragan@stancevic.com> writes:

> Hi Ying-
>
> On 4/4/23 01:47, Huang, Ying wrote:
>> Dragan Stancevic <dragan@stancevic.com> writes:
>> 
>>> Hi Mike,
>>>
>>> On 4/3/23 03:44, Mike Rapoport wrote:
>>>> Hi Dragan,
>>>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>>>> Hi,
>>>>>>
>>>>>> [..] >> One problem we experienced was occured in the combination of
>>>>> hot-remove and kerelspace allocation usecases.
>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>>>
>>>>>> This still does not describe what are the use cases that require having
>>>>>> kernel allocations on CXL.mem.
>>>>>>
>>>>>> I believe it's important to start with explanation *why* it is important to
>>>>>> have kernel allocations on removable devices.
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>>>> clustering and VM migration over cxl.mem [1].
>>>>>
>>>>> And in my mind, at least one reason that I can think of having kernel
>>>>> allocations from cxl.mem devices is where you have multiple VH connections
>>>>> sharing the memory [2]. Where for example you have a user space application
>>>>> stored in cxl.mem, and then you want the metadata about this
>>>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>>>> to another hypervisor. So basically the same way processors in a single
>>>>> hypervisors cooperate on memory, you extend that across processors that span
>>>>> over physical hypervisors. If that makes sense...
>>>> Let me reiterate to make sure I understand your example.
>>>> If we focus on VM usecase, your suggestion is to store VM's memory and
>>>> associated KVM structures on a CXL.mem device shared by several nodes.
>>>
>>> Yes correct. That is what I am exploring, two different approaches:
>>>
>>> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>>> approach the VM and the metadata executes/resides on a traditional
>>> NUMA node (cpu+dram) and only uses CXL.mem to transition between
>>> hypervisors. It's not kept permanently there. So basically on
>>> hypervisor A you would do something along the lines of migrate_pages
>>> into cxl.mem and then on hypervisor B you would migrate_pages from
>>> cxl.mem and onto the regular NUMA node (cpu+dram).
>>>
>>> Approach 2: Use CXL.mem to cluster hypervisors to improve high
>>> availability of VMs. In this approach the VM and metadata would be
>>> kept in CXL.mem permanently and each hypervisor accessing this shared
>>> memory could have the potential to schedule/run the VM if the other
>>> hypervisor experienced a failure.
>>>
>>>> Even putting aside the aspect of keeping KVM structures on presumably
>>>> slower memory,
>>>
>>> Totally agree, presumption of memory speed dully noted. As far as I am
>>> aware, CXL.mem at this point has higher latency than DRAM, and
>>> switched CXL.mem has an additional latency. That may or may not change
>>> in the future, but even with actual CXL induced latency I think there
>>> are benefits to the approaches.
>>>
>>> In the example #1 above, I think even if you had a very noisy VM that
>>> is dirtying pages at a high rate, once migrate_pages has occurred, it
>>> wouldn't have to be quiesced for the migration to happen. A migration
>>> could basically occur in-between the CPU slices, once VCPU is done
>>> with it's slice on hypervisor A, the next slice could be on hypervisor
>>> B.
>>>
>>> And the example #2 above, you are trading memory speed for
>>> high-availability. Where either hypervisor A or B could run the CPU
>>> load of the VM. You could even have a VM where some of the VCPUs are
>>> executing on hypervisor A and others on hypervisor B to be able to
>>> shift CPU load across hypervisors in quasi real-time.
>>>
>>>
>>>> what ZONE_EXMEM will provide that cannot be accomplished
>>>> with having the cxl memory in a memoryless node and using that node to
>>>> allocate VM metadata?
>>>
>>> It has crossed my mind to perhaps use NUMA node distance for the two
>>> approaches above. But I think that is not sufficient because we can
>>> have varying distance, and distance in itself doesn't indicate
>>> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>>> speaking just for myself here, with the two approaches above, the
>>> crucial differentiator in order for #1 and #2 to work would be that
>>> switched/shared CXL.mem would have to be indicated as such in a way.
>>> Because switched memory would have to be treated and formatted in some
>>> kind of ABI way that would allow hypervisors to cooperate and follow
>>> certain protocols when using this memory.
>>>
>>>
>>> I can't answer what ZONE_EXMEM will provide since we haven's seen
>>> Kyungsan's talk yet, that's why I myself was very curious to find out
>>> more about ZONE_EXMEM proposal and if it includes some provisions for
>>> CXL switched/shared memory.
>>>
>>> To me, I don't think it makes a difference if pages are coming from
>>> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
>>> if I could allocate from or migrate_pages to (ZONE_EXMEM | type
>>> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
>>> the typing. That's what I meant with my initial response but I guess
>>> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
>>> my case, this is where you'd have kernel allocations on CXL.mem"
>>>
>> We have 2 choices here.
>> a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
>> (normal or movable).  Then you can migrate pages there with
>> move_pages(2) or migrate_pages(2).  Or you can run your workload on the
>> CXL.mem with numactl.
>> b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
>> control your workloads in user space, you need a set of new ABIs.
>> Anything you cannot do in a)?
>
> I like the CXL.mem as a NUMA node approach, and also think it's best
> to do this with move/migrate_pages and numactl and those a & b are
> good choices.
>
> I think there is an option c too though, which is an amalgamation of a
> & b. Here is my thinking, and please do let me know what you think
> about this approach.
>
> If you think about CXL 3.0 shared/switched memory as a portal for a VM
> to move from one hypervisor to another, I think each switched memory 
> should be represented by it's own node and have a distinct type so the
> migration path becomes more deterministic. I was thinking along the 
> lines that there would be some kind of user space clustering/migration
> app/script that runs on all the hypervisors. Which would read, let's
> say /proc/pagetypeinfo to find these "portals":
> Node 4, zone Normal, type Switched ....
> Node 6, zone Normal, type Switched ....
>
> Then it would build a traversal Graph, find per hypervisor reach and
> critical connections, where critical connections are cross-rack or 
> cross-pod, perhaps something along the lines of this pseudo/python code:
> class Graph:
> 	def __init__(self, mydict):
> 		self.dict = mydict
> 		self.visited = set()
> 		self.critical = list()
> 		self.reach = dict()
> 		self.id = 0
> 	def depth_first_search(self, vertex, parent):
> 		self.visited.add(vertex)
> 		if vertex not in self.reach:
> 			self.reach[vertex] = {'id':self.id, 'reach':self.id}
> 			self.id += 1
> 		for next_vertex in self.dict[vertex] - {parent}:
> 			if next_vertex not in self.visited:
> 				self.depth_first_search(next_vertex, vertex)
> 			if self.reach[next_vertex]['reach'] < self.reach[vertex]['reach']:
> 				self.reach[vertex]['reach'] = self.reach[next_vertex]['reach']
> 		if parent != None and self.reach[vertex]['id'] ==
> 		self.reach[vertex]['reach']:
> 			self.critical.append([parent, vertex])
> 		return self.critical
>
> critical = mygraph.depth_first_search("hostname-foo4", None)
>
> that way you could have a VM migrate between only two hypervisors
> sharing switched memory, or pass through a subset of hypervisors (that 
> don't necessarily share switched memory) to reach it's
> destination. This may be rack confined, or across a rack or even a pod
> using critical connections.
>
> Long way of saying that if you do a) then the clustering/migration
> script only sees a bunch of nodes and a bunch of normal zones it 
> wouldn't know how to build the "flight-path" and where to send a
> VM. You'd probably have to add an additional interface in the kernel
> for the script to query the paths somehow, where on the other hand
> pulling things from proc/sys is easy.
>
>
> And then if you do b) and put it in an existing NUMA and with a
> "Switched" type, you could potentially end up with several "Switched" 
> types under the same node. So when you numactl/move/migrate pages they
> could go in either direction and you could send some pages through one 
> "portal" and others through another "portal", which is not what you
> want to do.
>
> That's why I think the c option might be the most optimal, where each
> switched memory has it's own node number. And then displaying type as 
> "Switched" just makes it easier to detect and Graph the topology.
>
>
> And with regards to an ABI, I was referring to an ABI needed between
> the kernels running on separate hypervisors. When hypervisor B boots,
> it needs to detect through an ABI if this switched/shared memory is
> already initialized and if there are VMs in there which are used by
> another hypervisor, say A. Also during the migration, hypervisors A
> and B would have to use this ABI to synchronize the hand-off between
> the two physical hosts. Not an all-inclusive list, but I was referring
> to those types of scenarios.
>
> What do you think?

It seems unnecessary to add a new zone type to mark a node with some
attribute.  For example, in the following patch, a per-node attribute
can be added and shown in sysfs.

https://lore.kernel.org/linux-mm/20220704135833.1496303-10-martin.fernandez@eclypsium.com/

Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2023-04-07  1:00 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230221014114epcas2p1687db1d75765a8f9ed0b3495eab1154d@epcas2p1.samsung.com>
2023-02-21  1:41 ` Kyungsan Kim
2023-02-27 23:14   ` Dan Williams
     [not found]     ` <CGME20230228043551epcas2p3085444899b00b106c2901e1f51814d2c@epcas2p3.samsung.com>
2023-02-28  4:35       ` Kyungsan Kim
2023-03-03  6:07   ` Huang, Ying
     [not found]     ` <CGME20230322043354epcas2p2227bcad190a470d635b92f92587dc69e@epcas2p2.samsung.com>
2023-03-22  4:33       ` FW: " Kyungsan Kim
2023-03-22 22:03         ` Dan Williams
     [not found]           ` <CGME20230323105106epcas2p39ea8de619622376a4698db425c6a6fb3@epcas2p3.samsung.com>
2023-03-23 10:51             ` RE(2): " Kyungsan Kim
2023-03-23 12:25               ` David Hildenbrand
     [not found]                 ` <CGME20230324090923epcas2p2710ba4dc8157f9141c03104cf66e9d26@epcas2p2.samsung.com>
2023-03-24  9:09                   ` RE(4): " Kyungsan Kim
2023-03-24  9:12                     ` David Hildenbrand
     [not found]                       ` <CGME20230324092731epcas2p315c348bd76ef9fc84bffdb158e4c1aa4@epcas2p3.samsung.com>
2023-03-24  9:27                         ` RE(2): " Kyungsan Kim
2023-03-24  9:30                           ` David Hildenbrand
     [not found]                             ` <CGME20230324095031epcas2p284095ae90b25a47360b5098478dffdaa@epcas2p2.samsung.com>
2023-03-24  9:50                               ` RE(3): " Kyungsan Kim
2023-03-24 13:08                                 ` Jørgen Hansen
2023-03-24 22:33                                   ` David Hildenbrand
     [not found]                                     ` <CGME20230331114220epcas2p2d5734efcbdd8956f861f8e7178cd5288@epcas2p2.samsung.com>
2023-03-31 11:42                                       ` Kyungsan Kim
2023-03-31 13:42                                         ` Matthew Wilcox
2023-03-31 15:56                                           ` Frank van der Linden
2023-04-03  8:34                                             ` David Hildenbrand
     [not found]                                               ` <CGME20230405021655epcas2p2364b1f56dcde629bbd05bc796c2896aa@epcas2p2.samsung.com>
2023-04-05  2:16                                                 ` Kyungsan Kim
     [not found]                                             ` <CGME20230405020631epcas2p1c85058b28a70bbd46d587e78a9c9c7ad@epcas2p1.samsung.com>
2023-04-05  2:06                                               ` Re: " Kyungsan Kim
2023-04-05  5:00                                                 ` Dan Williams
     [not found]                                           ` <CGME20230405020121epcas2p2d9d39c151b6c5ab9e568ab9e2ab826ce@epcas2p2.samsung.com>
2023-04-05  2:01                                             ` Kyungsan Kim
2023-04-05  3:11                                               ` Matthew Wilcox
2023-04-03  8:28                                         ` David Hildenbrand
     [not found]                                           ` <CGME20230405020916epcas2p24cf04f5354c12632eba50b64b217e403@epcas2p2.samsung.com>
2023-04-05  2:09                                             ` Kyungsan Kim
     [not found]                                   ` <CGME20230331113147epcas2p12655777fec6839f7070ffcc446e3581b@epcas2p1.samsung.com>
2023-03-31 11:31                                     ` RE: RE(3): " Kyungsan Kim
2023-03-24  0:41               ` RE(2): " Huang, Ying
     [not found]                 ` <CGME20230324084808epcas2p354865d38dccddcb5cd46b17610345a5f@epcas2p3.samsung.com>
2023-03-24  8:48                   ` RE(4): " Kyungsan Kim
2023-03-24 13:46                     ` Gregory Price
     [not found]                       ` <CGME20230331113417epcas2p20a886e1712dbdb1f8eec03a2ac0a47e2@epcas2p2.samsung.com>
2023-03-31 11:34                         ` Kyungsan Kim
2023-03-31 15:53                           ` Gregory Price
     [not found]                             ` <CGME20230405020257epcas2p11b253f8c97a353890b96e6ae6eb515d3@epcas2p1.samsung.com>
2023-04-05  2:02                               ` Kyungsan Kim
2023-03-24 14:55               ` RE(2): " Matthew Wilcox
2023-03-24 17:49                 ` Matthew Wilcox
     [not found]                   ` <CGME20230331113715epcas2p13127b95af4000ec1ed96a2e9d89b7444@epcas2p1.samsung.com>
2023-03-31 11:37                     ` Kyungsan Kim
2023-03-31 12:54                       ` Matthew Wilcox
     [not found]                         ` <CGME20230405020027epcas2p4682d43446a493385b60c39a1dbbf07d6@epcas2p4.samsung.com>
2023-04-05  2:00                           ` Kyungsan Kim
2023-04-05  4:48                             ` Dan Williams
2023-04-05 18:12                               ` Matthew Wilcox
2023-04-05 19:42                                 ` Dan Williams
2023-04-06 12:27                                   ` David Hildenbrand
     [not found]                                     ` <CGME20230407093007epcas2p32addf5da24110c3e45c90a15dcde0d01@epcas2p3.samsung.com>
2023-04-07  9:30                                       ` Kyungsan Kim
     [not found]                   ` <CGME20230331113845epcas2p313118617918ae2bf634c3c475fc5dbd8@epcas2p3.samsung.com>
2023-03-31 11:38                     ` Re: RE(2): " Kyungsan Kim
2023-03-26  7:21               ` Mike Rapoport
2023-03-30 22:03                 ` Dragan Stancevic
2023-04-03  8:44                   ` Mike Rapoport
2023-04-04  4:27                     ` Dragan Stancevic
2023-04-04  6:47                       ` Huang, Ying
2023-04-06 22:27                         ` Dragan Stancevic
2023-04-07  0:58                           ` Huang, Ying [this message]
     [not found]                             ` <CGME20230407092950epcas2p12bc20c2952a800cf3f4f1d0b695f67e2@epcas2p1.samsung.com>
2023-04-07  9:29                               ` Kyungsan Kim
2023-04-07 14:35                             ` Dragan Stancevic
     [not found]                       ` <CGME20230405101840epcas2p4c92037ceba77dfe963d24791a9058450@epcas2p4.samsung.com>
2023-04-05 10:18                         ` Kyungsan Kim
     [not found]                 ` <CGME20230331114526epcas2p2b6f1d4c8c1c0b2e3c12a425b6e48c0d8@epcas2p2.samsung.com>
2023-03-31 11:45                   ` RE: RE(2): " Kyungsan Kim
2023-04-04  8:31                     ` Mike Rapoport
2023-04-04 17:58                       ` Adam Manzanares
2023-04-01 10:51                         ` Gregory Price
2023-04-04 18:59                           ` [External] " Viacheslav A.Dubeyko
2023-04-01 11:51                             ` Gregory Price
2023-04-04 21:09                               ` Viacheslav A.Dubeyko
2023-04-04 23:51                               ` Dan Williams
2023-04-05  2:34                                 ` Gregory Price
     [not found]                               ` <CGME20230405101843epcas2p2c819c8d60b2a9a776124c2b4bc25af14@epcas2p2.samsung.com>
2023-04-05 10:18                                 ` Kyungsan Kim
2023-03-30 22:02   ` Dragan Stancevic
     [not found]     ` <CGME20230331114649epcas2p23d52cd1d224085e6192a0aaf22948e3e@epcas2p2.samsung.com>
2023-03-31 11:46       ` Kyungsan Kim
     [not found]   ` <CGME20230414084120epcas2p37f105901350410772a3115a5a490c215@epcas2p3.samsung.com>
2023-04-14  8:41     ` FW: " Kyungsan Kim
2023-05-09 18:45       ` MTK

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a5zky0c8.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=a.manzanares@samsung.com \
    --cc=dan.j.williams@intel.com \
    --cc=dragan@stancevic.com \
    --cc=ks0204.kim@samsung.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=nil-migration@lists.linux.dev \
    --cc=rppt@kernel.org \
    --cc=viacheslav.dubeyko@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox