From: Kyungsan Kim <ks0204.kim@samsung.com>
To: dragan@stancevic.com
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org, linux-cxl@vger.kernel.org,
a.manzanares@samsung.com, viacheslav.dubeyko@bytedance.com,
dan.j.williams@intel.com, seungjun.ha@samsung.com,
wj28.lee@samsung.com
Subject: RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
Date: Wed, 5 Apr 2023 19:18:39 +0900 [thread overview]
Message-ID: <20230405101839.415029-1-ks0204.kim@samsung.com> (raw)
In-Reply-To: <81baa7f2-6c95-5225-a675-71d1290032f0@stancevic.com>
>Hi Mike,
>
>On 4/3/23 03:44, Mike Rapoport wrote:
>> Hi Dragan,
>>
>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>> Hi,
>>>>
>>>> [..] >> One problem we experienced was occured in the combination of
>>> hot-remove and kerelspace allocation usecases.
>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>
>>>> This still does not describe what are the use cases that require having
>>>> kernel allocations on CXL.mem.
>>>>
>>>> I believe it's important to start with explanation *why* it is important to
>>>> have kernel allocations on removable devices.
>>>
>>> Hi Mike,
>>>
>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>> clustering and VM migration over cxl.mem [1].
>>>
>>> And in my mind, at least one reason that I can think of having kernel
>>> allocations from cxl.mem devices is where you have multiple VH connections
>>> sharing the memory [2]. Where for example you have a user space application
>>> stored in cxl.mem, and then you want the metadata about this
>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>> to another hypervisor. So basically the same way processors in a single
>>> hypervisors cooperate on memory, you extend that across processors that span
>>> over physical hypervisors. If that makes sense...
>>
>> Let me reiterate to make sure I understand your example.
>> If we focus on VM usecase, your suggestion is to store VM's memory and
>> associated KVM structures on a CXL.mem device shared by several nodes.
>
>Yes correct. That is what I am exploring, two different approaches:
>
>Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>approach the VM and the metadata executes/resides on a traditional NUMA
>node (cpu+dram) and only uses CXL.mem to transition between hypervisors.
>It's not kept permanently there. So basically on hypervisor A you would
>do something along the lines of migrate_pages into cxl.mem and then on
>hypervisor B you would migrate_pages from cxl.mem and onto the regular
>NUMA node (cpu+dram).
>
>Approach 2: Use CXL.mem to cluster hypervisors to improve high
>availability of VMs. In this approach the VM and metadata would be kept
>in CXL.mem permanently and each hypervisor accessing this shared memory
>could have the potential to schedule/run the VM if the other hypervisor
>experienced a failure.
>
>> Even putting aside the aspect of keeping KVM structures on presumably
>> slower memory,
>
>Totally agree, presumption of memory speed dully noted. As far as I am
>aware, CXL.mem at this point has higher latency than DRAM, and switched
>CXL.mem has an additional latency. That may or may not change in the
>future, but even with actual CXL induced latency I think there are
>benefits to the approaches.
>
>In the example #1 above, I think even if you had a very noisy VM that is
>dirtying pages at a high rate, once migrate_pages has occurred, it
>wouldn't have to be quiesced for the migration to happen. A migration
>could basically occur in-between the CPU slices, once VCPU is done with
>it's slice on hypervisor A, the next slice could be on hypervisor B.
>
>And the example #2 above, you are trading memory speed for
>high-availability. Where either hypervisor A or B could run the CPU load
>of the VM. You could even have a VM where some of the VCPUs are
>executing on hypervisor A and others on hypervisor B to be able to shift
>CPU load across hypervisors in quasi real-time.
>
>
>> what ZONE_EXMEM will provide that cannot be accomplished
>> with having the cxl memory in a memoryless node and using that node to
>> allocate VM metadata?
>
>It has crossed my mind to perhaps use NUMA node distance for the two
>approaches above. But I think that is not sufficient because we can have
>varying distance, and distance in itself doesn't indicate
>switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>speaking just for myself here, with the two approaches above, the
>crucial differentiator in order for #1 and #2 to work would be that
>switched/shared CXL.mem would have to be indicated as such in a way.
>Because switched memory would have to be treated and formatted in some
>kind of ABI way that would allow hypervisors to cooperate and follow
>certain protocols when using this memory.
>
>
>I can't answer what ZONE_EXMEM will provide since we haven's seen
>Kyungsan's talk yet, that's why I myself was very curious to find out
>more about ZONE_EXMEM proposal and if it includes some provisions for
>CXL switched/shared memory.
>
>To me, I don't think it makes a difference if pages are coming from
>ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was if
>I could allocate from or migrate_pages to (ZONE_EXMEM | type
>"SWITCHED/SHARED"). So it's not the zone that is crucial for me, it's
>the typing. That's what I meant with my initial response but I guess it
>wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in my
>case, this is where you'd have kernel allocations on CXL.mem"
Hi Dragan, I'm sorry for late reply, we are trying to reply well, though.
ZONE_EXMEM can be movable. A calling context is able to determine movability(movable/unmovable).
I'm not sure if it is related to the provision you keep in mind, but ZONE_EXMEM allows capacity and bandwidth aggregation among multiple CXL DRAM channels.
Multiple CXL DRAM can be grouped into a ZONE_EXMEM, then it is able to be exposed as a single memory-node[1].
Along with the increase of CXL DRAM channels through (multi-level) switch and enhanced CXL server system, we thought kernel should manage it seamlessly.
Otherwise, userspace would see many nodes, then a 3rd party tool would be always needed such as numactl and libnuma.
Of course, CXL switch can do the part, but HW/SW means have pros and cons in many ways, so we thought it would be co-existable.
Also, upon the composability expectation of CXL, I think memory sharing among VM/KVM instances well fits with CXL.
This is just a gut now, but a security and permission matter would be handled in the zone dimension possibly.
In general, given CXL nature(PCIe basis) and topology expansions(direct->switches->fabrics),
let us carefully guess more functionality and performance matter would be raised.
We have proposed ZONE_EXMEM as a separated logical management dimension for extended memory types, as of now CXL DRAM.
To help your clarify, please find the slide that explains our proposal[2].
[1] https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
[2] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL
>
>
>Sorry if it got long, hope that makes sense... :)
>
>
>>
>>> [1] A high-level explanation is at https://protect2.fireeye.com/v1/url?k=4536d55f-244b3fdc-45375e10-74fe48600158-3fa306550dc8830d&q=1&e=afaf972f-90cd-4c53-b50f-bead1fea18a3&u=http%3A%2F%2Fnil-migration.org%2F
>>> [2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, figure
>>> 1-4, black color scheme circle(3) and bars.
>>>
>
next prev parent reply other threads:[~2023-04-05 10:18 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20230221014114epcas2p1687db1d75765a8f9ed0b3495eab1154d@epcas2p1.samsung.com>
2023-02-21 1:41 ` Kyungsan Kim
2023-02-27 23:14 ` Dan Williams
[not found] ` <CGME20230228043551epcas2p3085444899b00b106c2901e1f51814d2c@epcas2p3.samsung.com>
2023-02-28 4:35 ` Kyungsan Kim
2023-03-03 6:07 ` Huang, Ying
[not found] ` <CGME20230322043354epcas2p2227bcad190a470d635b92f92587dc69e@epcas2p2.samsung.com>
2023-03-22 4:33 ` FW: " Kyungsan Kim
2023-03-22 22:03 ` Dan Williams
[not found] ` <CGME20230323105106epcas2p39ea8de619622376a4698db425c6a6fb3@epcas2p3.samsung.com>
2023-03-23 10:51 ` RE(2): " Kyungsan Kim
2023-03-23 12:25 ` David Hildenbrand
[not found] ` <CGME20230324090923epcas2p2710ba4dc8157f9141c03104cf66e9d26@epcas2p2.samsung.com>
2023-03-24 9:09 ` RE(4): " Kyungsan Kim
2023-03-24 9:12 ` David Hildenbrand
[not found] ` <CGME20230324092731epcas2p315c348bd76ef9fc84bffdb158e4c1aa4@epcas2p3.samsung.com>
2023-03-24 9:27 ` RE(2): " Kyungsan Kim
2023-03-24 9:30 ` David Hildenbrand
[not found] ` <CGME20230324095031epcas2p284095ae90b25a47360b5098478dffdaa@epcas2p2.samsung.com>
2023-03-24 9:50 ` RE(3): " Kyungsan Kim
2023-03-24 13:08 ` Jørgen Hansen
2023-03-24 22:33 ` David Hildenbrand
[not found] ` <CGME20230331114220epcas2p2d5734efcbdd8956f861f8e7178cd5288@epcas2p2.samsung.com>
2023-03-31 11:42 ` Kyungsan Kim
2023-03-31 13:42 ` Matthew Wilcox
2023-03-31 15:56 ` Frank van der Linden
2023-04-03 8:34 ` David Hildenbrand
[not found] ` <CGME20230405021655epcas2p2364b1f56dcde629bbd05bc796c2896aa@epcas2p2.samsung.com>
2023-04-05 2:16 ` Kyungsan Kim
[not found] ` <CGME20230405020631epcas2p1c85058b28a70bbd46d587e78a9c9c7ad@epcas2p1.samsung.com>
2023-04-05 2:06 ` Re: " Kyungsan Kim
2023-04-05 5:00 ` Dan Williams
[not found] ` <CGME20230405020121epcas2p2d9d39c151b6c5ab9e568ab9e2ab826ce@epcas2p2.samsung.com>
2023-04-05 2:01 ` Kyungsan Kim
2023-04-05 3:11 ` Matthew Wilcox
2023-04-03 8:28 ` David Hildenbrand
[not found] ` <CGME20230405020916epcas2p24cf04f5354c12632eba50b64b217e403@epcas2p2.samsung.com>
2023-04-05 2:09 ` Kyungsan Kim
[not found] ` <CGME20230331113147epcas2p12655777fec6839f7070ffcc446e3581b@epcas2p1.samsung.com>
2023-03-31 11:31 ` RE: RE(3): " Kyungsan Kim
2023-03-24 0:41 ` RE(2): " Huang, Ying
[not found] ` <CGME20230324084808epcas2p354865d38dccddcb5cd46b17610345a5f@epcas2p3.samsung.com>
2023-03-24 8:48 ` RE(4): " Kyungsan Kim
2023-03-24 13:46 ` Gregory Price
[not found] ` <CGME20230331113417epcas2p20a886e1712dbdb1f8eec03a2ac0a47e2@epcas2p2.samsung.com>
2023-03-31 11:34 ` Kyungsan Kim
2023-03-31 15:53 ` Gregory Price
[not found] ` <CGME20230405020257epcas2p11b253f8c97a353890b96e6ae6eb515d3@epcas2p1.samsung.com>
2023-04-05 2:02 ` Kyungsan Kim
2023-03-24 14:55 ` RE(2): " Matthew Wilcox
2023-03-24 17:49 ` Matthew Wilcox
[not found] ` <CGME20230331113715epcas2p13127b95af4000ec1ed96a2e9d89b7444@epcas2p1.samsung.com>
2023-03-31 11:37 ` Kyungsan Kim
2023-03-31 12:54 ` Matthew Wilcox
[not found] ` <CGME20230405020027epcas2p4682d43446a493385b60c39a1dbbf07d6@epcas2p4.samsung.com>
2023-04-05 2:00 ` Kyungsan Kim
2023-04-05 4:48 ` Dan Williams
2023-04-05 18:12 ` Matthew Wilcox
2023-04-05 19:42 ` Dan Williams
2023-04-06 12:27 ` David Hildenbrand
[not found] ` <CGME20230407093007epcas2p32addf5da24110c3e45c90a15dcde0d01@epcas2p3.samsung.com>
2023-04-07 9:30 ` Kyungsan Kim
[not found] ` <CGME20230331113845epcas2p313118617918ae2bf634c3c475fc5dbd8@epcas2p3.samsung.com>
2023-03-31 11:38 ` Re: RE(2): " Kyungsan Kim
2023-03-26 7:21 ` Mike Rapoport
2023-03-30 22:03 ` Dragan Stancevic
2023-04-03 8:44 ` Mike Rapoport
2023-04-04 4:27 ` Dragan Stancevic
2023-04-04 6:47 ` Huang, Ying
2023-04-06 22:27 ` Dragan Stancevic
2023-04-07 0:58 ` Huang, Ying
[not found] ` <CGME20230407092950epcas2p12bc20c2952a800cf3f4f1d0b695f67e2@epcas2p1.samsung.com>
2023-04-07 9:29 ` Kyungsan Kim
2023-04-07 14:35 ` Dragan Stancevic
[not found] ` <CGME20230405101840epcas2p4c92037ceba77dfe963d24791a9058450@epcas2p4.samsung.com>
2023-04-05 10:18 ` Kyungsan Kim [this message]
[not found] ` <CGME20230331114526epcas2p2b6f1d4c8c1c0b2e3c12a425b6e48c0d8@epcas2p2.samsung.com>
2023-03-31 11:45 ` RE: RE(2): " Kyungsan Kim
2023-04-04 8:31 ` Mike Rapoport
2023-04-04 17:58 ` Adam Manzanares
2023-04-01 10:51 ` Gregory Price
2023-04-04 18:59 ` [External] " Viacheslav A.Dubeyko
2023-04-01 11:51 ` Gregory Price
2023-04-04 21:09 ` Viacheslav A.Dubeyko
2023-04-04 23:51 ` Dan Williams
2023-04-05 2:34 ` Gregory Price
[not found] ` <CGME20230405101843epcas2p2c819c8d60b2a9a776124c2b4bc25af14@epcas2p2.samsung.com>
2023-04-05 10:18 ` Kyungsan Kim
2023-03-30 22:02 ` Dragan Stancevic
[not found] ` <CGME20230331114649epcas2p23d52cd1d224085e6192a0aaf22948e3e@epcas2p2.samsung.com>
2023-03-31 11:46 ` Kyungsan Kim
[not found] ` <CGME20230414084120epcas2p37f105901350410772a3115a5a490c215@epcas2p3.samsung.com>
2023-04-14 8:41 ` FW: " Kyungsan Kim
2023-05-09 18:45 ` MTK
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230405101839.415029-1-ks0204.kim@samsung.com \
--to=ks0204.kim@samsung.com \
--cc=a.manzanares@samsung.com \
--cc=dan.j.williams@intel.com \
--cc=dragan@stancevic.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=seungjun.ha@samsung.com \
--cc=viacheslav.dubeyko@bytedance.com \
--cc=wj28.lee@samsung.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox