Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dragan Stancevic <dragan@stancevic.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: lsf-pc@lists.linux-foundation.org, nil-migration@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
Date: Thu, 13 Apr 2023 22:32:48 -0500	[thread overview]
Message-ID: <253e7a73-be3c-44d4-1ca3-d0d060313517@stancevic.com> (raw)
In-Reply-To: <ZDS8WH+yViVfsuMi@memverge.com>

Hi Gregory-


On 4/10/23 20:48, Gregory Price wrote:
> On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote:
>> Hi Gregory-
>>
>> On 4/7/23 19:05, Gregory Price wrote:
>>> 3. This is changing the semantics of migration from a virtual memory
>>>      movement to a physical memory movement.  Typically you would expect
>>>      the RDMA process for live migration to work something like...
>>>
>>>      a) migration request arrives
>>>      b) source host informs destination host of size requirements
>>>      c) destination host allocations memory and passes a Virtual Address
>>>         back to source host
>>>      d) source host initates an RDMA from HostA-VA to HostB-VA
>>>      e) CPU task is migrated
>>>
>>>      Importantly, the allocation of memory by Host B handles the important
>>>      step of creating HVA->HPA mappings, and the Extended/Nested Page
>>>      Tables can simply be flushed and re-created after the VM is fully
>>>      migrated.
>>>
>>>      to long didn't read: live migration is a virtual address operation,
>>>      and node-migration is a PHYSICAL address operation, the virtual
>>>      addresses remain the same.
>>>
>>>      This is problematic, as it's changing the underlying semantics of the
>>>      migration operation.
>>
>> Those are all valid points, but what if you don't need to recreate HVA->HPA
>> mappings? If I am understanding the CXL 3.0 spec correctly, then both
>> virtual addresses and physical addresses wouldn't have to change. Because
>> the fabric "virtualizes" host physical addresses and the translation is done
>> by the G-FAM/GFD that has the capability to translate multi-host HPAs to
>> it's internal DPAs. So if you have two hypervisors seeing device physical
>> address as the same physical address, that might work?
>>
>>
> 
> Hm.  I hadn't considered the device side translation (decoders), though
> that's obviously a tool in the toolbox.  You still have to know how to
> slide ranges of data (which you mention below).

Hmm, do you have any quick thoughts on that?


>>> The reference in this case is... the page tables.  You need to know how
>>> to interpret the data in the CXL memory region on the remote host, and
>>> that's a "relative page table translation" (to coin a phrase? I'm not
>>> sure how to best describe it).
>>
>> right, coining phrases... I have been thinking of a "super-page" (for the
>> lack of a better word) a metadata region sitting on the switched CXL.mem
>> device that would allow hypervisors to synchronize on various aspects, such
>> as "relative page table translation", host is up, host is down, list of
>> peers, who owns what etc... In a perfect scenario, I would love to see the
>> hypervisors cooperating on switched CXL.mem device the same way cpus on
>> different numa nodes cooperate on memory in a single hypervisor. If either
>> host can allocate and schedule from this space then "NIL" aspect of
>> migration is "free".
>>
>>
> 
> The core of the problem is still that each of the hosts has to agree on
> the location (physically) of this region of memory, which could be
> problematic unless you have very strong BIOS and/or kernel driver
> controls to ensure certain devices are guaranteed to be mapped into
> certain spots in the CFMW.

Right, true. The way I am thinking of it is that this would be a part of 
data-center ops setup which at first pass would be a somewhat of a 
manual setup same way as other pre-OS related setup. But later on down 
the road perhaps this could be automated, either through some pre-agreed 
auto-ranges detection or similar, it's not unusual for dc ops to name 
hypervisors depending of where in dc/rack/etc they sit etc..



> After that it's a matter of treating this memory as incoherent shared
> memory and handling ownership in a safe way.  If the memory is only used
> for migrations, then you don't have to worry about performance.
> 
> So I agree, as long as shared memory mapped into the same CFMW area is
> used, this mechanism is totally sound.
> 
> My main concerns are that I don't know of a mechanism to ensure that.  I
> suppose for those interested, and with special BIOS/EFI, you could do
> that - but I think that's going to be a tall ask in a heterogenous cloud
> environment.

Yeah, I get that. But in my experience even heterogeneous setups have 
some level of homogeneity, weather it's per rack, or per pod. As old 
things are sunset and new things are brought in, it gives you these 
segments of homogeneity with more or less advanced features. So at the 
end of the day, if someone wants a feature X they will need to 
understand the feature requirements or limitations. I feel like I deal 
with hardware/feature fragmentation all the time, but doesn't preclude 
bringing newer things in. You just have to plant it appropriately.


>>> That's... complicated to say the least.
>>>
>>> <... snip ...>
>>>
>>> An Option:  Make pages physically contiguous on migration to CXL
>>>
>>> In this case, you don't necessarily care about the Host Virtual
>>> Addresses, what you actually care about are the structure of the pages
>>> in memory (are they physically contiguous? or do you need to
>>> reconstruct the contiguity by inspecting the page tables?).
>>>
>>> If a migration API were capable of reserving large swaths of contiguous
>>> CXL memory, you could discard individual page information and instead
>>> send page range information, reconstructing the virtual-physical
>>> mappings this way.
>>
>> yeah, good points, but this is all tricky though... it seems this would
>> require quiescing the VM and that is something I would like to avoid if
>> possible. I'd like to see the VM still executing while all of it's pages are
>> migrated onto CXL NUMA on the source hypervisor. And I would like to see the
>> VM executing on the destination hypervisor while migrate_pages is moving
>> pages off of CXL. Of course, what you are describing above would still be a
>> very fast VM migration, but would require quiescing.
>>
>>
> 
> Possibly.  If you're going to quiesce you're probably better off just
> snapshotting to shared memory and migrating the snapshot.

That is exactly my thought too.

> Maybe that's the better option for a first-pass migration mechanism.  I
> don't know.

I definitely see your point, "canning" and "re-hydration" approach as a 
first-pass. I'd be happy with even just a "Hello World" page migration 
as a first pass :)



> 
> Anyway, would love to attend this session.
> 
> ~Gregory
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

next prev parent reply	other threads:[~2023-04-14  3:32 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07 21:05 Dragan Stancevic
2023-04-07 22:23 ` James Houghton
2023-04-07 23:17   ` David Rientjes
2023-04-08  1:33     ` Dragan Stancevic
2023-04-08 16:24     ` Dragan Stancevic
2023-04-08  0:05 ` Gregory Price
2023-04-11  0:56   ` Dragan Stancevic
2023-04-11  1:48     ` Gregory Price
2023-04-14  3:32       ` Dragan Stancevic [this message]
2023-04-14 13:16         ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Jonathan Cameron
2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Huang, Ying
2023-04-11 15:36     ` Gregory Price
2023-04-12  2:54       ` Huang, Ying
2023-04-12  8:38         ` David Hildenbrand
     [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
2023-04-12 11:10             ` FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
2023-04-12 11:26               ` David Hildenbrand
     [not found]                 ` <CGME20230414084110epcas2p20b90a8d1892110d7ca3ac16290cd4686@epcas2p2.samsung.com>
2023-04-14  8:41                   ` Kyungsan Kim
2023-04-12 15:40               ` Matthew Wilcox
     [not found]                 ` <CGME20230414084114epcas2p4754d6c0d3c86a0d6d4e855058562100f@epcas2p4.samsung.com>
2023-04-14  8:41                   ` Kyungsan Kim
2023-04-12 15:15           ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory James Bottomley
2023-05-03 23:42             ` Dragan Stancevic
2023-04-12 15:26           ` Gregory Price
2023-04-12 15:50             ` David Hildenbrand
2023-04-12 16:34               ` Gregory Price
2023-04-14  4:16                 ` Dragan Stancevic
2023-04-14  3:33     ` Dragan Stancevic
2023-04-14  5:35       ` Huang, Ying
2023-04-09 17:40 ` Shreyas Shah
2023-04-11  1:08   ` Dragan Stancevic
2023-04-11  1:17     ` Shreyas Shah
2023-04-11  1:32       ` Dragan Stancevic
2023-04-11  4:33         ` Shreyas Shah
2023-04-14  3:26           ` Dragan Stancevic
     [not found] ` <CGME20230410030532epcas2p49eae675396bf81658c1a3401796da1d4@epcas2p4.samsung.com>
2023-04-10  3:05   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
2023-04-10 17:46     ` [External] " Viacheslav A.Dubeyko
2023-04-14  3:27     ` Dragan Stancevic
2023-04-11 18:00 ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Dave Hansen
2023-05-09 15:08 ` Dragan Stancevic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=253e7a73-be3c-44d4-1ca3-d0d060313517@stancevic.com \
    --to=dragan@stancevic.com \
    --cc=gregory.price@memverge.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=nil-migration@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox