Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dragan Stancevic <dragan@stancevic.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: lsf-pc@lists.linux-foundation.org, nil-migration@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
Date: Mon, 10 Apr 2023 19:56:01 -0500	[thread overview]
Message-ID: <9d22b56b-80ef-b36f-731b-4b3b588bc4bd@stancevic.com> (raw)
In-Reply-To: <ZDCv3lxLbquITy8M@memverge.com>

Hi Gregory-

On 4/7/23 19:05, Gregory Price wrote:
> On Fri, Apr 07, 2023 at 04:05:31PM -0500, Dragan Stancevic wrote:
>> Hi folks-
>>
>> if it's not too late for the schedule...
>>
>> I am starting to tackle VM live migration and hypervisor clustering over
>> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>
>> I'd be interested in doing a small BoF session with some slides and get into
>> a discussion/brainstorming with other people that deal with VM/LM cloud
>> loads. Among other things to discuss would be page migrations over switched
>> CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors,
>> etc...
>>
>> A few of us discussed some of this under the ZONE_XMEM thread, but I figured
>> it might be better to start a separate thread.
>>
>> If there is interested, thank you.
>>
>>
>> [1]. High-level overview available at http://nil-migration.org/
>> [2]. Based on CXL spec 3.0
>>
>> --
>> Peace can only come as a natural consequence
>> of universal enlightenment -Dr. Nikola Tesla
> 
> I've been chatting about this with folks offline, figure i'll toss my
> thoughts on the issue here.

excellent brain dump, thank you


> Some things to consider:
> 
> 1. If secure-compute is being used, then this mechanism won't work as
>     pages will be pinned, and therefore not movable and excluded from
>     using cxl memory at all.
> 
>     This issue does not exist with traditional live migration, because
>     typically some kind of copy is used from one virtual space to another
>     (i.e. RMDA), so pages aren't really migrated in the kernel memory
>     block/numa node sense.

right, agreed... I don't think we can migrate in all scenarios, such as 
pinning or forms of pass-through, etc

my opinion just to start off, as a base requirement, would be that the 
pages be movable.



> 2. During the migration process, the memory needs to be forced not to be
>     migrated to another node by other means (tiering software, swap,
>     etc).  The obvious way of doing this would be to migrate and
>     temporarily pin the page... but going back to problem #1 we see that
>     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>     troublesome.

Yeah, true. I'd have to check the code, but I wonder if perhaps we could 
mapcount or refount the pages upon migration onto CLX switched memory. 
If my memory serves me right, wouldn't the move_pages back off or stall? 
I guess it's TBD, how workable or useful that would be but it's good to 
be thinking of different ways of doing this


> 3. This is changing the semantics of migration from a virtual memory
>     movement to a physical memory movement.  Typically you would expect
>     the RDMA process for live migration to work something like...
> 
>     a) migration request arrives
>     b) source host informs destination host of size requirements
>     c) destination host allocations memory and passes a Virtual Address
>        back to source host
>     d) source host initates an RDMA from HostA-VA to HostB-VA
>     e) CPU task is migrated
> 
>     Importantly, the allocation of memory by Host B handles the important
>     step of creating HVA->HPA mappings, and the Extended/Nested Page
>     Tables can simply be flushed and re-created after the VM is fully
>     migrated.
> 
>     to long didn't read: live migration is a virtual address operation,
>     and node-migration is a PHYSICAL address operation, the virtual
>     addresses remain the same.
> 
>     This is problematic, as it's changing the underlying semantics of the
>     migration operation.

Those are all valid points, but what if you don't need to recreate 
HVA->HPA mappings? If I am understanding the CXL 3.0 spec correctly, 
then both virtual addresses and physical addresses wouldn't have to 
change. Because the fabric "virtualizes" host physical addresses and the 
translation is done by the G-FAM/GFD that has the capability to 
translate multi-host HPAs to it's internal DPAs. So if you have two 
hypervisors seeing device physical address as the same physical address, 
that might work?


> Problem #1 and #2 are head-scratchers, but maybe solvable.
> 
> Problem #3 is the meat and potatos of the issue in my opinion. So lets
> consider that a little more closely.
> 
> Generically: NIL Migration is basically a pass by reference operation.

Yup, agreed


> The reference in this case is... the page tables.  You need to know how
> to interpret the data in the CXL memory region on the remote host, and
> that's a "relative page table translation" (to coin a phrase? I'm not
> sure how to best describe it).

right, coining phrases... I have been thinking of a "super-page" (for 
the lack of a better word) a metadata region sitting on the switched 
CXL.mem device that would allow hypervisors to synchronize on various 
aspects, such as "relative page table translation", host is up, host is 
down, list of peers, who owns what etc... In a perfect scenario, I would 
love to see the hypervisors cooperating on switched CXL.mem device the 
same way cpus on different numa nodes cooperate on memory in a single 
hypervisor. If either host can allocate and schedule from this space 
then "NIL" aspect of migration is "free".


> That's... complicated to say the least.
> 1) Pages on the physical hardware do not need to be contiguous
> 2) The CFMW on source and target host do not need to be mapped at the
>     same place
> 3) There's not pre-allocation in these charts, and migration isn't
>     targeted, so having the source-host "expertly place" the data isn't
>     possible (right now, i suppose you could make kernel extensions).
> 4) Similar to problem #2 above, even with a pre-allocate added in, you
>     would need to ensure those mappings were pinned during migration,
>     lest the target host end up swapping a page or something.
> 
> 
> 
> An Option:  Make pages physically contiguous on migration to CXL
> 
> In this case, you don't necessarily care about the Host Virtual
> Addresses, what you actually care about are the structure of the pages
> in memory (are they physically contiguous? or do you need to
> reconstruct the contiguity by inspecting the page tables?).
> 
> If a migration API were capable of reserving large swaths of contiguous
> CXL memory, you could discard individual page information and instead
> send page range information, reconstructing the virtual-physical
> mappings this way.

yeah, good points, but this is all tricky though... it seems this would 
require quiescing the VM and that is something I would like to avoid if 
possible. I'd like to see the VM still executing while all of it's pages 
are migrated onto CXL NUMA on the source hypervisor. And I would like to 
see the VM executing on the destination hypervisor while migrate_pages 
is moving pages off of CXL. Of course, what you are describing above 
would still be a very fast VM migration, but would require quiescing.



> That's about as far as I've thought about it so far.  Feel free to rip
> it apart! :]

Those are all great thoughts and I appreciate you sharing them. I don't 
have all the answers either :)


> ~Gregory
> 


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

next prev parent reply	other threads:[~2023-04-11  0:56 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07 21:05 Dragan Stancevic
2023-04-07 22:23 ` James Houghton
2023-04-07 23:17   ` David Rientjes
2023-04-08  1:33     ` Dragan Stancevic
2023-04-08 16:24     ` Dragan Stancevic
2023-04-08  0:05 ` Gregory Price
2023-04-11  0:56   ` Dragan Stancevic [this message]
2023-04-11  1:48     ` Gregory Price
2023-04-14  3:32       ` Dragan Stancevic
2023-04-14 13:16         ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Jonathan Cameron
2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Huang, Ying
2023-04-11 15:36     ` Gregory Price
2023-04-12  2:54       ` Huang, Ying
2023-04-12  8:38         ` David Hildenbrand
     [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
2023-04-12 11:10             ` FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
2023-04-12 11:26               ` David Hildenbrand
     [not found]                 ` <CGME20230414084110epcas2p20b90a8d1892110d7ca3ac16290cd4686@epcas2p2.samsung.com>
2023-04-14  8:41                   ` Kyungsan Kim
2023-04-12 15:40               ` Matthew Wilcox
     [not found]                 ` <CGME20230414084114epcas2p4754d6c0d3c86a0d6d4e855058562100f@epcas2p4.samsung.com>
2023-04-14  8:41                   ` Kyungsan Kim
2023-04-12 15:15           ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory James Bottomley
2023-05-03 23:42             ` Dragan Stancevic
2023-04-12 15:26           ` Gregory Price
2023-04-12 15:50             ` David Hildenbrand
2023-04-12 16:34               ` Gregory Price
2023-04-14  4:16                 ` Dragan Stancevic
2023-04-14  3:33     ` Dragan Stancevic
2023-04-14  5:35       ` Huang, Ying
2023-04-09 17:40 ` Shreyas Shah
2023-04-11  1:08   ` Dragan Stancevic
2023-04-11  1:17     ` Shreyas Shah
2023-04-11  1:32       ` Dragan Stancevic
2023-04-11  4:33         ` Shreyas Shah
2023-04-14  3:26           ` Dragan Stancevic
     [not found] ` <CGME20230410030532epcas2p49eae675396bf81658c1a3401796da1d4@epcas2p4.samsung.com>
2023-04-10  3:05   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
2023-04-10 17:46     ` [External] " Viacheslav A.Dubeyko
2023-04-14  3:27     ` Dragan Stancevic
2023-04-11 18:00 ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Dave Hansen
2023-05-09 15:08 ` Dragan Stancevic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9d22b56b-80ef-b36f-731b-4b3b588bc4bd@stancevic.com \
    --to=dragan@stancevic.com \
    --cc=gregory.price@memverge.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=nil-migration@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox