From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 877E2C3ABAA for ; Mon, 5 May 2025 14:37:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 282086B008A; Mon, 5 May 2025 10:37:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 20AAF6B008C; Mon, 5 May 2025 10:37:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 087606B0092; Mon, 5 May 2025 10:37:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D42A06B008A for ; Mon, 5 May 2025 10:37:14 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6D8DB1CC889 for ; Mon, 5 May 2025 14:37:15 +0000 (UTC) X-FDA: 83409106830.21.B151597 Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by imf20.hostedemail.com (Postfix) with ESMTP id 4AFFD1C000A for ; Mon, 5 May 2025 14:37:13 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="WPLE/cza"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of ryabinin.a.a@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=ryabinin.a.a@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746455833; a=rsa-sha256; cv=none; b=HeC6BwBzqXmty18sPrpxjHM6ckCWsgwk95wMQ7QxTVlqYLloACRieFp7Y3CM194+iwWJCG 6RrqnFZF+SOrhw+XEPC4jaaZBDhy6NC2Wxn2k0Lsi025osabdLS3ZjhHAJouWkzsum4oVb AG29hWSJCWt7DKPrArBoydaI8FUkBG0= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="WPLE/cza"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of ryabinin.a.a@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=ryabinin.a.a@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746455833; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1NHcDk4r4MC4EBq9o/mXd6iYMNAbw2nwb2cXbhASKss=; b=FEj60RYr8sXEJMTt9buDqecTvYlXeNAVD1ZJy9v9i7zjZG19oVvEUq5A4Nex+eeeXk2IS8 aOqd1s999cRmcyM12UvGCnEPfsmyjAFCnGDGQpZP2M6EE1Ebffgy2AoTsdtF6PlcZ8TQYt z16GQZN9t6ZZe1cS3BLvCnqFwmvROfI= Received: by mail-lj1-f173.google.com with SMTP id 38308e7fff4ca-30d8cb711e2so3421711fa.0 for ; Mon, 05 May 2025 07:37:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746455831; x=1747060631; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=1NHcDk4r4MC4EBq9o/mXd6iYMNAbw2nwb2cXbhASKss=; b=WPLE/czaLKgL6h8r7jNTpFDitq/tT/t4fui37SB9YTGkOhzagUo6W7J+wU2X+HpoA9 b7qp84AIbgDh2UZFZtSNcglmFRcETOUUhiW5sEzIKg8MoL239DzDTJV+3eTl/X/KhQgn QJp7lgaNV0hjLrDRshGAT7xed6mcWuvlJj+GJ9q2rJ9dob6sB9/tuhA2aVRHyyf8jNNN GF5K9wFFUeX+Y43TRMlnYCr/JmM8LC7wm+0WNMFSmeTJXgPx/DraxIVvlwTtFcf6QNvk n8P3OKQRH0pp+jWssP8SHNc4mp8np1738fpF/M24gPHGdENjs/qbhL8PESzVQ1oWVsmE mMFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746455831; x=1747060631; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1NHcDk4r4MC4EBq9o/mXd6iYMNAbw2nwb2cXbhASKss=; b=U05FvDjnVqOSh8gepHYg+FiedUXNU/id/VioAzuELN4vVCcDc984wulXwz07Xszzq5 Z2xqtYhr8IMOnvDyG408J5JbOvb5fBQEe1p/ebglJxBfou/7qAztdkdc+cy+ph/b2Dpk vF4RNQIwyPz7sjq5AsbvENMSjJ831xwF1KUJjbCZkIzxLrLYiYOaBTAttm5hwlHPtXRC IliwPCGfeXSX6ZYIKVM10+hT2Bud1/UMIwazqAN/rTEWH/OVXsGjaPf13iqiS9J5dEf9 SqR6UhVTL4y/qayFuL+Mx1+u9qUDhZEncC0WXbQEJC/XPPQFan4bqmvZ9RCdl2E440yO fsjA== X-Forwarded-Encrypted: i=1; AJvYcCXNTDJsDtATStFuHzMWqOJTbozGJPNtpMYHVOa4nkoMKlfWNo89SEdvUeJQ9nJUE2AIWZgoqrsLmw==@kvack.org X-Gm-Message-State: AOJu0YypIUeMdPwse6GpHhRF0hnTam/xzv7mYnICm3AwGocOuAmcDxx/ 3oyPnJ1nQF9UVF3M+If0VTP1cmqYCL25zwMNnf/wYvYpAF2GeM5i X-Gm-Gg: ASbGncuuGm2dR30Zv+fVlg1BYOIKPRIB5XpM6L35meK6+d12CjrJ+sHOsxymuL+PIf2 JJHZ5ZcdiuJzV1AgFUQAJ+5nIGyZ8kfbjsbs/0PtJGvC0W/p8oiL/cL6BQ+dXQiznbSf9ERBGKO H+hC35+eo9O/wPjV0WKRWVE0dDY6F4HUwzAQwZj1ZJ+IyCRtdS9Fa3Ju++Mb3NjjiFnDBvwArfm WDRdtXPghB2/z4rUGhDoTALTN5ruYGE5+EtOIMeBo5LD+l/5ygh5MAUp/feq+y5lz59NxOfujxw RRwoxeIsGknZ4Wctl+tmZnyGCfBgBPynnEEDbTfyiubjtU7SFn/nCY4IHg== X-Google-Smtp-Source: AGHT+IGEtTrk5xxPxYLHFQLp2mHEJ3luWELdgPGmJBEEinUSHAR9t94mHQ/40kEYFTJUGu32Rm+2eQ== X-Received: by 2002:a2e:be8d:0:b0:30b:b852:2f5d with SMTP id 38308e7fff4ca-320c63b8d9amr12492721fa.11.1746455830940; Mon, 05 May 2025 07:37:10 -0700 (PDT) Received: from [10.214.35.248] ([80.93.240.68]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-3202ad908b9sm16303971fa.108.2025.05.05.07.37.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 05 May 2025 07:37:10 -0700 (PDT) Message-ID: Date: Mon, 5 May 2025 16:35:57 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec To: Chris Li Cc: linux-kernel@vger.kernel.org, Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H . Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Pratyush Yadav , Jason Gunthorpe , Pasha Tatashin , David Rientjes References: <20250310120318.2124-1-arbn@yandex-team.com> Content-Language: en-US From: Andrey Ryabinin In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4AFFD1C000A X-Stat-Signature: 9wyoqgqhodw39ojwo7krqz1kap59iatm X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1746455833-440347 X-HE-Meta: U2FsdGVkX1/xqHetSWIMpJWivmN5QhtmRSiddb4FWTZ+l7I7VR/fjdew9bcSURg+hXpxg6s0+9BWVOekL5lEmPeY5jRRmTmdMSCpPlrn3A8BjmtFQ1xaCilfMSbfBuW1zvTx8rHy0N5jfaWys1uhWWsis/aZC8oVjWHBl/dYLQY1Pj4zyfqQFKiHhbNLNXzSJaVk4tsPkElv4qYa5JhBT3BZiWJ8KAdRSbhQ15SjXcEc2nXqQEwqiMcnwYZ0Fn84DODEo7GMMhlM83L802vxsGeG/eLB5qjWbN5QiRZ7mWbKjNTHE8yC5408Z/pflJc0C4qMaCvV63nkjmeWsumwnS4nbFq9xWLbEc3nBc24yBb4j4RXHHltnL1wcasncY6RitoouyRB0SARwdNWOsxMz/JYDNSpoyNH9+piSKiejRiIi4XpgEL9POUSGGNvv1HmMU3y3Y+orBPxpgqczsQ2OcBYkW+8fbXuH2EIlOlFnjj80yjMpTME3HpA4wtAA4BMexMZ1O4Kj+pOr9t6zTJ3q3WsxZ8bqHN26Yr/hAAoOamXyK9PcSeKfTAchFQAmzbchy7N92qOl+vB9q2yDNw8vF688Lir08SPcfPNN0jqDBfzoeKhjBPps+ydY/acfsw/LCGbdRmMebnGMrKARDboZcVHaAU9C+Syr4F4799L/lgrKWuafobccwI9pDZOeLxDZNBXHTtnnsDKKimbyWBPGG36qOxI3/cEkT84T0RyCF5numVpZ2xyh9oNl+KXV98j6nYIWwa5OgQrxmHdXE+rIBsMOC1V6Yr0NKVIH/rQuxHCBYhHIPiDr2V3Xp24BNUwF395ilNxag9iQf1x+rXa6FJzix7De2STn7nV/kHugP6kOYqWMxYs6+iqLApYUH+dWJSvapqA0poREuGDDelmyj2WvU/b6eCYEImupdqaIbKEKk1tm+sb1fIPruQ602HQBJrOp+INJkB772ojvwV RVGO+h8n zzgycUtWF8zaZ2T0o6sjmqh1y6ZSHndHaM2t36uXvPPpN7LKSfW1jUJ7htMLvbED3NichkmlpNBD+6YoVKuFNsNtnYuue9FEera4dlUzvp41dxSiSlZm2jkVZQ3/41D1Gbb4Dxe/JimsbyVWvrqWdH0NL2LIaajbr4dkXT20VVQcIv2Tn4T19iXLPKwlrThUukPbwM5Udu7D6X9rQrxGFsT9kVw82le24NBJgImojOgY3cbkBEQbX3oDh0kWfUYv+Sc9L+9gU8vPWNPNkYDjqM4bz0JqPdK02mldik3a47r3ki4bGJXSGyW/8W5LTLXxXpqE6y0pT4a2EM7I5gQH6HKsyUp7Nmhu9ChVDDpQsevtM14ODDp6yy5RoISeoTG71Y7gFf7RpW3qk6dN4t7ukcP7MaRW0H1rw5DpGWYea+wAaayHWpIhVe1cOkw9epRxK0/Aow+WAOi7rKb9myAJBH/lB9JvXaetwtM/X3XiU7A/RwT9kmAepgu9ecJs6R1Mhf0n6ubv6JjzmE5M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 4/29/25 1:01 AM, Chris Li wrote: > Hi Andrey, > > I am working on the PCI portion of the live update and looking at > using KSTATE as an alternative to the FDT. Here are some high level > feedbacks. > Hi, thanks a lot. > On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin wrote: >> >> Main changes from v1 [1]: >> - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel >> - Lots of misc cleanups/refactorings. >> >> kstate (kernel state) is a mechanism to describe internal some part of the >> kernel state, save it into the memory and restore the state after kexec >> in the new kernel. >> >> The end goal here and the main use case for this is to be able to >> update host kernel under VMs with VFIO pass-through devices running >> on that host. Since we are pretty far from that end goal yet, this >> only establishes some basic infrastructure to describe and migrate complex >> in-kernel states. >> >> The idea behind KSTATE resembles QEMU's migration framework [1], which >> solves quite similar problem - migrate state of VM/emulated devices >> across different versions of QEMU. >> >> This is an altenative to Kexec Hand Over (KHO [3]). >> >> So, why not KHO? >> > > KHO does more than just serializing/unserializing. It also has scratch > areas etc to allow safely performing early allocation without stepping > on the preserved memory. I see KSTATE as an alternative to libFDT as > ways of serializing the preserved memory. Not a replacement for KHO. > > With that, it would be great to see a KSTATE build on top of the > current version of KHO. The V6 version of KHO uses a recursive FDT > object. I see recursive FDT can map to the C struct description > similar to the KSTATE field description nicely. However, that will > require KSTATE to make some considerable changes to embrace the KHO > v6. For example, the KSTATE uses one contiguous stream buffer and KHO > V6 uses many recursive physical address object pointers for different > objects. Maybe a KSTATE V3? > Yep, I'll take a look into combinig KSTATE with KHO. .... > >> a_kho_restore() >> { >> ... >> a.i = fdt_getprop(fdt, offset, "i", &len); >> if (!a.i || len != sizeof(a.i)) >> goto err >> *a.p_ulong = fdt_getprop.... >> } >> >> Each driver/subsystem has to solve this problem in their own way. >> Also if we use fdt properties for individual fields, that might be wastefull >> in terms of used memory, as these properties use strings as keys. > > Right, I need to write a lot of boilerplate code to do the per > property save/restore. I am not worried too much about memory usage. A > lot of string keys are not much longer than 8 bytes. The memory saving > convert to binary index is not huge. I actually would suggest adding > the string version of the field name to the description table, so that > we can dump the state in KSTATE just like the YAML FDT output for > debugging purposes. It is a very useful feature of FDT to dump the > current saving state into a human readable form. KSTATE can have the > same feature added. > kstate_field already have string with name of the field: #define KSTATE_BASE_TYPE(_f, _state, _type) { \ .name = (__stringify(_f)), \ Currently it's not used in code, but it's there for debug purposes >> While with KSTATE solves the same problem in more elegant way, with this: >> struct kstate_description a_state = { >> .name = "a_struct", >> .version_id = 1, >> .id = KSTATE_TEST_ID, >> .state_list = LIST_HEAD_INIT(test_state.state_list), >> .fields = (const struct kstate_field[]) { >> KSTATE_BASE_TYPE(i, struct a, int), >> KSTATE_BASE_TYPE(s, struct a, char [10]), >> KSTATE_POINTER(p_ulong, struct a), >> KSTATE_PAGE(page, struct a), >> KSTATE_END_OF_LIST() >> }, >> }; >> >> >> { >> static unsigned long ulong >> static struct a a_data = { .p_ulong = &ulong }; >> >> kstate_register(&test_state, &a_data); >> } >> >> The driver needs only to have a proper 'kstate_description' and call kstate_register() >> to save/restore a_data. >> Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'. >> And kstate_register() does all this save/restore stuff under the hood. > > It seems the KSTATE uses one contiguous stream and the object has to > be loaded in the order it was saved. For the PCI code, the PCI device > scanning and probing might cause the device load out of the order of > saving. (The PCI probing is actually the reverse order of saving). > This kstate_register() might pose restrictions on the restore order. > PCI will need to look up and find the device state based on the PCI > device ID. Other subsystems will likely have the requirement to look > up their own saved state as well. > I also see KSTATE can be extended to support that. > Absolutely agreed. I think we need to decouple restore and register, ie remove restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(), so the PCI code could do: kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn)); And on probing stage (probably in pci_device_add()) call kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn)) which would locate state for the device if any and restore it. >> - Another bonus point - kstate can preserve migratable memory, which is required >> to preserve guest memory >> >> >> So now to the part how this works. >> >> State of kernel data (usually it's some struct) is described by the >> 'struct kstate_description' containing the array of individual >> fields descpriptions - 'struct kstate_field'. Each field >> has set of bits in ->flags which instructs how to save/restore >> a certain field of the struct. E.g.: >> - KS_BASE_TYPE flag tells that field can be just copied by value, >> >> - KS_POINTER means that the struct member is a pointer to the actual >> data, so it needs to be dereference before saving/restoring data >> to/from kstate data steam. >> >> - KS_STRUCT - contains another struct, field->ksd must point to >> another 'struct kstate_dscription' > > The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT > type, right? Some of these flag combinations do not make sense. This > part might need more careful planning to keep it simple. Maybe some of > the flags bits should be enum. > Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field. Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL) >> >> - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save() >> ->restore() callbacks to save/restore data. >> >> - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the >> field->count() callback >> - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or >> linear address. Store offset > > I think we want to describe different stream types. > For example the most simple stream container is just a contiguous > buffer with start address and size. > The more complex one might be and size then an array of page pointers, > all those pointers add to up the new buffer which describe an saved > KSTATE that is larger than 4K and spread into an array of pages. Those > pages don't need to be contiguous. Such a page array buffer stores the > KSTATE entry described by a separate description table. > Agreed, I had similar thoughts. But this complicates code, so I started with something simple. >> >> - KS_END - special flag indicating the end of migration stream data. >> >> kstate_register() call accepts kstate_description along with an instance >> of an object and registers it in the global 'states' list. >> >> During kexec reboot phase we go through the list of 'kstate_description's >> and each instance of kstate_description forms the 'struct kstate_entry' >> which save into the kstate's data stream. >> >> The 'kstate_entry' contains information like ID of kstate_description, version >> of it, size of migration data and the data itself. The ->data is formed in >> accordance to the kstate_field's of the corresponding kstate_description. > > The version for the kstate_description might not be enough. The > version works if there is a linear history. Here we are likely to have > different vendors add their own extension to the device state saving. I think vendors can just declare separate kstate_description with different ID. The only problem with this, is that kstate_description.id is integer, so it would be a problem to allocate those without conflicts. Perhaps change the id to string? So vendors can just add vendor prefix to ID. > I suggest instead we save the old kernel's kstate_description table > (once per description table as a recursive object) alongside the > object physical address as well. The new kernel has their new version > of the description table. It can compare between the old and new > description tables and find out what fields need to be upgraded or > downgraded. The new kernel will use the old kstate_description table > to decode the previous kernel's saved object. Hmm.. I'm not sure, there is a lot to think about. This might make changes in kstate_description painful, e.g. if I want to rearrange some ->flags for whatever reason. So how to deal with changes in kstate_description itself? How do we save links to methods in kstate_field (->restore/->save/->count), and what if we'll need to change function prototypes of these methods ? > I think that way it is> more flexible to support adding one or more features and not tight to > which version has what feature. It can also make sure the new kernel > can always dump the old KSTATE into YAML. > > That way we might be able to simplify the subsection and the > depreciation flags. The new kernel doesn't need to carry the history > of changes made to the old description table. > >> After the reboot, when the kstate_register() called it parses migration >> stream, finds the appropriate 'kstate_entry' and restores the contents of >> the object in accordance with kstate_description and ->fields. > > Again this restoring can happen in a different order when the PCI > device scanning and probing order. The restoration might not happen in > one single call chain. Material for V3? Agreed. > I am happy to work with you to get KSTATE working with the existing KHO effort. > Thanks for useful feedback, appreciated.