From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83B97C36011 for ; Thu, 27 Mar 2025 19:30:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6C7A280115; Thu, 27 Mar 2025 15:29:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B1F292800FF; Thu, 27 Mar 2025 15:29:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9BD49280115; Thu, 27 Mar 2025 15:29:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7DF852800FF for ; Thu, 27 Mar 2025 15:29:58 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 3E644140F46 for ; Thu, 27 Mar 2025 19:29:59 +0000 (UTC) X-FDA: 83268321318.23.7FE9EED Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by imf17.hostedemail.com (Postfix) with ESMTP id 4DDE14000E for ; Thu, 27 Mar 2025 19:29:57 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=hBprGOwy; dmarc=pass (policy=none) header.from=soleen.com; spf=pass (imf17.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743103797; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C4YRUd0dUXxsiI+NBlGETAhHA6/U48pzMrcLDXPVjVo=; b=V2dfgs9AUyEc16TSpPnti5PcIb9UXuCyBhbsi/jH3onSVWrC0gwuXs9cIW46+swjTvvHkD q8g5iQ9bg3TQ8Y5FnFS/ipsIY3X4s3SBfOEeyqwVPh4o5DRDoJ6rEwTZPU85spPO+/8ion 0OU2h3XQY2qZKKrsKV9vtq2/rdDyilE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743103797; a=rsa-sha256; cv=none; b=dsJa2JCs34v3pD6Dqqhq30NizycJAr861gFT9tuuk5TdbqVhWO4F3HZzXSufw+GNuqIiNW b3ea4AxIr4pwtgXMdElp0Z3IpVPrsjB++ayxkRWL6uA23xu37F44dobQp+V3zQ/ptNK6ml IwN1dZFVqNUq+258hsHR78UgdwKP1/Q= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=soleen-com.20230601.gappssmtp.com header.s=20230601 header.b=hBprGOwy; dmarc=pass (policy=none) header.from=soleen.com; spf=pass (imf17.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-4775ccf3e56so26049171cf.0 for ; Thu, 27 Mar 2025 12:29:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen-com.20230601.gappssmtp.com; s=20230601; t=1743103796; x=1743708596; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=C4YRUd0dUXxsiI+NBlGETAhHA6/U48pzMrcLDXPVjVo=; b=hBprGOwy2vzY0SRTSwtmJTLaLaeMbsVcTbmUt7MoNU5TSduJXDWm+9jFb3OL1uvW/p UaYlCClkyHyWffhWcCZlewgl3AdwHiK5JhuNxDsOvdOl99ZpdjijSK8hqmdSJsJ1dhNh bJjVSyiEtRj7XuD2pfQxmkCwLoMwFF5BNBauHYWHfM4zRBAIiXJrPXmZb7s5B+sSHYQY GqtKi4hZbrLBkayjH8laNHuvm3hQiH4Stjl4aVWwIPlpXIcOqsqxvbWfOyTi1f2nxD3M Cp0FClc+q6hStE86ITGhGrkVQO06mLKOXl+Y/8nEs6L6168M2Jmx25To2mLDG9pq8spb Ks9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1743103796; x=1743708596; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C4YRUd0dUXxsiI+NBlGETAhHA6/U48pzMrcLDXPVjVo=; b=Ym0ADNdIrH6ZTxSqMsloCQVu7hCpWqju4FsgUyc9fiEyeCUIfUmH+JOnpyhxjkcFqH GeCeXWYgz2CIdsQCzDgbHkRXy3IttHRPcCPxQ/+ohiidx/C5PfrdI9+eyNRXfm/welXx GP6IMZvLeTtVYRtWpt4zNGutiDslloVqx1vJTQCzbwpQt3EsWTi4zRdaqCCgNd91LfTS BMzMguov/GMxML6ET4Wd8b6qj/zfCAYuCLDd2jsGgrRDUexBv/qZA80Fg6HrheT0O+S0 qGP3+Y1SH6ivAAcQeF7duwy8bk4beyzm08UqPRGMqFaQLWJoBxoe+7cExuUgU5oVEmgf pj0A== X-Forwarded-Encrypted: i=1; AJvYcCXZUCZFabEGGoQXp9kYiUHWxE2+Svvnht0TEekZB09KoBU4hvRnfzuR7Xjs3GnDRa14NgvP6lOJzQ==@kvack.org X-Gm-Message-State: AOJu0Yz7PZj94R7xhl/o3rIMlA/eUs+ISsKLsp7kZW0nDsOngQ0E32C/ H/6f8IXqGD8UuufHvhyg7Xq06N3gqmPf8E6CdcGV/mL0jYwyvrfx9/QxB0cZyVUGWJU6ONiS7wC bA8mtdsHY+ZXUQAUG4eao6MnMS+spWDwLp6n/Ug== X-Gm-Gg: ASbGnctt7K/75peGuUiAj7wbTfPvmS1VNonfeG6nRSImp8potvmKqqLT0LLzjDn4U+i K45WQ3V3Zo1aTeSwV6HPNZUparvpyU6zZhmFCfEsNqQPaEzqLyQTib9TSe3sBjf/kTl/xwXk2II rdomsB0dZcFCFoIIxauQvzFPItRLG+KoNekziYniQ= X-Google-Smtp-Source: AGHT+IHmKJiIAuGir5eLdnuzQYUg+lTpLRvScYqP43GjNb2vxGbldqJeyd7c81dwopkHCLQbzICpBpWIbb/q/WAmcLc= X-Received: by 2002:a05:622a:4018:b0:477:1126:5a33 with SMTP id d75a77b69052e-477843dee70mr963831cf.1.1743103796128; Thu, 27 Mar 2025 12:29:56 -0700 (PDT) MIME-Version: 1.0 References: <20250320024011.2995837-1-pasha.tatashin@soleen.com> <20250320024011.2995837-2-pasha.tatashin@soleen.com> <20250320144338.GW9311@nvidia.com> <20250320192601.GG206770@nvidia.com> In-Reply-To: <20250320192601.GG206770@nvidia.com> From: Pasha Tatashin Date: Thu, 27 Mar 2025 15:29:18 -0400 X-Gm-Features: AQ5f1JoqJbUNa31EKiHpHgysa1IL2SgJB7qz5YDT-dhXwuOYY82RJ56s9rbmlGw Message-ID: Subject: Re: [RFC v1 1/3] luo: Live Update Orchestrator To: Jason Gunthorpe Cc: changyuanl@google.com, graf@amazon.com, rppt@kernel.org, rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org, ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com, ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org, akpm@linux-foundation.org, tj@kernel.org, yoann.congal@smile.fr, mmaurer@google.com, roman.gushchin@linux.dev, chenridong@huawei.com, axboe@kernel.dk, mark.rutland@arm.com, jannh@google.com, vincent.guittot@linaro.org, hannes@cmpxchg.org, dan.j.williams@intel.com, david@redhat.com, joel.granados@kernel.org, rostedt@goodmis.org, anna.schumaker@oracle.com, song@kernel.org, zhangguopeng@kylinos.cn, linux@weissschuh.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, gregkh@linuxfoundation.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, rafael@kernel.org, dakr@kernel.org, bartosz.golaszewski@linaro.org, cw00.choi@samsung.com, myungjoo.ham@samsung.com, yesanishhere@gmail.com, Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com, aleksander.lobakin@intel.com, ira.weiny@intel.com, andriy.shevchenko@linux.intel.com, leon@kernel.org, lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org, djeffery@redhat.com, stuart.w.hayes@gmail.com, jgowans@amazon.com, Pratyush Yadav Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4DDE14000E X-Stat-Signature: 4yk5topbnogngendxzhpneo8s9bi76jf X-Rspam-User: X-HE-Tag: 1743103797-454198 X-HE-Meta: U2FsdGVkX19j+fQf1CnLmzw/kXAIpWTxoZvH/8a7IMqcI2f3lD4IZtUJEZkcZqdowx91LIwUQtRpLyoAJ0I+/sAx13dI6PutkiRaVPZk0U5KMLxLLZMd4ifYyy4o1OQCprICI+4mi2y1MSS6JHTlvRQvJ/jvoqtJfP6q7wERStCIzbWwdxFKKn/4tS73wjXRcqcTAbIfgJehrq+xAYocG7NaCc9hUO9EZWWF/xW7C4vVewcCKQUWaMwJhYYB4FKB85j3OHdrT5IvLKIJkDkDoDfyQ/UWHiWNhIoMb7/gij6xk+UYDNxE+5gkem1sKyfwfHHf+95jCAbenjgCCEA2QMfb3UapvNEadCamqJ8lPZT3L6ATlqfX2nDQLi5D1JoUUuvHyw2yLhw0EGOQh9vfDZr6sDJLLX5pvJUTE0FTS1/4OMjOh4kmmnyj3uiX4f7sZM9LOmLxlWqYTlUbyvJwW5x5WJjGlcnu1sFPVx/EltNXm7Vbueos8OGh5ZJBjILyp+ZsHRWrFn+ahlTOPkBp2WTdfpxhA3MD1R1SJzth/+6zCx7DqKQ2RCOzJXUiFoUa6UFNE8wEFOuxDN3YdiUQ7M/MWeLJc+fanHuKS3Dx1rB2u5vFMgCb34L5S66TYK5DsaMOQDFZqPLEEN+PHyHRddsu92GOBqNO+n1rQ6Ct7HHTHJln5l/4BFRMJcxEg51vtQd1zsTfOajw7q+OhtZkc7dVBEruxbr7HTGlMXwn5XLJYfvFXiZa3emdtRPmbq+NbZGQ9TYZ6tG2FbAC3aqTtkIPUbw4zy+Z6FWI5p9nI0g0Ukv70DbHOox59xVx3fD4TOyTfFqjBndzP5+4zjckod0y6NM5dQ3rCrWIa3KBFgowtGSTE0JLqYvmnphIEnx8WJsp3DGqkh1u6eFyyvxxS5nncv8GE0o1RiuKhyqs7v6/JQwzETlvNzkfLUPtOd9VBiln6bzaGRA89/UfPo+ E41v/Oqi vIpVkhdqbvXMX2pXkUuiaFrewSKaOKEviWRyL0/d1cwPWm0L0Uzg8IsAhqkDB86L2hF3Zc3kZV530+E8s/gkUzS5S3OVgGFZMr5gl+86/sHK3DECPrULNAHTTzcd9Tg7tF+4ZR67H4OT5l6ALUnf0Brn9qpJp1aZwupCT5ebJJc0A9RHw9IXEP8H8EUWzwyAUEgyDftYI757EKVagTpWFuWf94ZrNDpHPZL15jcQcYyacEw3yGJNgWIajjrKK/BM9ermgwJ2NX2mZoy9uCyCZhAj6K84bpgH1rwB6CyKiqqs9FAVHMACs5iXbQopnqb9jA3UoVfktlJdjoU9IA/H59+MWfN+vgK4pRVfZ4WQ/W3K8HdsiarYUH+1QOU7RYMU93589QMnVloOiQvLbyQ7Lt5tSPg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000058, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 20, 2025 at 3:26=E2=80=AFPM Jason Gunthorpe wr= ote: > > On Thu, Mar 20, 2025 at 03:00:31PM -0400, Pasha Tatashin wrote: > > > > I also think we should give up on the sysfs. If fdbox is going forwar= d > > > in a char dev direction then I think we should have two char devs > > > /dev/kho/serialize and /dev/kho/deserialize and run the whole thing > > > > KHO is a mechanism to preserve kernel memory across reboots. It can be > > used independently of live update, for example, to preserve kexec > > reboot telemetry, traces, and for other purposes. The LUO utilizes KHO > > for memory preservation but also orchestrates specifically a live > > update process, provides a generic way for subsystems and devices to > > participate, handles error recovery, unclaimed devices, and other live > > update-specific steps. > > > > That said, I can transition the LUO interface from sysfs to a character= device. > > Sure, I mean pick whatever name makes sense for this whole bundle.. > > > > through that. The concepts shown in the fdbox patches should be merge= d > > > into the kho/serialize char dev as just a general architecture of ope= n > > > the char dev, put stuff into it, then finalize and do the kexec. > > > > Some participating subsystems, such as interrupts, do not have a way > > to export a file descriptor. > > Interrupts that need to be preserved are owned by VFIO. Why do we need > to preserve interrupts? I thought the model was to halt all interrupts > and then re-inject a spurious one? > > > It is unclear why we would require this > > for kernel-internal state that needs to be preserved for live update, > > which should instead register with internally. > > Because there is almost no kernel state which is machine global and > unconditionally should be included. eg Interrupts for devices that are > not doing preservation should not be serialized. Only userspace knows > what should be preserved so you must always need a mechanism to tell > the kernel. > > > IMO, the current API and state machine are quite simple (I plan to > > present and go through them at one of the Hypervisor Live Update > > meetings). However, I am open to changing to a different API, and we > > can expose it through a character device. > > Everything seems simple before you actually try to use it :) > > > > Also agree with Greg, I think this needs more thoughtful patch stagin= g > > > with actual complete solutions. I think focusing on a progression of > > > demonstrable kexec preservation: > > > - A simple KVM and the VM's backing memory in a memfd is perserved > > > - A simple vfio-noiommu doing DMA to a preserved memfd, including no= t > > > resetting the device (but with no iommu driver) > > > - iommufd > > > > We are working on this. However, each component builds upon the > > previous one, so it makes sense to discuss the lower layers early to > > get early feedback. > Hi Jason, Thanks for your thoughts. I agree with your observation about components being worked on separately when they might be intrinsically linked. Especially, given that kvm/vfio/iommu all have FD counterparts to the global states, or device state. > I think part of the problem is there are lots of people working on > pieces as though they are seperate components, and I'm not sure this > is entirely wise, or the components are actually seperate. I see > fdbox and this luo patch series as effectively being the same > component, just different aspects of it. You've articulated precisely the point we discussed at LSF/MM. Based on that conversation, the next proposal will focus on unifying FDBox and the Live Update Orchestrator into a single, cohesive component. Here=E2=80=99s a summary of the planned approach: 1. Unified Location: LUO will be moved under misc/liveupdate/ to house the consolidated functionality. 2. User Interfaces: A primary character device (/dev/liveupdate) utilizing an ioctl interface for control operations. (An initial draft of this interface is available here: https://raw.githubusercontent.com/soleen/linux/refs/heads/luo/rfc-v2.1/incl= ude/uapi/linux/liveupdate.h) An optional sysfs interface will allow userspace applications to monitor the LUO's state and react appropriately. e.g. allows SystemD to load different services during different live update states. 3. Dependency Management: The viability of preserving a specific resource (file, device) will be checked when it initially requests participation. However, the actual dependencies will only be pulled and the final ordered list assembled during the prepare phase. This avoids the churn of repeatedly adding/removing dependencies as individual components register. To manage the preservation logic, we'll use specific handles categorized into three types: fd, device, and global. Each handle type will define callbacks for the different phases of the live update process. For instance, a file-system-related handle might look something like this: struct liveupdate_fs_handle { struct list_head liveupdate_entry; int (*prepare)(struct file *filp, void *preserve_page, ...); // Callback during prepare phase int (*reboot)(struct file *filp, void *preserve_page,...); // Callback during reboot phase void (*finish)(struct file *filp, void *preserve_page,...); // Callback after successful update to do state clean-up void (*cancel)(struct file *filp, void *preserve_page,...); // Callback if prepare/reboot is cancelled }; The overall preservation sequence involve processing these handles in a specific order: Preserved File Descriptors (e.g., memfd, kvmfd, iommufd, vfiofd) Preserved Devices (ordered appropriately, leaves-to-root) Global State Components Let me know if this direction aligns with your expectations. Pasha