From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD87AC28B30 for ; Mon, 17 Mar 2025 03:52:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 37BED280003; Sun, 16 Mar 2025 23:52:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 32A8F280001; Sun, 16 Mar 2025 23:52:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1CD3C280003; Sun, 16 Mar 2025 23:52:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E7F6A280001 for ; Sun, 16 Mar 2025 23:52:46 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 873C7C14FD for ; Mon, 17 Mar 2025 03:52:48 +0000 (UTC) X-FDA: 83229671616.16.87B52F5 Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf13.hostedemail.com (Postfix) with ESMTP id C956C20005 for ; Mon, 17 Mar 2025 03:52:46 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=h2urRrZ9; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of rientjes@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742183566; a=rsa-sha256; cv=none; b=AgiRHASa17zG/0LrKr4iLnqbb8Zw9KMzKUUPsxHLOM5flzGMu7MrhjMF5xEdSDj/lR3KnR WOThgGh5FNIOlmHmUUoYH/kqOo4D1+rmZOiIcyMrwvnWUyHycSxckrRgsGFOzstxM1Xob7 MBH+1oyqI5Ky0SHRfsD/Ew8xIVdu1ew= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=h2urRrZ9; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf13.hostedemail.com: domain of rientjes@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=rientjes@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742183566; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=NlW8w4RRBWj9GTYyM0QR0pwlik2bQU9FbiljA2VJaMo=; b=fSS92486Z3YDIuLZwR7vvTTwdh9vavMYBfqL4gjI5gCCQqe+jeSRbjYSi0/7j1oCJofHP6 2p4CC2evUWKlOLFDBQMEkLcsVIdDakv9TkBF+eDXk5Ri00q00l4QixDKZ2TUOP+sdT2pIy k2UFiQtqjsW2HsmZMlKh9ziCjAiDq2U= Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2242aca53efso300165ad.1 for ; Sun, 16 Mar 2025 20:52:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1742183565; x=1742788365; darn=kvack.org; h=mime-version:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to; bh=NlW8w4RRBWj9GTYyM0QR0pwlik2bQU9FbiljA2VJaMo=; b=h2urRrZ9ljWyugsq9k7dSUvKXEspZoO913FV8sbq8tnkBAJ1Q9VfDDcr33ikHMgJEQ BWxP62pAfHVStQMvq760FcUWygI0m1MWK4uI/X0AXHz7RncRfMfHnOvvYPxetUIvNnav yaSoWYXDuH/lGC7ySS+sNZhB413LjZNVqU77YCy/dR2HTG2H1Rk2biuZHIYiifzkdsoi Sn06CbH2gV4Nb+nne8VoMBQY+U/hD0Gguc5UU8orFg+B7ZKbrHdMQMdK1FXdFtOxXxal T5QUXVaAXvbrlMzHfBiW53vDGo4KAmju0NiGyNArLBDaopV/hw/vIS/tL6671aHZ3TdJ bPUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742183565; x=1742788365; h=mime-version:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=NlW8w4RRBWj9GTYyM0QR0pwlik2bQU9FbiljA2VJaMo=; b=J6OD1X493VGsf1J3OCovCIpAVBOjjI4LYYPu0SU4nNEJ4DBVLursQfvfW65yNI/8OF p01UegfXE6xXsc7aLuMglwv6y3rYmm9PNlTytssgmsHPCuJWFZtw6C6sVJI84Di/YXqe h54Jj35FmKwQ8wktSD+Bq0QOQ4QmiIgpSmjTl5JjjTwmhdMwpFZ8au7oViDD3G8hw0iW zOtaWN3PjLYS459GNb78taXO5I2eGlbxeJ4X5VQJPDczoNo+eMlJchS7ybT9gTjCFNaJ V4dt7c5E+6OCphp59s3ammKzndeUXbzBD0320D4VHPTM0Al3HeX4EE6Zq00Z/EswJvfq Z6YA== X-Gm-Message-State: AOJu0Yyf4lG9ar9PKSGDVAzp1KtDINKO45n/+C3MuOUG4QkTq+wr4TDx Q2bYU9w5hiPsIdSBcY3mR5AJlXi0V78KAq+SqOepOJhUgvSnNMRxp8ZYpc2TZw== X-Gm-Gg: ASbGncuIPE0OvIcdC0M4za35OgvOf/V9xHCudMiNJ+I7sPzOqGxMTlLg2Zkky7KoDPs PU4DWS/uxVk8tRrxv5cNIQXin2/Qtfo4HyMbeYcosMEEI/fd7Ul6u9IbUhYCj1bFLwIU0DhavDB BEp/J/VjzRGIp84IMwPGV578SHK2ANsjW7kX+TDlWCn/DBI37q2oN3ToMQ3la/aizbwIGKHtLX4 t3Xm0whKAVzUBdQXR1NLRV6/FMjSTwNSnBZlT5hQAgJQQRWSsywkAOP8I32QneV7eM5MCYLyuDu FB+oJ4+fCal/OAHjZYAGEFdmfY6s32Mmt+MwsX9u6FdIIzDBpQe/latuYvQyxKFiNCWnTg7MPni yNtz8SCoNwpNQHNSpI0XMBZpH6JB4T7BP1DC5cqM= X-Google-Smtp-Source: AGHT+IGFM2A4MRU9B7FIvTISeZ8zh3JOwAXWXqa/+F5Ql3k1wf0V5Gs+c+TCo+oRN1EITcTysPQTCA== X-Received: by 2002:a17:903:41cc:b0:223:37ec:63be with SMTP id d9443c01a7336-225f3c14de3mr2958175ad.4.1742183564977; Sun, 16 Mar 2025 20:52:44 -0700 (PDT) Received: from [2a00:79e0:2eb0:8:37fc:ef4d:16b9:8b63] ([2a00:79e0:2eb0:8:37fc:ef4d:16b9:8b63]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3015364ec12sm4950195a91.44.2025.03.16.20.52.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 16 Mar 2025 20:52:44 -0700 (PDT) Date: Sun, 16 Mar 2025 20:52:43 -0700 (PDT) From: David Rientjes To: Alexander Graf , Anthony Yznaga , Dave Hansen , David Hildenbrand , Frank van der Linden , James Gowans , Jason Gunthorpe , Junaid Shahid , Matthew Wilcox , Mike Rapoport , Pankaj Gupta , Pasha Tatashin , Pratyush Yadav , Vipin Sharma , Vishal Annapurve , "Woodhouse, David" cc: linux-mm@kvack.org, kexec@lists.infradead.org Subject: [Hypervisor Live Update] Notes from March 10, 2025 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Stat-Signature: t31qhw7j3o7dj1t8hrdt6d7u96bqo857 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C956C20005 X-Rspam-User: X-HE-Tag: 1742183566-263366 X-HE-Meta: U2FsdGVkX1/fWu2OwYRTqvoMW+Iit5q36JBQUSk76bx3wWBFflHPa7DrXbW/rOIh/3gjKgq5bpdNfrZaIv4HT6CQWwKZSjuD+n6YoEglMV8LByJTmeiAaa0iSuparsAAodkiVoYbABH48rtNu2K7RuOhfwTc/hfdjYjsbiKqf5nJ9wuVYJ8WAtmE6SGN/WEB3owLP/6a2oPv5V27vZJ0HdZFRqWV3GAggSzW7oGrEVaUbE7h3rjQN/ZkedTe+iWCyP3UY9IAoxSJl2ofpgaq/Q28dIADht0BHpOFA30Raw/CH96/mICHWXyI7eZIS6aoORZnVunpWufeBs4KmUhjgZsValEafe+r4Q8zFudn7pHIirogsCsh8178rMXZOQDlRvv0j4YvvqiQIFSd4HziGodY3+69Kv3vckKEn/ihNngVLGXJlaVXWDKVIo8kifwcx4/+y3tdzh0DGfVeavaWnf8AtV5+4nWj97X8T9P/HDcTr2EMqnKGUoBRNC+xtbXU8pnVl5NjfLp+x465USa5y6ve7XMOF6RG/A+Wfe1JBxPkS8XB67suyDKZrwpU8Vpa1Zmk51gfaGuwBfNLgu7zCm2a2rlrwnTYkQWpmzU8BXwWLlvW/ohcA+PlLBb2RlQBo/jvfE4Wye9rnoryJ8xdiL7IeC0nO4yHHY+HlqvrGdn9okmOpryhLdy0fOG4IVw2oumRTwp07Bdex9TSg7QKOB+2VrhtJuWqXD4H8ihnb3rB+ZRtBMZUTZx4kGy7DkjCUM8UgSA18urBusKX82SIyk41/Lhk3YHqhHwHMIIz2uFCZqKLk6ePdfTyIRNOyyJiGnYrEJb8YLFN3fUrePY7ZQk9xQ89K8upqe8w5Rd/hFzZW8uEYWviIJNYF1sSLTiFjVXyHD0cDN6HZZV4qzu15f6iQ7XZk1kYy8sC+jj7X2mzoZrONGNbWSIt9IUaw8SEv7AJ2qVXXHpXOF+TIuz Rkjf1tIJ ph8HB/EiUN0ZbPa3/MqRmA5wYexF8PlGaag4Y97G4AZH4+2ys0Hd43mT19eF4XmE2mh+qYmYvG422NnT4/XDm+/h7veZ73OQWxIa/V9AGNYxfBCtLdEJaw/BrlzcArCbV4ObFZvIgP6t0aFNMhO0H8jO2QtKRlBZcZHjU7Twv5rKy66/esNm+3GN+dgRQk+/CxJIzNWQ3ZUOluX37kjVGkhDuZhnjmmVqS2NWkSVRBnyuZI+LfikKSpADcNWNSYG1oSi2pOvba8CXcM056de/JyyiZldigL0h6QDwVGutCKDP9Pf9undQLxAXfQhrqqrixfATu2JgsWpBAQqP9UVWAgGlOCdzv9ZINeYUR/YFZn/dXQCMTcIZpE1pa4/r8coWW6ym6xDKfwuYvnNYw+Xjl6bbeqj3in9UUzH3fNVBYAQsk806fEDOdxN5SKn17eA2dNBxGo8Pps4z04tVr3YB1XIscZcRZcIyPmODdl2ued+SKHVuUROTcs9teO/GFelgUjIgWvM9VrPEEkbpMZSYL4T8wgbCKEgeQOjUJI0Z7+MCsB3QZ6nL/zbovQnSqlFLMDnjgMZPegGbnPPDqnkmDWFh3zvIJaE3Kydv8VuTv+0iUEu9ud7b84lQI7ohVDeWp9PN X-Bogosity: Ham, tests=bogofilter, spamicity=0.002602, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, March 10. Thanks for everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- Mike discussed taking Jason's proposal in response to v4 with xarray and extending it a bit for memory reservations, it appeared to be working correctly. He's hoping to have the first implementation for that by this week. Mike noted that the next KHO series was being prepared to be sent out before LSF/MM/BPF, including device tree. ----->o----- Mike noted that Pratyush found that the KHO scratch area does not work well with swiotlb[1]. The scratch area is reserved before the swiotlb is initialized; the second kernel doesn't have enough low memory for the swiotlb because it's still allocated from memblock. The current scratch areas are allocated in higher memory. Mike posted a patch series that split meminit for all architectures so this should be easier to fix. This will affect any driver that requires memory in the first 4GB. Alexander suggested allocating a scratch region in the low memory area. Pratyush proposed this as a solution, although he wondered if it would be possible to move swiotlb allocations to after the buddy allocator was up. Heuristics were discussed to determine how much memory should be reserved in low memory for this. Mike noted that for successive kexecs, there will be multiple regions of scratch area for each NUMA node. For low memory, this would be sized suitable for allocations that must originate below 4GB for DMA. Mike said a solution would still need to be developed for overlap with preserved scratch memory and Pratyush noted that should be explicit by denying those reservations. Pasha asked how drivers would know if reservations would be denied in the first 4GB of memory. Mike said an error code would be returned. Pasha was specific about devices that wanted to preserve the memory because they knew DMA would be on-going during the reboot. This became a more general question: what devices should we support for KHO and what should we not (what is considered too legacy?). In the meantime, Pratyush suggested explicit checks for this. ----->o----- We shifted to talking about Pratyush's patch series supporting fdbox for memfd[2]. Reaction was mixed to this: some feedback focused on the use of miscdevice and there were security concerns. Pratyush noted that there was no intent to propose this as a generic concept outside KHO. Pratyush noted there was no way to preserve folio orders in KHO and he also noted there was a need for page flags. He also said it would be possible to move away from miscdevice and perhaps toward VFS but would need to look more into this. Pasha asked about how the page flags were preserved. Pratyush said there was another property that would store them currently. Pasha asked how cgroups would be handled, but there was no current support for that. Pratyush said the current RFC focused on anon memfd and has not yet looked at hugetlb. Pasha emphasized the importance of focusing on one type of memory to start. Pratyush noted in chat: "With FDBox work, I also realized that you can't use FDT code from modules. Should not really be a problem since we can export those symbols I suppose, but it doesn't work _currently_ at least". ----->o----- Andrey had recently sent another patch series for KSTATE[3] that was discussed, now in v2, which was closer to being a formal submission rather than an RFC. He noted his concern with KHO was how hard it was to write serialization code. His goal was to give drivers the ability to migrate structs across kexec which could be more elegant (see the struct kstate_description). He suggested this would be more maintainable. It had previously been used for live migration in qemu. Andrey noted that each description would have a version field that enables defining the minimal supported version for each driver. He made the connection between this and version control in qemu. Pasha asked how this solves the problem when memory becomes sufficiently fragmented and the next kernel cannot boot due to it; Andrey noted the kexec would fail. Andrey suggested allocating a big contiguous area, the source and destination ranges would be the same. Mike noted that kstate_description definitions and the way drivers declare their state to preserve are independent from scratch memory reservation. Andrey noted this wasn't a replacement for KHO but rather could be built on top of KHO. Mike suggested on top of KHO we have FDT, then what Pasha is proposing for dynamic tree on top of that, then perhaps kstate on top of that. He would need to look more into kstate. ----->o----- Mike asked if kstate descriptions depend on how it's preserved on the backend, an earlier version had a migration stream. Andrey suggested using FDT underneath, but there is no strong dependency. Pasha asked what architectures were supported today for kstate, Andrey said x86. Pasha suggested that anything that lands upstream should likely support both x86 and ARM. Chris Li asked about kstate descriptions and if a struct adds or removes a member. Andrey said if you want to add a new member, then you can bump the version number. He showed an example from qemu[4] that could be used as reference for this. You could also add a new kstate description with a new id, on downgrade it wouldn't be used for backward compatibility. Alexander suggested starting with FDT logic because it already exists and then serialize and de-serialize binary data using a UAPI. Then, we should discuss deprecating FDT if/when we have something better. That won't be problematic unless we gain hundreds of users. He emphaszied we should focus on how to easily and quickly preserve memory across kexec, calling back to drivers to store their state at the right time, etc. The data format for how to serialize is a tiny detail in comparison. Pasha fully agreed with this. ----->o----- Next meeting is PREEMPTED for LSF/MM/BPF 2025 in Montreal. So the next meeting will be on Monday, April 7 at 8am PDT (UTC-7). I'll send a reminder on this mailing list. Topics I think we should cover in the next meeting: - debrief discussions at LSF/MM/BPF 2025 - update on Mike's patch series for memory reservation - update on Pratyush's progress for allocating swiotlb in low memory regions and any additional support required based on device requirements (who needs this scratch support?) - discuss whether the fdbox support would obsolete the need for guestmemfs in the long term - alignment on memblock as the first use case for KHO to justify upstreaming, including ftrace use cases - discuss Live Update Orchestrater (LUO) based on RFC patches sent by Pasha before then that helps to define the state machine - discuss how KSTATE plays into KHO upstreaming and complementary or overlapping goals - decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds can be added to an fdbox - iommufd patch series (as well as qemu) from James - establishing an API for callbacks into drivers to serialize state during brownout - topics proposed by Pasha: reducing blackout window, relaxed serialization, and KHO activation requirements - implications of preserving vIOMMU state - testing methodology for these components, including selftests Please let me know if you'd like to propose additional topics for discussion, thank you! [1] https://lore.kernel.org/all/mafs0cyf4ii4k.fsf@kernel.org [2] https://lore.kernel.org/lkml/20250307005830.65293-5-ptyadav@amazon.de/T/ [3] https://lore.kernel.org/linux-mm/20250310120318.2124-6-arbn@yandex-team.com/T/ [4] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate