From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE6F1C3DA6F for ; Fri, 25 Aug 2023 19:49:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 15C522800E0; Fri, 25 Aug 2023 15:49:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E5E22800DF; Fri, 25 Aug 2023 15:49:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA1212800E0; Fri, 25 Aug 2023 15:49:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D4FCD2800DF for ; Fri, 25 Aug 2023 15:49:24 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A997FA07EC for ; Fri, 25 Aug 2023 19:49:24 +0000 (UTC) X-FDA: 81163666248.08.4861A08 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by imf13.hostedemail.com (Postfix) with ESMTP id 3DA8C20014 for ; Fri, 25 Aug 2023 19:49:21 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.microsoft.com header.s=default header.b=MKvmnKvi; dmarc=pass (policy=none) header.from=linux.microsoft.com; spf=pass (imf13.hostedemail.com: domain of skinsburskii@linux.microsoft.com designates 13.77.154.182 as permitted sender) smtp.mailfrom=skinsburskii@linux.microsoft.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692992963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VUiCeAP+4tvBhOU/UgD9CTweEU2X/dx6nJEwyRbFPGM=; b=AX1Wm+usiIgUZxYBdtpjqbb85fUGqyfCYRqINlPFbzp+fKuq6KBtDlbPucxz/ySh8Vs/xj QC96e5p/j8/ZhzPuyScKRf/H3YUQj0aj12SsGnIC1prTl4Izydq7NaDw4IJd6LsonN0XWs xf7BdW+bMruWC+HQHGo8wQfZvN5BfUg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.microsoft.com header.s=default header.b=MKvmnKvi; dmarc=pass (policy=none) header.from=linux.microsoft.com; spf=pass (imf13.hostedemail.com: domain of skinsburskii@linux.microsoft.com designates 13.77.154.182 as permitted sender) smtp.mailfrom=skinsburskii@linux.microsoft.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692992963; a=rsa-sha256; cv=none; b=SYiDMMrNR8Jb7vD77ZgRYsoQ9f4OIY9BJh4q3r1i6hshd+CUCi8Fw80uU/w5WLizHba13b 8B6YpPMHfOgkuUiJwW1PN7e0kIblPUGVVu4bx2FYBoJSRfEVQ76EeUnn5ifGFeKxHZV9kg IMdd1ZeoNE50kkpukPm1KN5sYafd9u0= Received: from skinsburskii. (c-67-170-100-148.hsd1.wa.comcast.net [67.170.100.148]) by linux.microsoft.com (Postfix) with ESMTPSA id A6A972127C95; Fri, 25 Aug 2023 12:49:20 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com A6A972127C95 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1692992961; bh=VUiCeAP+4tvBhOU/UgD9CTweEU2X/dx6nJEwyRbFPGM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=MKvmnKvilzy4KZFFuAbjRkiJerUO6aoyubzYLxf926qW5CkmXVVk3TvxuFft90RwG 2KcBKFlgwRtxnJ16mD+Csy17ZgqxKgQNdo/0TPZpKTdQqPKZRIE4/SLpe7pykOpHFY nRRw5lUTjZ2f4Yphb+g+QESrAvUTKhiVb5s0feEo= Date: Tue, 22 Aug 2023 19:45:00 -0700 From: Stanislav Kinsburskii To: "Gowans, James" Cc: "gregkh@linuxfoundation.org" , "rppt@kernel.org" , "madvenka@linux.microsoft.com" , "anthony.yznaga@oracle.com" , "steven.sistare@oracle.com" , "stanislav.kinsburskii@gmail.com" , "linux-kernel@vger.kernel.org" , "seanjc@google.com" , "pbonzini@redhat.com" , "kys@microsoft.com" , "arnd@arndb.de" , "Graf (AWS), Alexander" , "wei.liu@kernel.org" , "anrayabh@linux.microsoft.com" , "dragan.cvetic@amd.com" , "jinankjain@linux.microsoft.com" , "derek.kiernan@amd.com" , "linux-mm@kvack.org" , Andrew Morton Subject: Re: [RFC PATCH] Introduce persistent memory pool Message-ID: <20230823024500.GA25462@skinsburskii.> References: <64e7cbf7.050a0220.114c7.b70dSMTPIN_ADDED_BROKEN@mx.google.com> <2023082506-enchanted-tripping-d1d5@gregkh> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 3DA8C20014 X-Stat-Signature: y6161ur58p7bbd7nb53wycarq4m7sgo3 X-Rspam-User: X-HE-Tag: 1692992961-345014 X-HE-Meta: U2FsdGVkX19N7cIdhZs7Gf7FTPZi4b/LksVFQHIPyzQPvNmFxwVyNpbhdHCdzcjSy+NEWTOPb3FpAVVv9sxuCz88ryMZfWGe5tV2ZzhFLomU+8RAuWLzb7P9e2ibqYr5ghCoYlepTwx3zG8q5xg2F0FMd07Ns+kbWaqxm2sXORRJg3seLxNFvaBJSSnI06YcQ68dNplb3/JbUBthdpFNk4aeKTYXvd4MDMLvC5ynH1nhPhYD+suLTlxBDxtjRG6b37gPmDH153RCAhQztkA12iNqkkiI506fC9MyhtbkPwr+pqiRJAtIVmOYFVllvjmifFj+XqubcsyHT3xd9kNMBXEqBhvGUu2GKFb4vlxLfOFI3x0SzOJOqJfhgnIg804WICFJEmnoo5HQnD/NKIgUHxH1xpSJqLSnvHBJVG9/7XJx7EVZRGqfTa1yCRp9Q9rqE5+Zq4+ofbfBEuS97kxaBr8mwQ2gFNkoLESQxwRpnLEcveTAk2sA7I/zpN5W4JKLNmUsBjx6zA3T8JFjxXn44g8b1dhRMRrm8bkda8v7qvh21EQcO77W/p+Cr+Dl6Cx2j7SfVsLzFRtGtORK6vtCucRi1q5iityvdhpGcs8pfOklF/TdbdY8Py+73o80fJJRTvaA6e82absDtvYq+CjbP3ztVHQgmor6pDnguVA2hxfY0hInR+MzPNRAf63QnLugAzgbCVTtaFXbu31KymDknIfFY6NaML9kkO0jXHWZK+9FZSxrYcp+rnDGKdDpOUfKHlZqGYUAPyRukVu18p6myIHm5VxphKsZPonw+OPtE5Ob3hXel9TfFuMUEExNxPrSU4jvrzEFEDPDLuyowQ3/opV2uTeHWumAPJuU8s2T51ByUM0/9CbXXuHEBY7ae03ta9rxDykHH2dqeZIRh5FN7bZyjIabSQF+342EaQ+tp3rSFhTA7E/RTNr6PESMXkNcc99/9eq5o1Mn0z6eFuu UUTCCzaJ KIt55/ZqmakZ2hrl2Q4+yz+yy0WBfSf1JYPMYiXICi4t5PafjrwSGpjxlTUjcCeCW4eLpsMRcqlLmot7uBpGpBpWUNbYkHPEGq1xrmppeUQaFIpAh5cBhN770z1NKT6LVRCJ0caWpLXbnujLFq7oVHJAfgLO2SroAdqG/FYBXowNz8rSqcYEx9RpUNTPvQrNej0uD1Bf4bcKSH0PTl6iDOB7bukXGkyYelLoa60ogHnTd+5dTipJNbhflyM87+ScyIpgsmVL35SsGwdWHsj089LAL5O4zU3nALpTVqhLHS1V5m+BBqptSkzNfc1ecjszTsViuYLNVn92sp+/Fjx5Bdc5D3Qw6y1fkwQmvFFzSZtPRqgqG8EY3DWEJEyNl694H92UyrvoMELzRFaojmIldIZKgiMVz48I8R0PCTMSenw8AHPgA0DLht2351H8k7YW4PJMKxEgcEOYLM7RAcTN9HkQtB4T7hBLs1hrDsECB/Be5WP3avb41EgKw8A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: +akpm, +linux-mm On Fri, Aug 25, 2023 at 01:32:40PM +0000, Gowans, James wrote: > On Fri, 2023-08-25 at 10:05 +0200, Greg Kroah-Hartman wrote: > > Thanks for adding me to this thread Greg! > > > On Tue, Aug 22, 2023 at 11:34:34AM -0700, Stanislav Kinsburskii wrote: > > > This patch addresses the need for a memory allocator dedicated to > > > persistent memory within the kernel. This allocator will preserve > > > kernel-specific states like DMA passthrough device states, IOMMU state, and > > > more across kexec. > > > The proposed solution offers a foundational implementation for potential > > > custom solutions that might follow. Though the implementation is > > > intentionally kept concise and straightforward to foster discussion and > > > feedback, it's fully functional in its current state. > > Hi Stanislav, it looks like we're working on similar things. I'm looking > to develop a mechanism to support hypervisor live update for when KVM is > running VMs with PCI device passthrough. VMs with device passthrough > also necessitates passing and re-hydrating IOMMU state so that DMA can > continue during live update. > > Planning on having an LPC session on this topic: > https://lpc.events/event/17/abstracts/1629/ (currently it's only a > submitted abstract so not sure if visible, hopefully it will be soon). > > We are looking at implementing persistence across kexec via an in-memory > filesystem on top of reserved memory. This would have files for anything > that needs to be persisted. That includes files for IOMMU pgtables, for > guest memory or userspace-accessible memory. > > It may be nice to solve all kexec persistence requirements with one > solution, but we can consider IOMMU separately. There are at least three > ways that this can be done: > a) carving out reserved memory for pgtables. This is done by your > proposal here, as well as my suggestion of a filesystem. > b) pre/post kexec hooks for drivers to serialise state and pass it > across in a structured format from old to new kernel. > c) Reconstructing IOMMU state in the new kernel by starting at the > hardware registers and walking the page tables. No state passing needed. > > Have you considered option (b) and (c) here? One of the implications of > (b) and (c) are that they would need to hook into the buddy allocator > really early to be able to carve out the reconstructed page tables > before the allocator is used. Similar to how pkram [0] hooks in early to > carve out pages used for its filesystem. > Hi James, We are indeed working on similar things, so thanks for chiming in. I've seen pkram proposal as well as your comments there. I think (b) will need some persistent-over-kexec memory to pass the state across kexec as well as some key-value store persisted as well. And the proposed persistent memory pool is aimed exactly for this purpose. Or do you imply some other way to pass driver's data accross kexec? I dind't consider (c) yet, thanks for for the pointer. I have a question in this scope: how is PCI devices registers state is persisted across kexec with the files system you are working on? I.e. how does driver know, that the device shouldn't not be reinitialized? > > > > > > > > Potential applications include: > > > > > > 1. Allowing various in-kernel entities to allocate persistent pages from > > > a singular memory pool, eliminating the need for multiple region > > > reservations. > > > > > > 2. For in-kernel components that require the allocation address to be > > > available on kernel kexec, this address can be exposed to user space and > > > then passed via the command line. > > Do you have specific examples of other state that needs to be passed > across? Trying to see whether tailoring specifically to the IOMMU case > is okay. Conceptually IOMMU state can be reconstructed starting with > hardware registers, not needing reserved memory. Other use-cases may not > have this option. > Well, basically it's IOMMU state and PCI devices to skip/avoid initializing. I bet there can be other misc (and unrelated things) like persistent filesystems, block devices, etc. But I don't have a solid set of use cases to present. > > > > As you have no in-kernel users of this, it's not something we can even > > consider at the moment for obvious reasons (neither would you want us > > to.) > > > > Can you make this part of a patch series that actually adds a user, > > probably more than one, so that we can see if any of this even makes > > sense? > > I'm very keen to see this as well. The way that the IOMMU drivers are > enlightened to hook into your memory pool will likely be similar to how > they would hook into my proposal of an in-memory filesystem. > Do you have code available showing the IOMMU integration? > No, I don't have such a code yet. But I was thinking that using such a allocator in the mempool allows to hide this implementation under the hood of an existent generic mechanism, which is then can be used to create persistent objects (file system, for example) on top of it. > > > > > drivers/misc/Kconfig | 7 + > > > drivers/misc/Makefile | 1 > > > drivers/misc/pmpool.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > include/linux/pmpool.h | 20 ++++ > > > 4 files changed, 298 insertions(+) > > > create mode 100644 drivers/misc/pmpool.c > > > create mode 100644 include/linux/pmpool.h > > > > misc is not for memory pools, as this is not a driver. please put this > > in the properly location instead of trying to hide it from the mm > > maintainers and subsystem :) > > One of the reasons I thought a proper filesystem would be a better way > of exposing this functionality. > Yes, I see the point of having a file system for the goals you are targeting. It looks like the right way forward. What I'm not sure about is that persistent-over-kexec memory management and the actual preservation mechanism should be an embedded part of this file system. What I'm trying to propose is a part of a generic mechanism to provide state persistence across kexec, which can then be used by your file system or something else. What do you think about this approach? Thanks, Stanislav > JG > > > [0] > https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com/T/