linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/7] Introduce persistent memory pool
@ 2023-09-25 21:27 Stanislav Kinsburskii
  2023-09-25 21:27 ` [RFC PATCH v2 1/7] kexec_file: Add fdt modification callback support Stanislav Kinsburskii
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

This patch introduces a memory allocator specifically tailored for
persistent memory within the kernel. The allocator maintains
kernel-specific states like DMA passthrough device states, IOMMU state, and
more across kexec.

The current implementation provides a foundation for custom solutions that
may be developed in the future. Although the design is kept concise and
straightforward to encourage discussion and feedback, it remains fully
functional.

The persistent memory pool builds upon the continuous memory allocator
(CMA) and ensures CMA state persistency across kexec by incorporating the
CMA bitmap into the memory region instead of allocation it from kernel
memory.

Persistent memory pool metadata is passed across kexec by using Flattened
Device Tree, which is added as another kexec segment for x86 architecture.

Potential applications include:

  1. Enabling various in-kernel entities to allocate persistent pages from
     a unified memory pool, obviating the need for reserving multiple
     regions.

  2. For in-kernel components that need the allocation address to be
     retained on kernel kexec, this address can be exposed to user space
     and subsequently passed through the command line.

  3. Distinct subsystems or drivers can set aside their region, allocating
     a segment for their persistent memory pool, suitable for uses such as
     file systems, key-value stores, and other applications.

Notes:

  1. The last patch of the series represents a use case for the feature.
     However, the patch won't compile and is for illustrative purposes only
     as the code being patched hasn't been merged yet.

  2. The code being patched is currently under review by the community. The
     series is named "Introduce /dev/mshv drivers":

         https://lkml.org/lkml/2023/9/22/1117


Changes since v1:

  1. Persistent memory pool is now a wrapper on top of CMA instead of being a
     new allocator.

  2. Persistent memory pool metadata doesn't belong to the pool anymore and
     is now passed via Flattened Device Tree instead over kexec to the new
     kernel.

The following series implements...

---

Stanislav Kinsburskii (7):
      kexec_file: Add fdt modification callback support
      x86: kexec: Transfer existing fdt to the new kernel
      x86: kexec: Enable fdt modification in callbacks
      pmpool: Introduce persistent memory pool
      pmpool: Update device tree on kexec
      pmpool: Restore state from device tree post-kexec
      Drivers: hv: Allocate persistent pages for root partition


 arch/x86/Kconfig                  |   16 +++
 arch/x86/kernel/kexec-bzimage64.c |   97 +++++++++++++++++
 drivers/hv/hv_common.c            |   13 ++
 include/linux/kexec.h             |    7 +
 include/linux/pmpool.h            |   22 ++++
 kernel/kexec_file.c               |   24 ++++
 mm/Kconfig                        |    9 ++
 mm/Makefile                       |    1 
 mm/pmpool.c                       |  208 +++++++++++++++++++++++++++++++++++++
 9 files changed, 394 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/pmpool.h
 create mode 100644 mm/pmpool.c



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 1/7] kexec_file: Add fdt modification callback support
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
@ 2023-09-25 21:27 ` Stanislav Kinsburskii
  2023-09-25 21:27 ` [RFC PATCH v2 2/7] x86: kexec: Transfer existing fdt to the new kernel Stanislav Kinsburskii
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

Introduce primitives to:
- Register and unregister callbacks for flattened device tree (fdt)
  modifications.
- Invoke all registered callbacks.
- Check for any registered callbacks.

These enhancements enable the use of a device tree to store kernel bits.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 include/linux/kexec.h |    7 +++++++
 kernel/kexec_file.c   |   24 ++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 22b5cd24f581..c9c70551796d 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -282,6 +282,13 @@ arch_kexec_apply_relocations(struct purgatory_info *pi, Elf_Shdr *section,
 	return -ENOEXEC;
 }
 #endif
+
+struct notifier_block;
+extern int register_kexec_fdt_notifier(struct notifier_block *nb);
+extern int unregister_kexec_fdt_notifier(struct notifier_block *nb);
+extern bool kexec_fdt_notify_list_empty(void);
+extern int kexec_fdt_notify(void *fdt);
+
 #endif /* CONFIG_KEXEC_FILE */
 
 #ifdef CONFIG_KEXEC_ELF
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 881ba0d1714c..f9245d5e4459 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -43,6 +43,30 @@ static int kexec_calculate_store_digests(struct kimage *image);
 /* Maximum size in bytes for kernel/initrd files. */
 #define KEXEC_FILE_SIZE_MAX	min_t(s64, 4LL << 30, SSIZE_MAX)
 
+static BLOCKING_NOTIFIER_HEAD(kexec_fdt_notify_list);
+
+bool kexec_fdt_notify_list_empty(void)
+{
+	return kexec_fdt_notify_list.head == NULL;
+}
+
+int kexec_fdt_notify(void *fdt)
+{
+	return blocking_notifier_call_chain(&kexec_fdt_notify_list, 0, fdt);
+}
+
+int register_kexec_fdt_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&kexec_fdt_notify_list, nb);
+}
+EXPORT_SYMBOL(register_kexec_fdt_notifier);
+
+int unregister_kexec_fdt_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&kexec_fdt_notify_list, nb);
+}
+EXPORT_SYMBOL(unregister_kexec_fdt_notifier);
+
 /*
  * Currently this is the only default function that is exported as some
  * architectures need it to do additional handlings.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 2/7] x86: kexec: Transfer existing fdt to the new kernel
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
  2023-09-25 21:27 ` [RFC PATCH v2 1/7] kexec_file: Add fdt modification callback support Stanislav Kinsburskii
@ 2023-09-25 21:27 ` Stanislav Kinsburskii
  2023-09-25 21:28 ` [RFC PATCH v2 3/7] x86: kexec: Enable fdt modification in callbacks Stanislav Kinsburskii
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:27 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

Enable passing of the Flattened Device Tree (fdt) over kexec for x86
architecture, as outlined in Documentation/x86/booting-dt.rst.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 arch/x86/Kconfig                  |    8 +++++
 arch/x86/kernel/kexec-bzimage64.c |   58 +++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e36261b4ea14..efb472e267ec 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2070,6 +2070,14 @@ config KEXEC_FILE
 	  for kernel and initramfs as opposed to list of segments as
 	  accepted by previous system call.
 
+config KEXEC_FILE_FDT
+	bool "Pass fdt over kexec"
+	depends on KEXEC_FILE && X86_64
+	depends on OF_FLATTREE
+	help
+	  This option enables passing existent Flattened Device Tree to the new
+	  kernel when kexec is invoked by the file based system call.
+
 config ARCH_HAS_KEXEC_PURGATORY
 	def_bool KEXEC_FILE
 
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index a61c12c01270..ab9ae02c9a5f 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -18,6 +18,8 @@
 #include <linux/mm.h>
 #include <linux/efi.h>
 #include <linux/random.h>
+#include <linux/of_fdt.h>
+#include <linux/libfdt.h>
 
 #include <asm/bootparam.h>
 #include <asm/setup.h>
@@ -381,7 +383,59 @@ static int bzImage64_probe(const char *buf, unsigned long len)
 
 	return ret;
 }
+#ifdef CONFIG_KEXEC_FILE_FDT
+static void *fdt_get_runtime(void)
+{
+	return initial_boot_params;
+}
+
+static int kexec_setup_fdt(struct kexec_buf *kbuf, struct boot_params *params)
+{
+	void *fdt;
+	struct setup_data *sd;
+	unsigned long fdt_load_addr, fdt_sz;
+	int ret;
+
+	fdt = fdt_get_runtime();
+	if (!fdt)
+		return 0;
+
+	fdt_sz = fdt_totalsize(fdt);
+
+	kbuf->bufsz = kbuf->memsz = sizeof(struct setup_data) + fdt_sz;
+
+	sd = kzalloc(kbuf->bufsz, GFP_KERNEL);
+	if (!sd)
+		return -ENOMEM;
+
+	kbuf->buffer = sd;
+	kbuf->buf_align = PAGE_SIZE;
+	kbuf->buf_min = MIN_INITRD_LOAD_ADDR;
+	kbuf->mem = KEXEC_BUF_MEM_UNKNOWN;
+	ret = kexec_add_buffer(kbuf);
+	if (ret)
+		return ret;
+
+	fdt_load_addr = kbuf->mem;
 
+	pr_debug("Loaded fdt at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+		fdt_load_addr, fdt_sz, fdt_sz);
+
+	sd->type = SETUP_DTB;
+	sd->len = fdt_sz;
+	memcpy(sd->data, fdt, fdt_sz);
+
+	sd->next = params->hdr.setup_data;
+	params->hdr.setup_data = fdt_load_addr;
+
+	return 0;
+}
+#else
+static int kexec_setup_fdt(struct kexec_buf *kbuf, struct boot_params *params)
+{
+	return 0;
+}
+#endif
 static void *bzImage64_load(struct kimage *image, char *kernel,
 			    unsigned long kernel_len, char *initrd,
 			    unsigned long initrd_len, char *cmdline,
@@ -561,6 +615,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
 	if (ret)
 		goto out_free_params;
 
+	ret = kexec_setup_fdt(&kbuf, params);
+	if (ret)
+		goto out_free_params;
+
 	/* Allocate loader specific data */
 	ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
 	if (!ldata) {




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 3/7] x86: kexec: Enable fdt modification in callbacks
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
  2023-09-25 21:27 ` [RFC PATCH v2 1/7] kexec_file: Add fdt modification callback support Stanislav Kinsburskii
  2023-09-25 21:27 ` [RFC PATCH v2 2/7] x86: kexec: Transfer existing fdt to the new kernel Stanislav Kinsburskii
@ 2023-09-25 21:28 ` Stanislav Kinsburskii
  2023-09-25 21:28 ` [RFC PATCH v2 4/7] pmpool: Introduce persistent memory pool Stanislav Kinsburskii
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

This option allows kernel subsystems to modify (or create, if necessary)
the Flattened Device Tree (fdt) using registered callbacks and then pass
the modified version to the new kernel.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 arch/x86/Kconfig                  |    8 +++++++
 arch/x86/kernel/kexec-bzimage64.c |   41 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index efb472e267ec..90da51fbb8f8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2078,6 +2078,14 @@ config KEXEC_FILE_FDT
 	  This option enables passing existent Flattened Device Tree to the new
 	  kernel when kexec is invoked by the file based system call.
 
+config KEXEC_FILE_FDT_CALLBACK
+	bool "Enable kexec fdt modification support"
+	depends on KEXEC_FILE_FDT
+	select LIBFDT
+	help
+	  This option enables Flattened Device Tree modification (and creation
+	  if needed) by kernel subsystems, registered corresponding callback.
+
 config ARCH_HAS_KEXEC_PURGATORY
 	def_bool KEXEC_FILE
 
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index ab9ae02c9a5f..3c6df28d3637 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -384,11 +384,50 @@ static int bzImage64_probe(const char *buf, unsigned long len)
 	return ret;
 }
 #ifdef CONFIG_KEXEC_FILE_FDT
+#ifdef CONFIG_KEXEC_FILE_FDT_CALLBACK
+static void *fdt_get_runtime(void)
+{
+	void *fdt;
+	size_t fdt_size = SZ_2M;
+	int status;
+
+	/* It's nothing to do without existent fdt and any callbacks */
+	if (!initial_boot_params && kexec_fdt_notify_list_empty())
+		return NULL;
+
+	fdt = kzalloc(fdt_size, GFP_KERNEL);
+	if (!fdt)
+		return NULL;
+
+	if (initial_boot_params)
+		status = fdt_open_into(initial_boot_params, fdt, fdt_size);
+	else
+		status = fdt_create_empty_tree(fdt, fdt_size);
+	if (status != 0) {
+		pr_err("failed to get fdt\n");
+		goto free_fdt;
+	}
+
+	status = kexec_fdt_notify(fdt);
+	if (status) {
+		pr_err("fdt notification failed\n");
+		goto free_fdt;
+	}
+
+	fdt_pack(fdt);
+
+	return fdt;
+
+free_fdt:
+	kfree(fdt);
+	return NULL;
+}
+#else
 static void *fdt_get_runtime(void)
 {
 	return initial_boot_params;
 }
-
+#endif
 static int kexec_setup_fdt(struct kexec_buf *kbuf, struct boot_params *params)
 {
 	void *fdt;




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 4/7] pmpool: Introduce persistent memory pool
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
                   ` (2 preceding siblings ...)
  2023-09-25 21:28 ` [RFC PATCH v2 3/7] x86: kexec: Enable fdt modification in callbacks Stanislav Kinsburskii
@ 2023-09-25 21:28 ` Stanislav Kinsburskii
  2023-09-25 21:28 ` [RFC PATCH v2 5/7] pmpool: Update device tree on kexec Stanislav Kinsburskii
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

This patch introduces a memory allocator specifically tailored for
persistent memory within the kernel. The allocator maintains
kernel-specific states like DMA passthrough device states, IOMMU state, and
more across kexec.

The current implementation provides a foundation for custom solutions that
may be developed in the future. Although the design is kept concise and
straightforward to encourage discussion and feedback, it remains fully
functional.

The persistent memory pool builds upon the continuous memory allocator
(CMA) and ensures CMA state persistency across kexec by incorporating the
CMA bitmap into the memory region.

Potential applications include:

  1. Enabling various in-kernel entities to allocate persistent pages from
     a unified memory pool, obviating the need for reserving multiple
     regions.

  2. For in-kernel components that need the allocation address to be
     retained on kernel kexec, this address can be exposed to user space
     and subsequently passed through the command line.

  3. Distinct subsystems or drivers can set aside their region, allocating
     a segment for their persistent memory pool, suitable for uses such as
     file systems, key-value stores, and other applications.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 include/linux/pmpool.h |   22 +++++++++++
 mm/Kconfig             |    8 ++++
 mm/Makefile            |    1 
 mm/pmpool.c            |  100 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 131 insertions(+)
 create mode 100644 include/linux/pmpool.h
 create mode 100644 mm/pmpool.c

diff --git a/include/linux/pmpool.h b/include/linux/pmpool.h
new file mode 100644
index 000000000000..b41f16fa9660
--- /dev/null
+++ b/include/linux/pmpool.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _PMPOOL_H
+#define _PMPOOL_H
+
+struct page;
+
+#if defined(CONFIG_PMPOOL)
+struct page *pmpool_alloc(unsigned long count);
+bool pmpool_release(struct page *pages, unsigned long count);
+#else
+static inline struct page *pmpool_alloc(unsigned long count)
+{
+	return NULL;
+}
+static inline bool pmpool_release(struct page *pages, unsigned long count)
+{
+	return false;
+}
+#endif
+
+#endif /* _PMPOOL_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..e7c10094fb10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -922,6 +922,14 @@ config CMA_AREAS
 
 	  If unsure, leave the default value "7" in UMA and "19" in NUMA.
 
+config PMPOOL
+	bool "Persistent memory pool support"
+	select CMA
+	help
+	  This option adds support for CMA-based persistent memory pool
+	  feature, which provides pages allocation and freeing from a set of
+	  persistent memory ranges, deposited to the memory pool.
+
 config MEM_SOFT_DIRTY
 	bool "Track memory changes"
 	depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/Makefile b/mm/Makefile
index 678530a07326..8d3579e58c2c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -139,3 +139,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_PMPOOL) += pmpool.o
diff --git a/mm/pmpool.c b/mm/pmpool.c
new file mode 100644
index 000000000000..12a8cac75558
--- /dev/null
+++ b/mm/pmpool.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "pmpool: " fmt
+
+#include <linux/bitmap.h>
+#include <linux/cma.h>
+#include <linux/io.h>
+#include <linux/kexec.h>
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pmpool.h>
+
+#include "cma.h"
+
+struct pmpool {
+	struct cma *cma;
+};
+
+static struct pmpool *default_pmpool;
+
+bool pmpool_release(struct page *pages, unsigned long count)
+{
+	if (!default_pmpool)
+		return false;
+
+	return cma_release(default_pmpool->cma, pages, count);
+}
+
+struct page *pmpool_alloc(unsigned long count)
+{
+	if (!default_pmpool)
+		return NULL;
+
+	return cma_alloc(default_pmpool->cma, count, 0, true);
+}
+
+static void pmpool_fixup_cma(struct cma *cma)
+{
+	unsigned long bitmap_size;
+
+	bitmap_free(cma->bitmap);
+	cma->bitmap = phys_to_virt(PFN_PHYS(cma->base_pfn));
+
+	bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma));
+	memset(cma->bitmap, 0, bitmap_size);
+	bitmap_set(cma->bitmap, 0, PAGE_ALIGN(bitmap_size) >> PAGE_SHIFT);
+
+	pr_info("CMA bitmap moved to %#llx\n", virt_to_phys(cma->bitmap));
+}
+
+static int __init default_pmpool_fixup_cma(void)
+{
+	if (!default_pmpool)
+		return 0;
+
+	pmpool_fixup_cma(default_pmpool->cma);
+	return 0;
+}
+postcore_initcall(default_pmpool_fixup_cma);
+
+static int __init parse_pmpool_opt(char *str)
+{
+	static struct pmpool pmpool;
+	phys_addr_t base, size;
+	int err;
+
+	/* Format is pmpool=<base>,<size> */
+	base = memparse(str, &str);
+	size = memparse(str + 1, NULL);
+
+	err = memblock_is_region_reserved(base, size);
+	if (err) {
+		pr_err("memory block overlaps with another one: %d\n", err);
+		return 0;
+	}
+
+	err = memblock_reserve(base, size);
+	if (err) {
+		pr_err("failed to reerve memory block: %d\n", err);
+		return 0;
+	}
+
+	err = cma_init_reserved_mem(base, size, 0, "pmpool", &pmpool.cma);
+	if (err) {
+		pr_err("failed to initialize CMA: %d\n", err);
+		goto free_memblock;
+	}
+
+	pr_info("default memory pool is created: %#llx-%#llx\n",
+		base, base + size);
+
+	default_pmpool = &pmpool;
+
+	return 0;
+
+free_memblock:
+	memblock_phys_free(base, size);
+	return 0;
+}
+early_param("pmpool", parse_pmpool_opt);




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 5/7] pmpool: Update device tree on kexec
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
                   ` (3 preceding siblings ...)
  2023-09-25 21:28 ` [RFC PATCH v2 4/7] pmpool: Introduce persistent memory pool Stanislav Kinsburskii
@ 2023-09-25 21:28 ` Stanislav Kinsburskii
  2023-09-25 21:28 ` [RFC PATCH v2 6/7] pmpool: Restore state from device tree post-kexec Stanislav Kinsburskii
  2023-09-25 21:28 ` [RFC PATCH v2 7/7] Drivers: hv: Allocate persistent pages for root partition Stanislav Kinsburskii
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

Introduce a pmpool kexec fdt notifier that enables pmpool to pass its
metadata, including the bitmap address, to the new kernel during kexec.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 mm/Kconfig  |    1 +
 mm/pmpool.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e7c10094fb10..1eefdd4c82ba 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -925,6 +925,7 @@ config CMA_AREAS
 config PMPOOL
 	bool "Persistent memory pool support"
 	select CMA
+	select LIBFDT
 	help
 	  This option adds support for CMA-based persistent memory pool
 	  feature, which provides pages allocation and freeing from a set of
diff --git a/mm/pmpool.c b/mm/pmpool.c
index 12a8cac75558..f2173db782d6 100644
--- a/mm/pmpool.c
+++ b/mm/pmpool.c
@@ -6,6 +6,7 @@
 #include <linux/cma.h>
 #include <linux/io.h>
 #include <linux/kexec.h>
+#include <linux/libfdt.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
 #include <linux/pmpool.h>
@@ -58,6 +59,59 @@ static int __init default_pmpool_fixup_cma(void)
 }
 postcore_initcall(default_pmpool_fixup_cma);
 
+static int pmpool_fdt_update(struct notifier_block *nb, unsigned long val,
+			     void *data)
+{
+	void *fdt = data;
+	int node, status;
+
+	if (!fdt)
+		goto err;
+
+	node = fdt_subnode_offset(fdt, 0, "chosen");
+	if (node < 0) {
+		node = fdt_add_subnode(fdt, 0, "chosen");
+		if (node < 0)
+			goto err;
+	}
+
+	node = fdt_add_subnode(fdt, node, "default_pmpool");
+	if (node == -FDT_ERR_EXISTS)
+		return 0;
+	if (node < 0)
+		goto err;
+
+	status = fdt_setprop(fdt, node, "compatible",
+			     "pmpool", sizeof("pmpool"));
+	if (status)
+		goto err;
+
+	status = fdt_setprop_u64(fdt, node, "bitmap",
+				 virt_to_phys(default_pmpool->cma->bitmap));
+	if (status)
+		goto err;
+
+	status = fdt_setprop_u64(fdt, node, "size",
+				 default_pmpool->cma->count << PAGE_SHIFT);
+	if (status)
+		goto err;
+
+	status = fdt_setprop_u64(fdt, node, "base",
+				 default_pmpool->cma->base_pfn << PAGE_SHIFT);
+	if (status)
+		goto err;
+
+	return NOTIFY_DONE;
+
+err:
+	pr_err("failed to update fdt\n");
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block pmpool_kexec_fdt_nb = {
+	.notifier_call  = pmpool_fdt_update,
+};
+
 static int __init parse_pmpool_opt(char *str)
 {
 	static struct pmpool pmpool;
@@ -80,10 +134,16 @@ static int __init parse_pmpool_opt(char *str)
 		return 0;
 	}
 
+	err = register_kexec_fdt_notifier(&pmpool_kexec_fdt_nb);
+	if (err) {
+		pr_err("failed to register kexec fdt notifier: %d\n", err);
+		goto free_memblock;
+	}
+
 	err = cma_init_reserved_mem(base, size, 0, "pmpool", &pmpool.cma);
 	if (err) {
 		pr_err("failed to initialize CMA: %d\n", err);
-		goto free_memblock;
+		goto notifier_unregister;
 	}
 
 	pr_info("default memory pool is created: %#llx-%#llx\n",
@@ -93,6 +153,8 @@ static int __init parse_pmpool_opt(char *str)
 
 	return 0;
 
+notifier_unregister:
+	unregister_kexec_fdt_notifier(&pmpool_kexec_fdt_nb);
 free_memblock:
 	memblock_phys_free(base, size);
 	return 0;




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 6/7] pmpool: Restore state from device tree post-kexec
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
                   ` (4 preceding siblings ...)
  2023-09-25 21:28 ` [RFC PATCH v2 5/7] pmpool: Update device tree on kexec Stanislav Kinsburskii
@ 2023-09-25 21:28 ` Stanislav Kinsburskii
  2023-09-25 21:28 ` [RFC PATCH v2 7/7] Drivers: hv: Allocate persistent pages for root partition Stanislav Kinsburskii
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

From: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>

Retrieve the pmpool bitmap from metadata in the fdt passed over kexec,
bypassing the need for reinitialization. This ensures the seamless transfer
of the pmpool state across kexec.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 mm/pmpool.c |   46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/mm/pmpool.c b/mm/pmpool.c
index f2173db782d6..6c1a28fd3493 100644
--- a/mm/pmpool.c
+++ b/mm/pmpool.c
@@ -9,6 +9,7 @@
 #include <linux/libfdt.h>
 #include <linux/memblock.h>
 #include <linux/mm.h>
+#include <linux/of.h>
 #include <linux/pmpool.h>
 
 #include "cma.h"
@@ -49,11 +50,56 @@ static void pmpool_fixup_cma(struct cma *cma)
 	pr_info("CMA bitmap moved to %#llx\n", virt_to_phys(cma->bitmap));
 }
 
+static int pmpool_fdt_restore(struct cma *cma)
+{
+	struct device_node *dn;
+	u64 val;
+
+	dn = of_find_compatible_node(NULL, NULL, "pmpool");
+	if (!dn)
+		return -ENOENT;
+
+	if (of_property_read_u64(dn, "base", &val)) {
+		pr_err("invalid fdt: no base\n");
+		return -EINVAL;
+	}
+	if (val != PFN_PHYS(cma->base_pfn)) {
+		pr_err("fdt base doesn't match: %#llx != %#llx\n",
+			val, PFN_PHYS(cma->base_pfn));
+		return -EINVAL;
+	}
+
+	if (of_property_read_u64(dn, "size", &val)) {
+		pr_err("invalid fdt: no size\n");
+		return -EINVAL;
+	}
+	if (val != (cma->count << PAGE_SHIFT)) {
+		pr_err("fdt size doesn't match: %#llx != %#lx\n",
+			val, cma->count << PAGE_SHIFT);
+		return -EINVAL;
+	}
+
+	if (of_property_read_u64(dn, "bitmap", &val)) {
+		pr_err("invalid fdt: no bitmap\n");
+		return -EINVAL;
+	}
+
+	pr_info("CMA bitmap restored to %#llx\n", val);
+
+	bitmap_free(cma->bitmap);
+	cma->bitmap = phys_to_virt(val);
+
+	return 0;
+}
+
 static int __init default_pmpool_fixup_cma(void)
 {
 	if (!default_pmpool)
 		return 0;
 
+	if (!pmpool_fdt_restore(default_pmpool->cma))
+		return 0;
+
 	pmpool_fixup_cma(default_pmpool->cma);
 	return 0;
 }




^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH v2 7/7] Drivers: hv: Allocate persistent pages for root partition
  2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
                   ` (5 preceding siblings ...)
  2023-09-25 21:28 ` [RFC PATCH v2 6/7] pmpool: Restore state from device tree post-kexec Stanislav Kinsburskii
@ 2023-09-25 21:28 ` Stanislav Kinsburskii
  6 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-25 21:28 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

Deposited pages are owned by the hypervisor. Accessing them can trigger a
kernel panic due to a general protection fault.

This patch ensures that pages for the root partition are allocated from the
persistent memory pool. This allocation guarantees stability post-kexec,
protecting hypervisor-deposited pages from unintended reuse by the new
kernel.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
 drivers/hv/hv_common.c |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 335aec5ec504..a81c5613e745 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -426,7 +426,10 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 		order = 31 - __builtin_clz(num_pages);
 
 		while (1) {
-			pages[i] = alloc_pages_node(node, GFP_KERNEL, order);
+			if (paritition_id == hv_current_partition_id)
+				pages[i] = pmpool_alloc(1 << order);
+			else
+				pages[i] = alloc_pages_node(node, GFP_KERNEL, order);
 			if (pages[i])
 				break;
 			if (!order) {
@@ -471,8 +474,12 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages)
 err_free_allocations:
 	for (i = 0; i < num_allocations; ++i) {
 		base_pfn = page_to_pfn(pages[i]);
-		for (j = 0; j < counts[i]; ++j)
-			__free_page(pfn_to_page(base_pfn + j));
+		for (j = 0; j < counts[i]; ++j) {
+			if (paritition_id == hv_current_partition_id)
+				pmpool_release(pages[i], counts[i]);
+			else
+				__free_page(pfn_to_page(base_pfn + j));
+		}
 	}
 
 free_buf:




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28  2:46                   ` Stanislav Kinsburskii
@ 2023-09-29 10:13                     ` Shutemov, Kirill
  2023-09-28  9:16                       ` Stanislav Kinsburskii
  0 siblings, 1 reply; 27+ messages in thread
From: Shutemov, Kirill @ 2023-09-29 10:13 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Dave Hansen, Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa,
	ebiederm, akpm, stanislav.kinsburskii, corbet, linux-kernel,
	kexec, linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf,
	pbonzini

On Wed, Sep 27, 2023 at 07:46:36PM -0700, Stanislav Kinsburskii wrote:
> I'd answer yes, "System MAP" must be persisted across kexec.
> Could you elaborate on why there should be a mechanism to tell the
> kernel anything special about the existent "System map" in this context?
> Say, one can reserve a CMA region (or a crash kernel region, etc), store
> there some data, and then pass it across kexec. Reserved CMA region will
> still be a part of the "System MAP", won't it?

Em. When crash kernel starts all System RAM of the the first kernel
becomes E820_TYPE_RESERVED and only memory pre-allocated for crash
scenario becomes E820_TYPE_RAM. See crash_setup_memmap_entries().

Can't you go the same path? Report all deposited memory as
E820_TYPE_RESERVED.

Or do you have too many deposited memory ranges, so we would run out of
e820 entries?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
       [not found]                   ` <64208.123092816192300612@us-mta-483.us.mimecast.lan>
@ 2023-09-28 23:56                     ` Baoquan He
  2023-09-28  7:18                       ` Stanislav Kinsburskii
  0 siblings, 1 reply; 27+ messages in thread
From: Baoquan He @ 2023-09-28 23:56 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Dave Hansen, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini,
	Shutemov, Kirill

On 09/27/23 at 07:46pm, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 12:16:31PM -0700, Dave Hansen wrote:
> > On 9/27/23 17:38, Stanislav Kinsburskii wrote:
> > > On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote:
> > >> On 9/27/23 17:02, Stanislav Kinsburskii wrote:
> > >>> On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> > >> ...
> > >>> Well, not exactly. That's something I'd like to have indeed, but from my
> > >>> POV this goal is out of scope of discussion at the moment.
> > >>> Let me try to express it the same way you did above:
> > >>>
> > >>> 1. Boot some kernel
> > >>> 2. Grow the deposited memory a bunch
> > >>> 5. Kexec
> > >>> 4. Kernel panic due to GPF upon accessing the memory deposited to
> > >>> hypervisor.
> > >>
> > >> I basically consider this a bug in the first kernel.  It *can't* kexec
> > >> when it's left RAM in shambles.  It doesn't know what features the new
> > >> kernel has and whether this is even safe.
> > >>
> > > 
> > > Could you elaborate more on why this is a bug in the first kernel?
> > > Say, kernel memory can be allocated in big physically consequitive
> > > chunks by the first kernel for depositing. The information about these
> > > chunks is then passed the the second kernel via FDT or even command
> > > line, so the seconds kernel can reserve this region during booting.
> > > What's wrong with this approach?
> > 
> > How do you know the second kernel can parse the FDT entry or the
> > command-line you pass to it?
> > 
> > >> Can the new kernel even read the new device tree data?
> > > 
> > > I'm not sure I understand the question, to be honest.
> > > Why can't it? This series contains code parts for both first and seconds
> > > kernels.
> > 
> > How do you know the second kernel isn't the version *before* this series
> > gets merged?
> > 
> 
> The answer to both questions above is the following: the feature is deployed
> fleed-wide first, and enabled only upon the next deployment.
> It worth mentioning, that fleet-wide deployments usually don't need to support
> updates to a version older that the previous one.
> Also, since kexec is initialited by user space, it always can be
> enlightened about kernel capabilities and simply don't kexec to an
> incompatible kernel version.
> One more bit to mention, that it real life this problme exists only
> during initial transition, as once the upgrade to a kernel with a
> feature has happened, there won't be a revert to a versoin without it.
> 
> > ...
> > >> I still think the only way this will possibly work when kexec'ing both
> > >> old and new kernels is to do it with the memory maps that *all* kernels
> > >> can read.
> > > 
> > > Could you elaborate more on this?
> > > The avaiable memory map actually stays the same for both kernels. The
> > > difference here can be in a different list of memory regions to reserve,
> > > when the first kernel allocated and deposited another chunk, and thus
> > > the second kernel needs to reserve this memory as a new region upon
> > > booting.
> > 
> > Please take a step back from your implementation for a moment.  There
> > are two basic design points that need to be considered.
> > 
> > First, *must* "System RAM" (according to the memory map) be persisted
> > across kexec?  If no, then there's no problem to solve and we can stop
> > this thread.  If yes, then some mechanism must be used to tell the new
> > kernel that the "System RAM" in the memory map is not normal RAM.
> > 
> > Second, *if* we agree that some data must communicate across kexec, then
> > what mechanism should be used?  You're arguing for a new mechanism that
> > only new kernels can use.  I'm arguing that you should likely reuse an
> > existing mechanism (probably the UEFI/e820 maps) so that *ALL* kernels
> > can consume the information, old and new.
> > 
> 
> I'd answer yes, "System MAP" must be persisted across kexec.
> Could you elaborate on why there should be a mechanism to tell the
> kernel anything special about the existent "System map" in this context?
> Say, one can reserve a CMA region (or a crash kernel region, etc), store
> there some data, and then pass it across kexec. Reserved CMA region will
> still be a part of the "System MAP", won't it?

Well, I haven't gone through all the discusison thread and clearly got
your intention and motivation. But here I have to say there's
misunderstanding. At least I am astonished when I heard the above
description. Who said a CMA region or a crahs kernel region need be
passed across kexec. Think kexec as a bootloader, in essence it's no
different than any other bootloader. When it jumps to 2nd kernel, the
whole system will be booted up and reconstructed on the system resources.
All the difference kexec has is it won't go through firmware to do those
detecting/testing/init. If the intentionn is to preserve any state or
region in 1st kernel, you absolutely got it wrong.

This is not the first time people want to put burden on kexec because
of a specifica scenario, and this is not the 2nd time, and not 3rd time
in the recent 2 years. But I would say please think about what is kexec
reboot, what we expect it to do, whether the problem be fixed in its own
side.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28  0:38               ` Stanislav Kinsburskii
@ 2023-09-28 19:16                 ` Dave Hansen
  2023-09-28  2:46                   ` Stanislav Kinsburskii
       [not found]                   ` <64208.123092816192300612@us-mta-483.us.mimecast.lan>
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Hansen @ 2023-09-28 19:16 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini,
	Shutemov, Kirill

On 9/27/23 17:38, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote:
>> On 9/27/23 17:02, Stanislav Kinsburskii wrote:
>>> On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
>> ...
>>> Well, not exactly. That's something I'd like to have indeed, but from my
>>> POV this goal is out of scope of discussion at the moment.
>>> Let me try to express it the same way you did above:
>>>
>>> 1. Boot some kernel
>>> 2. Grow the deposited memory a bunch
>>> 5. Kexec
>>> 4. Kernel panic due to GPF upon accessing the memory deposited to
>>> hypervisor.
>>
>> I basically consider this a bug in the first kernel.  It *can't* kexec
>> when it's left RAM in shambles.  It doesn't know what features the new
>> kernel has and whether this is even safe.
>>
> 
> Could you elaborate more on why this is a bug in the first kernel?
> Say, kernel memory can be allocated in big physically consequitive
> chunks by the first kernel for depositing. The information about these
> chunks is then passed the the second kernel via FDT or even command
> line, so the seconds kernel can reserve this region during booting.
> What's wrong with this approach?

How do you know the second kernel can parse the FDT entry or the
command-line you pass to it?

>> Can the new kernel even read the new device tree data?
> 
> I'm not sure I understand the question, to be honest.
> Why can't it? This series contains code parts for both first and seconds
> kernels.

How do you know the second kernel isn't the version *before* this series
gets merged?

...
>> I still think the only way this will possibly work when kexec'ing both
>> old and new kernels is to do it with the memory maps that *all* kernels
>> can read.
> 
> Could you elaborate more on this?
> The avaiable memory map actually stays the same for both kernels. The
> difference here can be in a different list of memory regions to reserve,
> when the first kernel allocated and deposited another chunk, and thus
> the second kernel needs to reserve this memory as a new region upon
> booting.

Please take a step back from your implementation for a moment.  There
are two basic design points that need to be considered.

First, *must* "System RAM" (according to the memory map) be persisted
across kexec?  If no, then there's no problem to solve and we can stop
this thread.  If yes, then some mechanism must be used to tell the new
kernel that the "System RAM" in the memory map is not normal RAM.

Second, *if* we agree that some data must communicate across kexec, then
what mechanism should be used?  You're arguing for a new mechanism that
only new kernels can use.  I'm arguing that you should likely reuse an
existing mechanism (probably the UEFI/e820 maps) so that *ALL* kernels
can consume the information, old and new.

I'm not convinced that this series is going in the right direction on
either of those points.

> Can all this considered, as, say, the first kernel uses device tree to
> inform the second kernel about the memory regions to reserve?
> In this case the first kernel behaves a bit like a firmware piece for
> the second one.
> 
>> Can the hypervisor be improved to make this release operation faster?
> 
> I guess it can, but shutting down guests contributes to downtime the
> most. And without shutting down the guests the deposited memory can't be
> withdrawn.

Do you really need to fully shut down each guest?  Or do you just need
to get them to a quiescent state where the hypervisor and devices aren't
writing to the deposited memory?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28  0:02           ` Stanislav Kinsburskii
@ 2023-09-28 18:00             ` Dave Hansen
  2023-09-28  0:38               ` Stanislav Kinsburskii
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2023-09-28 18:00 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini,
	Shutemov, Kirill

On 9/27/23 17:02, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
...
> Well, not exactly. That's something I'd like to have indeed, but from my
> POV this goal is out of scope of discussion at the moment.
> Let me try to express it the same way you did above:
> 
> 1. Boot some kernel
> 2. Grow the deposited memory a bunch
> 5. Kexec
> 4. Kernel panic due to GPF upon accessing the memory deposited to
> hypervisor.

I basically consider this a bug in the first kernel.  It *can't* kexec
when it's left RAM in shambles.  It doesn't know what features the new
kernel has and whether this is even safe.

Can the new kernel even read the new device tree data?

>> Can't the deposited memory just be shrunk before kexec?  Surely there
>> aren't a bunch of pathological things consuming that memory right before
>> kexec, which is basically a reboot.
> 
> In general it can. But for this to happen hypervisor needs to release
> this memory. And it can release the memory iff the guests are stopped.
> And stopping the guests during kexec isn't something we want to have in the
> long run.
> Also, even if we stop the guests before kexec, we need to restart them
> after boot meaning we have to deposit the pages once again.
> All this: stopping the guests, withdrawing the pages upon kexec,
> allocating after boot and depostiting them again significatnly affect
> guests downtime.

Ahh, and you're presumably kexec'ing in the first place because you've
got a bug in the first kernel and you want a second kernel with fewer bugs.

I still think the only way this will possibly work when kexec'ing both
old and new kernels is to do it with the memory maps that *all* kernels
can read.

Can the hypervisor be improved to make this release operation faster?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 17:35       ` David Hildenbrand
@ 2023-09-28 17:37         ` Dave Hansen
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2023-09-28 17:37 UTC (permalink / raw)
  To: David Hildenbrand, Stanislav Kinsburskii, Baoquan He
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On 9/28/23 10:35, David Hildenbrand wrote:
> On 28.09.23 15:22, Dave Hansen wrote:
>> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
>>> Once deposited, these pages can't be accessed by Linux anymore and thus
>>> must be preserved in "used" state across kexec, as hypervisor state is
>>> unware of kexec.
>>
>> If Linux can't access them, they're not RAM any more.  I'd much rather
>> remove them from the memory map and move on with life rather than
>> implement a bunch of new ABI that's got to be handed across kernels.
> 
> The motivation of handling kexec (faster?) in a hyper-v domain doesn't
> sound particularly compelling got me for such features. If you inflated
> memory, just don't allow to kexec. It's been broken for years IIUC.

That's a good point.  What prevents deflating before kexec?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 13:22     ` Dave Hansen
  2023-09-27 23:25       ` Stanislav Kinsburskii
@ 2023-09-28 17:35       ` David Hildenbrand
  2023-09-28 17:37         ` Dave Hansen
  1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-09-28 17:35 UTC (permalink / raw)
  To: Dave Hansen, Stanislav Kinsburskii, Baoquan He
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On 28.09.23 15:22, Dave Hansen wrote:
> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
>> Once deposited, these pages can't be accessed by Linux anymore and thus
>> must be preserved in "used" state across kexec, as hypervisor state is
>> unware of kexec.
> 
> If Linux can't access them, they're not RAM any more.  I'd much rather
> remove them from the memory map and move on with life rather than
> implement a bunch of new ABI that's got to be handed across kernels.

The motivation of handling kexec (faster?) in a hyper-v domain doesn't 
sound particularly compelling got me for such features. If you inflated 
memory, just don't allow to kexec. It's been broken for years IIUC.

Maybe the other use cases are more "relevant".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 10:25     ` Baoquan He
  2023-09-27 22:44       ` Stanislav Kinsburskii
@ 2023-09-28 17:29       ` David Hildenbrand
  1 sibling, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-09-28 17:29 UTC (permalink / raw)
  To: Baoquan He, Stanislav Kinsburskii
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On 28.09.23 12:25, Baoquan He wrote:
> On 09/27/23 at 09:13am, Stanislav Kinsburskii wrote:
>> On Wed, Sep 27, 2023 at 01:44:38PM +0800, Baoquan He wrote:
>>> Hi Stanislav,
>>>
>>> On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote:
>>>> This patch introduces a memory allocator specifically tailored for
>>>> persistent memory within the kernel. The allocator maintains
>>>> kernel-specific states like DMA passthrough device states, IOMMU state, and
>>>> more across kexec.
>>>
>>> Can you give more details about how this persistent memory pool will be
>>> utilized in a actual scenario? I mean, what problem have you met so that
>>> you have to introduce persistent memory pool to solve it?
>>>
>>
>> The major reason we have at the moment, is that Linux root partition
>> running on top of the Microsoft hypervisor needs to deposit pages to
>> hypervisor in runtime, when hypervisor runs out of memory.
>> "Depositing" here means, that Linux passes a set of its PFNs to the
>> hypervisor via hypercall, and hypervisor then uses these pages for its
>> own needs.
>>
>> Once deposited, these pages can't be accessed by Linux anymore and thus
>> must be preserved in "used" state across kexec, as hypervisor state is
>> unware of kexec. In the same time, these pages can we withdrawn when
>> usused. Thus, an allocator persistent across kexec looks reasonable for
>> this particular matter.
> 
> Thanks for these details.
>   
> The deposit and withdraw remind me the Balloon driver, David's virtio-mem,
> DLPAR on ppc which can hot increasing or shrinking phisical memory on guest
> OS. Can't microsoft hypervisor do the similar thing to reclaim or give
> back the memory from or to the 'Linux root partition' running on top of
> the hypervisor?

virtio-mem was designed with kexec support in mind. You only expose the 
initial memory to the second kernel, and that memory can never have such 
holes. That does not apply to memory ballooning implementations, like 
Hyper-V dynamic memory.

In the virtio-mem paper I have the following:

"In our experiments, Hyper-V VMs crashed reliably when
trying to use kexec under Linux for fast OS reboots with
an inflated balloon. Other memory ballooning mechanisms
either have to temporarily deflate the whole balloon or al-
low access to inflated memory, which is undesired in cloud
environments."

I remember XEN does something elaborate, whereby they allow access to 
all inflated memory during reboot, but limit the total number of pages 
they will hand out. IIRC, you then have to work around things like 
"Windows initializes all memory with 0s when booting, and cope with 
that". So there are ways how hypervisors handled that in the past.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-27 23:25       ` Stanislav Kinsburskii
@ 2023-09-28 17:29         ` Dave Hansen
  2023-09-28  0:02           ` Stanislav Kinsburskii
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2023-09-28 17:29 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On 9/27/23 16:25, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
>> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
>>> Once deposited, these pages can't be accessed by Linux anymore and thus
>>> must be preserved in "used" state across kexec, as hypervisor state is
>>> unware of kexec.
>>
>> If Linux can't access them, they're not RAM any more.  I'd much rather
>> remove them from the memory map and move on with life rather than
>> implement a bunch of new ABI that's got to be handed across kernels.
> 
> Could you elaborate more on the new ABIs? FDT is handled by x86 already,
> and passing it over kexec looks like a natural extension.
> Also, adding more state to it also doens't look like a new ABI.
> Or does it?

FDT makes it easier to pass arbitrary data around, but you're still
creating a new "default_pmpool" device tree node on one end and
consuming it on the other.  That's a new ABI in my book.

> Let me also comment on removing this regions from the memory map. The
> major peculiarity here is that hypervisor distinguish between the pages,
> deposited for guests to rnu and the pages deposited for the Linux root
> partition to keep the guest-related portion of hypervisor state in the
> root partition. And the latter is the matter in question.
> 
> We can indeed isolate and deposit a excessive amount of memory upfront
> in hope that hypervisor will never get into the situation, when it needs
> more memory.
> However, it's not reliable, as the amount of memory will always be an
> estimation, depending on the number of expected guests, guest-attached
> devices, etc. And this becomes even a bigger problem when most of the
> memory is already removed from the memory map to host guest partitions.
> It's also not efficient as the amount of memory required by hypervisor
> can grow or shrink depending on the use case or host configuration, and
> deposting excessive amount of memory will be a waste.
> 
> But, actually, the idea of removing the pages from memory map was
> reflected to some extent in the first version of this proposal,
> so let me elaborate on it a bit.
> 
> Effectively, instead of reserving and depositing a lot of memory to
> hypervisor upfront, the memory can be allocated from kernel memory when
> needed and then returned back when unused.
> This would still require pages removal from the memory map upon kexec,
> but that's another problem.

Let's distill this down a bit.

I agree that it's a waste to reserve an obscene amount of memory up
front for all guests for rare cases.  Having the amount of consumed
memory grow is a nice feature.

You can also quite easily *shrink* the amount of memory on a given
kernel without new code.  Right?

The problem comes when you've grown the footprint of hypervisor-donated
memory, kexec, and *THEN* want to shrink it.  That's what needs new
metadata to be communicated over to the new kernel.

1. Boot some kernel
2. Grow the deposited memory a bunch
3. Kexec
4. Shrink the deposited memory

Right?

That's where you lose me.

Can't the deposited memory just be shrunk before kexec?  Surely there
aren't a bunch of pathological things consuming that memory right before
kexec, which is basically a reboot.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-27 16:13   ` Stanislav Kinsburskii
@ 2023-09-28 13:22     ` Dave Hansen
  2023-09-27 23:25       ` Stanislav Kinsburskii
  2023-09-28 17:35       ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Hansen @ 2023-09-28 13:22 UTC (permalink / raw)
  To: Stanislav Kinsburskii, Baoquan He
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On 9/27/23 09:13, Stanislav Kinsburskii wrote:
> Once deposited, these pages can't be accessed by Linux anymore and thus
> must be preserved in "used" state across kexec, as hypervisor state is
> unware of kexec.

If Linux can't access them, they're not RAM any more.  I'd much rather
remove them from the memory map and move on with life rather than
implement a bunch of new ABI that's got to be handed across kernels.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
       [not found]   ` <58146.123092712145601339@us-mta-73.us.mimecast.lan>
@ 2023-09-28 10:25     ` Baoquan He
  2023-09-27 22:44       ` Stanislav Kinsburskii
  2023-09-28 17:29       ` David Hildenbrand
  0 siblings, 2 replies; 27+ messages in thread
From: Baoquan He @ 2023-09-28 10:25 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini, david

On 09/27/23 at 09:13am, Stanislav Kinsburskii wrote:
> On Wed, Sep 27, 2023 at 01:44:38PM +0800, Baoquan He wrote:
> > Hi Stanislav,
> > 
> > On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote:
> > > This patch introduces a memory allocator specifically tailored for
> > > persistent memory within the kernel. The allocator maintains
> > > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > > more across kexec.
> > 
> > Can you give more details about how this persistent memory pool will be
> > utilized in a actual scenario? I mean, what problem have you met so that
> > you have to introduce persistent memory pool to solve it?
> > 
> 
> The major reason we have at the moment, is that Linux root partition
> running on top of the Microsoft hypervisor needs to deposit pages to
> hypervisor in runtime, when hypervisor runs out of memory.
> "Depositing" here means, that Linux passes a set of its PFNs to the
> hypervisor via hypercall, and hypervisor then uses these pages for its
> own needs.
> 
> Once deposited, these pages can't be accessed by Linux anymore and thus
> must be preserved in "used" state across kexec, as hypervisor state is
> unware of kexec. In the same time, these pages can we withdrawn when
> usused. Thus, an allocator persistent across kexec looks reasonable for
> this particular matter.

Thanks for these details.
 
The deposit and withdraw remind me the Balloon driver, David's virtio-mem,
DLPAR on ppc which can hot increasing or shrinking phisical memory on guest
OS. Can't microsoft hypervisor do the similar thing to reclaim or give
back the memory from or to the 'Linux root partition' running on top of
the hypervisor?

Thanks
Baoquan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-29 10:13                     ` Shutemov, Kirill
@ 2023-09-28  9:16                       ` Stanislav Kinsburskii
  0 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-28  9:16 UTC (permalink / raw)
  To: Shutemov, Kirill
  Cc: Dave Hansen, Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa,
	ebiederm, akpm, stanislav.kinsburskii, corbet, linux-kernel,
	kexec, linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf,
	pbonzini

On Fri, Sep 29, 2023 at 01:13:24PM +0300, Shutemov, Kirill wrote:
> On Wed, Sep 27, 2023 at 07:46:36PM -0700, Stanislav Kinsburskii wrote:
> > I'd answer yes, "System MAP" must be persisted across kexec.
> > Could you elaborate on why there should be a mechanism to tell the
> > kernel anything special about the existent "System map" in this context?
> > Say, one can reserve a CMA region (or a crash kernel region, etc), store
> > there some data, and then pass it across kexec. Reserved CMA region will
> > still be a part of the "System MAP", won't it?
> 
> Em. When crash kernel starts all System RAM of the the first kernel
> becomes E820_TYPE_RESERVED and only memory pre-allocated for crash
> scenario becomes E820_TYPE_RAM. See crash_setup_memmap_entries().
> 
> Can't you go the same path? Report all deposited memory as
> E820_TYPE_RESERVED.
> 

Sure I can.
This approach will have the corresponding command line option as a
requirement, and therefore is less flexible. But if passing device tree
across kexec on x86 is the major concern, then of course I can change it
the way you suggest.

> Or do you have too many deposited memory ranges, so we would run out of
> e820 entries?
> 

No, I don't think I have.
I can imagine how such a pool with a lot of regions can exhaust e820
table, but the implementation currently proposed is based on CMA and thus
limited by 19 entires by default, so I guess running out of e820 entries
is unlikely in real world scenarios.

Thanks,
Stanislav

> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 23:56                     ` Baoquan He
@ 2023-09-28  7:18                       ` Stanislav Kinsburskii
  0 siblings, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-28  7:18 UTC (permalink / raw)
  To: Baoquan He
  Cc: Dave Hansen, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini,
	Shutemov, Kirill

On Fri, Sep 29, 2023 at 07:56:37AM +0800, Baoquan He wrote:
> On 09/27/23 at 07:46pm, Stanislav Kinsburskii wrote:
> > On Thu, Sep 28, 2023 at 12:16:31PM -0700, Dave Hansen wrote:
> > > On 9/27/23 17:38, Stanislav Kinsburskii wrote:
> > > > On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote:
> > > >> On 9/27/23 17:02, Stanislav Kinsburskii wrote:
> > > >>> On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> > > >> ...
> > > >>> Well, not exactly. That's something I'd like to have indeed, but from my
> > > >>> POV this goal is out of scope of discussion at the moment.
> > > >>> Let me try to express it the same way you did above:
> > > >>>
> > > >>> 1. Boot some kernel
> > > >>> 2. Grow the deposited memory a bunch
> > > >>> 5. Kexec
> > > >>> 4. Kernel panic due to GPF upon accessing the memory deposited to
> > > >>> hypervisor.
> > > >>
> > > >> I basically consider this a bug in the first kernel.  It *can't* kexec
> > > >> when it's left RAM in shambles.  It doesn't know what features the new
> > > >> kernel has and whether this is even safe.
> > > >>
> > > > 
> > > > Could you elaborate more on why this is a bug in the first kernel?
> > > > Say, kernel memory can be allocated in big physically consequitive
> > > > chunks by the first kernel for depositing. The information about these
> > > > chunks is then passed the the second kernel via FDT or even command
> > > > line, so the seconds kernel can reserve this region during booting.
> > > > What's wrong with this approach?
> > > 
> > > How do you know the second kernel can parse the FDT entry or the
> > > command-line you pass to it?
> > > 
> > > >> Can the new kernel even read the new device tree data?
> > > > 
> > > > I'm not sure I understand the question, to be honest.
> > > > Why can't it? This series contains code parts for both first and seconds
> > > > kernels.
> > > 
> > > How do you know the second kernel isn't the version *before* this series
> > > gets merged?
> > > 
> > 
> > The answer to both questions above is the following: the feature is deployed
> > fleed-wide first, and enabled only upon the next deployment.
> > It worth mentioning, that fleet-wide deployments usually don't need to support
> > updates to a version older that the previous one.
> > Also, since kexec is initialited by user space, it always can be
> > enlightened about kernel capabilities and simply don't kexec to an
> > incompatible kernel version.
> > One more bit to mention, that it real life this problme exists only
> > during initial transition, as once the upgrade to a kernel with a
> > feature has happened, there won't be a revert to a versoin without it.
> > 
> > > ...
> > > >> I still think the only way this will possibly work when kexec'ing both
> > > >> old and new kernels is to do it with the memory maps that *all* kernels
> > > >> can read.
> > > > 
> > > > Could you elaborate more on this?
> > > > The avaiable memory map actually stays the same for both kernels. The
> > > > difference here can be in a different list of memory regions to reserve,
> > > > when the first kernel allocated and deposited another chunk, and thus
> > > > the second kernel needs to reserve this memory as a new region upon
> > > > booting.
> > > 
> > > Please take a step back from your implementation for a moment.  There
> > > are two basic design points that need to be considered.
> > > 
> > > First, *must* "System RAM" (according to the memory map) be persisted
> > > across kexec?  If no, then there's no problem to solve and we can stop
> > > this thread.  If yes, then some mechanism must be used to tell the new
> > > kernel that the "System RAM" in the memory map is not normal RAM.
> > > 
> > > Second, *if* we agree that some data must communicate across kexec, then
> > > what mechanism should be used?  You're arguing for a new mechanism that
> > > only new kernels can use.  I'm arguing that you should likely reuse an
> > > existing mechanism (probably the UEFI/e820 maps) so that *ALL* kernels
> > > can consume the information, old and new.
> > > 
> > 
> > I'd answer yes, "System MAP" must be persisted across kexec.
> > Could you elaborate on why there should be a mechanism to tell the
> > kernel anything special about the existent "System map" in this context?
> > Say, one can reserve a CMA region (or a crash kernel region, etc), store
> > there some data, and then pass it across kexec. Reserved CMA region will
> > still be a part of the "System MAP", won't it?
> 
> Well, I haven't gone through all the discusison thread and clearly got
> your intention and motivation. But here I have to say there's
> misunderstanding. At least I am astonished when I heard the above
> description. Who said a CMA region or a crahs kernel region need be
> passed across kexec. Think kexec as a bootloader, in essence it's no
> different than any other bootloader. When it jumps to 2nd kernel, the
> whole system will be booted up and reconstructed on the system resources.
> All the difference kexec has is it won't go through firmware to do those
> detecting/testing/init. If the intentionn is to preserve any state or
> region in 1st kernel, you absolutely got it wrong.
> 
> This is not the first time people want to put burden on kexec because
> of a specifica scenario, and this is not the 2nd time, and not 3rd time
> in the recent 2 years. But I would say please think about what is kexec
> reboot, what we expect it to do, whether the problem be fixed in its own
> side.

Frankly, I'm confused as I don't really understand, what you are arguing
with exactly... Maybe I triggered some pain point, but I don't think you
are reacting to what I actually said.
I never said, that either CMA or crash kernel needs to be passed across
kexec: I said they may be (and, actually are) passed in real worlds
scenarios. Also, it's not just CMA, but pmem backed by RAM as well.
What do I miss here?

And to me it looks like I do think about kexec as a boot loader just
like you mentioned, as the proposal in this series is to construct a
device tree exactly the same way as it it's constructed by (for example)
uboot for both x86 and arm64.
So, if we think about kexec as a bootloader, why uboot can pass a
resource to the new kernel, while the previous kernel can't do the same
and why may it be considered as an additional burden?

Thanks,
Stanislav


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 19:16                 ` Dave Hansen
@ 2023-09-28  2:46                   ` Stanislav Kinsburskii
  2023-09-29 10:13                     ` Shutemov, Kirill
       [not found]                   ` <64208.123092816192300612@us-mta-483.us.mimecast.lan>
  1 sibling, 1 reply; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-28  2:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini,
	Shutemov, Kirill

On Thu, Sep 28, 2023 at 12:16:31PM -0700, Dave Hansen wrote:
> On 9/27/23 17:38, Stanislav Kinsburskii wrote:
> > On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote:
> >> On 9/27/23 17:02, Stanislav Kinsburskii wrote:
> >>> On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> >> ...
> >>> Well, not exactly. That's something I'd like to have indeed, but from my
> >>> POV this goal is out of scope of discussion at the moment.
> >>> Let me try to express it the same way you did above:
> >>>
> >>> 1. Boot some kernel
> >>> 2. Grow the deposited memory a bunch
> >>> 5. Kexec
> >>> 4. Kernel panic due to GPF upon accessing the memory deposited to
> >>> hypervisor.
> >>
> >> I basically consider this a bug in the first kernel.  It *can't* kexec
> >> when it's left RAM in shambles.  It doesn't know what features the new
> >> kernel has and whether this is even safe.
> >>
> > 
> > Could you elaborate more on why this is a bug in the first kernel?
> > Say, kernel memory can be allocated in big physically consequitive
> > chunks by the first kernel for depositing. The information about these
> > chunks is then passed the the second kernel via FDT or even command
> > line, so the seconds kernel can reserve this region during booting.
> > What's wrong with this approach?
> 
> How do you know the second kernel can parse the FDT entry or the
> command-line you pass to it?
> 
> >> Can the new kernel even read the new device tree data?
> > 
> > I'm not sure I understand the question, to be honest.
> > Why can't it? This series contains code parts for both first and seconds
> > kernels.
> 
> How do you know the second kernel isn't the version *before* this series
> gets merged?
> 

The answer to both questions above is the following: the feature is deployed
fleed-wide first, and enabled only upon the next deployment.
It worth mentioning, that fleet-wide deployments usually don't need to support
updates to a version older that the previous one.
Also, since kexec is initialited by user space, it always can be
enlightened about kernel capabilities and simply don't kexec to an
incompatible kernel version.
One more bit to mention, that it real life this problme exists only
during initial transition, as once the upgrade to a kernel with a
feature has happened, there won't be a revert to a versoin without it.

> ...
> >> I still think the only way this will possibly work when kexec'ing both
> >> old and new kernels is to do it with the memory maps that *all* kernels
> >> can read.
> > 
> > Could you elaborate more on this?
> > The avaiable memory map actually stays the same for both kernels. The
> > difference here can be in a different list of memory regions to reserve,
> > when the first kernel allocated and deposited another chunk, and thus
> > the second kernel needs to reserve this memory as a new region upon
> > booting.
> 
> Please take a step back from your implementation for a moment.  There
> are two basic design points that need to be considered.
> 
> First, *must* "System RAM" (according to the memory map) be persisted
> across kexec?  If no, then there's no problem to solve and we can stop
> this thread.  If yes, then some mechanism must be used to tell the new
> kernel that the "System RAM" in the memory map is not normal RAM.
> 
> Second, *if* we agree that some data must communicate across kexec, then
> what mechanism should be used?  You're arguing for a new mechanism that
> only new kernels can use.  I'm arguing that you should likely reuse an
> existing mechanism (probably the UEFI/e820 maps) so that *ALL* kernels
> can consume the information, old and new.
> 

I'd answer yes, "System MAP" must be persisted across kexec.
Could you elaborate on why there should be a mechanism to tell the
kernel anything special about the existent "System map" in this context?
Say, one can reserve a CMA region (or a crash kernel region, etc), store
there some data, and then pass it across kexec. Reserved CMA region will
still be a part of the "System MAP", won't it?

Regarding the communication mechanism, device tree is not the only one
indeed.
However, could you elaborate on how e820 extension can help to
communicate thing here without introducing new ABI?
And if it can't then done without a new ABI, then why e820 extension is
better than a device tree extension? AFAIU e820 isn't really designed to
pass arbitrary data bits in it.
Are you suggesting to intoduce another e820_type like E820_TYPE_PMPOOL?

> I'm not convinced that this series is going in the right direction on
> either of those points.
> 

I understand the skepticism. I appreciate your efforts in helping to
find a solution.

> > Can all this considered, as, say, the first kernel uses device tree to
> > inform the second kernel about the memory regions to reserve?
> > In this case the first kernel behaves a bit like a firmware piece for
> > the second one.
> > 
> >> Can the hypervisor be improved to make this release operation faster?
> > 
> > I guess it can, but shutting down guests contributes to downtime the
> > most. And without shutting down the guests the deposited memory can't be
> > withdrawn.
> 
> Do you really need to fully shut down each guest?  Or do you just need
> to get them to a quiescent state where the hypervisor and devices aren't
> writing to the deposited memory?

Unfortunatelly, quiescing is not enough as the guest-related state in
root partition will still exist in the hypervisor.

The way it works right now, is that the hypervisor can return a
"ENOMEM"-like error upon guest altering hypercall in the root partition
(like partition creation or device addition) and then Linux deposits
more memory to hypervisor. IOW, while the guest is running, correposing
root partition pages are "used" by the hypervisor and can't be withdrawn.

Also, guest quiescing itself isn't something mandatory with type 1
hypervisors, as guest can be scheduled by hypervisor without VMM
support, and VM exits can be trapped on the hypervisor level using that
persistent guest-realted state in the root partition. VMM can then
reattach bacak to the persistent state after kexec.

Thanks,
Stanislav


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 18:00             ` Dave Hansen
@ 2023-09-28  0:38               ` Stanislav Kinsburskii
  2023-09-28 19:16                 ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-28  0:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini,
	Shutemov, Kirill

On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote:
> On 9/27/23 17:02, Stanislav Kinsburskii wrote:
> > On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> ...
> > Well, not exactly. That's something I'd like to have indeed, but from my
> > POV this goal is out of scope of discussion at the moment.
> > Let me try to express it the same way you did above:
> > 
> > 1. Boot some kernel
> > 2. Grow the deposited memory a bunch
> > 5. Kexec
> > 4. Kernel panic due to GPF upon accessing the memory deposited to
> > hypervisor.
> 
> I basically consider this a bug in the first kernel.  It *can't* kexec
> when it's left RAM in shambles.  It doesn't know what features the new
> kernel has and whether this is even safe.
> 

Could you elaborate more on why this is a bug in the first kernel?
Say, kernel memory can be allocated in big physically consequitive
chunks by the first kernel for depositing. The information about these
chunks is then passed the the second kernel via FDT or even command
line, so the seconds kernel can reserve this region during booting.
What's wrong with this approach?

> Can the new kernel even read the new device tree data?
> 

I'm not sure I understand the question, to be honest.
Why can't it? This series contains code parts for both first and seconds
kernels.

> >> Can't the deposited memory just be shrunk before kexec?  Surely there
> >> aren't a bunch of pathological things consuming that memory right before
> >> kexec, which is basically a reboot.
> > 
> > In general it can. But for this to happen hypervisor needs to release
> > this memory. And it can release the memory iff the guests are stopped.
> > And stopping the guests during kexec isn't something we want to have in the
> > long run.
> > Also, even if we stop the guests before kexec, we need to restart them
> > after boot meaning we have to deposit the pages once again.
> > All this: stopping the guests, withdrawing the pages upon kexec,
> > allocating after boot and depostiting them again significatnly affect
> > guests downtime.
> 
> Ahh, and you're presumably kexec'ing in the first place because you've
> got a bug in the first kernel and you want a second kernel with fewer bugs.
> 

Right. All this is for "kernel servicing" purposes, when kexec is used
to update the kernel in a fleet with in attempt to reduce users downtime
as mush as possible.
I'm sorry for keeping this bit of context to myself instead of
explicitly stating it the series description: it wasn't intentional.

> I still think the only way this will possibly work when kexec'ing both
> old and new kernels is to do it with the memory maps that *all* kernels
> can read.
> 

Could you elaborate more on this?
The avaiable memory map actually stays the same for both kernels. The
difference here can be in a different list of memory regions to reserve,
when the first kernel allocated and deposited another chunk, and thus
the second kernel needs to reserve this memory as a new region upon
booting.

Can all this considered, as, say, the first kernel uses device tree to
inform the second kernel about the memory regions to reserve?
In this case the first kernel behaves a bit like a firmware piece for
the second one.

> Can the hypervisor be improved to make this release operation faster?

I guess it can, but shutting down guests contributes to downtime the
most. And without shutting down the guests the deposited memory can't be
withdrawn.

Thanks,
Stanislav


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 17:29         ` Dave Hansen
@ 2023-09-28  0:02           ` Stanislav Kinsburskii
  2023-09-28 18:00             ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-28  0:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> On 9/27/23 16:25, Stanislav Kinsburskii wrote:
> > On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
> >> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
> >>> Once deposited, these pages can't be accessed by Linux anymore and thus
> >>> must be preserved in "used" state across kexec, as hypervisor state is
> >>> unware of kexec.
> >>
> >> If Linux can't access them, they're not RAM any more.  I'd much rather
> >> remove them from the memory map and move on with life rather than
> >> implement a bunch of new ABI that's got to be handed across kernels.
> > 
> > Could you elaborate more on the new ABIs? FDT is handled by x86 already,
> > and passing it over kexec looks like a natural extension.
> > Also, adding more state to it also doens't look like a new ABI.
> > Or does it?
> 
> FDT makes it easier to pass arbitrary data around, but you're still
> creating a new "default_pmpool" device tree node on one end and
> consuming it on the other.  That's a new ABI in my book.
> 

Well, then yes, it's a new ABI.
I guess it can still be named as "linux,cma", but then another
compatibility needs to be introduced, and that's again a new ABI, isn't
it?

> > Let me also comment on removing this regions from the memory map. The
> > major peculiarity here is that hypervisor distinguish between the pages,
> > deposited for guests to rnu and the pages deposited for the Linux root
> > partition to keep the guest-related portion of hypervisor state in the
> > root partition. And the latter is the matter in question.
> > 
> > We can indeed isolate and deposit a excessive amount of memory upfront
> > in hope that hypervisor will never get into the situation, when it needs
> > more memory.
> > However, it's not reliable, as the amount of memory will always be an
> > estimation, depending on the number of expected guests, guest-attached
> > devices, etc. And this becomes even a bigger problem when most of the
> > memory is already removed from the memory map to host guest partitions.
> > It's also not efficient as the amount of memory required by hypervisor
> > can grow or shrink depending on the use case or host configuration, and
> > deposting excessive amount of memory will be a waste.
> > 
> > But, actually, the idea of removing the pages from memory map was
> > reflected to some extent in the first version of this proposal,
> > so let me elaborate on it a bit.
> > 
> > Effectively, instead of reserving and depositing a lot of memory to
> > hypervisor upfront, the memory can be allocated from kernel memory when
> > needed and then returned back when unused.
> > This would still require pages removal from the memory map upon kexec,
> > but that's another problem.
> 
> Let's distill this down a bit.
> 
> I agree that it's a waste to reserve an obscene amount of memory up
> front for all guests for rare cases.  Having the amount of consumed
> memory grow is a nice feature.
> 
> You can also quite easily *shrink* the amount of memory on a given
> kernel without new code.  Right?
> 
> The problem comes when you've grown the footprint of hypervisor-donated
> memory, kexec, and *THEN* want to shrink it.  That's what needs new
> metadata to be communicated over to the new kernel.
> 
> 1. Boot some kernel
> 2. Grow the deposited memory a bunch
> 3. Kexec
> 4. Shrink the deposited memory
> 
> Right?
> 

Well, not exactly. That's something I'd like to have indeed, but from my
POV this goal is out of scope of discussion at the moment.
Let me try to express it the same way you did above:

1. Boot some kernel
2. Grow the deposited memory a bunch
5. Kexec
4. Kernel panic due to GPF upon accessing the memory deposited to
hypervisor.

> That's where you lose me.
> 
> Can't the deposited memory just be shrunk before kexec?  Surely there
> aren't a bunch of pathological things consuming that memory right before
> kexec, which is basically a reboot.

In general it can. But for this to happen hypervisor needs to release
this memory. And it can release the memory iff the guests are stopped.
And stopping the guests during kexec isn't something we want to have in the
long run.
Also, even if we stop the guests before kexec, we need to restart them
after boot meaning we have to deposit the pages once again.
All this: stopping the guests, withdrawing the pages upon kexec,
allocating after boot and depostiting them again significatnly affect
guests downtime.

Thanks,
Stanislav


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 13:22     ` Dave Hansen
@ 2023-09-27 23:25       ` Stanislav Kinsburskii
  2023-09-28 17:29         ` Dave Hansen
  2023-09-28 17:35       ` David Hildenbrand
  1 sibling, 1 reply; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-27 23:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Baoquan He, tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm,
	akpm, stanislav.kinsburskii, corbet, linux-kernel, kexec,
	linux-mm, kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
> > Once deposited, these pages can't be accessed by Linux anymore and thus
> > must be preserved in "used" state across kexec, as hypervisor state is
> > unware of kexec.
> 
> If Linux can't access them, they're not RAM any more.  I'd much rather
> remove them from the memory map and move on with life rather than
> implement a bunch of new ABI that's got to be handed across kernels.

Could you elaborate more on the new ABIs? FDT is handled by x86 already,
and passing it over kexec looks like a natural extension.
Also, adding more state to it also doens't look like a new ABI.
Or does it?

Let me also comment on removing this regions from the memory map. The
major peculiarity here is that hypervisor distinguish between the pages,
deposited for guests to rnu and the pages deposited for the Linux root
partition to keep the guest-related portion of hypervisor state in the
root partition. And the latter is the matter in question.

We can indeed isolate and deposit a excessive amount of memory upfront
in hope that hypervisor will never get into the situation, when it needs
more memory.
However, it's not reliable, as the amount of memory will always be an
estimation, depending on the number of expected guests, guest-attached
devices, etc. And this becomes even a bigger problem when most of the
memory is already removed from the memory map to host guest partitions.
It's also not efficient as the amount of memory required by hypervisor
can grow or shrink depending on the use case or host configuration, and
deposting excessive amount of memory will be a waste.

But, actually, the idea of removing the pages from memory map was
reflected to some extent in the first version of this proposal,
so let me elaborate on it a bit.

Effectively, instead of reserving and depositing a lot of memory to
hypervisor upfront, the memory can be allocated from kernel memory when
needed and then returned back when unused.
This would still require pages removal from the memory map upon kexec,
but that's another problem.

Thanks,
Stanislav



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-28 10:25     ` Baoquan He
@ 2023-09-27 22:44       ` Stanislav Kinsburskii
  2023-09-28 17:29       ` David Hildenbrand
  1 sibling, 0 replies; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-27 22:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini, david

On Thu, Sep 28, 2023 at 06:25:44PM +0800, Baoquan He wrote:
> On 09/27/23 at 09:13am, Stanislav Kinsburskii wrote:
> > On Wed, Sep 27, 2023 at 01:44:38PM +0800, Baoquan He wrote:
> > > Hi Stanislav,
> > > 
> > > On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote:
> > > > This patch introduces a memory allocator specifically tailored for
> > > > persistent memory within the kernel. The allocator maintains
> > > > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > > > more across kexec.
> > > 
> > > Can you give more details about how this persistent memory pool will be
> > > utilized in a actual scenario? I mean, what problem have you met so that
> > > you have to introduce persistent memory pool to solve it?
> > > 
> > 
> > The major reason we have at the moment, is that Linux root partition
> > running on top of the Microsoft hypervisor needs to deposit pages to
> > hypervisor in runtime, when hypervisor runs out of memory.
> > "Depositing" here means, that Linux passes a set of its PFNs to the
> > hypervisor via hypercall, and hypervisor then uses these pages for its
> > own needs.
> > 
> > Once deposited, these pages can't be accessed by Linux anymore and thus
> > must be preserved in "used" state across kexec, as hypervisor state is
> > unware of kexec. In the same time, these pages can we withdrawn when
> > usused. Thus, an allocator persistent across kexec looks reasonable for
> > this particular matter.
> 
> Thanks for these details.
>  
> The deposit and withdraw remind me the Balloon driver, David's virtio-mem,
> DLPAR on ppc which can hot increasing or shrinking phisical memory on guest
> OS. Can't microsoft hypervisor do the similar thing to reclaim or give
> back the memory from or to the 'Linux root partition' running on top of
> the hypervisor?
> 

Although Microsoft hypervisor is a type 1 hypervisor and runs on the
physical hardware, like Xen, it doens't control all the memory, but is
rather granted with memory by either boot loader or by Linux root
partition (similar priveleged VM is called "Dom0" in Xen world). IOW,
this works in the oposite direction: Linux gives memory to hypervisor,
and can reclaim it back. However, doing so on kexec increases downtime
as withdrawn pages must be deposited back again after booting to restore
the guests ("DomU" in Xen terminology).

It worth mentionining, that the "deposited pages" in this context don't
mean guest pages, but the pages required by the hypevisor to store Linux
root partition state user to control guest partitions.

Also, pages reclaim is not possible, if guests are left running during
kexec, as hypervisor requires to keep the Linux root partition-related
state intact to keep the guest state consistent.

> Thanks
> Baoquan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
  2023-09-27  5:44 ` [RFC PATCH v2 0/7] Introduce persistent memory pool Baoquan He
@ 2023-09-27 16:13   ` Stanislav Kinsburskii
  2023-09-28 13:22     ` Dave Hansen
       [not found]   ` <58146.123092712145601339@us-mta-73.us.mimecast.lan>
  1 sibling, 1 reply; 27+ messages in thread
From: Stanislav Kinsburskii @ 2023-09-27 16:13 UTC (permalink / raw)
  To: Baoquan He
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

On Wed, Sep 27, 2023 at 01:44:38PM +0800, Baoquan He wrote:
> Hi Stanislav,
> 
> On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote:
> > This patch introduces a memory allocator specifically tailored for
> > persistent memory within the kernel. The allocator maintains
> > kernel-specific states like DMA passthrough device states, IOMMU state, and
> > more across kexec.
> 
> Can you give more details about how this persistent memory pool will be
> utilized in a actual scenario? I mean, what problem have you met so that
> you have to introduce persistent memory pool to solve it?
> 

The major reason we have at the moment, is that Linux root partition
running on top of the Microsoft hypervisor needs to deposit pages to
hypervisor in runtime, when hypervisor runs out of memory.
"Depositing" here means, that Linux passes a set of its PFNs to the
hypervisor via hypercall, and hypervisor then uses these pages for its
own needs.

Once deposited, these pages can't be accessed by Linux anymore and thus
must be preserved in "used" state across kexec, as hypervisor state is
unware of kexec. In the same time, these pages can we withdrawn when
usused. Thus, an allocator persistent across kexec looks reasonable for
this particular matter.

Also, the last patch in the series is aimed to demonstrate the usage,
described above.

Thanks,
Stanislav

> Thanks
> Baoquan
> 
> > 
> > The current implementation provides a foundation for custom solutions that
> > may be developed in the future. Although the design is kept concise and
> > straightforward to encourage discussion and feedback, it remains fully
> > functional.
> > 
> > The persistent memory pool builds upon the continuous memory allocator
> > (CMA) and ensures CMA state persistency across kexec by incorporating the
> > CMA bitmap into the memory region instead of allocation it from kernel
> > memory.
> > 
> > Persistent memory pool metadata is passed across kexec by using Flattened
> > Device Tree, which is added as another kexec segment for x86 architecture.
> > 
> > Potential applications include:
> > 
> >   1. Enabling various in-kernel entities to allocate persistent pages from
> >      a unified memory pool, obviating the need for reserving multiple
> >      regions.
> > 
> >   2. For in-kernel components that need the allocation address to be
> >      retained on kernel kexec, this address can be exposed to user space
> >      and subsequently passed through the command line.
> > 
> >   3. Distinct subsystems or drivers can set aside their region, allocating
> >      a segment for their persistent memory pool, suitable for uses such as
> >      file systems, key-value stores, and other applications.
> > 
> > Notes:
> > 
> >   1. The last patch of the series represents a use case for the feature.
> >      However, the patch won't compile and is for illustrative purposes only
> >      as the code being patched hasn't been merged yet.
> > 
> >   2. The code being patched is currently under review by the community. The
> >      series is named "Introduce /dev/mshv drivers":
> > 
> >          https://lkml.org/lkml/2023/9/22/1117
> > 
> > 
> > Changes since v1:
> > 
> >   1. Persistent memory pool is now a wrapper on top of CMA instead of being a
> >      new allocator.
> > 
> >   2. Persistent memory pool metadata doesn't belong to the pool anymore and
> >      is now passed via Flattened Device Tree instead over kexec to the new
> >      kernel.
> > 
> > The following series implements...
> > 
> > ---
> > 
> > Stanislav Kinsburskii (7):
> >       kexec_file: Add fdt modification callback support
> >       x86: kexec: Transfer existing fdt to the new kernel
> >       x86: kexec: Enable fdt modification in callbacks
> >       pmpool: Introduce persistent memory pool
> >       pmpool: Update device tree on kexec
> >       pmpool: Restore state from device tree post-kexec
> >       Drivers: hv: Allocate persistent pages for root partition
> > 
> > 
> >  arch/x86/Kconfig                  |   16 +++
> >  arch/x86/kernel/kexec-bzimage64.c |   97 +++++++++++++++++
> >  drivers/hv/hv_common.c            |   13 ++
> >  include/linux/kexec.h             |    7 +
> >  include/linux/pmpool.h            |   22 ++++
> >  kernel/kexec_file.c               |   24 ++++
> >  mm/Kconfig                        |    9 ++
> >  mm/Makefile                       |    1 
> >  mm/pmpool.c                       |  208 +++++++++++++++++++++++++++++++++++++
> >  9 files changed, 394 insertions(+), 3 deletions(-)
> >  create mode 100644 include/linux/pmpool.h
> >  create mode 100644 mm/pmpool.c
> > 
> > 
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> > 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH v2 0/7] Introduce persistent memory pool
       [not found] <01828.123092517290700465@us-mta-156.us.mimecast.lan>
@ 2023-09-27  5:44 ` Baoquan He
  2023-09-27 16:13   ` Stanislav Kinsburskii
       [not found]   ` <58146.123092712145601339@us-mta-73.us.mimecast.lan>
  0 siblings, 2 replies; 27+ messages in thread
From: Baoquan He @ 2023-09-27  5:44 UTC (permalink / raw)
  To: Stanislav Kinsburskii
  Cc: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
	stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
	kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini

Hi Stanislav,

On 09/25/23 at 02:27pm, Stanislav Kinsburskii wrote:
> This patch introduces a memory allocator specifically tailored for
> persistent memory within the kernel. The allocator maintains
> kernel-specific states like DMA passthrough device states, IOMMU state, and
> more across kexec.

Can you give more details about how this persistent memory pool will be
utilized in a actual scenario? I mean, what problem have you met so that
you have to introduce persistent memory pool to solve it?

Thanks
Baoquan

> 
> The current implementation provides a foundation for custom solutions that
> may be developed in the future. Although the design is kept concise and
> straightforward to encourage discussion and feedback, it remains fully
> functional.
> 
> The persistent memory pool builds upon the continuous memory allocator
> (CMA) and ensures CMA state persistency across kexec by incorporating the
> CMA bitmap into the memory region instead of allocation it from kernel
> memory.
> 
> Persistent memory pool metadata is passed across kexec by using Flattened
> Device Tree, which is added as another kexec segment for x86 architecture.
> 
> Potential applications include:
> 
>   1. Enabling various in-kernel entities to allocate persistent pages from
>      a unified memory pool, obviating the need for reserving multiple
>      regions.
> 
>   2. For in-kernel components that need the allocation address to be
>      retained on kernel kexec, this address can be exposed to user space
>      and subsequently passed through the command line.
> 
>   3. Distinct subsystems or drivers can set aside their region, allocating
>      a segment for their persistent memory pool, suitable for uses such as
>      file systems, key-value stores, and other applications.
> 
> Notes:
> 
>   1. The last patch of the series represents a use case for the feature.
>      However, the patch won't compile and is for illustrative purposes only
>      as the code being patched hasn't been merged yet.
> 
>   2. The code being patched is currently under review by the community. The
>      series is named "Introduce /dev/mshv drivers":
> 
>          https://lkml.org/lkml/2023/9/22/1117
> 
> 
> Changes since v1:
> 
>   1. Persistent memory pool is now a wrapper on top of CMA instead of being a
>      new allocator.
> 
>   2. Persistent memory pool metadata doesn't belong to the pool anymore and
>      is now passed via Flattened Device Tree instead over kexec to the new
>      kernel.
> 
> The following series implements...
> 
> ---
> 
> Stanislav Kinsburskii (7):
>       kexec_file: Add fdt modification callback support
>       x86: kexec: Transfer existing fdt to the new kernel
>       x86: kexec: Enable fdt modification in callbacks
>       pmpool: Introduce persistent memory pool
>       pmpool: Update device tree on kexec
>       pmpool: Restore state from device tree post-kexec
>       Drivers: hv: Allocate persistent pages for root partition
> 
> 
>  arch/x86/Kconfig                  |   16 +++
>  arch/x86/kernel/kexec-bzimage64.c |   97 +++++++++++++++++
>  drivers/hv/hv_common.c            |   13 ++
>  include/linux/kexec.h             |    7 +
>  include/linux/pmpool.h            |   22 ++++
>  kernel/kexec_file.c               |   24 ++++
>  mm/Kconfig                        |    9 ++
>  mm/Makefile                       |    1 
>  mm/pmpool.c                       |  208 +++++++++++++++++++++++++++++++++++++
>  9 files changed, 394 insertions(+), 3 deletions(-)
>  create mode 100644 include/linux/pmpool.h
>  create mode 100644 mm/pmpool.c
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-09-29 16:09 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-25 21:27 [RFC PATCH v2 0/7] Introduce persistent memory pool Stanislav Kinsburskii
2023-09-25 21:27 ` [RFC PATCH v2 1/7] kexec_file: Add fdt modification callback support Stanislav Kinsburskii
2023-09-25 21:27 ` [RFC PATCH v2 2/7] x86: kexec: Transfer existing fdt to the new kernel Stanislav Kinsburskii
2023-09-25 21:28 ` [RFC PATCH v2 3/7] x86: kexec: Enable fdt modification in callbacks Stanislav Kinsburskii
2023-09-25 21:28 ` [RFC PATCH v2 4/7] pmpool: Introduce persistent memory pool Stanislav Kinsburskii
2023-09-25 21:28 ` [RFC PATCH v2 5/7] pmpool: Update device tree on kexec Stanislav Kinsburskii
2023-09-25 21:28 ` [RFC PATCH v2 6/7] pmpool: Restore state from device tree post-kexec Stanislav Kinsburskii
2023-09-25 21:28 ` [RFC PATCH v2 7/7] Drivers: hv: Allocate persistent pages for root partition Stanislav Kinsburskii
     [not found] <01828.123092517290700465@us-mta-156.us.mimecast.lan>
2023-09-27  5:44 ` [RFC PATCH v2 0/7] Introduce persistent memory pool Baoquan He
2023-09-27 16:13   ` Stanislav Kinsburskii
2023-09-28 13:22     ` Dave Hansen
2023-09-27 23:25       ` Stanislav Kinsburskii
2023-09-28 17:29         ` Dave Hansen
2023-09-28  0:02           ` Stanislav Kinsburskii
2023-09-28 18:00             ` Dave Hansen
2023-09-28  0:38               ` Stanislav Kinsburskii
2023-09-28 19:16                 ` Dave Hansen
2023-09-28  2:46                   ` Stanislav Kinsburskii
2023-09-29 10:13                     ` Shutemov, Kirill
2023-09-28  9:16                       ` Stanislav Kinsburskii
     [not found]                   ` <64208.123092816192300612@us-mta-483.us.mimecast.lan>
2023-09-28 23:56                     ` Baoquan He
2023-09-28  7:18                       ` Stanislav Kinsburskii
2023-09-28 17:35       ` David Hildenbrand
2023-09-28 17:37         ` Dave Hansen
     [not found]   ` <58146.123092712145601339@us-mta-73.us.mimecast.lan>
2023-09-28 10:25     ` Baoquan He
2023-09-27 22:44       ` Stanislav Kinsburskii
2023-09-28 17:29       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox