* [RFC PATCH v3 0/3] Introduce persistent memory pool
@ 2023-10-04 22:23 Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 1/3] x86/boot/e820: Expose kexec range update, remove and table update functions Stanislav Kinsburskii
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Stanislav Kinsburskii @ 2023-10-04 22:23 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini, bhe,
dave.hansen, kirill.shutemov
This patch introduces a memory allocator specifically tailored for
persistent memory within the kernel. The allocator maintains
kernel-specific states like DMA passthrough device states, IOMMU state, and
more across kexec.
The current implementation provides a foundation for custom solutions that
may be developed in the future. Although the design is kept concise and
straightforward to encourage discussion and feedback, it remains fully
functional.
The immediate need for the allocator is in ability to persist the kernel
pages deposited into Microsoft Hypervisor across kexec: these pages must
not be accessed by kernel when deposited, but can be withdrawn and released
back to kernel. Kexec in turn is used for servicing purposes and aimed to
minimize service downtime upon kernel upgrade in a fleet of machines.
The persistent memory pool builds upon the continuous memory allocator
(CMA) and ensures CMA state persistency across kexec by incorporating the
CMA bitmap into the memory region instead of allocation it from kernel
memory.
Persistent memory pool metadata is passed across kexec by using Flattened
Device Tree, which is added as another kexec segment for x86 architecture.
Potential applications include:
1. Enabling various in-kernel entities to allocate persistent pages from
a unified memory pool, obviating the need for reserving multiple
regions.
2. For in-kernel components that need the allocation address to be
retained on kernel kexec, this address can be exposed to user space
and subsequently passed through the command line.
3. Distinct subsystems or drivers can set aside their region, allocating
a segment for their persistent memory pool, suitable for uses such as
file systems, key-value stores, and other applications.
Changes since v2:
1. Device tree-related change are removed.
2. Persistent memory pool region is marked as "reserved by kernel" in
kexec e820 table, which indicates to the new kernel, that the pool
must restored.
Changes since v1:
1. Persistent memory pool is now a wrapper on top of CMA instead of being a
new allocator.
2. Persistent memory pool metadata doesn't belong to the pool anymore and
is now passed via Flattened Device Tree instead over kexec to the new
kernel.
The following series implements...
---
Stanislav Kinsburskii (3):
x86/boot/e820: Expose kexec range update, remove and table update functions
pmpool: Introduce persistent memory pool
pmpool: Mark reserved range as "kernel reserved" in kexec e820 table
arch/x86/include/asm/e820/api.h | 4 +
arch/x86/kernel/e820.c | 21 ++++-
include/linux/pmpool.h | 22 +++++
mm/Kconfig | 8 ++
mm/Makefile | 1
mm/pmpool.c | 159 +++++++++++++++++++++++++++++++++++++++
6 files changed, 209 insertions(+), 6 deletions(-)
create mode 100644 include/linux/pmpool.h
create mode 100644 mm/pmpool.c
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFC PATCH v3 1/3] x86/boot/e820: Expose kexec range update, remove and table update functions
2023-10-04 22:23 [RFC PATCH v3 0/3] Introduce persistent memory pool Stanislav Kinsburskii
@ 2023-10-04 22:23 ` Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 2/3] pmpool: Introduce persistent memory pool Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 3/3] pmpool: Mark reserved range as "kernel reserved" in kexec e820 table Stanislav Kinsburskii
2 siblings, 0 replies; 4+ messages in thread
From: Stanislav Kinsburskii @ 2023-10-04 22:23 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini, bhe,
dave.hansen, kirill.shutemov
This functions are to be used to reserve memory regions in kexec kernel by
other kernel subsystems.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
arch/x86/include/asm/e820/api.h | 4 ++++
arch/x86/kernel/e820.c | 21 +++++++++++++++------
2 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/e820/api.h b/arch/x86/include/asm/e820/api.h
index e8f58ddd06d9..24bb8da928bb 100644
--- a/arch/x86/include/asm/e820/api.h
+++ b/arch/x86/include/asm/e820/api.h
@@ -22,6 +22,10 @@ extern void e820__print_table(char *who);
extern int e820__update_table(struct e820_table *table);
extern void e820__update_table_print(void);
+extern u64 e820__range_update_kexec(u64 start, u64 size, enum e820_type old_type, enum e820_type new_type);
+extern u64 e820__range_remove_kexec(u64 start, u64 size, enum e820_type old_type, bool check_type);
+extern void e820__update_table_kexec(void);
+
extern unsigned long e820__end_of_ram_pfn(void);
extern unsigned long e820__end_of_low_ram_pfn(void);
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..f339815029f7 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -532,13 +532,12 @@ u64 __init e820__range_update(u64 start, u64 size, enum e820_type old_type, enum
return __e820__range_update(e820_table, start, size, old_type, new_type);
}
-static u64 __init e820__range_update_kexec(u64 start, u64 size, enum e820_type old_type, enum e820_type new_type)
+u64 __init e820__range_update_kexec(u64 start, u64 size, enum e820_type old_type, enum e820_type new_type)
{
return __e820__range_update(e820_table_kexec, start, size, old_type, new_type);
}
-/* Remove a range of memory from the E820 table: */
-u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool check_type)
+u64 __init __e820__range_remove(struct e820_table *table, u64 start, u64 size, enum e820_type old_type, bool check_type)
{
int i;
u64 end;
@@ -553,8 +552,8 @@ u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool
e820_print_type(old_type);
pr_cont("\n");
- for (i = 0; i < e820_table->nr_entries; i++) {
- struct e820_entry *entry = &e820_table->entries[i];
+ for (i = 0; i < table->nr_entries; i++) {
+ struct e820_entry *entry = &table->entries[i];
u64 final_start, final_end;
u64 entry_end;
@@ -599,6 +598,16 @@ u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool
return real_removed_size;
}
+u64 __init e820__range_remove(u64 start, u64 size, enum e820_type old_type, bool check_type)
+{
+ return __e820__range_remove(e820_table, start, size, old_type, check_type);
+}
+
+u64 __init e820__range_remove_kexec(u64 start, u64 size, enum e820_type old_type, bool check_type)
+{
+ return __e820__range_remove(e820_table_kexec, start, size, old_type, check_type);
+}
+
void __init e820__update_table_print(void)
{
if (e820__update_table(e820_table))
@@ -608,7 +617,7 @@ void __init e820__update_table_print(void)
e820__print_table("modified");
}
-static void __init e820__update_table_kexec(void)
+void __init e820__update_table_kexec(void)
{
e820__update_table(e820_table_kexec);
}
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFC PATCH v3 2/3] pmpool: Introduce persistent memory pool
2023-10-04 22:23 [RFC PATCH v3 0/3] Introduce persistent memory pool Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 1/3] x86/boot/e820: Expose kexec range update, remove and table update functions Stanislav Kinsburskii
@ 2023-10-04 22:23 ` Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 3/3] pmpool: Mark reserved range as "kernel reserved" in kexec e820 table Stanislav Kinsburskii
2 siblings, 0 replies; 4+ messages in thread
From: Stanislav Kinsburskii @ 2023-10-04 22:23 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini, bhe,
dave.hansen, kirill.shutemov
This patch introduces a memory allocator specifically tailored for
persistent memory within the kernel. The allocator maintains
kernel-specific states like DMA passthrough device states, IOMMU state, and
more across kexec.
The current implementation provides a foundation for custom solutions that
may be developed in the future. Although the design is kept concise and
straightforward to encourage discussion and feedback, it remains fully
functional.
The persistent memory pool builds upon the continuous memory allocator
(CMA) and ensures CMA state persistency across kexec by incorporating the
CMA bitmap into the memory region.
Potential applications include:
1. Enabling various in-kernel entities to allocate persistent pages from
a unified memory pool, obviating the need for reserving multiple
regions.
2. For in-kernel components that need the allocation address to be
retained on kernel kexec, this address can be exposed to user space
and subsequently passed through the command line.
3. Distinct subsystems or drivers can set aside their region, allocating
a segment for their persistent memory pool, suitable for uses such as
file systems, key-value stores, and other applications.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
include/linux/pmpool.h | 22 +++++++++
mm/Kconfig | 8 +++
mm/Makefile | 1
mm/pmpool.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 146 insertions(+)
create mode 100644 include/linux/pmpool.h
create mode 100644 mm/pmpool.c
diff --git a/include/linux/pmpool.h b/include/linux/pmpool.h
new file mode 100644
index 000000000000..b41f16fa9660
--- /dev/null
+++ b/include/linux/pmpool.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _PMPOOL_H
+#define _PMPOOL_H
+
+struct page;
+
+#if defined(CONFIG_PMPOOL)
+struct page *pmpool_alloc(unsigned long count);
+bool pmpool_release(struct page *pages, unsigned long count);
+#else
+static inline struct page *pmpool_alloc(unsigned long count)
+{
+ return NULL;
+}
+static inline bool pmpool_release(struct page *pages, unsigned long count)
+{
+ return false;
+}
+#endif
+
+#endif /* _PMPOOL_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..e7c10094fb10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -922,6 +922,14 @@ config CMA_AREAS
If unsure, leave the default value "7" in UMA and "19" in NUMA.
+config PMPOOL
+ bool "Persistent memory pool support"
+ select CMA
+ help
+ This option adds support for CMA-based persistent memory pool
+ feature, which provides pages allocation and freeing from a set of
+ persistent memory ranges, deposited to the memory pool.
+
config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/Makefile b/mm/Makefile
index 678530a07326..8d3579e58c2c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -139,3 +139,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_PMPOOL) += pmpool.o
diff --git a/mm/pmpool.c b/mm/pmpool.c
new file mode 100644
index 000000000000..c74f09b99283
--- /dev/null
+++ b/mm/pmpool.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt) "pmpool: " fmt
+
+#include <linux/bitmap.h>
+#include <linux/cma.h>
+#include <linux/io.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pmpool.h>
+
+#include "cma.h"
+
+struct pmpool {
+ struct resource resource;
+ struct cma *cma;
+};
+
+static struct pmpool *default_pmpool;
+
+bool pmpool_release(struct page *pages, unsigned long count)
+{
+ if (!default_pmpool)
+ return false;
+
+ return cma_release(default_pmpool->cma, pages, count);
+}
+
+struct page *pmpool_alloc(unsigned long count)
+{
+ if (!default_pmpool)
+ return NULL;
+
+ return cma_alloc(default_pmpool->cma, count, 0, true);
+}
+
+static void pmpool_cma_accomodate_bitmap(struct cma *cma)
+{
+ unsigned long bitmap_size;
+
+ bitmap_free(cma->bitmap);
+ cma->bitmap = phys_to_virt(PFN_PHYS(cma->base_pfn));
+
+ bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma));
+ memset(cma->bitmap, 0, bitmap_size);
+ bitmap_set(cma->bitmap, 0, PAGE_ALIGN(bitmap_size) >> PAGE_SHIFT);
+
+ pr_info("CMA bitmap moved to %#llx\n", virt_to_phys(cma->bitmap));
+}
+
+static int __init default_pmpool_fixup(void)
+{
+ if (!default_pmpool)
+ return 0;
+
+ if (insert_resource(&iomem_resource, &default_pmpool->resource))
+ pr_err("failed to insert resource\n");
+
+ pmpool_cma_accomodate_bitmap(default_pmpool->cma);
+ return 0;
+}
+postcore_initcall(default_pmpool_fixup);
+
+static int __init parse_pmpool_opt(char *str)
+{
+ static struct pmpool pmpool = {
+ .resource = {
+ .name = "Persistent Memory Pool",
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
+ .desc = IORES_DESC_CXL
+ }
+ };
+ phys_addr_t base, size, end;
+ int err;
+
+ /* Format is pmpool=<base>,<size> */
+ base = memparse(str, &str);
+ size = memparse(str + 1, NULL);
+ end = base + size - 1;
+
+ err = memblock_is_region_reserved(base, size);
+ if (err) {
+ pr_err("memory block overlaps with another one: %d\n", err);
+ return 0;
+ }
+
+ err = memblock_reserve(base, size);
+ if (err) {
+ pr_err("failed to reserve memory block: %d\n", err);
+ return 0;
+ }
+
+ err = cma_init_reserved_mem(base, size, 0, "pmpool", &pmpool.cma);
+ if (err) {
+ pr_err("failed to initialize CMA: %d\n", err);
+ goto free_memblock;
+ }
+
+ pmpool.resource.start = base;
+ pmpool.resource.end = end;
+
+ pr_info("default memory pool is created: %#llx-%#llx\n",
+ base, end);
+
+ default_pmpool = &pmpool;
+
+ return 0;
+
+free_memblock:
+ memblock_phys_free(base, size);
+ return 0;
+}
+early_param("pmpool", parse_pmpool_opt);
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFC PATCH v3 3/3] pmpool: Mark reserved range as "kernel reserved" in kexec e820 table
2023-10-04 22:23 [RFC PATCH v3 0/3] Introduce persistent memory pool Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 1/3] x86/boot/e820: Expose kexec range update, remove and table update functions Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 2/3] pmpool: Introduce persistent memory pool Stanislav Kinsburskii
@ 2023-10-04 22:23 ` Stanislav Kinsburskii
2 siblings, 0 replies; 4+ messages in thread
From: Stanislav Kinsburskii @ 2023-10-04 22:23 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, hpa, ebiederm, akpm,
stanislav.kinsburskii, corbet, linux-kernel, kexec, linux-mm,
kys, jgowans, wei.liu, arnd, gregkh, graf, pbonzini, bhe,
dave.hansen, kirill.shutemov
Update the logic to classify the persistent memory pool in the kexec e820
table as "kernel reserved" when its corresponding e820 region type is
"System RAM". Restore the pool when its type is "kernel reserved". This
ensures the persistence of the memory pool across kexec operations.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
mm/pmpool.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 47 insertions(+), 3 deletions(-)
diff --git a/mm/pmpool.c b/mm/pmpool.c
index c74f09b99283..1e3a2dffc5d3 100644
--- a/mm/pmpool.c
+++ b/mm/pmpool.c
@@ -11,11 +11,14 @@
#include <linux/mm.h>
#include <linux/pmpool.h>
+#include <asm/e820/api.h>
+
#include "cma.h"
struct pmpool {
struct resource resource;
struct cma *cma;
+ bool exists;
};
static struct pmpool *default_pmpool;
@@ -50,6 +53,18 @@ static void pmpool_cma_accomodate_bitmap(struct cma *cma)
pr_info("CMA bitmap moved to %#llx\n", virt_to_phys(cma->bitmap));
}
+static void pmpool_cma_restore_bitmap(struct cma *cma)
+{
+ u64 base;
+
+ base = PFN_PHYS(cma->base_pfn);
+
+ bitmap_free(cma->bitmap);
+ cma->bitmap = phys_to_virt(base);
+
+ pr_info("CMA bitmap restored to %#llx\n", base);
+}
+
static int __init default_pmpool_fixup(void)
{
if (!default_pmpool)
@@ -58,7 +73,11 @@ static int __init default_pmpool_fixup(void)
if (insert_resource(&iomem_resource, &default_pmpool->resource))
pr_err("failed to insert resource\n");
- pmpool_cma_accomodate_bitmap(default_pmpool->cma);
+ if (default_pmpool->exists)
+ pmpool_cma_restore_bitmap(default_pmpool->cma);
+ else
+ pmpool_cma_accomodate_bitmap(default_pmpool->cma);
+
return 0;
}
postcore_initcall(default_pmpool_fixup);
@@ -73,7 +92,7 @@ static int __init parse_pmpool_opt(char *str)
}
};
phys_addr_t base, size, end;
- int err;
+ int err, e820_type;
/* Format is pmpool=<base>,<size> */
base = memparse(str, &str);
@@ -92,10 +111,33 @@ static int __init parse_pmpool_opt(char *str)
return 0;
}
+ e820_type = e820__get_entry_type(base, end);
+ switch (e820_type) {
+ case E820_TYPE_RAM:
+ e820__range_update_kexec(base, size, E820_TYPE_RAM,
+ E820_TYPE_RESERVED_KERN);
+ e820__update_table_kexec();
+ break;
+ case E820_TYPE_RESERVED_KERN:
+ /*
+ * TODO: there are several assumptions here:
+ * 1. That the kernel reserved region represents pmpool,
+ * 2. That the region had the same base and size and
+ * 3. That the region was properly initialized.
+ * All these assumptions aren't valid in general case and this
+ * should be addressed.
+ */
+ pmpool.exists = true;
+ break;
+ default:
+ pr_err("unsupported e820 type: %d\n", e820_type);
+ goto free_memblock;
+ }
+
err = cma_init_reserved_mem(base, size, 0, "pmpool", &pmpool.cma);
if (err) {
pr_err("failed to initialize CMA: %d\n", err);
- goto free_memblock;
+ goto remove_e820_kexec_range;
}
pmpool.resource.start = base;
@@ -108,6 +150,8 @@ static int __init parse_pmpool_opt(char *str)
return 0;
+remove_e820_kexec_range:
+ e820__range_remove_kexec(base, size, E820_TYPE_RESERVED_KERN, 1);
free_memblock:
memblock_phys_free(base, size);
return 0;
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-10-04 22:23 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-04 22:23 [RFC PATCH v3 0/3] Introduce persistent memory pool Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 1/3] x86/boot/e820: Expose kexec range update, remove and table update functions Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 2/3] pmpool: Introduce persistent memory pool Stanislav Kinsburskii
2023-10-04 22:23 ` [RFC PATCH v3 3/3] pmpool: Mark reserved range as "kernel reserved" in kexec e820 table Stanislav Kinsburskii
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox