* [PATCH 1/4] s390/mm: Support removal of boot-allocated virtual memory map
2025-09-26 13:15 [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
@ 2025-09-26 13:15 ` Sumanth Korikkar
2025-09-26 13:15 ` [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory Sumanth Korikkar
` (4 subsequent siblings)
5 siblings, 0 replies; 20+ messages in thread
From: Sumanth Korikkar @ 2025-09-26 13:15 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Sumanth Korikkar
On s390, memory blocks are not currently removed via
arch_remove_memory(). With upcoming dynamic memory (de)configuration
support, runtime removal of memory blocks is possible. This internally
involves tearing down identity mapping, virtual memory mappings and
freeing the physical memory backing the struct pages metadata.
During early boot, physical memory used to back the struct pages
metadata in vmemmap is allocated through:
setup_arch()
-> sparse_init()
-> sparse_init_nid()
-> __populate_section_memmap()
-> vmemmap_alloc_block_buf()
-> sparse_buffer_alloc()
-> memblock_alloc()
Here, sparse_init_nid() sets up virtual-to-physical mapping for struct
pages backed by memblock_alloc(). This differs from runtime addition of
hotplug memory which uses the buddy allocator later.
To correctly free identity mappings, vmemmap mappings during hot-remove,
boot-time and runtime allocations must be distinguished using the
PageReserved bit:
* Boot-time memory, such as identity-mapped page tables allocated via
boot_crst_alloc() and reserved via reserve_pgtables() is marked
PageReserved in memmap_init_reserved_pages().
* Physical memory backing vmemmap (struct pages from memblock_alloc())
is also marked PageReserved similarly.
During teardown, PageReserved bit is checked to distinguish between
boot-time allocation or buddy allocation.
This is similar to commit 645d5ce2f7d6 ("powerpc/mm/radix: Fix PTE/PMD
fragment count for early page table mappings")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
arch/s390/mm/pgalloc.c | 2 ++
arch/s390/mm/vmem.c | 21 ++++++++++++---------
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index ad3e0f7f7fc1..596c05244ed0 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -164,6 +164,8 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
{
struct ptdesc *ptdesc = virt_to_ptdesc(table);
+ if (pagetable_is_reserved(ptdesc))
+ return free_reserved_ptdesc(ptdesc);
pagetable_dtor_free(ptdesc);
}
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index f48ef361bc83..d96587b84e81 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -4,6 +4,7 @@
*/
#include <linux/memory_hotplug.h>
+#include <linux/bootmem_info.h>
#include <linux/cpufeature.h>
#include <linux/memblock.h>
#include <linux/pfn.h>
@@ -39,15 +40,21 @@ static void __ref *vmem_alloc_pages(unsigned int order)
static void vmem_free_pages(unsigned long addr, int order, struct vmem_altmap *altmap)
{
+ unsigned int nr_pages = 1 << order;
+ struct page *page;
+
if (altmap) {
vmem_altmap_free(altmap, 1 << order);
return;
}
- /* We don't expect boot memory to be removed ever. */
- if (!slab_is_available() ||
- WARN_ON_ONCE(PageReserved(virt_to_page((void *)addr))))
- return;
- free_pages(addr, order);
+ page = virt_to_page((void *)addr);
+ if (PageReserved(page)) {
+ /* allocated from memblock */
+ while (nr_pages--)
+ free_bootmem_page(page++);
+ } else {
+ free_pages(addr, order);
+ }
}
void *vmem_crst_alloc(unsigned long val)
@@ -79,10 +86,6 @@ pte_t __ref *vmem_pte_alloc(void)
static void vmem_pte_free(unsigned long *table)
{
- /* We don't expect boot memory to be removed ever. */
- if (!slab_is_available() ||
- WARN_ON_ONCE(PageReserved(virt_to_page(table))))
- return;
page_table_free(&init_mm, table);
}
--
2.48.1
^ permalink raw reply [flat|nested] 20+ messages in thread* [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory
2025-09-26 13:15 [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
2025-09-26 13:15 ` [PATCH 1/4] s390/mm: Support removal of boot-allocated virtual memory map Sumanth Korikkar
@ 2025-09-26 13:15 ` Sumanth Korikkar
2025-10-07 20:07 ` David Hildenbrand
2025-09-26 13:15 ` [PATCH 3/4] s390/sclp: Remove MHP_OFFLINE_INACCESSIBLE Sumanth Korikkar
` (3 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-09-26 13:15 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Sumanth Korikkar
Provide a new interface for dynamic configuration and deconfiguration of
hotplug memory, allowing with/without memmap_on_memory support. It is a
follow up on the discussion with David when introducing memmap_on_memory
support for s390 and support dynamic (de)configuration of memory:
https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/
The original motivation for introducing memmap_on_memory on s390 was to
avoid using online memory to store struct pages metadata, particularly
for standby memory blocks. This became critical in cases where there was
an imbalance between standby and online memory, potentially leading to
boot failures due to insufficient memory for metadata allocation.
To address this, memmap_on_memory was utilized on s390. However, in its
current form, it adds struct pages metadata at the start of each memory
block at the time of addition and this configuration is static. It
cannot be changed at runtime. (When the user needs continuous physical
memory).
Inorder to provide more flexibility to the user and overcome the above
limitation, add option to dynamically configure and deconfigure
hotpluggable memory block with/without memmap_on_memory.
With the new interface, s390 will not add all possible hotplug memory in
advance, like before, to make it visible in sysfs for online/offline
actions. Instead, before memory block can be set online, it has to be
configured via a new interface in /sys/firmware/memory/memoryX/config,
which makes s390 similar to others. i.e. Adding of hotpluggable memory is
controlled by the user instead of adding it at boottime.
The s390 kernel sysfs interface to configure and deconfigure memory is
as follows (considering the upcoming lsmem changes):
* Initial memory layout:
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
* Configure memory
sys="/sys"
echo 1 > $sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0x87ffffff 128M offline 16 yes yes
0x88000000-0xffffffff 1.9G offline 17-31 no yes
* Deconfigure memory
echo 0 > $sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
3. Enable memmap_on_memory and online it.
echo 0 > $sys/devices/system/memory/memory5/online
echo 0 > $sys/firmware/memory/memory5/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff 640M online 0-4 yes no
0x28000000-0x2fffffff 128M offline 5 no no
0x30000000-0x7fffffff 1.3G online 6-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
echo 1 > $sys/firmware/memory/memory5/memmap_on_memory
echo 1 > $sys/firmware/memory/memory5/config
echo 1 > $sys/devices/system/memory/memory5/online
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff 640M online 0-4 yes no
0x28000000-0x2fffffff 128M online 5 yes yes
0x30000000-0x7fffffff 1.3G online 6-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
4. Disable memmap_on_memory and online it.
echo 0 > $sys/devices/system/memory/memory5/online
echo 0 > $sys/firmware/memory/memory5/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff 640M online 0-4 yes no
0x28000000-0x2fffffff 128M offline 5 no yes
0x30000000-0x7fffffff 1.3G online 6-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
echo 0 > $sys/firmware/memory/memory5/memmap_on_memory
echo 1 > $sys/firmware/memory/memory5/config
echo 1 > $sys/devices/system/memory/memory5/online
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff 2G online 0-15 yes no
0x80000000-0xffffffff 2G offline 16-31 no yes
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
drivers/s390/char/sclp_mem.c | 291 +++++++++++++++++++++++++++++------
1 file changed, 241 insertions(+), 50 deletions(-)
diff --git a/drivers/s390/char/sclp_mem.c b/drivers/s390/char/sclp_mem.c
index 27f49f5fd358..802439230294 100644
--- a/drivers/s390/char/sclp_mem.c
+++ b/drivers/s390/char/sclp_mem.c
@@ -9,9 +9,12 @@
#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
#include <linux/cpufeature.h>
+#include <linux/container_of.h>
#include <linux/err.h>
#include <linux/errno.h>
#include <linux/init.h>
+#include <linux/kobject.h>
+#include <linux/kstrtox.h>
#include <linux/memory.h>
#include <linux/memory_hotplug.h>
#include <linux/mm.h>
@@ -27,7 +30,6 @@
#define SCLP_CMDW_ASSIGN_STORAGE 0x000d0001
#define SCLP_CMDW_UNASSIGN_STORAGE 0x000c0001
-static DEFINE_MUTEX(sclp_mem_mutex);
static LIST_HEAD(sclp_mem_list);
static u8 sclp_max_storage_id;
static DECLARE_BITMAP(sclp_storage_ids, 256);
@@ -38,6 +40,18 @@ struct memory_increment {
int standby;
};
+struct mblock {
+ struct kobject kobj;
+ unsigned int id;
+ unsigned int memmap_on_memory;
+ unsigned int config;
+};
+
+struct memory_block_arg {
+ struct mblock *mblocks;
+ struct kset *kset;
+};
+
struct assign_storage_sccb {
struct sccb_header header;
u16 rn;
@@ -185,15 +199,11 @@ static int sclp_mem_notifier(struct notifier_block *nb,
{
unsigned long start, size;
struct memory_notify *arg;
- unsigned char id;
int rc = 0;
arg = data;
start = arg->start_pfn << PAGE_SHIFT;
size = arg->nr_pages << PAGE_SHIFT;
- mutex_lock(&sclp_mem_mutex);
- for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
- sclp_attach_storage(id);
switch (action) {
case MEM_GOING_OFFLINE:
/*
@@ -204,45 +214,201 @@ static int sclp_mem_notifier(struct notifier_block *nb,
if (contains_standby_increment(start, start + size))
rc = -EPERM;
break;
- case MEM_PREPARE_ONLINE:
- /*
- * Access the altmap_start_pfn and altmap_nr_pages fields
- * within the struct memory_notify specifically when dealing
- * with only MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
- *
- * When altmap is in use, take the specified memory range
- * online, which includes the altmap.
- */
- if (arg->altmap_nr_pages) {
- start = PFN_PHYS(arg->altmap_start_pfn);
- size += PFN_PHYS(arg->altmap_nr_pages);
- }
- rc = sclp_mem_change_state(start, size, 1);
- if (rc || !arg->altmap_nr_pages)
- break;
- /*
- * Set CMMA state to nodat here, since the struct page memory
- * at the beginning of the memory block will not go through the
- * buddy allocator later.
- */
- __arch_set_page_nodat((void *)__va(start), arg->altmap_nr_pages);
+ default:
break;
- case MEM_FINISH_OFFLINE:
+ }
+ return rc ? NOTIFY_BAD : NOTIFY_OK;
+}
+
+static ssize_t config_mblock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+ struct mblock *mblock = container_of(kobj, struct mblock, kobj);
+
+ return sysfs_emit(buf, "%u\n", READ_ONCE(mblock->config));
+}
+
+static ssize_t config_mblock_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long long addr, block_size;
+ struct memory_block *mem;
+ struct mblock *mblock;
+ unsigned char id;
+ bool value;
+ int rc;
+
+ rc = kstrtobool(buf, &value);
+ if (rc)
+ return rc;
+ mblock = container_of(kobj, struct mblock, kobj);
+ block_size = memory_block_size_bytes();
+ addr = mblock->id * block_size;
+ /*
+ * Hold device_hotplug_lock when adding/removing memory blocks.
+ * Additionally, also protect calls to find_memory_block() and
+ * sclp_attach_storage().
+ */
+ rc = lock_device_hotplug_sysfs();
+ if (rc)
+ goto out;
+ for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
+ sclp_attach_storage(id);
+ if (value) {
+ if (mblock->config)
+ goto out_unlock;
+ rc = sclp_mem_change_state(addr, block_size, 1);
+ if (rc)
+ goto out_unlock;
/*
- * When altmap is in use, take the specified memory range
- * offline, which includes the altmap.
+ * Set entire memory block CMMA state to nodat. Later, when
+ * page tables pages are allocated via __add_memory(), those
+ * regions are marked __arch_set_page_dat().
*/
- if (arg->altmap_nr_pages) {
- start = PFN_PHYS(arg->altmap_start_pfn);
- size += PFN_PHYS(arg->altmap_nr_pages);
+ __arch_set_page_nodat((void *)__va(addr), block_size >> PAGE_SHIFT);
+ rc = __add_memory(0, addr, block_size,
+ mblock->memmap_on_memory ?
+ MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
+ if (rc)
+ goto out_unlock;
+ mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(addr)));
+ put_device(&mem->dev);
+ WRITE_ONCE(mblock->config, 1);
+ } else {
+ if (!mblock->config)
+ goto out_unlock;
+ mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(addr)));
+ if (mem->state != MEM_OFFLINE) {
+ put_device(&mem->dev);
+ rc = -EBUSY;
+ goto out_unlock;
}
- sclp_mem_change_state(start, size, 0);
- break;
- default:
- break;
+ /* drop the ref just got via find_memory_block() */
+ put_device(&mem->dev);
+ sclp_mem_change_state(addr, block_size, 0);
+ __remove_memory(addr, block_size);
+ WRITE_ONCE(mblock->config, 0);
}
- mutex_unlock(&sclp_mem_mutex);
- return rc ? NOTIFY_BAD : NOTIFY_OK;
+out_unlock:
+ unlock_device_hotplug();
+out:
+ return rc ? rc : count;
+}
+
+static ssize_t memmap_on_memory_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+ struct mblock *mblock = container_of(kobj, struct mblock, kobj);
+
+ return sysfs_emit(buf, "%u\n", READ_ONCE(mblock->memmap_on_memory));
+}
+
+static ssize_t memmap_on_memory_store(struct kobject *kobj, struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long block_size;
+ struct memory_block *mem;
+ struct mblock *mblock;
+ bool value;
+ int rc;
+
+ rc = kstrtobool(buf, &value);
+ if (rc)
+ return rc;
+ rc = lock_device_hotplug_sysfs();
+ if (rc)
+ return rc;
+ block_size = memory_block_size_bytes();
+ mblock = container_of(kobj, struct mblock, kobj);
+ mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(mblock->id * block_size)));
+ if (!mem) {
+ WRITE_ONCE(mblock->memmap_on_memory, value);
+ } else {
+ put_device(&mem->dev);
+ rc = -EBUSY;
+ }
+ unlock_device_hotplug();
+ return rc ? rc : count;
+}
+
+static void mblock_sysfs_release(struct kobject *kobj)
+{
+ struct mblock *mblock = container_of(kobj, struct mblock, kobj);
+
+ kfree(mblock);
+}
+
+static const struct kobj_type ktype = {
+ .release = mblock_sysfs_release,
+ .sysfs_ops = &kobj_sysfs_ops,
+};
+
+static struct kobj_attribute memmap_attr =
+ __ATTR(memmap_on_memory, 0644, memmap_on_memory_show, memmap_on_memory_store);
+static struct kobj_attribute config_attr =
+ __ATTR(config, 0644, config_mblock_show, config_mblock_store);
+
+static struct attribute *mblock_attrs[] = {
+ &memmap_attr.attr,
+ &config_attr.attr,
+ NULL,
+};
+
+static struct attribute_group mblock_attr_group = {
+ .attrs = mblock_attrs,
+};
+
+static int create_mblock(struct mblock *mblock, struct kset *kset,
+ unsigned int id, bool config, bool memmap_on_memory)
+{
+ int rc;
+
+ mblock->memmap_on_memory = memmap_on_memory;
+ mblock->config = config;
+ mblock->id = id;
+ kobject_init(&mblock->kobj, &ktype);
+ rc = kobject_add(&mblock->kobj, &kset->kobj, "memory%d", id);
+ if (rc)
+ return rc;
+ rc = sysfs_create_group(&mblock->kobj, &mblock_attr_group);
+ if (rc)
+ kobject_put(&mblock->kobj);
+ return rc;
+}
+
+/*
+ * Create /sys/firmware/memory/memoryX for boottime configured online memory
+ * blocks
+ */
+static int create_online_mblock(struct memory_block *mem, void *argument)
+{
+ struct memory_block_arg *arg;
+ struct mblock *mblocks;
+ struct kset *kset;
+ unsigned int id;
+
+ id = mem->dev.id;
+ arg = (struct memory_block_arg *)argument;
+ mblocks = arg->mblocks;
+ kset = arg->kset;
+ return create_mblock(&mblocks[id], kset, id, true, false);
+}
+
+static int __init create_initial_online_mblocks(struct mblock *mblocks, struct kset *kset)
+{
+ struct memory_block_arg arg;
+
+ arg.mblocks = mblocks;
+ arg.kset = kset;
+ return for_each_memory_block(&arg, create_online_mblock);
+}
+
+static struct mblock * __init allocate_mblocks(void)
+{
+ u64 max_mblocks;
+ u64 block_size;
+
+ block_size = memory_block_size_bytes();
+ max_mblocks = roundup(sclp.rnmax * sclp.rzm, block_size) / block_size;
+ return kcalloc(max_mblocks, sizeof(struct mblock), GFP_KERNEL);
}
static struct notifier_block sclp_mem_nb = {
@@ -264,14 +430,17 @@ static void __init align_to_block_size(unsigned long *start,
*size = size_align;
}
-static void __init add_memory_merged(u16 rn)
+static int __init create_standby_mblocks_merged(struct mblock *mblocks,
+ struct kset *kset, u16 rn)
{
unsigned long start, size, addr, block_size;
static u16 first_rn, num;
+ unsigned int id;
+ int rc = 0;
if (rn && first_rn && (first_rn + num == rn)) {
num++;
- return;
+ return rc;
}
if (!first_rn)
goto skip_add;
@@ -286,24 +455,31 @@ static void __init add_memory_merged(u16 rn)
if (!size)
goto skip_add;
for (addr = start; addr < start + size; addr += block_size) {
- add_memory(0, addr, block_size,
- cpu_has_edat1() ?
- MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
+ id = addr / block_size;
+ rc = create_mblock(&mblocks[id], kset, id, false, mhp_supports_memmap_on_memory());
+ if (rc)
+ break;
}
skip_add:
first_rn = rn;
num = 1;
+ return rc;
}
-static void __init sclp_add_standby_memory(void)
+static int __init create_standby_mblocks(struct mblock *mblocks, struct kset *kset)
{
struct memory_increment *incr;
+ int rc = 0;
list_for_each_entry(incr, &sclp_mem_list, list) {
if (incr->standby)
- add_memory_merged(incr->rn);
+ rc = create_standby_mblocks_merged(mblocks, kset, incr->rn);
+ if (rc)
+ goto out;
}
- add_memory_merged(0);
+ rc = create_standby_mblocks_merged(mblocks, kset, 0);
+out:
+ return rc;
}
static void __init insert_increment(u16 rn, int standby, int assigned)
@@ -336,10 +512,12 @@ static void __init insert_increment(u16 rn, int standby, int assigned)
list_add(&new_incr->list, prev);
}
-static int __init sclp_detect_standby_memory(void)
+static int __init sclp_setup_memory(void)
{
struct read_storage_sccb *sccb;
int i, id, assigned, rc;
+ struct mblock *mblocks;
+ struct kset *kset;
/* No standby memory in kdump mode */
if (oldmem_data.start)
@@ -391,9 +569,22 @@ static int __init sclp_detect_standby_memory(void)
rc = register_memory_notifier(&sclp_mem_nb);
if (rc)
goto out;
- sclp_add_standby_memory();
+ mblocks = allocate_mblocks();
+ if (!mblocks) {
+ rc = -ENOMEM;
+ goto out;
+ }
+ kset = kset_create_and_add("memory", NULL, firmware_kobj);
+ if (!kset) {
+ rc = -ENOMEM;
+ goto out;
+ }
+ rc = create_initial_online_mblocks(mblocks, kset);
+ if (rc)
+ goto out;
+ rc = create_standby_mblocks(mblocks, kset);
out:
free_page((unsigned long)sccb);
return rc;
}
-__initcall(sclp_detect_standby_memory);
+__initcall(sclp_setup_memory);
--
2.48.1
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory
2025-09-26 13:15 ` [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory Sumanth Korikkar
@ 2025-10-07 20:07 ` David Hildenbrand
2025-10-08 6:46 ` Sumanth Korikkar
0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-10-07 20:07 UTC (permalink / raw)
To: Sumanth Korikkar, Andrew Morton, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev
[...]
> ---
> drivers/s390/char/sclp_mem.c | 291 +++++++++++++++++++++++++++++------
> 1 file changed, 241 insertions(+), 50 deletions(-)
>
> diff --git a/drivers/s390/char/sclp_mem.c b/drivers/s390/char/sclp_mem.c
> index 27f49f5fd358..802439230294 100644
> --- a/drivers/s390/char/sclp_mem.c
> +++ b/drivers/s390/char/sclp_mem.c
> @@ -9,9 +9,12 @@
> #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
>
> #include <linux/cpufeature.h>
> +#include <linux/container_of.h>
> #include <linux/err.h>
> #include <linux/errno.h>
> #include <linux/init.h>
> +#include <linux/kobject.h>
> +#include <linux/kstrtox.h>
> #include <linux/memory.h>
> #include <linux/memory_hotplug.h>
> #include <linux/mm.h>
> @@ -27,7 +30,6 @@
> #define SCLP_CMDW_ASSIGN_STORAGE 0x000d0001
> #define SCLP_CMDW_UNASSIGN_STORAGE 0x000c0001
>
> -static DEFINE_MUTEX(sclp_mem_mutex);
> static LIST_HEAD(sclp_mem_list);
> static u8 sclp_max_storage_id;
> static DECLARE_BITMAP(sclp_storage_ids, 256);
> @@ -38,6 +40,18 @@ struct memory_increment {
> int standby;
> };
>
> +struct mblock {
> + struct kobject kobj;
> + unsigned int id;
> + unsigned int memmap_on_memory;
> + unsigned int config;
> +};
> +
> +struct memory_block_arg {
> + struct mblock *mblocks;
> + struct kset *kset;
> +};
I would avoid using "memory_block_arg" as it reminds of core mm "struct memory_block".
Similarly, I'd not call this "mblock".
What about incorporating the "sclp" side of things?
"struct sclp_mem" / "struct sclp_mem_arg"
Nicely fits "sclp_mem.c" ;)
Something like that might be better.
> +
> struct assign_storage_sccb {
> struct sccb_header header;
> u16 rn;
> @@ -185,15 +199,11 @@ static int sclp_mem_notifier(struct notifier_block *nb,
> {
> unsigned long start, size;
> struct memory_notify *arg;
> - unsigned char id;
> int rc = 0;
>
> arg = data;
> start = arg->start_pfn << PAGE_SHIFT;
> size = arg->nr_pages << PAGE_SHIFT;
> - mutex_lock(&sclp_mem_mutex);
> - for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
> - sclp_attach_storage(id);
> switch (action) {
> case MEM_GOING_OFFLINE:
> /*
> @@ -204,45 +214,201 @@ static int sclp_mem_notifier(struct notifier_block *nb,
> if (contains_standby_increment(start, start + size))
> rc = -EPERM;
> break;
Is there any reson this notifier is still needed? I'd assume we can just allow
for offlining + re-onlining as we please now.
In fact, I'd assume we can get rid of the notifier entirely now?
> - case MEM_PREPARE_ONLINE:
> - /*
> - * Access the altmap_start_pfn and altmap_nr_pages fields
> - * within the struct memory_notify specifically when dealing
> - * with only MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
> - *
> - * When altmap is in use, take the specified memory range
> - * online, which includes the altmap.
> - */
> - if (arg->altmap_nr_pages) {
> - start = PFN_PHYS(arg->altmap_start_pfn);
> - size += PFN_PHYS(arg->altmap_nr_pages);
> - }
> - rc = sclp_mem_change_state(start, size, 1);
> - if (rc || !arg->altmap_nr_pages)
> - break;
> - /*
> - * Set CMMA state to nodat here, since the struct page memory
> - * at the beginning of the memory block will not go through the
> - * buddy allocator later.
> - */
> - __arch_set_page_nodat((void *)__va(start), arg->altmap_nr_pages);
> + default:
> break;
> - case MEM_FINISH_OFFLINE:
> + }
> + return rc ? NOTIFY_BAD : NOTIFY_OK;
> +}
> +
> +static ssize_t config_mblock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> + struct mblock *mblock = container_of(kobj, struct mblock, kobj);
> +
> + return sysfs_emit(buf, "%u\n", READ_ONCE(mblock->config));
> +}
> +
> +static ssize_t config_mblock_store(struct kobject *kobj, struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + unsigned long long addr, block_size;
"unsigned long" should be sufficient I'm sure :)
> + struct memory_block *mem;
> + struct mblock *mblock;
> + unsigned char id;
> + bool value;
> + int rc;
> +
> + rc = kstrtobool(buf, &value);
> + if (rc)
> + return rc;
> + mblock = container_of(kobj, struct mblock, kobj);
> + block_size = memory_block_size_bytes();
> + addr = mblock->id * block_size;
> + /*
> + * Hold device_hotplug_lock when adding/removing memory blocks.
> + * Additionally, also protect calls to find_memory_block() and
> + * sclp_attach_storage().
> + */
> + rc = lock_device_hotplug_sysfs();
> + if (rc)
> + goto out;
> + for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
> + sclp_attach_storage(id);
> + if (value) {
> + if (mblock->config)
> + goto out_unlock;
> + rc = sclp_mem_change_state(addr, block_size, 1);
> + if (rc)
> + goto out_unlock;
> /*
> - * When altmap is in use, take the specified memory range
> - * offline, which includes the altmap.
> + * Set entire memory block CMMA state to nodat. Later, when
> + * page tables pages are allocated via __add_memory(), those
> + * regions are marked __arch_set_page_dat().
> */
> - if (arg->altmap_nr_pages) {
> - start = PFN_PHYS(arg->altmap_start_pfn);
> - size += PFN_PHYS(arg->altmap_nr_pages);
> + __arch_set_page_nodat((void *)__va(addr), block_size >> PAGE_SHIFT);
> + rc = __add_memory(0, addr, block_size,
> + mblock->memmap_on_memory ?
> + MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
> + if (rc)
> + goto out_unlock;
Do we have to undo the state change?
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(addr)));
> + put_device(&mem->dev);
> + WRITE_ONCE(mblock->config, 1);
> + } else {
> + if (!mblock->config)
> + goto out_unlock;
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(addr)));
> + if (mem->state != MEM_OFFLINE) {
> + put_device(&mem->dev);
> + rc = -EBUSY;
> + goto out_unlock;
> }
> - sclp_mem_change_state(start, size, 0);
> - break;
> - default:
> - break;
> + /* drop the ref just got via find_memory_block() */
> + put_device(&mem->dev);
> + sclp_mem_change_state(addr, block_size, 0);
> + __remove_memory(addr, block_size);
> + WRITE_ONCE(mblock->config, 0);
> }
> - mutex_unlock(&sclp_mem_mutex);
> - return rc ? NOTIFY_BAD : NOTIFY_OK;
> +out_unlock:
> + unlock_device_hotplug();
> +out:
> + return rc ? rc : count;
> +}
> +
> +static ssize_t memmap_on_memory_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> +{
> + struct mblock *mblock = container_of(kobj, struct mblock, kobj);
> +
> + return sysfs_emit(buf, "%u\n", READ_ONCE(mblock->memmap_on_memory));
> +}
> +
> +static ssize_t memmap_on_memory_store(struct kobject *kobj, struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + unsigned long block_size;
> + struct memory_block *mem;
> + struct mblock *mblock;
> + bool value;
> + int rc;
> +
> + rc = kstrtobool(buf, &value);
> + if (rc)
> + return rc;
> + rc = lock_device_hotplug_sysfs();
> + if (rc)
> + return rc;
> + block_size = memory_block_size_bytes();
> + mblock = container_of(kobj, struct mblock, kobj);
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(mblock->id * block_size)));
> + if (!mem) {
> + WRITE_ONCE(mblock->memmap_on_memory, value);
> + } else {
> + put_device(&mem->dev);
> + rc = -EBUSY;
> + }
> + unlock_device_hotplug();
> + return rc ? rc : count;
> +}
> +
> +static void mblock_sysfs_release(struct kobject *kobj)
> +{
> + struct mblock *mblock = container_of(kobj, struct mblock, kobj);
> +
> + kfree(mblock);
> +}
> +
> +static const struct kobj_type ktype = {
> + .release = mblock_sysfs_release,
> + .sysfs_ops = &kobj_sysfs_ops,
> +};
> +
> +static struct kobj_attribute memmap_attr =
> + __ATTR(memmap_on_memory, 0644, memmap_on_memory_show, memmap_on_memory_store);
> +static struct kobj_attribute config_attr =
> + __ATTR(config, 0644, config_mblock_show, config_mblock_store);
> +
> +static struct attribute *mblock_attrs[] = {
> + &memmap_attr.attr,
> + &config_attr.attr,
> + NULL,
> +};
> +
> +static struct attribute_group mblock_attr_group = {
> + .attrs = mblock_attrs,
> +};
> +
> +static int create_mblock(struct mblock *mblock, struct kset *kset,
> + unsigned int id, bool config, bool memmap_on_memory)
> +{
> + int rc;
> +
> + mblock->memmap_on_memory = memmap_on_memory;
> + mblock->config = config;
> + mblock->id = id;
> + kobject_init(&mblock->kobj, &ktype);
> + rc = kobject_add(&mblock->kobj, &kset->kobj, "memory%d", id);
> + if (rc)
> + return rc;
> + rc = sysfs_create_group(&mblock->kobj, &mblock_attr_group);
> + if (rc)
> + kobject_put(&mblock->kobj);
> + return rc;
> +}
> +
> +/*
> + * Create /sys/firmware/memory/memoryX for boottime configured online memory
> + * blocks
> + */
> +static int create_online_mblock(struct memory_block *mem, void *argument)
"online" is conusing. It's "initial" / "configured". Same applies to the other functions
that mention "online".
> +{
> + struct memory_block_arg *arg;
> + struct mblock *mblocks;
> + struct kset *kset;
> + unsigned int id;
> +
> + id = mem->dev.id;
> + arg = (struct memory_block_arg *)argument;
> + mblocks = arg->mblocks;
> + kset = arg->kset;
> + return create_mblock(&mblocks[id], kset, id, true, false);
> +}
> +
> +static int __init create_initial_online_mblocks(struct mblock *mblocks, struct kset *kset)
> +{
> + struct memory_block_arg arg;
> +
> + arg.mblocks = mblocks;
> + arg.kset = kset;
> + return for_each_memory_block(&arg, create_online_mblock);
> +}
> +
> +static struct mblock * __init allocate_mblocks(void)
> +{
> + u64 max_mblocks;
Nit: why an u64? The block ids are "unsigned int id;"
> + u64 block_size;
> +
> + block_size = memory_block_size_bytes();
> + max_mblocks = roundup(sclp.rnmax * sclp.rzm, block_size) / block_size;
> + return kcalloc(max_mblocks, sizeof(struct mblock), GFP_KERNEL);
I think you should structure the code a bit differently, not splitting
the function up into tiny helpers.
static int __init init_sclp_mem(void)
{
const u64 block_size = memory_block_size_bytes();
const u64 max_mblocks = roundup(sclp.rnmax * sclp.rzm, block_size) / block_size;
struct sclp_mem_arg arg;
struct kset *kset;
int rc;
/* We'll allocate memory for all blocks ahead of time. */
sclp_mem = kcalloc(max_mblocks, sizeof(struct mblock), GFP_KERNEL);
if (!sclp_mem)
return -ENOMEM;
kset = kset_create_and_add("memory", NULL, firmware_kobj);
if (!kset)
return -ENOMEM;
/* Initial memory is in the "configured" state already. */
arg.sclp_mem = sclp_mem;
arg.kset = kset;
rc = for_each_memory_block(&arg, create_configured_sclp_mem);
if (rc)
return rc;
/* Standby memory is "deconfigured". */
return create_standby_sclp_mem(sclp_mem, kset);
}
Should still be quite readable.
> }
>
> static struct notifier_block sclp_mem_nb = {
> @@ -264,14 +430,17 @@ static void __init align_to_block_size(unsigned long *start,
> *size = size_align;
> }
>
> -static void __init add_memory_merged(u16 rn)
> +static int __init create_standby_mblocks_merged(struct mblock *mblocks,
> + struct kset *kset, u16 rn)
> {
> unsigned long start, size, addr, block_size;
> static u16 first_rn, num;
> + unsigned int id;
> + int rc = 0;
>
> if (rn && first_rn && (first_rn + num == rn)) {
> num++;
> - return;
> + return rc;
> }
> if (!first_rn)
> goto skip_add;
> @@ -286,24 +455,31 @@ static void __init add_memory_merged(u16 rn)
> if (!size)
> goto skip_add;
> for (addr = start; addr < start + size; addr += block_size) {
> - add_memory(0, addr, block_size,
> - cpu_has_edat1() ?
> - MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
> + id = addr / block_size;
> + rc = create_mblock(&mblocks[id], kset, id, false, mhp_supports_memmap_on_memory());
> + if (rc)
> + break;
> }
> skip_add:
> first_rn = rn;
> num = 1;
> + return rc;
> }
>
> -static void __init sclp_add_standby_memory(void)
> +static int __init create_standby_mblocks(struct mblock *mblocks, struct kset *kset)
> {
> struct memory_increment *incr;
> + int rc = 0;
>
> list_for_each_entry(incr, &sclp_mem_list, list) {
> if (incr->standby)
> - add_memory_merged(incr->rn);
> + rc = create_standby_mblocks_merged(mblocks, kset, incr->rn);
> + if (rc)
> + goto out;
> }
> - add_memory_merged(0);
> + rc = create_standby_mblocks_merged(mblocks, kset, 0);
> +out:
> + return rc;
> }
>
> static void __init insert_increment(u16 rn, int standby, int assigned)
> @@ -336,10 +512,12 @@ static void __init insert_increment(u16 rn, int standby, int assigned)
> list_add(&new_incr->list, prev);
> }
>
> -static int __init sclp_detect_standby_memory(void)
> +static int __init sclp_setup_memory(void)
> {
> struct read_storage_sccb *sccb;
> int i, id, assigned, rc;
> + struct mblock *mblocks;
> + struct kset *kset;
>
> /* No standby memory in kdump mode */
> if (oldmem_data.start)
Wouldn't we still want to create the ones for initial memory at least?
[...]
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory
2025-10-07 20:07 ` David Hildenbrand
@ 2025-10-08 6:46 ` Sumanth Korikkar
2025-10-08 8:05 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-10-08 6:46 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
On Tue, Oct 07, 2025 at 10:07:43PM +0200, David Hildenbrand wrote:
> [...]
>
> > ---
> > drivers/s390/char/sclp_mem.c | 291 +++++++++++++++++++++++++++++------
> > 1 file changed, 241 insertions(+), 50 deletions(-)
> >
> > diff --git a/drivers/s390/char/sclp_mem.c b/drivers/s390/char/sclp_mem.c
> > index 27f49f5fd358..802439230294 100644
> > --- a/drivers/s390/char/sclp_mem.c
> > +++ b/drivers/s390/char/sclp_mem.c
> > @@ -9,9 +9,12 @@
> > #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
> > #include <linux/cpufeature.h>
> > +#include <linux/container_of.h>
> > #include <linux/err.h>
> > #include <linux/errno.h>
> > #include <linux/init.h>
> > +#include <linux/kobject.h>
> > +#include <linux/kstrtox.h>
> > #include <linux/memory.h>
> > #include <linux/memory_hotplug.h>
> > #include <linux/mm.h>
> > @@ -27,7 +30,6 @@
> > #define SCLP_CMDW_ASSIGN_STORAGE 0x000d0001
> > #define SCLP_CMDW_UNASSIGN_STORAGE 0x000c0001
> > -static DEFINE_MUTEX(sclp_mem_mutex);
> > static LIST_HEAD(sclp_mem_list);
> > static u8 sclp_max_storage_id;
> > static DECLARE_BITMAP(sclp_storage_ids, 256);
> > @@ -38,6 +40,18 @@ struct memory_increment {
> > int standby;
> > };
> > +struct mblock {
> > + struct kobject kobj;
> > + unsigned int id;
> > + unsigned int memmap_on_memory;
> > + unsigned int config;
> > +};
> > +
> > +struct memory_block_arg {
> > + struct mblock *mblocks;
> > + struct kset *kset;
> > +};
>
> I would avoid using "memory_block_arg" as it reminds of core mm "struct memory_block".
>
> Similarly, I'd not call this "mblock".
>
> What about incorporating the "sclp" side of things?
>
> "struct sclp_mem" / "struct sclp_mem_arg"
>
> Nicely fits "sclp_mem.c" ;)
>
> Something like that might be better.
Sure. I will change it. Thanks
> > +
> > struct assign_storage_sccb {
> > struct sccb_header header;
> > u16 rn;
> > @@ -185,15 +199,11 @@ static int sclp_mem_notifier(struct notifier_block *nb,
> > {
> > unsigned long start, size;
> > struct memory_notify *arg;
> > - unsigned char id;
> > int rc = 0;
> > arg = data;
> > start = arg->start_pfn << PAGE_SHIFT;
> > size = arg->nr_pages << PAGE_SHIFT;
> > - mutex_lock(&sclp_mem_mutex);
> > - for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
> > - sclp_attach_storage(id);
> > switch (action) {
> > case MEM_GOING_OFFLINE:
> > /*
> > @@ -204,45 +214,201 @@ static int sclp_mem_notifier(struct notifier_block *nb,
> > if (contains_standby_increment(start, start + size))
> > rc = -EPERM;
> > break;
>
> Is there any reson this notifier is still needed? I'd assume we can just allow
> for offlining + re-onlining as we please now.
>
> In fact, I'd assume we can get rid of the notifier entirely now?
I was initially uncertain about contains_standby_increment() use case
and didnt change it here. However, after testing by removing the
contains_standby_increment() checks, I observed that the common memory
hotplug code already prevents offlining a memory block that contains
holes. This ensures safety without relying on these checks.
c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes")
i.e. #cp define storage 3504M standby 2148M
This leads to a configuration where memory block 27 contains both
assigned and standby incr.
But, offlining it will not succeed:
chmem -d 0x00000000d8000000-0x00000000dfffffff
chmem: Memory Block 27 (0x00000000d8000000-0x00000000dfffffff) disable
failed: Invalid argument
Hence, I will remove it. Thanks.
> > - case MEM_PREPARE_ONLINE:
> > - /*
> > - * Access the altmap_start_pfn and altmap_nr_pages fields
> > - * within the struct memory_notify specifically when dealing
> > - * with only MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
> > - *
> > - * When altmap is in use, take the specified memory range
> > - * online, which includes the altmap.
> > - */
> > - if (arg->altmap_nr_pages) {
> > - start = PFN_PHYS(arg->altmap_start_pfn);
> > - size += PFN_PHYS(arg->altmap_nr_pages);
> > - }
> > - rc = sclp_mem_change_state(start, size, 1);
> > - if (rc || !arg->altmap_nr_pages)
> > - break;
> > - /*
> > - * Set CMMA state to nodat here, since the struct page memory
> > - * at the beginning of the memory block will not go through the
> > - * buddy allocator later.
> > - */
> > - __arch_set_page_nodat((void *)__va(start), arg->altmap_nr_pages);
> > + default:
> > break;
> > - case MEM_FINISH_OFFLINE:
> > + }
> > + return rc ? NOTIFY_BAD : NOTIFY_OK;
> > +}
> > +
> > +static ssize_t config_mblock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
> > +{
> > + struct mblock *mblock = container_of(kobj, struct mblock, kobj);
> > +
> > + return sysfs_emit(buf, "%u\n", READ_ONCE(mblock->config));
> > +}
> > +
> > +static ssize_t config_mblock_store(struct kobject *kobj, struct kobj_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + unsigned long long addr, block_size;
>
> "unsigned long" should be sufficient I'm sure :)
Left over. I will do so.
> > + struct memory_block *mem;
> > + struct mblock *mblock;
> > + unsigned char id;
> > + bool value;
> > + int rc;
> > +
> > + rc = kstrtobool(buf, &value);
> > + if (rc)
> > + return rc;
> > + mblock = container_of(kobj, struct mblock, kobj);
> > + block_size = memory_block_size_bytes();
> > + addr = mblock->id * block_size;
> > + /*
> > + * Hold device_hotplug_lock when adding/removing memory blocks.
> > + * Additionally, also protect calls to find_memory_block() and
> > + * sclp_attach_storage().
> > + */
> > + rc = lock_device_hotplug_sysfs();
> > + if (rc)
> > + goto out;
> > + for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
> > + sclp_attach_storage(id);
> > + if (value) {
> > + if (mblock->config)
> > + goto out_unlock;
> > + rc = sclp_mem_change_state(addr, block_size, 1);
> > + if (rc)
> > + goto out_unlock;
> > /*
> > - * When altmap is in use, take the specified memory range
> > - * offline, which includes the altmap.
> > + * Set entire memory block CMMA state to nodat. Later, when
> > + * page tables pages are allocated via __add_memory(), those
> > + * regions are marked __arch_set_page_dat().
> > */
> > - if (arg->altmap_nr_pages) {
> > - start = PFN_PHYS(arg->altmap_start_pfn);
> > - size += PFN_PHYS(arg->altmap_nr_pages);
> > + __arch_set_page_nodat((void *)__va(addr), block_size >> PAGE_SHIFT);
> > + rc = __add_memory(0, addr, block_size,
> > + mblock->memmap_on_memory ?
> > + MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
> > + if (rc)
> > + goto out_unlock;
>
> Do we have to undo the state change?
Intention was to keep error handling simple. In case of failure in
add_memory(), we would have state set to 1 (not given back). But,
subsequent configuration request for that block will not have an impact.
...
> > +static int create_mblock(struct mblock *mblock, struct kset *kset,
> > + unsigned int id, bool config, bool memmap_on_memory)
> > +{
> > + int rc;
> > +
> > + mblock->memmap_on_memory = memmap_on_memory;
> > + mblock->config = config;
> > + mblock->id = id;
> > + kobject_init(&mblock->kobj, &ktype);
> > + rc = kobject_add(&mblock->kobj, &kset->kobj, "memory%d", id);
> > + if (rc)
> > + return rc;
> > + rc = sysfs_create_group(&mblock->kobj, &mblock_attr_group);
> > + if (rc)
> > + kobject_put(&mblock->kobj);
> > + return rc;
> > +}
> > +
> > +/*
> > + * Create /sys/firmware/memory/memoryX for boottime configured online memory
> > + * blocks
> > + */
> > +static int create_online_mblock(struct memory_block *mem, void *argument)
>
> "online" is conusing. It's "initial" / "configured". Same applies to the other functions
> that mention "online".
Sure. I will change it.
> > +{
> > + struct memory_block_arg *arg;
> > + struct mblock *mblocks;
> > + struct kset *kset;
> > + unsigned int id;
> > +
> > + id = mem->dev.id;
> > + arg = (struct memory_block_arg *)argument;
> > + mblocks = arg->mblocks;
> > + kset = arg->kset;
> > + return create_mblock(&mblocks[id], kset, id, true, false);
> > +}
> > +
> > +static int __init create_initial_online_mblocks(struct mblock *mblocks, struct kset *kset)
> > +{
> > + struct memory_block_arg arg;
> > +
> > + arg.mblocks = mblocks;
> > + arg.kset = kset;
> > + return for_each_memory_block(&arg, create_online_mblock);
> > +}
> > +
> > +static struct mblock * __init allocate_mblocks(void)
> > +{
> > + u64 max_mblocks;
>
> Nit: why an u64? The block ids are "unsigned int id;"
Sure. I will correct it.
> > + u64 block_size;
> > +
> > + block_size = memory_block_size_bytes();
> > + max_mblocks = roundup(sclp.rnmax * sclp.rzm, block_size) / block_size;
> > + return kcalloc(max_mblocks, sizeof(struct mblock), GFP_KERNEL);
>
>
> I think you should structure the code a bit differently, not splitting
> the function up into tiny helpers.
>
> static int __init init_sclp_mem(void)
> {
> const u64 block_size = memory_block_size_bytes();
> const u64 max_mblocks = roundup(sclp.rnmax * sclp.rzm, block_size) / block_size;
> struct sclp_mem_arg arg;
> struct kset *kset;
> int rc;
>
> /* We'll allocate memory for all blocks ahead of time. */
> sclp_mem = kcalloc(max_mblocks, sizeof(struct mblock), GFP_KERNEL);
> if (!sclp_mem)
> return -ENOMEM;
>
> kset = kset_create_and_add("memory", NULL, firmware_kobj);
> if (!kset)
> return -ENOMEM;
>
> /* Initial memory is in the "configured" state already. */
> arg.sclp_mem = sclp_mem;
> arg.kset = kset;
> rc = for_each_memory_block(&arg, create_configured_sclp_mem);
> if (rc)
> return rc;
>
> /* Standby memory is "deconfigured". */
> return create_standby_sclp_mem(sclp_mem, kset);
> }
>
> Should still be quite readable.
Then, I'll make use of it.
...
> > -static int __init sclp_detect_standby_memory(void)
> > +static int __init sclp_setup_memory(void)
> > {
> > struct read_storage_sccb *sccb;
> > int i, id, assigned, rc;
> > + struct mblock *mblocks;
> > + struct kset *kset;
> > /* No standby memory in kdump mode */
> > if (oldmem_data.start)
>
> Wouldn't we still want to create the ones for initial memory at least?
Intention was the following:
configuration and deconfiguration of memory with optional
memmap-on-memory is mostly needed for only standby memory.
If standby memory is absent or sclp is unavailable, we continue using
the previous behavior (only software offline/online), since the sclp
memory notifier was not registered in that case before either.
Thank you David
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory
2025-10-08 6:46 ` Sumanth Korikkar
@ 2025-10-08 8:05 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-08 8:05 UTC (permalink / raw)
To: Sumanth Korikkar
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
>> Is there any reson this notifier is still needed? I'd assume we can just allow
>> for offlining + re-onlining as we please now.
>>
>> In fact, I'd assume we can get rid of the notifier entirely now?
>
> I was initially uncertain about contains_standby_increment() use case
> and didnt change it here. However, after testing by removing the
> contains_standby_increment() checks, I observed that the common memory
> hotplug code already prevents offlining a memory block that contains
> holes. This ensures safety without relying on these checks.
>
> c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes")
Rings a bell :)
>
> i.e. #cp define storage 3504M standby 2148M
> This leads to a configuration where memory block 27 contains both
> assigned and standby incr.
>
> But, offlining it will not succeed:
> chmem -d 0x00000000d8000000-0x00000000dfffffff
> chmem: Memory Block 27 (0x00000000d8000000-0x00000000dfffffff) disable
> failed: Invalid argument
>
> Hence, I will remove it. Thanks.
Cool!
>
>>> - case MEM_PREPARE_ONLINE:
>>> - /*
>>> - * Access the altmap_start_pfn and altmap_nr_pages fields
>>> - * within the struct memory_notify specifically when dealing
>>> - * with only MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
>>> - *
>>> - * When altmap is in use, take the specified memory range
>>> - * online, which includes the altmap.
>>> - */
>>> - if (arg->altmap_nr_pages) {
>>> - start = PFN_PHYS(arg->altmap_start_pfn);
>>> - size += PFN_PHYS(arg->altmap_nr_pages);
>>> - }
>>> - rc = sclp_mem_change_state(start, size, 1);
>>> - if (rc || !arg->altmap_nr_pages)
>>> - break;
>>> - /*
>>> - * Set CMMA state to nodat here, since the struct page memory
>>> - * at the beginning of the memory block will not go through the
>>> - * buddy allocator later.
>>> - */
>>> - __arch_set_page_nodat((void *)__va(start), arg->altmap_nr_pages);
>>> + default:
>>> break;
>>> - case MEM_FINISH_OFFLINE:
>>> + }
>>> + return rc ? NOTIFY_BAD : NOTIFY_OK;
>>> +}
>>> +
>>> +static ssize_t config_mblock_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
>>> +{
>>> + struct mblock *mblock = container_of(kobj, struct mblock, kobj);
>>> +
>>> + return sysfs_emit(buf, "%u\n", READ_ONCE(mblock->config));
>>> +}
>>> +
>>> +static ssize_t config_mblock_store(struct kobject *kobj, struct kobj_attribute *attr,
>>> + const char *buf, size_t count)
>>> +{
>>> + unsigned long long addr, block_size;
>>
>> "unsigned long" should be sufficient I'm sure :)
>
> Left over. I will do so.
>
>>> + struct memory_block *mem;
>>> + struct mblock *mblock;
>>> + unsigned char id;
>>> + bool value;
>>> + int rc;
>>> +
>>> + rc = kstrtobool(buf, &value);
>>> + if (rc)
>>> + return rc;
>>> + mblock = container_of(kobj, struct mblock, kobj);
>>> + block_size = memory_block_size_bytes();
>>> + addr = mblock->id * block_size;
>>> + /*
>>> + * Hold device_hotplug_lock when adding/removing memory blocks.
>>> + * Additionally, also protect calls to find_memory_block() and
>>> + * sclp_attach_storage().
>>> + */
>>> + rc = lock_device_hotplug_sysfs();
>>> + if (rc)
>>> + goto out;
>>> + for_each_clear_bit(id, sclp_storage_ids, sclp_max_storage_id + 1)
>>> + sclp_attach_storage(id);
>>> + if (value) {
>>> + if (mblock->config)
>>> + goto out_unlock;
>>> + rc = sclp_mem_change_state(addr, block_size, 1);
>>> + if (rc)
>>> + goto out_unlock;
>>> /*
>>> - * When altmap is in use, take the specified memory range
>>> - * offline, which includes the altmap.
>>> + * Set entire memory block CMMA state to nodat. Later, when
>>> + * page tables pages are allocated via __add_memory(), those
>>> + * regions are marked __arch_set_page_dat().
>>> */
>>> - if (arg->altmap_nr_pages) {
>>> - start = PFN_PHYS(arg->altmap_start_pfn);
>>> - size += PFN_PHYS(arg->altmap_nr_pages);
>>> + __arch_set_page_nodat((void *)__va(addr), block_size >> PAGE_SHIFT);
>>> + rc = __add_memory(0, addr, block_size,
>>> + mblock->memmap_on_memory ?
>>> + MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
>>> + if (rc)
>>> + goto out_unlock;
>>
>> Do we have to undo the state change?
>
> Intention was to keep error handling simple. In case of failure in
> add_memory(), we would have state set to 1 (not given back). But,
> subsequent configuration request for that block will not have an impact.
I mean, if we can cleanup easily here by doing another
sclp_mem_change_state(), I think we should just do that.
I'd assume that sclp_mem_change_state() to 0 will usually not fail (I
might be wrong :) ).
[...]
>>> -static int __init sclp_detect_standby_memory(void)
>>> +static int __init sclp_setup_memory(void)
>>> {
>>> struct read_storage_sccb *sccb;
>>> int i, id, assigned, rc;
>>> + struct mblock *mblocks;
>>> + struct kset *kset;
>>> /* No standby memory in kdump mode */
>>> if (oldmem_data.start)
>>
>> Wouldn't we still want to create the ones for initial memory at least?
>
> Intention was the following:
> configuration and deconfiguration of memory with optional
> memmap-on-memory is mostly needed for only standby memory.
>
> If standby memory is absent or sclp is unavailable, we continue using
> the previous behavior (only software offline/online), since the sclp
> memory notifier was not registered in that case before either.
I mean, probably nobody in the kdump kernel cares about it either way,
agreed.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 3/4] s390/sclp: Remove MHP_OFFLINE_INACCESSIBLE
2025-09-26 13:15 [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
2025-09-26 13:15 ` [PATCH 1/4] s390/mm: Support removal of boot-allocated virtual memory map Sumanth Korikkar
2025-09-26 13:15 ` [PATCH 2/4] s390/sclp: Add support for dynamic (de)configuration of memory Sumanth Korikkar
@ 2025-09-26 13:15 ` Sumanth Korikkar
2025-10-07 19:39 ` David Hildenbrand
2025-09-26 13:15 ` [PATCH 4/4] mm/memory_hotplug: Remove MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers Sumanth Korikkar
` (2 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-09-26 13:15 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Sumanth Korikkar
mhp_flag MHP_OFFLINE_INACCESSIBLE was used to mark memory as not
accessible until memory hotplug online phase begins.
Earlier, standby memory blocks were added upfront during boottime and
MHP_OFFLINE_INACCESSIBLE flag avoided page_init_poison() on memmap
during mhp addtion phase.
However with dynamic runtime configuration of memory, standby memory can
be brought to accessible state before performing add_memory(). Hence,
remove MHP_OFFLINE_INACCESSIBLE.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
drivers/s390/char/sclp_mem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/s390/char/sclp_mem.c b/drivers/s390/char/sclp_mem.c
index 802439230294..f49b8457e721 100644
--- a/drivers/s390/char/sclp_mem.c
+++ b/drivers/s390/char/sclp_mem.c
@@ -267,7 +267,7 @@ static ssize_t config_mblock_store(struct kobject *kobj, struct kobj_attribute *
__arch_set_page_nodat((void *)__va(addr), block_size >> PAGE_SHIFT);
rc = __add_memory(0, addr, block_size,
mblock->memmap_on_memory ?
- MHP_MEMMAP_ON_MEMORY | MHP_OFFLINE_INACCESSIBLE : MHP_NONE);
+ MHP_MEMMAP_ON_MEMORY : MHP_NONE);
if (rc)
goto out_unlock;
mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(addr)));
--
2.48.1
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 3/4] s390/sclp: Remove MHP_OFFLINE_INACCESSIBLE
2025-09-26 13:15 ` [PATCH 3/4] s390/sclp: Remove MHP_OFFLINE_INACCESSIBLE Sumanth Korikkar
@ 2025-10-07 19:39 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-07 19:39 UTC (permalink / raw)
To: Sumanth Korikkar, Andrew Morton, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev
On 26.09.25 15:15, Sumanth Korikkar wrote:
> mhp_flag MHP_OFFLINE_INACCESSIBLE was used to mark memory as not
> accessible until memory hotplug online phase begins.
>
> Earlier, standby memory blocks were added upfront during boottime and
> MHP_OFFLINE_INACCESSIBLE flag avoided page_init_poison() on memmap
> during mhp addtion phase.
>
> However with dynamic runtime configuration of memory, standby memory can
> be brought to accessible state before performing add_memory(). Hence,
> remove MHP_OFFLINE_INACCESSIBLE.
>
> Acked-by: Heiko Carstens <hca@linux.ibm.com>
> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
> ---
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 4/4] mm/memory_hotplug: Remove MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers
2025-09-26 13:15 [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
` (2 preceding siblings ...)
2025-09-26 13:15 ` [PATCH 3/4] s390/sclp: Remove MHP_OFFLINE_INACCESSIBLE Sumanth Korikkar
@ 2025-09-26 13:15 ` Sumanth Korikkar
2025-10-07 14:30 ` [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
2025-10-07 16:11 ` David Hildenbrand
5 siblings, 0 replies; 20+ messages in thread
From: Sumanth Korikkar @ 2025-09-26 13:15 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Sumanth Korikkar
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers were introduced
to prepare the transition of memory to and from a physically accessible
state. This enhancement was crucial for implementing the "memmap on memory"
feature for s390.
With introduction of dynamic (de)configuration of hotpluggable memory,
memory can be brought to accessible state before add_memory(). Memory
can be brought to inaccessible state before remove_memory(). Hence,
there is no need of MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory
notifiers anymore.
This basically reverts commit
c5f1e2d18909 ("mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers")
Additionally, apply minor adjustments to the function parameters of
move_pfn_range_to_zone() and mhp_supports_memmap_on_memory() to ensure
compatibility with the latest branch.
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
---
drivers/base/memory.c | 23 +----------------------
include/linux/memory.h | 9 ---------
include/linux/memory_hotplug.h | 18 +-----------------
include/linux/memremap.h | 1 -
mm/memory_hotplug.c | 17 +++--------------
mm/sparse.c | 3 +--
6 files changed, 6 insertions(+), 65 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 5c6c1d6bb59f..67a41575ac77 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -226,7 +226,6 @@ static int memory_block_online(struct memory_block *mem)
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
unsigned long nr_vmemmap_pages = 0;
- struct memory_notify arg;
struct zone *zone;
int ret;
@@ -246,19 +245,9 @@ static int memory_block_online(struct memory_block *mem)
if (mem->altmap)
nr_vmemmap_pages = mem->altmap->free;
- arg.altmap_start_pfn = start_pfn;
- arg.altmap_nr_pages = nr_vmemmap_pages;
- arg.start_pfn = start_pfn + nr_vmemmap_pages;
- arg.nr_pages = nr_pages - nr_vmemmap_pages;
mem_hotplug_begin();
- ret = memory_notify(MEM_PREPARE_ONLINE, &arg);
- ret = notifier_to_errno(ret);
- if (ret)
- goto out_notifier;
-
if (nr_vmemmap_pages) {
- ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages,
- zone, mem->altmap->inaccessible);
+ ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
if (ret)
goto out;
}
@@ -280,11 +269,7 @@ static int memory_block_online(struct memory_block *mem)
nr_vmemmap_pages);
mem->zone = zone;
- mem_hotplug_done();
- return ret;
out:
- memory_notify(MEM_FINISH_OFFLINE, &arg);
-out_notifier:
mem_hotplug_done();
return ret;
}
@@ -297,7 +282,6 @@ static int memory_block_offline(struct memory_block *mem)
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
unsigned long nr_vmemmap_pages = 0;
- struct memory_notify arg;
int ret;
if (!mem->zone)
@@ -329,11 +313,6 @@ static int memory_block_offline(struct memory_block *mem)
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
mem->zone = NULL;
- arg.altmap_start_pfn = start_pfn;
- arg.altmap_nr_pages = nr_vmemmap_pages;
- arg.start_pfn = start_pfn + nr_vmemmap_pages;
- arg.nr_pages = nr_pages - nr_vmemmap_pages;
- memory_notify(MEM_FINISH_OFFLINE, &arg);
out:
mem_hotplug_done();
return ret;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index 40eb70ccb09d..e42534b5c5ec 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -96,17 +96,8 @@ int set_memory_block_size_order(unsigned int order);
#define MEM_GOING_ONLINE (1<<3)
#define MEM_CANCEL_ONLINE (1<<4)
#define MEM_CANCEL_OFFLINE (1<<5)
-#define MEM_PREPARE_ONLINE (1<<6)
-#define MEM_FINISH_OFFLINE (1<<7)
struct memory_notify {
- /*
- * The altmap_start_pfn and altmap_nr_pages fields are designated for
- * specifying the altmap range and are exclusively intended for use in
- * MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers.
- */
- unsigned long altmap_start_pfn;
- unsigned long altmap_nr_pages;
unsigned long start_pfn;
unsigned long nr_pages;
};
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 23f038a16231..f2f16cdd73ee 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -58,22 +58,6 @@ typedef int __bitwise mhp_t;
* implies the node id (nid).
*/
#define MHP_NID_IS_MGID ((__force mhp_t)BIT(2))
-/*
- * The hotplugged memory is completely inaccessible while the memory is
- * offline. The memory provider will handle MEM_PREPARE_ONLINE /
- * MEM_FINISH_OFFLINE notifications and make the memory accessible.
- *
- * This flag is only relevant when used along with MHP_MEMMAP_ON_MEMORY,
- * because the altmap cannot be written (e.g., poisoned) when adding
- * memory -- before it is set online.
- *
- * This allows for adding memory with an altmap that is not currently
- * made available by a hypervisor. When onlining that memory, the
- * hypervisor can be instructed to make that memory available, and
- * the onlining phase will not require any memory allocations, which is
- * helpful in low-memory situations.
- */
-#define MHP_OFFLINE_INACCESSIBLE ((__force mhp_t)BIT(3))
/*
* Extended parameters for memory hotplug:
@@ -123,7 +107,7 @@ extern void adjust_present_page_count(struct page *page,
long nr_pages);
/* VM interface that may be used by firmware interface */
extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
- struct zone *zone, bool mhp_off_inaccessible);
+ struct zone *zone);
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
struct zone *zone, struct memory_group *group);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..7467035d4f29 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -25,7 +25,6 @@ struct vmem_altmap {
unsigned long free;
unsigned long align;
unsigned long alloc;
- bool inaccessible;
};
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 74318c787715..db95933daa4c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1088,7 +1088,7 @@ void adjust_present_page_count(struct page *page, struct memory_group *group,
}
int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
- struct zone *zone, bool mhp_off_inaccessible)
+ struct zone *zone)
{
unsigned long end_pfn = pfn + nr_pages;
int ret, i;
@@ -1097,15 +1097,6 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
if (ret)
return ret;
- /*
- * Memory block is accessible at this stage and hence poison the struct
- * pages now. If the memory block is accessible during memory hotplug
- * addition phase, then page poisining is already performed in
- * sparse_add_section().
- */
- if (mhp_off_inaccessible)
- page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
-
move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE,
false);
@@ -1444,7 +1435,7 @@ static void remove_memory_blocks_and_altmaps(u64 start, u64 size)
}
static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
- u64 start, u64 size, mhp_t mhp_flags)
+ u64 start, u64 size)
{
unsigned long memblock_size = memory_block_size_bytes();
u64 cur_start;
@@ -1460,8 +1451,6 @@ static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
};
mhp_altmap.free = memory_block_memmap_on_memory_pages();
- if (mhp_flags & MHP_OFFLINE_INACCESSIBLE)
- mhp_altmap.inaccessible = true;
params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
GFP_KERNEL);
if (!params.altmap) {
@@ -1547,7 +1536,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
*/
if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
mhp_supports_memmap_on_memory()) {
- ret = create_altmaps_and_memory_blocks(nid, group, start, size, mhp_flags);
+ ret = create_altmaps_and_memory_blocks(nid, group, start, size);
if (ret)
goto error;
} else {
diff --git a/mm/sparse.c b/mm/sparse.c
index e6075b622407..24323122f6cb 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -951,8 +951,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
* Poison uninitialized struct pages in order to catch invalid flags
* combinations.
*/
- if (!altmap || !altmap->inaccessible)
- page_init_poison(memmap, sizeof(struct page) * nr_pages);
+ page_init_poison(memmap, sizeof(struct page) * nr_pages);
ms = __nr_to_section(section_nr);
set_section_nid(section_nr, nid);
--
2.48.1
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-09-26 13:15 [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
` (3 preceding siblings ...)
2025-09-26 13:15 ` [PATCH 4/4] mm/memory_hotplug: Remove MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers Sumanth Korikkar
@ 2025-10-07 14:30 ` Sumanth Korikkar
2025-10-07 16:02 ` David Hildenbrand
2025-10-07 16:11 ` David Hildenbrand
5 siblings, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-10-07 14:30 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev
On Fri, Sep 26, 2025 at 03:15:23PM +0200, Sumanth Korikkar wrote:
> Hi,
>
> Patchset provides a new interface for dynamic configuration and
> deconfiguration of hotplug memory on s390, allowing with/without
> memmap_on_memory support. It is a follow up on the discussion with David
> when introducing memmap_on_memory support for s390 and support dynamic
> (de)configuration of memory:
> https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
> https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/
>
> The original motivation for introducing memmap_on_memory on s390 was to
> avoid using online memory to store struct pages metadata, particularly
> for standby memory blocks. This became critical in cases where there was
> an imbalance between standby and online memory, potentially leading to
> boot failures due to insufficient memory for metadata allocation.
>
> To address this, memmap_on_memory was utilized on s390. However, in its
> current form, it adds struct pages metadata at the start of each memory
> block at the time of addition (only standby memory), and this
> configuration is static. It cannot be changed at runtime (When the user
> needs continuous physical memory).
>
> Inorder to provide more flexibility to the user and overcome the above
> limitation, add an option to dynamically configure and deconfigure
> hotpluggable memory block with/without memmap_on_memory.
>
> With the new interface, s390 will not add all possible hotplug memory in
> advance, like before, to make it visible in sysfs for online/offline
> actions. Instead, before memory block can be set online, it has to be
> configured via a new interface in /sys/firmware/memory/memoryX/config,
> which makes s390 similar to others. i.e. Adding of hotpluggable memory is
> controlled by the user instead of adding it at boottime.
Hi David,
Looking forward to your feedback to proceed further.
Thank you,
Sumanth
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-07 14:30 ` [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
@ 2025-10-07 16:02 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-07 16:02 UTC (permalink / raw)
To: Sumanth Korikkar, Andrew Morton, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev
On 07.10.25 16:30, Sumanth Korikkar wrote:
> On Fri, Sep 26, 2025 at 03:15:23PM +0200, Sumanth Korikkar wrote:
>> Hi,
>>
>> Patchset provides a new interface for dynamic configuration and
>> deconfiguration of hotplug memory on s390, allowing with/without
>> memmap_on_memory support. It is a follow up on the discussion with David
>> when introducing memmap_on_memory support for s390 and support dynamic
>> (de)configuration of memory:
>> https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
>> https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/
>>
>> The original motivation for introducing memmap_on_memory on s390 was to
>> avoid using online memory to store struct pages metadata, particularly
>> for standby memory blocks. This became critical in cases where there was
>> an imbalance between standby and online memory, potentially leading to
>> boot failures due to insufficient memory for metadata allocation.
>>
>> To address this, memmap_on_memory was utilized on s390. However, in its
>> current form, it adds struct pages metadata at the start of each memory
>> block at the time of addition (only standby memory), and this
>> configuration is static. It cannot be changed at runtime (When the user
>> needs continuous physical memory).
>>
>> Inorder to provide more flexibility to the user and overcome the above
>> limitation, add an option to dynamically configure and deconfigure
>> hotpluggable memory block with/without memmap_on_memory.
>>
>> With the new interface, s390 will not add all possible hotplug memory in
>> advance, like before, to make it visible in sysfs for online/offline
>> actions. Instead, before memory block can be set online, it has to be
>> configured via a new interface in /sys/firmware/memory/memoryX/config,
>> which makes s390 similar to others. i.e. Adding of hotpluggable memory is
>> controlled by the user instead of adding it at boottime.
>
> Hi David,
>
> Looking forward to your feedback to proceed further.
Thanks for bumping it up in my inbox, will comment today :)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-09-26 13:15 [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
` (4 preceding siblings ...)
2025-10-07 14:30 ` [PATCH 0/4] Support dynamic (de)configuration of memory Sumanth Korikkar
@ 2025-10-07 16:11 ` David Hildenbrand
2025-10-07 17:56 ` Sumanth Korikkar
5 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-10-07 16:11 UTC (permalink / raw)
To: Sumanth Korikkar, Andrew Morton, linux-mm
Cc: LKML, linux-s390, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev
On 26.09.25 15:15, Sumanth Korikkar wrote:
> Hi,
Hi,
>
> Patchset provides a new interface for dynamic configuration and
> deconfiguration of hotplug memory on s390, allowing with/without
> memmap_on_memory support. It is a follow up on the discussion with David
> when introducing memmap_on_memory support for s390 and support dynamic
> (de)configuration of memory:
> https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
> https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/
>
> The original motivation for introducing memmap_on_memory on s390 was to
> avoid using online memory to store struct pages metadata, particularly
> for standby memory blocks. This became critical in cases where there was
> an imbalance between standby and online memory, potentially leading to
> boot failures due to insufficient memory for metadata allocation.
>
> To address this, memmap_on_memory was utilized on s390. However, in its
> current form, it adds struct pages metadata at the start of each memory
> block at the time of addition (only standby memory), and this
> configuration is static. It cannot be changed at runtime (When the user
> needs continuous physical memory).
>
> Inorder to provide more flexibility to the user and overcome the above
> limitation, add an option to dynamically configure and deconfigure
> hotpluggable memory block with/without memmap_on_memory.
This will cleanly add/remove the memory, including the directmap and
other tracking data, so I like it.
>
> With the new interface, s390 will not add all possible hotplug memory in
> advance, like before, to make it visible in sysfs for online/offline
> actions. Instead, before memory block can be set online, it has to be
> configured via a new interface in /sys/firmware/memory/memoryX/config,
> which makes s390 similar to others. i.e. Adding of hotpluggable memory is
> controlled by the user instead of adding it at boottime.
Before I dig into the details, will onlining/offling still trigger
hypervisor action, or does that now really happen when memory is
added/removed?
That would be really nice, because it would remove the whole need for
"standby" memory, and having to treat hotplugged memory differently
under LPAR/z/VM than anywhere else (-> keep it offline).
>
> s390 kernel sysfs interface to configure/deconfigure memory with
> memmap_on_memory (with upcoming lsmem changes):
>
> * Initial memory layout:
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x7fffffff 2G online 0-15 yes no
> 0x80000000-0xffffffff 2G offline 16-31 no yes
Could we instead modify "STATE" to reflect that it is "not added" / "not
configured" / "disabled" etc?
Like
lsmem -o RANGE,SIZE,STATE,BLOCK,MEMMAP_ON_MEMORY
RANGE SIZE STATE BLOCK
0x00000000-0x7fffffff 2G online 0-15
0x80000000-0xffffffff 2G disabled 16-31
Or is that an attempt to maintain backwards compatibility?
>
> * Configure memory
> echo 1 > /sys/firmware/memory/memory16/config
The granularity here is also memory_block_size_bytes(), correct?
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x7fffffff 2G online 0-15 yes no
> 0x80000000-0x87ffffff 128M offline 16 yes yes
> 0x88000000-0xffffffff 1.9G offline 17-31 no yes
>
> * Deconfigure memory
> echo 0 > /sys/firmware/memory/memory16/config
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x7fffffff 2G online 0-15 yes no
> 0x80000000-0xffffffff 2G offline 16-31 no yes
>
> * Enable memmap_on_memory and online it.
> (Deconfigure first)
> echo 0 > /sys/devices/system/memory/memory5/online
> echo 0 > /sys/firmware/memory/memory5/config
>
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x27ffffff 640M online 0-4 yes no
> 0x28000000-0x2fffffff 128M offline 5 no no
> 0x30000000-0x7fffffff 1.3G online 6-15 yes no
> 0x80000000-0xffffffff 2G offline 16-31 no yes
>
> (Enable memmap_on_memory and online it)
> echo 1 > /sys/firmware/memory/memory5/memmap_on_memory
> echo 1 > /sys/firmware/memory/memory5/config
> echo 1 > /sys/devices/system/memory/memory5/online
I guess the use for memmap_on_memory would now be limited to making
hotplug more likely to succeed in OOM scenarios.
>
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x27ffffff 640M online 0-4 yes no
> 0x28000000-0x2fffffff 128M online 5 yes yes
> 0x30000000-0x7fffffff 1.3G online 6-15 yes no
> 0x80000000-0xffffffff 2G offline 16-31 no yes
>
> * Disable memmap_on_memory and online it.
> (Deconfigure first)
> echo 0 > /sys/devices/system/memory/memory5/online
> echo 0 > /sys/firmware/memory/memory5/config
>
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x27ffffff 640M online 0-4 yes no
> 0x28000000-0x2fffffff 128M offline 5 no yes
> 0x30000000-0x7fffffff 1.3G online 6-15 yes no
> 0x80000000-0xffffffff 2G offline 16-31 no yes
>
> (Disable memmap_on_memory and online it)
> echo 0 > /sys/firmware/memory/memory5/memmap_on_memory
> echo 1 > /sys/firmware/memory/memory5/config
> echo 1 > /sys/devices/system/memory/memory5/online
>
> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> 0x00000000-0x7fffffff 2G online 0-15 yes no
> 0x80000000-0xffffffff 2G offline 16-31 no yes
>
> * Userspace changes:
> lsmem/chmem tool is also changed to use the new interface. I will send
> it to util-linux soon.
>
> Patch 1 adds support for removal of boot-allocated memory blocks.
>
> Patch 2 provides option to dynamically configure and deconfigure memory
> with/without memmap_on_memory.
>
> Patch 3 removes MHP_OFFLINE_INACCESSIBLE from s390. The mhp flag was
> used to mark memory as not accessible until memory hotplug online phase
> begins. However, with patch 2, it is no longer essential. Memory can be
> brought to accessible state before adding memory, as the memory is added
> during runttime now instead of boottime.
Nice.
>
> Patch 4 removes the MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. It
> is no longer needed. Memory can be brought to accessible state before
> adding memory now, with runtime (de)configuration of memory.
Nice.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-07 16:11 ` David Hildenbrand
@ 2025-10-07 17:56 ` Sumanth Korikkar
2025-10-07 19:35 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-10-07 17:56 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
> > With the new interface, s390 will not add all possible hotplug memory in
> > advance, like before, to make it visible in sysfs for online/offline
> > actions. Instead, before memory block can be set online, it has to be
> > configured via a new interface in /sys/firmware/memory/memoryX/config,
> > which makes s390 similar to others. i.e. Adding of hotpluggable memory is
> > controlled by the user instead of adding it at boottime.
>
> Before I dig into the details, will onlining/offling still trigger
> hypervisor action, or does that now really happen when memory is
> added/removed?
>
> That would be really nice, because it would remove the whole need for
> "standby" memory, and having to treat hotplugged memory differently under
> LPAR/z/VM than anywhere else (-> keep it offline).
With this approach, hypervisor actions are triggered only when memory is
actually added or removed.
Online and offline operations are common code memory hotplug actions and
the s390 memory notifier actions are none/minimal.
> > s390 kernel sysfs interface to configure/deconfigure memory with
> > memmap_on_memory (with upcoming lsmem changes):
> > * Initial memory layout:
> > lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> > RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> > 0x00000000-0x7fffffff 2G online 0-15 yes no
> > 0x80000000-0xffffffff 2G offline 16-31 no yes
>
> Could we instead modify "STATE" to reflect that it is "not added" / "not
> configured" / "disabled" etc?
>
> Like
>
> lsmem -o RANGE,SIZE,STATE,BLOCK,MEMMAP_ON_MEMORY
> RANGE SIZE STATE BLOCK
> 0x00000000-0x7fffffff 2G online 0-15
> 0x80000000-0xffffffff 2G disabled 16-31
>
> Or is that an attempt to maintain backwards compatibility?
Mostly. Also, similar to lscpu output, where CPU status shows
CONFIGURED/STATE column.
Also, older scripts to get list of offline memory typically use:
lsmem | grep offline
and
chmem -e <SIZE> would work as usual, where <SIZE> specifies amount of
memory to set online.
chmem changes would look like:
chmem -c 128M -m 1 : configure memory with memmap-on-memory enabled
chmem -g 128M : deconfigure memory
chmem -e 128M : optionally configure (if supported by architecture) and
always online memory
chmem -d 128M : offline and optionally deconfigure memory (if supported
by architecture)
> > * Configure memory
> > echo 1 > /sys/firmware/memory/memory16/config
>
> The granularity here is also memory_block_size_bytes(), correct?
Yes, correct.
> > lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> > RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> > 0x00000000-0x7fffffff 2G online 0-15 yes no
> > 0x80000000-0x87ffffff 128M offline 16 yes yes
> > 0x88000000-0xffffffff 1.9G offline 17-31 no yes
> >
> > * Deconfigure memory
> > echo 0 > /sys/firmware/memory/memory16/config
> > lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> > RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> > 0x00000000-0x7fffffff 2G online 0-15 yes no
> > 0x80000000-0xffffffff 2G offline 16-31 no yes
> >
> > * Enable memmap_on_memory and online it.
> > (Deconfigure first)
> > echo 0 > /sys/devices/system/memory/memory5/online
> > echo 0 > /sys/firmware/memory/memory5/config
> >
> > lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
> > RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
> > 0x00000000-0x27ffffff 640M online 0-4 yes no
> > 0x28000000-0x2fffffff 128M offline 5 no no
> > 0x30000000-0x7fffffff 1.3G online 6-15 yes no
> > 0x80000000-0xffffffff 2G offline 16-31 no yes
> >
> > (Enable memmap_on_memory and online it)
> > echo 1 > /sys/firmware/memory/memory5/memmap_on_memory
> > echo 1 > /sys/firmware/memory/memory5/config
> > echo 1 > /sys/devices/system/memory/memory5/online
>
> I guess the use for memmap_on_memory would now be limited to making hotplug
> more likely to succeed in OOM scenarios.
Yes. with memmap-on-memory enabled, mainly in OOM situations.
However, it also provides flexibility to the user to configure few
memory blocks with memmap-on-memory enabled and few with
memmap-on-memory disabled (When the user needs continuous physical
memory across memory blocks).
> > Patch 4 removes the MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. It
> > is no longer needed. Memory can be brought to accessible state before
> > adding memory now, with runtime (de)configuration of memory.
>
> Nice.
Thank you David
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-07 17:56 ` Sumanth Korikkar
@ 2025-10-07 19:35 ` David Hildenbrand
2025-10-08 6:05 ` Sumanth Korikkar
0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-10-07 19:35 UTC (permalink / raw)
To: Sumanth Korikkar
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
On 07.10.25 19:56, Sumanth Korikkar wrote:
>>> With the new interface, s390 will not add all possible hotplug memory in
>>> advance, like before, to make it visible in sysfs for online/offline
>>> actions. Instead, before memory block can be set online, it has to be
>>> configured via a new interface in /sys/firmware/memory/memoryX/config,
>>> which makes s390 similar to others. i.e. Adding of hotpluggable memory is
>>> controlled by the user instead of adding it at boottime.
>>
>> Before I dig into the details, will onlining/offling still trigger
>> hypervisor action, or does that now really happen when memory is
>> added/removed?
>>
>> That would be really nice, because it would remove the whole need for
>> "standby" memory, and having to treat hotplugged memory differently under
>> LPAR/z/VM than anywhere else (-> keep it offline).
>
> With this approach, hypervisor actions are triggered only when memory is
> actually added or removed.
>
> Online and offline operations are common code memory hotplug actions and
> the s390 memory notifier actions are none/minimal.
Very nice.
>
>>> s390 kernel sysfs interface to configure/deconfigure memory with
>>> memmap_on_memory (with upcoming lsmem changes):
>>> * Initial memory layout:
>>> lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
>>> RANGE SIZE STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
>>> 0x00000000-0x7fffffff 2G online 0-15 yes no
>>> 0x80000000-0xffffffff 2G offline 16-31 no yes
>>
>> Could we instead modify "STATE" to reflect that it is "not added" / "not
>> configured" / "disabled" etc?
>>
>> Like
>>
>> lsmem -o RANGE,SIZE,STATE,BLOCK,MEMMAP_ON_MEMORY
>> RANGE SIZE STATE BLOCK
>> 0x00000000-0x7fffffff 2G online 0-15
>> 0x80000000-0xffffffff 2G disabled 16-31
>>
>> Or is that an attempt to maintain backwards compatibility?
>
> Mostly. Also, similar to lscpu output, where CPU status shows
> CONFIGURED/STATE column.
Care to share an example output? I only have a s390x VM with 2 CPUs and
no way to configure/deconfigure.
>
> Also, older scripts to get list of offline memory typically use:
> lsmem | grep offline
>
> and
>
> chmem -e <SIZE> would work as usual, where <SIZE> specifies amount of
> memory to set online.
>
> chmem changes would look like:
> chmem -c 128M -m 1 : configure memory with memmap-on-memory enabled
> chmem -g 128M : deconfigure memory
I wonder if the above two are really required. I would expect most/all
users to simply keep using -e / -d.
Sure, there might be some corner cases, but I would assume most people
to not want to care about memmap-on-memory with the new model.
> chmem -e 128M : optionally configure (if supported by architecture) and
> always online memory
> chmem -d 128M : offline and optionally deconfigure memory (if supported
> by architecture)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-07 19:35 ` David Hildenbrand
@ 2025-10-08 6:05 ` Sumanth Korikkar
2025-10-08 8:02 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-10-08 6:05 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
> Care to share an example output? I only have a s390x VM with 2 CPUs and no
> way to configure/deconfigure.
lscpu -e
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED POLARIZATION ADDRESS
0 0 0 0 0 0 0:0:0 yes yes vert-medium 0
1 0 0 0 0 0 1:1:1 yes yes vert-medium 1
2 0 0 0 0 1 2:2:2 yes yes vert-low 2
3 0 0 0 0 1 3:3:3 yes yes vert-low 3
# chcpu -d 2-3
CPU 2 disabled
CPU 3 disabled
# chcpu -g 2
CPU 2 deconfigured
# chcpu -c 2
CPU 2 configured
# chcpu -e 2-3
CPU 2 enabled
CPU 3 enabled
> > chmem changes would look like:
> > chmem -c 128M -m 1 : configure memory with memmap-on-memory enabled
> > chmem -g 128M : deconfigure memory
>
> I wonder if the above two are really required. I would expect most/all users
> to simply keep using -e / -d.
>
> Sure, there might be some corner cases, but I would assume most people to
> not want to care about memmap-on-memory with the new model.
I believe this remains very beneficial for customers in the following
scenario:
1) Initial memory layout:
4 GB configured online
512 GB standby
If memory_hotplug.memmap_on_memory=Y is set in the kernel command line:
Suppose user requires more memory and onlines 256 GB. With memmap-on-memory
enabled, this likely succeeds by default.
Later, the user needs 256 GB of contiguous physical memory across memory
blocks. Then, the user can still configure those memory blocks with
memmap-on-memory disabled and online it.
2) If the administrator forgets to configure
memory_hotplug.memmap_on_memory=Y, the following steps can be taken:
Rescue from OOM situations: configure with memmap-on-memory enabled, online it.
Thank you,
Sumanth
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-08 6:05 ` Sumanth Korikkar
@ 2025-10-08 8:02 ` David Hildenbrand
2025-10-08 9:12 ` Heiko Carstens
2025-10-08 9:13 ` Sumanth Korikkar
0 siblings, 2 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-08 8:02 UTC (permalink / raw)
To: Sumanth Korikkar
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
On 08.10.25 08:05, Sumanth Korikkar wrote:
>> Care to share an example output? I only have a s390x VM with 2 CPUs and no
>> way to configure/deconfigure.
>
> lscpu -e
> CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED POLARIZATION ADDRESS
> 0 0 0 0 0 0 0:0:0 yes yes vert-medium 0
> 1 0 0 0 0 0 1:1:1 yes yes vert-medium 1
> 2 0 0 0 0 1 2:2:2 yes yes vert-low 2
> 3 0 0 0 0 1 3:3:3 yes yes vert-low 3
>
> # chcpu -d 2-3
> CPU 2 disabled
> CPU 3 disabled
> # chcpu -g 2
> CPU 2 deconfigured
> # chcpu -c 2
> CPU 2 configured
> # chcpu -e 2-3
> CPU 2 enabled
> CPU 3 enabled
Makes sense, thanks!
>
>>> chmem changes would look like:
>>> chmem -c 128M -m 1 : configure memory with memmap-on-memory enabled
>>> chmem -g 128M : deconfigure memory
>>
>> I wonder if the above two are really required. I would expect most/all users
>> to simply keep using -e / -d.
>>
>> Sure, there might be some corner cases, but I would assume most people to
>> not want to care about memmap-on-memory with the new model.
>
> I believe this remains very beneficial for customers in the following
> scenario:
>
> 1) Initial memory layout:
> 4 GB configured online
> 512 GB standby
>
> If memory_hotplug.memmap_on_memory=Y is set in the kernel command line:
> Suppose user requires more memory and onlines 256 GB. With memmap-on-memory
> enabled, this likely succeeds by default.
>
> Later, the user needs 256 GB of contiguous physical memory across memory
> blocks. Then, the user can still configure those memory blocks with
> memmap-on-memory disabled and online it.
>
> 2) If the administrator forgets to configure
> memory_hotplug.memmap_on_memory=Y, the following steps can be taken:
> Rescue from OOM situations: configure with memmap-on-memory enabled, online it.
That's my point: I don't consider either very likely to be used by
actual admins.
I guess in (1) it really only is a problem with very big memory blocks.
Assuming a memory block is just 128 MiB (or even 1 GiB), you can
add+online them individually. Once you succeeded with the first one
(very likely), the other ones will follow.
Sure, if you are so low on memory that you cannot even a single memory
block, then memmap-on-memory makes sense.
But note that memmap-on-memory was added to handle hotplug of large
chunks of memory (large DIMM/NVDIMM, large CXL device) in one go,
without the chance to add+online individual memory blocks incrementally.
That's also the reason why I didn't care so far to implement
memmap-on-memory support for virito-mem: as we add+online individual
(small) emmory blocks, the implementation effort for supporting
memmap_on_memory was so far not warranted.
(it's a bit trickier for virtio-mem to implement :) )
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-08 8:02 ` David Hildenbrand
@ 2025-10-08 9:12 ` Heiko Carstens
2025-10-08 9:43 ` David Hildenbrand
2025-10-08 9:13 ` Sumanth Korikkar
1 sibling, 1 reply; 20+ messages in thread
From: Heiko Carstens @ 2025-10-08 9:12 UTC (permalink / raw)
To: David Hildenbrand
Cc: Sumanth Korikkar, Andrew Morton, linux-mm, LKML, linux-s390,
Gerald Schaefer, Vasily Gorbik, Alexander Gordeev
On Wed, Oct 08, 2025 at 10:02:26AM +0200, David Hildenbrand wrote:
> On 08.10.25 08:05, Sumanth Korikkar wrote:
> > > > chmem changes would look like:
> > > > chmem -c 128M -m 1 : configure memory with memmap-on-memory enabled
> > > > chmem -g 128M : deconfigure memory
> > >
> > > I wonder if the above two are really required. I would expect most/all users
> > > to simply keep using -e / -d.
> > >
> > > Sure, there might be some corner cases, but I would assume most people to
> > > not want to care about memmap-on-memory with the new model.
...
> > 2) If the administrator forgets to configure
> > memory_hotplug.memmap_on_memory=Y, the following steps can be taken:
> > Rescue from OOM situations: configure with memmap-on-memory enabled, online it.
>
> That's my point: I don't consider either very likely to be used by actual
> admins.
But does it really hurt to add those options? If really needed then all of
the sudden admins would have to deal with architecture specific sysfs
layout - so the very rare emergency case becomes even more complicated.
Given that these tools exist to help that people don't have to deal with
such details, I'm much in favor of adding those options.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-08 9:12 ` Heiko Carstens
@ 2025-10-08 9:43 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-08 9:43 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sumanth Korikkar, Andrew Morton, linux-mm, LKML, linux-s390,
Gerald Schaefer, Vasily Gorbik, Alexander Gordeev
On 08.10.25 11:12, Heiko Carstens wrote:
> On Wed, Oct 08, 2025 at 10:02:26AM +0200, David Hildenbrand wrote:
>> On 08.10.25 08:05, Sumanth Korikkar wrote:
>>>>> chmem changes would look like:
>>>>> chmem -c 128M -m 1 : configure memory with memmap-on-memory enabled
>>>>> chmem -g 128M : deconfigure memory
>>>>
>>>> I wonder if the above two are really required. I would expect most/all users
>>>> to simply keep using -e / -d.
>>>>
>>>> Sure, there might be some corner cases, but I would assume most people to
>>>> not want to care about memmap-on-memory with the new model.
>
> ...
>
>>> 2) If the administrator forgets to configure
>>> memory_hotplug.memmap_on_memory=Y, the following steps can be taken:
>>> Rescue from OOM situations: configure with memmap-on-memory enabled, online it.
>>
>> That's my point: I don't consider either very likely to be used by actual
>> admins.
>
> But does it really hurt to add those options?
Oh, I don't think so.
I was just a bit surprised to see it in the first version of this,
because it felt to me like this is something to be added later on top
quite easily/cleanly.
In particular, patch #2 would get a lot lighter also in terms of
documentation.
So no strong opinion about adding it, but maybe we can just split it
into a separate patch and focus on patch #2 on the real magic?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-08 8:02 ` David Hildenbrand
2025-10-08 9:12 ` Heiko Carstens
@ 2025-10-08 9:13 ` Sumanth Korikkar
2025-10-08 9:33 ` David Hildenbrand
1 sibling, 1 reply; 20+ messages in thread
From: Sumanth Korikkar @ 2025-10-08 9:13 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
> > > I wonder if the above two are really required. I would expect most/all users
> > > to simply keep using -e / -d.
> > >
> > > Sure, there might be some corner cases, but I would assume most people to
> > > not want to care about memmap-on-memory with the new model.
> >
> > I believe this remains very beneficial for customers in the following
> > scenario:
> >
> > 1) Initial memory layout:
> > 4 GB configured online
> > 512 GB standby
> >
> > If memory_hotplug.memmap_on_memory=Y is set in the kernel command line:
> > Suppose user requires more memory and onlines 256 GB. With memmap-on-memory
> > enabled, this likely succeeds by default.
> >
> > Later, the user needs 256 GB of contiguous physical memory across memory
> > blocks. Then, the user can still configure those memory blocks with
> > memmap-on-memory disabled and online it.
> >
> > 2) If the administrator forgets to configure
> > memory_hotplug.memmap_on_memory=Y, the following steps can be taken:
> > Rescue from OOM situations: configure with memmap-on-memory enabled, online it.
>
> That's my point: I don't consider either very likely to be used by actual
> admins.
>
> I guess in (1) it really only is a problem with very big memory blocks.
> Assuming a memory block is just 128 MiB (or even 1 GiB), you can add+online
> them individually. Once you succeeded with the first one (very likely), the
> other ones will follow.
>
> Sure, if you are so low on memory that you cannot even a single memory
> block, then memmap-on-memory makes sense.
>
> But note that memmap-on-memory was added to handle hotplug of large chunks
> of memory (large DIMM/NVDIMM, large CXL device) in one go, without the
> chance to add+online individual memory blocks incrementally.
Interesting. Thanks David.
Heiko suggested that memory increment size could also be upto
64GB. In that case, it might be useful.
https://lore.kernel.org/all/20250521142149.11483C95-hca@linux.ibm.com/
> That's also the reason why I didn't care so far to implement
> memmap-on-memory support for virito-mem: as we add+online individual (small)
> emmory blocks, the implementation effort for supporting memmap_on_memory was
> so far not warranted.
>
> (it's a bit trickier for virtio-mem to implement :) )
>
> --
> Cheers
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/4] Support dynamic (de)configuration of memory
2025-10-08 9:13 ` Sumanth Korikkar
@ 2025-10-08 9:33 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-10-08 9:33 UTC (permalink / raw)
To: Sumanth Korikkar
Cc: Andrew Morton, linux-mm, LKML, linux-s390, Gerald Schaefer,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev
On 08.10.25 11:13, Sumanth Korikkar wrote:
>>>> I wonder if the above two are really required. I would expect most/all users
>>>> to simply keep using -e / -d.
>>>>
>>>> Sure, there might be some corner cases, but I would assume most people to
>>>> not want to care about memmap-on-memory with the new model.
>>>
>>> I believe this remains very beneficial for customers in the following
>>> scenario:
>>>
>>> 1) Initial memory layout:
>>> 4 GB configured online
>>> 512 GB standby
>>>
>>> If memory_hotplug.memmap_on_memory=Y is set in the kernel command line:
>>> Suppose user requires more memory and onlines 256 GB. With memmap-on-memory
>>> enabled, this likely succeeds by default.
>>>
>>> Later, the user needs 256 GB of contiguous physical memory across memory
>>> blocks. Then, the user can still configure those memory blocks with
>>> memmap-on-memory disabled and online it.
>>>
>>> 2) If the administrator forgets to configure
>>> memory_hotplug.memmap_on_memory=Y, the following steps can be taken:
>>> Rescue from OOM situations: configure with memmap-on-memory enabled, online it.
>>
>> That's my point: I don't consider either very likely to be used by actual
>> admins.
>>
>> I guess in (1) it really only is a problem with very big memory blocks.
>> Assuming a memory block is just 128 MiB (or even 1 GiB), you can add+online
>> them individually. Once you succeeded with the first one (very likely), the
>> other ones will follow.
>>
>> Sure, if you are so low on memory that you cannot even a single memory
>> block, then memmap-on-memory makes sense.
>>
>> But note that memmap-on-memory was added to handle hotplug of large chunks
>> of memory (large DIMM/NVDIMM, large CXL device) in one go, without the
>> chance to add+online individual memory blocks incrementally.
>
> Interesting. Thanks David.
>
> Heiko suggested that memory increment size could also be upto
> 64GB. In that case, it might be useful.
Yeha, rings a bell. But that would not be your 4GiB scenario you shared :)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread