* [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
2026-01-13 8:14 [PATCH v4 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
@ 2026-01-13 8:14 ` Akinobu Mita
2026-01-13 9:30 ` Pratyush Brahma
2026-01-13 8:14 ` [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
2026-01-13 8:14 ` [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
2 siblings, 1 reply; 6+ messages in thread
From: Akinobu Mita @ 2026-01-13 8:14 UTC (permalink / raw)
To: akinobu.mita
Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, ziy,
matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
ying.huang, apopple, bingjiao, jonathan.cameron, pratyush.brahma
This makes it possible to create memory tiers using fake numa nodes
generated by numa emulation.
The "numa_emulation.adistance=" kernel cmdline option allows you to set
the abstract distance for each NUMA node.
For example, you can create two fake nodes, each in a different memory
tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
Here, the abstract distances of node0 and node1 are set to 576 and 704,
respectively.
Each memory tier covers an abstract distance chunk size of 128. Thus,
nodes with abstract distances between 512 and 639 are classified into the
same memory tier, and nodes with abstract distances between 640 and 767
are classified into the next slower memory tier.
The abstract distance of fake nodes not specified in the parameter will be
the default DRAM abstract distance of 576.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
---
v4:
- remove unnecessary include of linux/node.h, suggested by Jonathan Cameron
- include linux/notifier.h for the notifier_block, suggested by Jonathan Cameron
- typo in abstruct distance value in the commit log
v2:
- fix the explanation about cmdline parameter in the commit log
mm/numa_emulation.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
index 703c8fa05048..2d05e61570cc 100644
--- a/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -6,6 +6,9 @@
#include <linux/errno.h>
#include <linux/topology.h>
#include <linux/memblock.h>
+#include <linux/memory-tiers.h>
+#include <linux/module.h>
+#include <linux/notifier.h>
#include <linux/numa_memblks.h>
#include <asm/numa.h>
#include <acpi/acpi_numa.h>
@@ -344,6 +347,27 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid)
return max_emu_nid;
}
+static int adistance[MAX_NUMNODES];
+module_param_array(adistance, int, NULL, 0400);
+MODULE_PARM_DESC(adistance, "Abstract distance values for each NUMA node");
+
+static int emu_calculate_adistance(struct notifier_block *self,
+ unsigned long nid, void *data)
+{
+ if (adistance[nid]) {
+ int *adist = data;
+
+ *adist = adistance[nid];
+ return NOTIFY_STOP;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block emu_adist_nb = {
+ .notifier_call = emu_calculate_adistance,
+ .priority = INT_MIN,
+};
+
/**
* numa_emulation - Emulate NUMA nodes
* @numa_meminfo: NUMA configuration to massage
@@ -532,6 +556,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
}
}
+ register_mt_adistance_algorithm(&emu_adist_nb);
+
/* free the copied physical distance table */
memblock_free(phys_dist, phys_size);
return;
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
2026-01-13 8:14 ` [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
@ 2026-01-13 9:30 ` Pratyush Brahma
0 siblings, 0 replies; 6+ messages in thread
From: Pratyush Brahma @ 2026-01-13 9:30 UTC (permalink / raw)
To: akinobu.mita
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, bingjiao, byungchul,
david, gourry, hannes, jonathan.cameron, joshua.hahnjy,
linux-cxl, linux-kernel, linux-mm, lorenzo.stoakes,
matthew.brost, mhocko, pratyush.brahma, rakie.kim, rppt,
shakeel.butt, surenb, vbabka, weixugc, ying.huang, yuanchu,
zhengqi.arch, ziy
Reviewed-by: Pratyush Brahma <pratyush.brahma@oss.qualcomm.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation
2026-01-13 8:14 [PATCH v4 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
2026-01-13 8:14 ` [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
@ 2026-01-13 8:14 ` Akinobu Mita
2026-01-13 9:32 ` Pratyush Brahma
2026-01-13 8:14 ` [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
2 siblings, 1 reply; 6+ messages in thread
From: Akinobu Mita @ 2026-01-13 8:14 UTC (permalink / raw)
To: akinobu.mita
Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, ziy,
matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
ying.huang, apopple, bingjiao, jonathan.cameron, pratyush.brahma
Add a document with a brief explanation of numa emulation and how to use
the newly added "numa_emulation.adistance=" kernel cmdline parameter.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
---
v4:
- typo in abstruct distance value
- add information about supported architectures in numa_emulation.rst
v2:
- added in v2
Documentation/mm/index.rst | 1 +
Documentation/mm/numa_emulation.rst | 31 +++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)
create mode 100644 Documentation/mm/numa_emulation.rst
diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 7aa2a8886908..7d628edd6a17 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -24,6 +24,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`.
page_cache
shmfs
oom
+ numa_emulation
Unsorted Documentation
======================
diff --git a/Documentation/mm/numa_emulation.rst b/Documentation/mm/numa_emulation.rst
new file mode 100644
index 000000000000..81f15ea68022
--- /dev/null
+++ b/Documentation/mm/numa_emulation.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+NUMA emulation
+==============
+
+NUMA emulation is currently supported on x86, arm64, and risc-v architectures.
+If CONFIG_NUMA_EMU is enabled, you can create fake NUMA nodes with
+``numa=fake=`` kernel cmdline option.
+See Documentation/admin-guide/kernel-parameters.txt and
+Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst for more information.
+
+
+Multiple Memory Tiers Creation
+==============================
+
+The "numa_emulation.adistance=" kernel cmdline option allows you to set
+the abstract distance for each NUMA node.
+
+For example, you can create two fake nodes, each in a different memory
+tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
+Here, the abstract distances of node0 and node1 are set to 576 and 704,
+respectively.
+
+Each memory tier covers an abstract distance chunk size of 128. Thus,
+nodes with abstract distances between 512 and 639 are classified into the
+same memory tier, and nodes with abstract distances between 640 and 767
+are classified into the next slower memory tier.
+
+The abstract distance of fake nodes not specified in the parameter will be
+the default DRAM abstract distance of 576.
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation
2026-01-13 8:14 ` [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
@ 2026-01-13 9:32 ` Pratyush Brahma
0 siblings, 0 replies; 6+ messages in thread
From: Pratyush Brahma @ 2026-01-13 9:32 UTC (permalink / raw)
To: akinobu.mita
Cc: Liam.Howlett, akpm, apopple, axelrasmussen, bingjiao, byungchul,
david, gourry, hannes, jonathan.cameron, joshua.hahnjy,
linux-cxl, linux-kernel, linux-mm, lorenzo.stoakes,
matthew.brost, mhocko, pratyush.brahma, rakie.kim, rppt,
shakeel.butt, surenb, vbabka, weixugc, ying.huang, yuanchu,
zhengqi.arch, ziy
Reviewed-by: Pratyush Brahma <pratyush.brahma@oss.qualcomm.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
2026-01-13 8:14 [PATCH v4 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
2026-01-13 8:14 ` [PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
2026-01-13 8:14 ` [PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
@ 2026-01-13 8:14 ` Akinobu Mita
2 siblings, 0 replies; 6+ messages in thread
From: Akinobu Mita @ 2026-01-13 8:14 UTC (permalink / raw)
To: akinobu.mita
Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, ziy,
matthew.brost, joshua.hahnjy, rakie.kim, byungchul, gourry,
ying.huang, apopple, bingjiao, jonathan.cameron, pratyush.brahma
On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.
Here's the command to reproduce:
$ sudo swapoff -a
$ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
--memrate-rd-mbs 1 --memrate-wr-mbs 1
The memory usage is the number of workers specified with the --memrate
option multiplied by the buffer size specified with the --memrate-bytes
option, so please adjust it so that it exceeds the total size of the
installed DRAM and CXL memory.
If swap is disabled, you can usually expect the OOM killer to terminate
the stress-ng process when memory usage approaches the installed memory
size.
However, if multiple memory-tiers exist (multiple
/sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
/sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
invoked and the system will become inoperable, regardless of whether MGLRU
is enabled or not.
This issue can be reproduced using NUMA emulation even on systems with
only DRAM. You can create two-fake memory-tiers by booting a single-node
system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
parameters.
The reason for this issue is that memory allocations do not directly
trigger the oom-killer, assuming that if the target node has an underlying
memory tier, it can always be reclaimed by demotion.
So this change avoids this issue by not attempting to demote if the
underlying node has less free memory than the minimum watermark, and the
oom-killer will be triggered directly from memory allocations.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
v4:
- add a code comment in can_demote()
v3:
- rebase to linux-next (next-20260108), where demotion target has changed
from node id to node mask.
v2:
- describe reproducibility with !mglru in the commit log
- removed unnecessary consideration for scan control when checking demotion_nid watermarks
mm/vmscan.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f35afc5093dc..f980c533c778 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -358,7 +358,22 @@ static bool can_demote(int nid, struct scan_control *sc,
/* Filter out nodes that are not in cgroup's mems_allowed. */
mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
- return !nodes_empty(allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return false;
+
+ /* Check if there is enough free memory in the demotion target */
+ for_each_node_mask(nid, allowed_mask) {
+ int z;
+ struct zone *zone;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+
+ for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
+ if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
+ ZONE_MOVABLE, 0))
+ return true;
+ }
+ }
+ return false;
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread