* [PATCH 0/2] mm: fix oom-killer not being invoked when demotion is enabled
@ 2025-12-08 9:40 Akinobu Mita
2025-12-08 9:40 ` [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
2025-12-08 9:40 ` [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
0 siblings, 2 replies; 7+ messages in thread
From: Akinobu Mita @ 2025-12-08 9:40 UTC (permalink / raw)
To: akinobu.mita
Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb
On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.
This patchset includes a fix for the issue and an enhancement that makes
the issue reproducible with numa emulation.
Link: https://lore.kernel.org/lkml/20251022135735.246203-1-akinobu.mita@gmail.com/
Akinobu Mita (2):
mm: memory-tiers, numa_emu: enable to create memory tiers using fake
numa nodes
mm/vmscan: don't demote if there is not enough free memory in the
lower memory tier
mm/numa_emulation.c | 26 ++++++++++++++++++++++++++
mm/vmscan.c | 15 ++++++++++++++-
2 files changed, 40 insertions(+), 1 deletion(-)
--
2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes 2025-12-08 9:40 [PATCH 0/2] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita @ 2025-12-08 9:40 ` Akinobu Mita 2025-12-16 20:24 ` Andrew Morton 2025-12-08 9:40 ` [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita 1 sibling, 1 reply; 7+ messages in thread From: Akinobu Mita @ 2025-12-08 9:40 UTC (permalink / raw) To: akinobu.mita Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu, weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb This makes it possible to create memory tiers using fake numa nodes generated by numa emulation. The new "numa_emulation.adistance" kernel parameter allows you to set the abstract distance for each NUMA node. For example, if the system is booted with the parameters "numa=fake=2 numa_emulation.adistance=576,704", it will configure memory tiers with node0 having the default DRAM adistance value and node1 having a lower adistance value. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> --- mm/numa_emulation.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c index 703c8fa05048..a4266da21344 100644 --- a/mm/numa_emulation.c +++ b/mm/numa_emulation.c @@ -6,6 +6,9 @@ #include <linux/errno.h> #include <linux/topology.h> #include <linux/memblock.h> +#include <linux/memory-tiers.h> +#include <linux/module.h> +#include <linux/node.h> #include <linux/numa_memblks.h> #include <asm/numa.h> #include <acpi/acpi_numa.h> @@ -344,6 +347,27 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid) return max_emu_nid; } +static int adistance[MAX_NUMNODES]; +module_param_array(adistance, int, NULL, 0400); +MODULE_PARM_DESC(adistance, "Abstract distance values for each NUMA node"); + +static int emu_calculate_adistance(struct notifier_block *self, + unsigned long nid, void *data) +{ + if (adistance[nid]) { + int *adist = data; + + *adist = adistance[nid]; + return NOTIFY_STOP; + } + return NOTIFY_OK; +} + +static struct notifier_block emu_adist_nb = { + .notifier_call = emu_calculate_adistance, + .priority = INT_MIN, +}; + /** * numa_emulation - Emulate NUMA nodes * @numa_meminfo: NUMA configuration to massage @@ -532,6 +556,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) } } + register_mt_adistance_algorithm(&emu_adist_nb); + /* free the copied physical distance table */ memblock_free(phys_dist, phys_size); return; -- 2.43.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes 2025-12-08 9:40 ` [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita @ 2025-12-16 20:24 ` Andrew Morton 2025-12-17 13:55 ` Akinobu Mita 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2025-12-16 20:24 UTC (permalink / raw) To: Akinobu Mita Cc: linux-cxl, linux-kernel, linux-mm, axelrasmussen, yuanchu, weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb On Mon, 8 Dec 2025 18:40:27 +0900 Akinobu Mita <akinobu.mita@gmail.com> wrote: > This makes it possible to create memory tiers using fake numa nodes > generated by numa emulation. > > The new "numa_emulation.adistance" kernel parameter allows you to set the > abstract distance for each NUMA node. > > For example, if the system is booted with the parameters > "numa=fake=2 numa_emulation.adistance=576,704", it will configure memory > tiers with node0 having the default DRAM adistance value and node1 having > a lower adistance value. Confusing. I'd have thought that this commandline would gave node0 a distance of 576 and node1 a distance of 704? But the text talks about some third "default" distance, of unknown value. Can we please clear all this up? Also, we have little documentation for this stuff. fake-numa-for-cpusets.rst and kernel-parameters.txt. Can you please find somewhere appropriate to document this new user-facing feature? Maybe a new Documentation file? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes 2025-12-16 20:24 ` Andrew Morton @ 2025-12-17 13:55 ` Akinobu Mita 0 siblings, 0 replies; 7+ messages in thread From: Akinobu Mita @ 2025-12-17 13:55 UTC (permalink / raw) To: Andrew Morton Cc: linux-cxl, linux-kernel, linux-mm, axelrasmussen, yuanchu, weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb 2025年12月17日(水) 5:24 Andrew Morton <akpm@linux-foundation.org>: > > On Mon, 8 Dec 2025 18:40:27 +0900 Akinobu Mita <akinobu.mita@gmail.com> wrote: > > > This makes it possible to create memory tiers using fake numa nodes > > generated by numa emulation. > > > > The new "numa_emulation.adistance" kernel parameter allows you to set the > > abstract distance for each NUMA node. > > > > For example, if the system is booted with the parameters > > "numa=fake=2 numa_emulation.adistance=576,704", it will configure memory > > tiers with node0 having the default DRAM adistance value and node1 having > > a lower adistance value. > > Confusing. I'd have thought that this commandline would gave node0 a > distance of 576 and node1 a distance of 704? But the text talks about > some third "default" distance, of unknown value. > > Can we please clear all this up? The DRAM abstract distance is defined by MEMTIER_ADISTANCE_DRAM in linux/memory-tiers.h and has a value of 576. Each memory tier covers an abstract distance chunk size of 128, so nodes with abstract distances between 512 and 639 are classified into the DRAM tier. Here, the abstract distances of node0 and node1 are set to 576 and 706, respectively, so they are classified into different tiers. > Also, we have little documentation for this stuff. > fake-numa-for-cpusets.rst and kernel-parameters.txt. Can you please > find somewhere appropriate to document this new user-facing feature? > Maybe a new Documentation file? Looks good. I'll create a new Documentation/mm/numa_emulation.rst and document at least this new parameter. ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier 2025-12-08 9:40 [PATCH 0/2] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita 2025-12-08 9:40 ` [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita @ 2025-12-08 9:40 ` Akinobu Mita 2025-12-17 12:55 ` Vlastimil Babka 1 sibling, 1 reply; 7+ messages in thread From: Akinobu Mita @ 2025-12-08 9:40 UTC (permalink / raw) To: akinobu.mita Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu, weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb On systems with multiple memory-tiers consisting of DRAM and CXL memory, the OOM killer is not invoked properly. Here's the command to reproduce: $ sudo swapoff -a $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \ --memrate-rd-mbs 1 --memrate-wr-mbs 1 The memory usage is the number of workers specified with the --memrate option multiplied by the buffer size specified with the --memrate-bytes option, so please adjust it so that it exceeds the total size of the installed DRAM and CXL memory. If swap is disabled, you can usually expect the OOM killer to terminate the stress-ng process when memory usage approaches the installed memory size. However, if multiple memory-tiers exist (multiple /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist), and /sys/kernel/mm/numa/demotion_enabled is true and /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked and the system will become inoperable. This issue can be reproduced using NUMA emulation even on systems with only DRAM. You can create two-fake memory-tiers by booting a single-node system with "numa=fake=2 numa_emulation.adistance=576,704" kernel parameters. The reason for this issue is that if the target node for allocation has an underlying memory tier, it is always assumed that it can be reclaimed via demotion. So this change avoids this issue by not attempting to demote if the demoting node has less free memory than the minimum watermark. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> --- mm/vmscan.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index fddd168a9737..f4748f258294 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -356,7 +356,20 @@ static bool can_demote(int nid, struct scan_control *sc, return false; /* If demotion node isn't in the cgroup's mems_allowed, fall back */ - return mem_cgroup_node_allowed(memcg, demotion_nid); + if (mem_cgroup_node_allowed(memcg, demotion_nid)) { + int z; + struct zone *zone; + struct pglist_data *pgdat = NODE_DATA(demotion_nid); + unsigned int highest_zoneidx = sc ? sc->reclaim_idx : MAX_NR_ZONES - 1; + int order = sc ? sc->order : 0; + + for_each_managed_zone_pgdat(zone, pgdat, z, highest_zoneidx) { + if (zone_watermark_ok(zone, order, min_wmark_pages(zone), + highest_zoneidx, 0)) + return true; + } + } + return false; } static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, -- 2.43.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier 2025-12-08 9:40 ` [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita @ 2025-12-17 12:55 ` Vlastimil Babka 2025-12-17 14:30 ` Akinobu Mita 0 siblings, 1 reply; 7+ messages in thread From: Vlastimil Babka @ 2025-12-17 12:55 UTC (permalink / raw) To: Akinobu Mita Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu, weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett, rppt, surenb On 12/8/25 10:40, Akinobu Mita wrote: > On systems with multiple memory-tiers consisting of DRAM and CXL memory, > the OOM killer is not invoked properly. > > Here's the command to reproduce: > > $ sudo swapoff -a > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \ > --memrate-rd-mbs 1 --memrate-wr-mbs 1 > > The memory usage is the number of workers specified with the --memrate > option multiplied by the buffer size specified with the --memrate-bytes > option, so please adjust it so that it exceeds the total size of the > installed DRAM and CXL memory. > > If swap is disabled, you can usually expect the OOM killer to terminate > the stress-ng process when memory usage approaches the installed memory > size. > > However, if multiple memory-tiers exist (multiple > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist), > and /sys/kernel/mm/numa/demotion_enabled is true and > /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked Does this mean only mglru has the problem, or !mglru too? Also is min_ttl_ms = 0 a sensible setting? What happens without it? If !mglru doesn't have this problem, how does the fix affect it? > and the system will become inoperable. > > This issue can be reproduced using NUMA emulation even on systems with > only DRAM. You can create two-fake memory-tiers by booting a single-node > system with "numa=fake=2 numa_emulation.adistance=576,704" kernel > parameters. > > The reason for this issue is that if the target node for allocation has > an underlying memory tier, it is always assumed that it can be reclaimed > via demotion. > > So this change avoids this issue by not attempting to demote if the > demoting node has less free memory than the minimum watermark. > > Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> > --- > mm/vmscan.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fddd168a9737..f4748f258294 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -356,7 +356,20 @@ static bool can_demote(int nid, struct scan_control *sc, > return false; > > /* If demotion node isn't in the cgroup's mems_allowed, fall back */ > - return mem_cgroup_node_allowed(memcg, demotion_nid); > + if (mem_cgroup_node_allowed(memcg, demotion_nid)) { > + int z; > + struct zone *zone; > + struct pglist_data *pgdat = NODE_DATA(demotion_nid); > + unsigned int highest_zoneidx = sc ? sc->reclaim_idx : MAX_NR_ZONES - 1; > + int order = sc ? sc->order : 0; > + > + for_each_managed_zone_pgdat(zone, pgdat, z, highest_zoneidx) { > + if (zone_watermark_ok(zone, order, min_wmark_pages(zone), > + highest_zoneidx, 0)) > + return true; > + } > + } > + return false; > } > > static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier 2025-12-17 12:55 ` Vlastimil Babka @ 2025-12-17 14:30 ` Akinobu Mita 0 siblings, 0 replies; 7+ messages in thread From: Akinobu Mita @ 2025-12-17 14:30 UTC (permalink / raw) To: Vlastimil Babka Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu, weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett, rppt, surenb 2025年12月17日(水) 21:55 Vlastimil Babka <vbabka@suse.cz>: > > On 12/8/25 10:40, Akinobu Mita wrote: > > On systems with multiple memory-tiers consisting of DRAM and CXL memory, > > the OOM killer is not invoked properly. > > > > Here's the command to reproduce: > > > > $ sudo swapoff -a > > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \ > > --memrate-rd-mbs 1 --memrate-wr-mbs 1 > > > > The memory usage is the number of workers specified with the --memrate > > option multiplied by the buffer size specified with the --memrate-bytes > > option, so please adjust it so that it exceeds the total size of the > > installed DRAM and CXL memory. > > > > If swap is disabled, you can usually expect the OOM killer to terminate > > the stress-ng process when memory usage approaches the installed memory > > size. > > > > However, if multiple memory-tiers exist (multiple > > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist), > > and /sys/kernel/mm/numa/demotion_enabled is true and > > /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked > > Does this mean only mglru has the problem, or !mglru too? !mglru has the problem, too. > Also is min_ttl_ms = 0 a sensible setting? What happens without it? Setting min_tto_ms = 1 or a longer value will cause kswapd to trigger the oom-killer. However, when the stress-ng devshm test was run in addition to the above test to increase the load, the system remained inoperable with only the oom-killer by kswapd. > If !mglru doesn't have this problem, how does the fix affect it? With this patch, the oom-killer will be triggered directly from memory allocations, regardless of mglru or not. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-12-17 14:30 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-12-08 9:40 [PATCH 0/2] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita 2025-12-08 9:40 ` [PATCH 1/2] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita 2025-12-16 20:24 ` Andrew Morton 2025-12-17 13:55 ` Akinobu Mita 2025-12-08 9:40 ` [PATCH 2/2] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita 2025-12-17 12:55 ` Vlastimil Babka 2025-12-17 14:30 ` Akinobu Mita
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox