oom-killer not invoked on systems with multiple memory-tiers

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* oom-killer not invoked on systems with multiple memory-tiers
@ 2025-10-22 13:57 Akinobu Mita
  2025-10-28  7:27 ` Akinobu Mita
  2025-10-28 19:54 ` Shakeel Butt
  0 siblings, 2 replies; 5+ messages in thread
From: Akinobu Mita @ 2025-10-22 13:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-cxl, linux-mm, akinobu.mita

On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.

Here's the command to reproduce:

$ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
    --memrate-rd-mbs 1 --memrate-wr-mbs 1

The memory usage is the number of workers specified with the --memrate
option multiplied by the buffer size specified with the --memrate-bytes
option, so please adjust it so that it exceeds the total size of the
installed DRAM and CXL memory.

If swap is disabled, you can usually expect the OOM killer to terminate
the stress-ng process when memory usage approaches the installed memory size.

However, if multiple memory-tiers exist (multiple
/sys/devices/virtual/memory_tiering/memory_tier<N> directories exist),
and /sys/kernel/mm/numa/demotion_enabled is true and
/sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked
and the system will become inoperable.

If /sys/kernel/mm/numa/demotion_enabled is false, or if demotion_enabled
is true but /sys/kernel/mm/lru_gen/min_ttl_ms is set to a non-zero value
such as 1000, the OOM killer will be invoked properly.

This issue can be reproduced using NUMA emulation even on systems with
only DRAM. However, to configure multiple memory-tiers using fake nodes,
you must apply the attached patch.

You can create two-fake memory-tiers by booting a single-node system with
the following boot options:

numa=fake=2
numa_emulation.default_dram=1,0
numa_emulation.read_latency=100,1000
numa_emulation.write_latency=100,1000
numa_emulation.read_bandwidth=100000,10000
numa_emulation.write_bandwidth=100000,10000

---
 mm/numa_emulation.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
index 703c8fa05048..b1c283b99038 100644
--- a/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -6,6 +6,9 @@
 #include <linux/errno.h>
 #include <linux/topology.h>
 #include <linux/memblock.h>
+#include <linux/memory-tiers.h>
+#include <linux/module.h>
+#include <linux/node.h>
 #include <linux/numa_memblks.h>
 #include <asm/numa.h>
 #include <acpi/acpi_numa.h>
@@ -344,6 +347,46 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid)
 	return max_emu_nid;
 }

+static bool default_dram[MAX_NUMNODES];
+module_param_array(default_dram, bool, NULL, 0400);
+
+static unsigned int read_latency[MAX_NUMNODES];
+module_param_array(read_latency, uint, NULL, 0400);
+
+static unsigned int write_latency[MAX_NUMNODES];
+module_param_array(write_latency, uint, NULL, 0400);
+
+static unsigned int read_bandwidth[MAX_NUMNODES];
+module_param_array(read_bandwidth, uint, NULL, 0400);
+
+static unsigned int write_bandwidth[MAX_NUMNODES];
+module_param_array(write_bandwidth, uint, NULL, 0400);
+
+static int emu_calculate_adistance(struct notifier_block *self,
+				unsigned long nid, void *data)
+{
+	struct access_coordinate perf = {
+		.read_bandwidth = read_bandwidth[nid],
+		.write_bandwidth = write_bandwidth[nid],
+		.read_latency = read_latency[nid],
+		.write_latency = write_latency[nid],
+	};
+	int *adist = data;
+
+	if (default_dram[nid])
+		mt_set_default_dram_perf(nid, &perf, "numa_emu");
+
+	if (mt_perf_to_adistance(&perf, adist))
+		return NOTIFY_OK;
+
+	return NOTIFY_STOP;
+}
+
+static struct notifier_block emu_adist_nb = {
+	.notifier_call = emu_calculate_adistance,
+	.priority = INT_MIN,
+};
+
 /**
  * numa_emulation - Emulate NUMA nodes
  * @numa_meminfo: NUMA configuration to massage
@@ -532,6 +575,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		}
 	}

+	register_mt_adistance_algorithm(&emu_adist_nb);
+
 	/* free the copied physical distance table */
 	memblock_free(phys_dist, phys_size);
 	return;
-- 
2.43.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: oom-killer not invoked on systems with multiple memory-tiers
  2025-10-22 13:57 oom-killer not invoked on systems with multiple memory-tiers Akinobu Mita
@ 2025-10-28  7:27 ` Akinobu Mita
  2025-10-28 19:54 ` Shakeel Butt
  1 sibling, 0 replies; 5+ messages in thread
From: Akinobu Mita @ 2025-10-28  7:27 UTC (permalink / raw)
  To: akinobu.mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes

lru_gen_age_node() checks the reclaimability of a node using
lruvec_is_reclaimable() when min_ttl_ms is non-zero, but if min_ttl_ms
is zero, it unconditionally determines that the node is reclaimable.

I think we should still call lruvec_is_reclaimable() to check
reclaimability even if min_ttl_ms is zero.
Otherwise, we'll miss the opportunity to trigger an OOM kill in
lru_gen_age_node(), which will cause the problem I reported.

https://lore.kernel.org/linux-mm/20251022135735.246203-1-akinobu.mita@gmail.com/

The attached patch made the changes mentioned above and worked around
the problem.

---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c922bad2b8fd..e1a88ed4e0b0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4148,7 +4148,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
-	bool reclaimable = !min_ttl;
+	bool reclaimable = false;

 	VM_WARN_ON_ONCE(!current_is_kswapd());

-- 
2.43.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: oom-killer not invoked on systems with multiple memory-tiers
  2025-10-22 13:57 oom-killer not invoked on systems with multiple memory-tiers Akinobu Mita
  2025-10-28  7:27 ` Akinobu Mita
@ 2025-10-28 19:54 ` Shakeel Butt
  2025-10-30  0:51   ` Akinobu Mita
  1 sibling, 1 reply; 5+ messages in thread
From: Shakeel Butt @ 2025-10-28 19:54 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-kernel, linux-cxl, linux-mm

Hi Akinobu,

On Wed, Oct 22, 2025 at 10:57:35PM +0900, Akinobu Mita wrote:
> On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> the OOM killer is not invoked properly.
> 
> Here's the command to reproduce:
> 
> $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
>     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> 
> The memory usage is the number of workers specified with the --memrate
> option multiplied by the buffer size specified with the --memrate-bytes
> option, so please adjust it so that it exceeds the total size of the
> installed DRAM and CXL memory.
> 
> If swap is disabled, you can usually expect the OOM killer to terminate
> the stress-ng process when memory usage approaches the installed memory size.
> 
> However, if multiple memory-tiers exist (multiple
> /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist),
> and /sys/kernel/mm/numa/demotion_enabled is true and
> /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked
> and the system will become inoperable.
> 
> If /sys/kernel/mm/numa/demotion_enabled is false, or if demotion_enabled
> is true but /sys/kernel/mm/lru_gen/min_ttl_ms is set to a non-zero value
> such as 1000, the OOM killer will be invoked properly.
> 
> This issue can be reproduced using NUMA emulation even on systems with
> only DRAM. However, to configure multiple memory-tiers using fake nodes,
> you must apply the attached patch.
> 
> You can create two-fake memory-tiers by booting a single-node system with
> the following boot options:
> 
> numa=fake=2
> numa_emulation.default_dram=1,0
> numa_emulation.read_latency=100,1000
> numa_emulation.write_latency=100,1000
> numa_emulation.read_bandwidth=100000,10000
> numa_emulation.write_bandwidth=100000,10000
> 

Thanks for the report. Can you try to repro this with traditional LRU
i.e. not MGLRU? I just want to see if this is MGLRU only issue or more
general.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: oom-killer not invoked on systems with multiple memory-tiers
  2025-10-28 19:54 ` Shakeel Butt
@ 2025-10-30  0:51   ` Akinobu Mita
  2025-11-14  8:43     ` Akinobu Mita
  0 siblings, 1 reply; 5+ messages in thread
From: Akinobu Mita @ 2025-10-30  0:51 UTC (permalink / raw)
  To: Shakeel Butt; +Cc: linux-kernel, linux-cxl, linux-mm

2025年10月29日(水) 4:54 Shakeel Butt <shakeel.butt@linux.dev>:
>
> Hi Akinobu,
>
> On Wed, Oct 22, 2025 at 10:57:35PM +0900, Akinobu Mita wrote:
> > On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> > the OOM killer is not invoked properly.
> >
> > Here's the command to reproduce:
> >
> > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> >     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> >
> > The memory usage is the number of workers specified with the --memrate
> > option multiplied by the buffer size specified with the --memrate-bytes
> > option, so please adjust it so that it exceeds the total size of the
> > installed DRAM and CXL memory.
> >
> > If swap is disabled, you can usually expect the OOM killer to terminate
> > the stress-ng process when memory usage approaches the installed memory size.
> >
> > However, if multiple memory-tiers exist (multiple
> > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist),
> > and /sys/kernel/mm/numa/demotion_enabled is true and
> > /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked
> > and the system will become inoperable.
> >
> > If /sys/kernel/mm/numa/demotion_enabled is false, or if demotion_enabled
> > is true but /sys/kernel/mm/lru_gen/min_ttl_ms is set to a non-zero value
> > such as 1000, the OOM killer will be invoked properly.
> >
> > This issue can be reproduced using NUMA emulation even on systems with
> > only DRAM. However, to configure multiple memory-tiers using fake nodes,
> > you must apply the attached patch.
> >
> > You can create two-fake memory-tiers by booting a single-node system with
> > the following boot options:
> >
> > numa=fake=2
> > numa_emulation.default_dram=1,0
> > numa_emulation.read_latency=100,1000
> > numa_emulation.write_latency=100,1000
> > numa_emulation.read_bandwidth=100000,10000
> > numa_emulation.write_bandwidth=100000,10000
> >
>
> Thanks for the report. Can you try to repro this with traditional LRU
> i.e. not MGLRU? I just want to see if this is MGLRU only issue or more
> general.

The problem was also reproduced with traditional LRU.
Here is a summary of the conditions for reproduction:

* Reproducible (The system becomes inoperable)
  * demotion_enabled=true lru_gen/enabled=0x0007,min_ttl_ms=0
  * demotion_enabled=true lru_gen/enabled=0x0000
* Unreproducible (OOM killer is invoked)
  * demotion_enabled=false
  * demotion_enabled=true lru_gen/enabled=0x0007,min_ttl_ms=1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: oom-killer not invoked on systems with multiple memory-tiers
  2025-10-30  0:51   ` Akinobu Mita
@ 2025-11-14  8:43     ` Akinobu Mita
  0 siblings, 0 replies; 5+ messages in thread
From: Akinobu Mita @ 2025-11-14  8:43 UTC (permalink / raw)
  To: akinobu.mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb

Regarding the issue I reported where processes that run out of memory cannot be
terminated by OOM when demotion_enabled is true, this can be avoided by not
demoting if the target node does not have enough free memory.

Specifically, the can_demote() function, which determines whether or not to
demote, will be changed to also check whether the target node has enough free
memory (for example, whether the minimum watermark is exceeded).

Does this change make sense?

---
 mm/vmscan.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8890f4b58673..63d751e54b08 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -356,7 +356,20 @@ static bool can_demote(int nid, struct scan_control *sc,
 		return false;
 
 	/* If demotion node isn't in the cgroup's mems_allowed, fall back */
-	return mem_cgroup_node_allowed(memcg, demotion_nid);
+	if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
+		int z;
+		struct zone *zone;
+		struct pglist_data *pgdat = NODE_DATA(demotion_nid);
+		unsigned int highest_zoneidx = sc ? sc->reclaim_idx : MAX_NR_ZONES - 1;
+		int order = sc ? sc->order : 0;
+
+		for_each_managed_zone_pgdat(zone, pgdat, z, highest_zoneidx) {
+			if (zone_watermark_ok(zone, order, min_wmark_pages(zone),
+						highest_zoneidx, 0))
+				return true;
+		}
+	}
+	return false;
 }
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-14  8:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-22 13:57 oom-killer not invoked on systems with multiple memory-tiers Akinobu Mita
2025-10-28  7:27 ` Akinobu Mita
2025-10-28 19:54 ` Shakeel Butt
2025-10-30  0:51   ` Akinobu Mita
2025-11-14  8:43     ` Akinobu Mita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox