linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] mm: fix oom-killer not being invoked when demotion is enabled
@ 2026-01-08 10:15 Akinobu Mita
  2026-01-08 10:15 ` [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Akinobu Mita @ 2026-01-08 10:15 UTC (permalink / raw)
  To: akinobu.mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.

This patchset includes a fix for the issue and an enhancement that makes
the issue reproducible with numa emulation.

v3:
- rebase to linux-next (next-20260108), where demotion target has changed
  from node id to node mask.

v2:
- fix the explanation about cmdline parameter in the commit log
- add document for NUMA emulation
- describe reproducibility with !mglru in the commit log
- removed unnecessary consideration for scan control when checking demotion_nid watermarks

v1:
- https://lore.kernel.org/linux-mm/20251208094028.214949-1-akinobu.mita@gmail.com/T/

original report:
- https://lore.kernel.org/lkml/20251022135735.246203-1-akinobu.mita@gmail.com/T

Akinobu Mita (3):
  mm: memory-tiers, numa_emu: enable to create memory tiers using fake
    numa nodes
  mm: numa_emu: add document for NUMA emulation
  mm/vmscan: don't demote if there is not enough free memory in the
    lower memory tier

 Documentation/mm/index.rst          |  1 +
 Documentation/mm/numa_emulation.rst | 30 +++++++++++++++++++++++++++++
 mm/numa_emulation.c                 | 26 +++++++++++++++++++++++++
 mm/vmscan.c                         | 16 ++++++++++++++-
 4 files changed, 72 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/mm/numa_emulation.rst

-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
  2026-01-08 10:15 [PATCH v3 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
@ 2026-01-08 10:15 ` Akinobu Mita
  2026-01-08 15:47   ` Jonathan Cameron
  2026-01-09  4:43   ` Pratyush Brahma
  2026-01-08 10:15 ` [PATCH v3 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
  2026-01-08 10:15 ` [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
  2 siblings, 2 replies; 19+ messages in thread
From: Akinobu Mita @ 2026-01-08 10:15 UTC (permalink / raw)
  To: akinobu.mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

This makes it possible to create memory tiers using fake numa nodes
generated by numa emulation.

The "numa_emulation.adistance=" kernel cmdline option allows you to set
the abstract distance for each NUMA node.

For example, you can create two fake nodes, each in a different memory
tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
Here, the abstract distances of node0 and node1 are set to 576 and 706,
respectively.

Each memory tier covers an abstract distance chunk size of 128. Thus,
nodes with abstract distances between 512 and 639 are classified into the
same memory tier, and nodes with abstract distances between 640 and 767
are classified into the next slower memory tier.

The abstract distance of fake nodes not specified in the parameter will
be the default DRAM abstract distance of 576.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
v2:
- fix the explanation about cmdline parameter in the commit log

 mm/numa_emulation.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
index 703c8fa05048..a4266da21344 100644
--- a/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -6,6 +6,9 @@
 #include <linux/errno.h>
 #include <linux/topology.h>
 #include <linux/memblock.h>
+#include <linux/memory-tiers.h>
+#include <linux/module.h>
+#include <linux/node.h>
 #include <linux/numa_memblks.h>
 #include <asm/numa.h>
 #include <acpi/acpi_numa.h>
@@ -344,6 +347,27 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid)
 	return max_emu_nid;
 }
 
+static int adistance[MAX_NUMNODES];
+module_param_array(adistance, int, NULL, 0400);
+MODULE_PARM_DESC(adistance, "Abstract distance values for each NUMA node");
+
+static int emu_calculate_adistance(struct notifier_block *self,
+				unsigned long nid, void *data)
+{
+	if (adistance[nid]) {
+		int *adist = data;
+
+		*adist = adistance[nid];
+		return NOTIFY_STOP;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block emu_adist_nb = {
+	.notifier_call = emu_calculate_adistance,
+	.priority = INT_MIN,
+};
+
 /**
  * numa_emulation - Emulate NUMA nodes
  * @numa_meminfo: NUMA configuration to massage
@@ -532,6 +556,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 		}
 	}
 
+	register_mt_adistance_algorithm(&emu_adist_nb);
+
 	/* free the copied physical distance table */
 	memblock_free(phys_dist, phys_size);
 	return;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 2/3] mm: numa_emu: add document for NUMA emulation
  2026-01-08 10:15 [PATCH v3 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
  2026-01-08 10:15 ` [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
@ 2026-01-08 10:15 ` Akinobu Mita
  2026-01-08 15:51   ` Jonathan Cameron
  2026-01-08 10:15 ` [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
  2 siblings, 1 reply; 19+ messages in thread
From: Akinobu Mita @ 2026-01-08 10:15 UTC (permalink / raw)
  To: akinobu.mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

Add a document with a brief explanation of numa emulation and how to use
the newly added "numa_emulation.adistance=" kernel cmdline parameter.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
v2:
- added in v2

 Documentation/mm/index.rst          |  1 +
 Documentation/mm/numa_emulation.rst | 30 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)
 create mode 100644 Documentation/mm/numa_emulation.rst

diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
index 7aa2a8886908..7d628edd6a17 100644
--- a/Documentation/mm/index.rst
+++ b/Documentation/mm/index.rst
@@ -24,6 +24,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`.
    page_cache
    shmfs
    oom
+   numa_emulation
 
 Unsorted Documentation
 ======================
diff --git a/Documentation/mm/numa_emulation.rst b/Documentation/mm/numa_emulation.rst
new file mode 100644
index 000000000000..dce9f607c031
--- /dev/null
+++ b/Documentation/mm/numa_emulation.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+NUMA emulation
+==============
+
+If CONFIG_NUMA_EMU is enabled, you can create fake NUMA nodes with
+``numa=fake=`` kernel cmdline option.
+See Documentation/admin-guide/kernel-parameters.txt and
+Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst for more information.
+
+
+Multiple Memory Tiers Creation
+==============================
+
+The "numa_emulation.adistance=" kernel cmdline option allows you to set
+the abstract distance for each NUMA node.
+
+For example, you can create two fake nodes, each in a different memory
+tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
+Here, the abstract distances of node0 and node1 are set to 576 and 706,
+respectively.
+
+Each memory tier covers an abstract distance chunk size of 128. Thus,
+nodes with abstract distances between 512 and 639 are classified into the
+same memory tier, and nodes with abstract distances between 640 and 767
+are classified into the next slower memory tier.
+
+The abstract distance of fake nodes not specified in the parameter will
+be the default DRAM abstract distance of 576.
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-08 10:15 [PATCH v3 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
  2026-01-08 10:15 ` [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
  2026-01-08 10:15 ` [PATCH v3 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
@ 2026-01-08 10:15 ` Akinobu Mita
  2026-01-08 19:00   ` Andrew Morton
  2026-01-09 16:07   ` Gregory Price
  2 siblings, 2 replies; 19+ messages in thread
From: Akinobu Mita @ 2026-01-08 10:15 UTC (permalink / raw)
  To: akinobu.mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.

Here's the command to reproduce:

$ sudo swapoff -a
$ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
    --memrate-rd-mbs 1 --memrate-wr-mbs 1

The memory usage is the number of workers specified with the --memrate
option multiplied by the buffer size specified with the --memrate-bytes
option, so please adjust it so that it exceeds the total size of the
installed DRAM and CXL memory.

If swap is disabled, you can usually expect the OOM killer to terminate
the stress-ng process when memory usage approaches the installed memory
size.

However, if multiple memory-tiers exist (multiple
/sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
/sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
invoked and the system will become inoperable, regardless of whether MGLRU
is enabled or not.

This issue can be reproduced using NUMA emulation even on systems with
only DRAM.  You can create two-fake memory-tiers by booting a single-node
system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
parameters.

The reason for this issue is that memory allocations do not directly
trigger the oom-killer, assuming that if the target node has an underlying
memory tier, it can always be reclaimed by demotion.

So this change avoids this issue by not attempting to demote if the
underlying node has less free memory than the minimum watermark, and the
oom-killer will be triggered directly from memory allocations.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
v3:
- rebase to linux-next (next-20260108), where demotion target has changed
  from node id to node mask.

v2:
- describe reproducibility with !mglru in the commit log
- removed unnecessary consideration for scan control when checking demotion_nid watermarks

 mm/vmscan.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a34cf784e131..9a4b12ef6b53 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -358,7 +358,21 @@ static bool can_demote(int nid, struct scan_control *sc,
 
 	/* Filter out nodes that are not in cgroup's mems_allowed. */
 	mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
-	return !nodes_empty(allowed_mask);
+	if (nodes_empty(allowed_mask))
+		return false;
+
+	for_each_node_mask(nid, allowed_mask) {
+		int z;
+		struct zone *zone;
+		struct pglist_data *pgdat = NODE_DATA(nid);
+
+		for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
+			if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
+						ZONE_MOVABLE, 0))
+				return true;
+		}
+	}
+	return false;
 }
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
  2026-01-08 10:15 ` [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
@ 2026-01-08 15:47   ` Jonathan Cameron
  2026-01-10  3:47     ` Akinobu Mita
  2026-01-09  4:43   ` Pratyush Brahma
  1 sibling, 1 reply; 19+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:47 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

On Thu,  8 Jan 2026 19:15:33 +0900
Akinobu Mita <akinobu.mita@gmail.com> wrote:

> This makes it possible to create memory tiers using fake numa nodes
> generated by numa emulation.
> 
> The "numa_emulation.adistance=" kernel cmdline option allows you to set
> the abstract distance for each NUMA node.
> 
> For example, you can create two fake nodes, each in a different memory
> tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
> Here, the abstract distances of node0 and node1 are set to 576 and 706,
> respectively.
> 
> Each memory tier covers an abstract distance chunk size of 128. Thus,
> nodes with abstract distances between 512 and 639 are classified into the
> same memory tier, and nodes with abstract distances between 640 and 767
> are classified into the next slower memory tier.
> 
> The abstract distance of fake nodes not specified in the parameter will
> be the default DRAM abstract distance of 576.
> 
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> ---
> v2:
> - fix the explanation about cmdline parameter in the commit log
A couple of comments on includes, with those resolved LGTM.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> 
>  mm/numa_emulation.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
> index 703c8fa05048..a4266da21344 100644
> --- a/mm/numa_emulation.c
> +++ b/mm/numa_emulation.c
> @@ -6,6 +6,9 @@
>  #include <linux/errno.h>
>  #include <linux/topology.h>
>  #include <linux/memblock.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/module.h>
> +#include <linux/node.h>

I can't immediately spot why the new code needs node.h

Should also include
linux/notifier.h for the notifier_block definition.

>  #include <linux/numa_memblks.h>
>  #include <asm/numa.h>
>  #include <acpi/acpi_numa.h>
> @@ -344,6 +347,27 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid)
>  	return max_emu_nid;
>  }
>  
> +static int adistance[MAX_NUMNODES];
> +module_param_array(adistance, int, NULL, 0400);
> +MODULE_PARM_DESC(adistance, "Abstract distance values for each NUMA node");
> +
> +static int emu_calculate_adistance(struct notifier_block *self,
> +				unsigned long nid, void *data)
> +{
> +	if (adistance[nid]) {
> +		int *adist = data;
> +
> +		*adist = adistance[nid];
> +		return NOTIFY_STOP;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block emu_adist_nb = {
> +	.notifier_call = emu_calculate_adistance,
> +	.priority = INT_MIN,
> +};
> +
>  /**
>   * numa_emulation - Emulate NUMA nodes
>   * @numa_meminfo: NUMA configuration to massage
> @@ -532,6 +556,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  		}
>  	}
>  
> +	register_mt_adistance_algorithm(&emu_adist_nb);
> +
>  	/* free the copied physical distance table */
>  	memblock_free(phys_dist, phys_size);
>  	return;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/3] mm: numa_emu: add document for NUMA emulation
  2026-01-08 10:15 ` [PATCH v3 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
@ 2026-01-08 15:51   ` Jonathan Cameron
  0 siblings, 0 replies; 19+ messages in thread
From: Jonathan Cameron @ 2026-01-08 15:51 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

On Thu,  8 Jan 2026 19:15:34 +0900
Akinobu Mita <akinobu.mita@gmail.com> wrote:

> Add a document with a brief explanation of numa emulation and how to use
> the newly added "numa_emulation.adistance=" kernel cmdline parameter.
> 
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
I'm not sure on the benefit of a NUMA emulation doc that only really covers
the new stuff in your series.  Feels like a more thorough doc with everything
in one place would be better.  Still, what you have here is good to have so
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
> v2:
> - added in v2
> 
>  Documentation/mm/index.rst          |  1 +
>  Documentation/mm/numa_emulation.rst | 30 +++++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+)
>  create mode 100644 Documentation/mm/numa_emulation.rst
> 
> diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> index 7aa2a8886908..7d628edd6a17 100644
> --- a/Documentation/mm/index.rst
> +++ b/Documentation/mm/index.rst
> @@ -24,6 +24,7 @@ see the :doc:`admin guide <../admin-guide/mm/index>`.
>     page_cache
>     shmfs
>     oom
> +   numa_emulation
>  
>  Unsorted Documentation
>  ======================
> diff --git a/Documentation/mm/numa_emulation.rst b/Documentation/mm/numa_emulation.rst
> new file mode 100644
> index 000000000000..dce9f607c031
> --- /dev/null
> +++ b/Documentation/mm/numa_emulation.rst
> @@ -0,0 +1,30 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============
> +NUMA emulation
> +==============
> +
> +If CONFIG_NUMA_EMU is enabled, you can create fake NUMA nodes with
> +``numa=fake=`` kernel cmdline option.
> +See Documentation/admin-guide/kernel-parameters.txt and
> +Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst for more information.
> +
> +
> +Multiple Memory Tiers Creation
> +==============================
> +
> +The "numa_emulation.adistance=" kernel cmdline option allows you to set
> +the abstract distance for each NUMA node.
> +
> +For example, you can create two fake nodes, each in a different memory
> +tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
> +Here, the abstract distances of node0 and node1 are set to 576 and 706,
> +respectively.
> +
> +Each memory tier covers an abstract distance chunk size of 128. Thus,
> +nodes with abstract distances between 512 and 639 are classified into the
> +same memory tier, and nodes with abstract distances between 640 and 767
> +are classified into the next slower memory tier.
> +
> +The abstract distance of fake nodes not specified in the parameter will
> +be the default DRAM abstract distance of 576.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-08 10:15 ` [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
@ 2026-01-08 19:00   ` Andrew Morton
  2026-01-09 16:07   ` Gregory Price
  1 sibling, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2026-01-08 19:00 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-cxl, linux-kernel, linux-mm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao,
	David Rientjes

On Thu,  8 Jan 2026 19:15:35 +0900 Akinobu Mita <akinobu.mita@gmail.com> wrote:

> On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> the OOM killer is not invoked properly.
> 
> Here's the command to reproduce:
> 
> $ sudo swapoff -a
> $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
>     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> 
> The memory usage is the number of workers specified with the --memrate
> option multiplied by the buffer size specified with the --memrate-bytes
> option, so please adjust it so that it exceeds the total size of the
> installed DRAM and CXL memory.
> 
> If swap is disabled, you can usually expect the OOM killer to terminate
> the stress-ng process when memory usage approaches the installed memory
> size.
> 
> However, if multiple memory-tiers exist (multiple
> /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> invoked and the system will become inoperable, regardless of whether MGLRU
> is enabled or not.
> 
> This issue can be reproduced using NUMA emulation even on systems with
> only DRAM.  You can create two-fake memory-tiers by booting a single-node
> system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> parameters.
> 
> The reason for this issue is that memory allocations do not directly
> trigger the oom-killer, assuming that if the target node has an underlying
> memory tier, it can always be reclaimed by demotion.
> 
> So this change avoids this issue by not attempting to demote if the
> underlying node has less free memory than the minimum watermark, and the
> oom-killer will be triggered directly from memory allocations.
> 

Thanks.

An oom-killer fix which doesn't touch mm/oom-kill.c Hopefully
David/Shakeel/Michal can take a look.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -358,7 +358,21 @@ static bool can_demote(int nid, struct scan_control *sc,
>  
>  	/* Filter out nodes that are not in cgroup's mems_allowed. */
>  	mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
> -	return !nodes_empty(allowed_mask);
> +	if (nodes_empty(allowed_mask))
> +		return false;
> +
> +	for_each_node_mask(nid, allowed_mask) {
> +		int z;
> +		struct zone *zone;
> +		struct pglist_data *pgdat = NODE_DATA(nid);
> +
> +		for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
> +			if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> +						ZONE_MOVABLE, 0))
> +				return true;
> +		}
> +	}
> +	return false;
>  }

It would be nice to have a code comment in here to explain to readers
why we're doing this.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
  2026-01-08 10:15 ` [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
  2026-01-08 15:47   ` Jonathan Cameron
@ 2026-01-09  4:43   ` Pratyush Brahma
  2026-01-10  4:03     ` Akinobu Mita
  1 sibling, 1 reply; 19+ messages in thread
From: Pratyush Brahma @ 2026-01-09  4:43 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao


On 1/8/2026 3:45 PM, Akinobu Mita wrote:
> This makes it possible to create memory tiers using fake numa nodes
> generated by numa emulation.
>
> The "numa_emulation.adistance=" kernel cmdline option allows you to set
> the abstract distance for each NUMA node.
>
> For example, you can create two fake nodes, each in a different memory
> tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
> Here, the abstract distances of node0 and node1 are set to 576 and 706,
You mention 704 in the cmdline but then mention 706 in the following text.
Please correct the typo. Btw I am not entirely sure if this example is 
required
in the commit text here. The Documentation seems to the right place for 
this.
> respectively.
>
> Each memory tier covers an abstract distance chunk size of 128. Thus,
> nodes with abstract distances between 512 and 639 are classified into the
> same memory tier, and nodes with abstract distances between 640 and 767
> are classified into the next slower memory tier.
>
> The abstract distance of fake nodes not specified in the parameter will
> be the default DRAM abstract distance of 576.
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> ---
> v2:
> - fix the explanation about cmdline parameter in the commit log
>
>   mm/numa_emulation.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
>
> diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
> index 703c8fa05048..a4266da21344 100644
> --- a/mm/numa_emulation.c
> +++ b/mm/numa_emulation.c
> @@ -6,6 +6,9 @@
>   #include <linux/errno.h>
>   #include <linux/topology.h>
>   #include <linux/memblock.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/module.h>
> +#include <linux/node.h>
>   #include <linux/numa_memblks.h>
>   #include <asm/numa.h>
>   #include <acpi/acpi_numa.h>
> @@ -344,6 +347,27 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid)
>   	return max_emu_nid;
>   }
>   
> +static int adistance[MAX_NUMNODES];
> +module_param_array(adistance, int, NULL, 0400);
> +MODULE_PARM_DESC(adistance, "Abstract distance values for each NUMA node");
> +
> +static int emu_calculate_adistance(struct notifier_block *self,
> +				unsigned long nid, void *data)
> +{
> +	if (adistance[nid]) {
> +		int *adist = data;
> +
> +		*adist = adistance[nid];
> +		return NOTIFY_STOP;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block emu_adist_nb = {
> +	.notifier_call = emu_calculate_adistance,
> +	.priority = INT_MIN,
> +};
> +
>   /**
>    * numa_emulation - Emulate NUMA nodes
>    * @numa_meminfo: NUMA configuration to massage
> @@ -532,6 +556,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>   		}
>   	}
>   
> +	register_mt_adistance_algorithm(&emu_adist_nb);
> +
>   	/* free the copied physical distance table */
>   	memblock_free(phys_dist, phys_size);
>   	return;
Best Regards
Pratyush


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-08 10:15 ` [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
  2026-01-08 19:00   ` Andrew Morton
@ 2026-01-09 16:07   ` Gregory Price
  2026-01-10 13:55     ` Akinobu Mita
  1 sibling, 1 reply; 19+ messages in thread
From: Gregory Price @ 2026-01-09 16:07 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

> +	for_each_node_mask(nid, allowed_mask) {
> +		int z;
> +		struct zone *zone;
> +		struct pglist_data *pgdat = NODE_DATA(nid);
> +
> +		for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
> +			if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> +						ZONE_MOVABLE, 0))

Why does this only check zone movable?  

Also, would this also limit pressure-signal to invoke reclaim when
there is still swap space available?  Should demotion not be a pressure
source for triggering harder reclaim?

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
  2026-01-08 15:47   ` Jonathan Cameron
@ 2026-01-10  3:47     ` Akinobu Mita
  0 siblings, 0 replies; 19+ messages in thread
From: Akinobu Mita @ 2026-01-10  3:47 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

2026年1月9日(金) 0:47 Jonathan Cameron <jonathan.cameron@huawei.com>:
>
> On Thu,  8 Jan 2026 19:15:33 +0900
> Akinobu Mita <akinobu.mita@gmail.com> wrote:
>
> > This makes it possible to create memory tiers using fake numa nodes
> > generated by numa emulation.
> >
> > The "numa_emulation.adistance=" kernel cmdline option allows you to set
> > the abstract distance for each NUMA node.
> >
> > For example, you can create two fake nodes, each in a different memory
> > tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
> > Here, the abstract distances of node0 and node1 are set to 576 and 706,
> > respectively.
> >
> > Each memory tier covers an abstract distance chunk size of 128. Thus,
> > nodes with abstract distances between 512 and 639 are classified into the
> > same memory tier, and nodes with abstract distances between 640 and 767
> > are classified into the next slower memory tier.
> >
> > The abstract distance of fake nodes not specified in the parameter will
> > be the default DRAM abstract distance of 576.
> >
> > Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> > ---
> > v2:
> > - fix the explanation about cmdline parameter in the commit log
> A couple of comments on includes, with those resolved LGTM.
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
> >
> >  mm/numa_emulation.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
> > index 703c8fa05048..a4266da21344 100644
> > --- a/mm/numa_emulation.c
> > +++ b/mm/numa_emulation.c
> > @@ -6,6 +6,9 @@
> >  #include <linux/errno.h>
> >  #include <linux/topology.h>
> >  #include <linux/memblock.h>
> > +#include <linux/memory-tiers.h>
> > +#include <linux/module.h>
> > +#include <linux/node.h>
>
> I can't immediately spot why the new code needs node.h

The first version used the access_coordinate struct, but that is no longer
used, so including linux/node.h is no longer necessary.

>
> Should also include
> linux/notifier.h for the notifier_block definition.

I will fix these includes in the next version.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
  2026-01-09  4:43   ` Pratyush Brahma
@ 2026-01-10  4:03     ` Akinobu Mita
  0 siblings, 0 replies; 19+ messages in thread
From: Akinobu Mita @ 2026-01-10  4:03 UTC (permalink / raw)
  To: Pratyush Brahma
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

2026年1月9日(金) 13:44 Pratyush Brahma <pratyush.brahma@oss.qualcomm.com>:
>
>
> On 1/8/2026 3:45 PM, Akinobu Mita wrote:
> > This makes it possible to create memory tiers using fake numa nodes
> > generated by numa emulation.
> >
> > The "numa_emulation.adistance=" kernel cmdline option allows you to set
> > the abstract distance for each NUMA node.
> >
> > For example, you can create two fake nodes, each in a different memory
> > tier by booting with "numa=fake=2 numa_emulation.adistance=576,704".
> > Here, the abstract distances of node0 and node1 are set to 576 and 706,
> You mention 704 in the cmdline but then mention 706 in the following text.

Thanks for pointing that out. It was an obvious typo.

> Please correct the typo. Btw I am not entirely sure if this example is
> required
> in the commit text here. The Documentation seems to the right place for
> this.

This example is also included in the next 2/3 documentation patch.
The same typo exists there, so I'll fix it.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-09 16:07   ` Gregory Price
@ 2026-01-10 13:55     ` Akinobu Mita
  2026-01-27 20:24       ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Akinobu Mita @ 2026-01-10 13:55 UTC (permalink / raw)
  To: gourry
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

2026年1月10日(土) 1:08 Gregory Price <gourry@gourry.net>:
>
> > +     for_each_node_mask(nid, allowed_mask) {
> > +             int z;
> > +             struct zone *zone;
> > +             struct pglist_data *pgdat = NODE_DATA(nid);
> > +
> > +             for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
> > +                     if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > +                                             ZONE_MOVABLE, 0))
>
> Why does this only check zone movable?

Here, zone_watermark_ok() checks the free memory for all zones from 0 to
MAX_NR_ZONES - 1.
There is no strong reason to pass ZONE_MOVABLE as the highest_zoneidx
argument every time zone_watermark_ok() is called; I can change it if an
appropriate value is found.
In v1, highest_zoneidx was "sc ? sc->reclaim_idx : MAX_NR_ZONES - 1"

> Also, would this also limit pressure-signal to invoke reclaim when
> there is still swap space available?  Should demotion not be a pressure
> source for triggering harder reclaim?

Since can_reclaim_anon_pages() checks whether there is free space on the swap
device before checking with can_demote(), I think the negative impact of this
change will be small. However, since I have not been able to confirm the
behavior when a swap device is available, I would like to correctly understand
the impact.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-10 13:55     ` Akinobu Mita
@ 2026-01-27 20:24       ` Gregory Price
  2026-01-27 23:28         ` Bing Jiao
                           ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Gregory Price @ 2026-01-27 20:24 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

On Sat, Jan 10, 2026 at 10:55:02PM +0900, Akinobu Mita wrote:
> 2026年1月10日(土) 1:08 Gregory Price <gourry@gourry.net>:
> >
> > > +     for_each_node_mask(nid, allowed_mask) {
> > > +             int z;
> > > +             struct zone *zone;
> > > +             struct pglist_data *pgdat = NODE_DATA(nid);
> > > +
> > > +             for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
> > > +                     if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > > +                                             ZONE_MOVABLE, 0))
> >
> > Why does this only check zone movable?
> 
> Here, zone_watermark_ok() checks the free memory for all zones from 0 to
> MAX_NR_ZONES - 1.
> There is no strong reason to pass ZONE_MOVABLE as the highest_zoneidx
> argument every time zone_watermark_ok() is called; I can change it if an
> appropriate value is found.
> In v1, highest_zoneidx was "sc ? sc->reclaim_idx : MAX_NR_ZONES - 1"
> 
> > Also, would this also limit pressure-signal to invoke reclaim when
> > there is still swap space available?  Should demotion not be a pressure
> > source for triggering harder reclaim?
> 
> Since can_reclaim_anon_pages() checks whether there is free space on the swap
> device before checking with can_demote(), I think the negative impact of this
> change will be small. However, since I have not been able to confirm the
> behavior when a swap device is available, I would like to correctly understand
> the impact.

Something else is going on here

See demote_folio_list and alloc_demote_folio

static unsigned int demote_folio_list(struct list_head *demote_folios,
                                      struct pglist_data *pgdat,
                                      struct mem_cgroup *memcg)
{
        struct migration_target_control mtc = {
                 */
                .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
                        __GFP_NOMEMALLOC | GFP_NOWAIT,
        };
}

static struct folio *alloc_demote_folio(struct folio *src,
                unsigned long private)
{
	/* Only attempt to demote to the preferred node */
        mtc->nmask = NULL;
        mtc->gfp_mask |= __GFP_THISNODE;
        dst = alloc_migration_target(src, (unsigned long)mtc);
        if (dst)
                return dst;

	/* Now attempt to demote to any node in the lower tier */
        mtc->gfp_mask &= ~__GFP_THISNODE;
        mtc->nmask = allowed_mask;
        return alloc_migration_target(src, (unsigned long)mtc);
}


/*
* %__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
*/


You basically shouldn't be hitting any reclaim behavior at all, and if
the target nodes are actually under various watermarks, you should be
getting allocation failures and quick-outs from the demotion logic.

i.e. you should be seeing OOM happen

When I dug in far enough I found this:

static struct folio *alloc_demote_folio(struct folio *src,
                unsigned long private)
{
...
        dst = alloc_migration_target(src, (unsigned long)mtc);
}

struct folio *alloc_migration_target(struct folio *src, unsigned long private)
{
        
...
        if (folio_test_hugetlb(src)) {
                struct hstate *h = folio_hstate(src);

                gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
                return alloc_hugetlb_folio_nodemask(h, nid, ...)
	}
}

static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
{
        gfp_t modified_mask = htlb_alloc_mask(h);

        /* Some callers might want to enforce node */
        modified_mask |= (gfp_mask & __GFP_THISNODE);

        modified_mask |= (gfp_mask & __GFP_NOWARN);

        return modified_mask;
}

/* Movability of hugepages depends on migration support. */
static inline gfp_t htlb_alloc_mask(struct hstate *h)
{
        gfp_t gfp = __GFP_COMP | __GFP_NOWARN;

        gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;

        return gfp;
}

#define GFP_USER        (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
#define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE | __GFP_SKIP_KASAN)


If we try to move a hugepage, we start including __GFP_RECLAIM again -
regardless of whether HIGHUSER_MOVABLE or HIGHUSER is used.


Any chance you are using hugetlb on this system?  This looks like a
clear bug, but it may not be what you're experiencing.

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-27 20:24       ` Gregory Price
@ 2026-01-27 23:28         ` Bing Jiao
  2026-01-27 23:43           ` Gregory Price
  2026-01-28  9:56         ` Michal Hocko
  2026-01-29  0:44         ` Akinobu Mita
  2 siblings, 1 reply; 19+ messages in thread
From: Bing Jiao @ 2026-01-27 23:28 UTC (permalink / raw)
  To: Gregory Price
  Cc: Akinobu Mita, linux-cxl, linux-kernel, linux-mm, akpm,
	axelrasmussen, yuanchu, weixugc, hannes, david, mhocko,
	zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb

On Tue, Jan 27, 2026 at 03:24:36PM -0500, Gregory Price wrote:
> On Sat, Jan 10, 2026 at 10:55:02PM +0900, Akinobu Mita wrote:
> > Since can_reclaim_anon_pages() checks whether there is free space on the swap
> > device before checking with can_demote(), I think the negative impact of this
> > change will be small. However, since I have not been able to confirm the
> > behavior when a swap device is available, I would like to correctly understand
> > the impact.
>
> Something else is going on here
>
> See demote_folio_list and alloc_demote_folio
>
> static unsigned int demote_folio_list(struct list_head *demote_folios,
>                                       struct pglist_data *pgdat,
>                                       struct mem_cgroup *memcg)
> {
>         struct migration_target_control mtc = {
>                  */
>                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>                         __GFP_NOMEMALLOC | GFP_NOWAIT,
>         };
> }
>
> static struct folio *alloc_demote_folio(struct folio *src,
>                 unsigned long private)
> {
> 	/* Only attempt to demote to the preferred node */
>         mtc->nmask = NULL;
>         mtc->gfp_mask |= __GFP_THISNODE;
>         dst = alloc_migration_target(src, (unsigned long)mtc);
>         if (dst)
>                 return dst;
>
> 	/* Now attempt to demote to any node in the lower tier */
>         mtc->gfp_mask &= ~__GFP_THISNODE;
>         mtc->nmask = allowed_mask;
>         return alloc_migration_target(src, (unsigned long)mtc);
> }
>
>
> /*
> * %__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> */
>
>
> You basically shouldn't be hitting any reclaim behavior at all, and if
> the target nodes are actually under various watermarks, you should be
> getting allocation failures and quick-outs from the demotion logic.

Hi, Gregory, hope you are doing well.

I observed that during the allocation of a large folio,
alloc_migration_target() cleans __GFP_RECLAIM but subsequently applies
GFP_TRANSHUGE. Given that GFP_TRANSHUGE includes __GFP_DIRECT_RECLAIM,
I am wondering if this triggers a form of reclamation that should be
avoided during demotion.

struct folio *alloc_migration_target(struct folio *src, unsigned long private)
...
	if (folio_test_large(src)) {
		/*
		 * clear __GFP_RECLAIM to make the migration callback
		 * consistent with regular THP allocations.
		 */
		gfp_mask &= ~__GFP_RECLAIM;
		gfp_mask |= GFP_TRANSHUGE;
		order = folio_order(src);
	}

#define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)

Best,
Bing


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-27 23:28         ` Bing Jiao
@ 2026-01-27 23:43           ` Gregory Price
  0 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2026-01-27 23:43 UTC (permalink / raw)
  To: Bing Jiao
  Cc: Akinobu Mita, linux-cxl, linux-kernel, linux-mm, akpm,
	axelrasmussen, yuanchu, weixugc, hannes, david, mhocko,
	zhengqi.arch, shakeel.butt, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb

On Tue, Jan 27, 2026 at 11:28:59PM +0000, Bing Jiao wrote:
> Hi, Gregory, hope you are doing well.
> 
> I observed that during the allocation of a large folio,
> alloc_migration_target() cleans __GFP_RECLAIM but subsequently applies
> GFP_TRANSHUGE. Given that GFP_TRANSHUGE includes __GFP_DIRECT_RECLAIM,
> I am wondering if this triggers a form of reclamation that should be
> avoided during demotion.
> 
> struct folio *alloc_migration_target(struct folio *src, unsigned long private)
> ...
> 	if (folio_test_large(src)) {
> 		/*
> 		 * clear __GFP_RECLAIM to make the migration callback
> 		 * consistent with regular THP allocations.
> 		 */
> 		gfp_mask &= ~__GFP_RECLAIM;
> 		gfp_mask |= GFP_TRANSHUGE;
> 		order = folio_order(src);
> 	}
> 
> #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
> 

I think the answer is that the demotion code is a mess and no one
actually knows what it's doing.  We probably need to rework this
entirely because we've now found at least 2 paths which clean and then
add reclaim.

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-27 20:24       ` Gregory Price
  2026-01-27 23:28         ` Bing Jiao
@ 2026-01-28  9:56         ` Michal Hocko
  2026-01-28 14:21           ` Gregory Price
  2026-01-29  0:44         ` Akinobu Mita
  2 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2026-01-28  9:56 UTC (permalink / raw)
  To: Gregory Price
  Cc: Akinobu Mita, linux-cxl, linux-kernel, linux-mm, akpm,
	axelrasmussen, yuanchu, weixugc, hannes, david, zhengqi.arch,
	shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, bingjiao

On Tue 27-01-26 15:24:36, Gregory Price wrote:
> On Sat, Jan 10, 2026 at 10:55:02PM +0900, Akinobu Mita wrote:
> > 2026年1月10日(土) 1:08 Gregory Price <gourry@gourry.net>:
> > >
> > > > +     for_each_node_mask(nid, allowed_mask) {
> > > > +             int z;
> > > > +             struct zone *zone;
> > > > +             struct pglist_data *pgdat = NODE_DATA(nid);
> > > > +
> > > > +             for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
> > > > +                     if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > > > +                                             ZONE_MOVABLE, 0))
> > >
> > > Why does this only check zone movable?
> > 
> > Here, zone_watermark_ok() checks the free memory for all zones from 0 to
> > MAX_NR_ZONES - 1.
> > There is no strong reason to pass ZONE_MOVABLE as the highest_zoneidx
> > argument every time zone_watermark_ok() is called; I can change it if an
> > appropriate value is found.
> > In v1, highest_zoneidx was "sc ? sc->reclaim_idx : MAX_NR_ZONES - 1"
> > 
> > > Also, would this also limit pressure-signal to invoke reclaim when
> > > there is still swap space available?  Should demotion not be a pressure
> > > source for triggering harder reclaim?
> > 
> > Since can_reclaim_anon_pages() checks whether there is free space on the swap
> > device before checking with can_demote(), I think the negative impact of this
> > change will be small. However, since I have not been able to confirm the
> > behavior when a swap device is available, I would like to correctly understand
> > the impact.
> 
> Something else is going on here
> 
> See demote_folio_list and alloc_demote_folio
> 
> static unsigned int demote_folio_list(struct list_head *demote_folios,
>                                       struct pglist_data *pgdat,
>                                       struct mem_cgroup *memcg)
> {
>         struct migration_target_control mtc = {
>                  */
>                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>                         __GFP_NOMEMALLOC | GFP_NOWAIT,
>         };
> }
> 
> static struct folio *alloc_demote_folio(struct folio *src,
>                 unsigned long private)
> {
> 	/* Only attempt to demote to the preferred node */
>         mtc->nmask = NULL;
>         mtc->gfp_mask |= __GFP_THISNODE;
>         dst = alloc_migration_target(src, (unsigned long)mtc);
>         if (dst)
>                 return dst;
> 
> 	/* Now attempt to demote to any node in the lower tier */
>         mtc->gfp_mask &= ~__GFP_THISNODE;
>         mtc->nmask = allowed_mask;
>         return alloc_migration_target(src, (unsigned long)mtc);
> }
> 
> 
> /*
> * %__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> */
> 
> 
> You basically shouldn't be hitting any reclaim behavior at all, and if

This will trigger kswapd so there will be background reclaim demoting
from those lower tiers.

> the target nodes are actually under various watermarks, you should be
> getting allocation failures and quick-outs from the demotion logic.
> 
> i.e. you should be seeing OOM happen
> 
> When I dug in far enough I found this:
> 
> static struct folio *alloc_demote_folio(struct folio *src,
>                 unsigned long private)
> {
> ...
>         dst = alloc_migration_target(src, (unsigned long)mtc);
> }
> 
> struct folio *alloc_migration_target(struct folio *src, unsigned long private)
> {
>         
> ...
>         if (folio_test_hugetlb(src)) {
>                 struct hstate *h = folio_hstate(src);
> 
>                 gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
>                 return alloc_hugetlb_folio_nodemask(h, nid, ...)
> 	}
> }
> 
> static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> {
>         gfp_t modified_mask = htlb_alloc_mask(h);
> 
>         /* Some callers might want to enforce node */
>         modified_mask |= (gfp_mask & __GFP_THISNODE);
> 
>         modified_mask |= (gfp_mask & __GFP_NOWARN);
> 
>         return modified_mask;
> }
> 
> /* Movability of hugepages depends on migration support. */
> static inline gfp_t htlb_alloc_mask(struct hstate *h)
> {
>         gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
> 
>         gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
> 
>         return gfp;
> }
> 
> #define GFP_USER        (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> #define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
> #define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE | __GFP_SKIP_KASAN)
> 
> 
> If we try to move a hugepage, we start including __GFP_RECLAIM again -
> regardless of whether HIGHUSER_MOVABLE or HIGHUSER is used.
> 
> 
> Any chance you are using hugetlb on this system?  This looks like a
> clear bug, but it may not be what you're experiencing.

Hugetlb pages are not sitting on LRU lists so they are not participating
in the demotion.

Or maybe I missed your point.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-28  9:56         ` Michal Hocko
@ 2026-01-28 14:21           ` Gregory Price
  2026-01-28 21:14             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2026-01-28 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Akinobu Mita, linux-cxl, linux-kernel, linux-mm, akpm,
	axelrasmussen, yuanchu, weixugc, hannes, david, zhengqi.arch,
	shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, bingjiao

On Wed, Jan 28, 2026 at 10:56:44AM +0100, Michal Hocko wrote:
> >                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> >                         __GFP_NOMEMALLOC | GFP_NOWAIT,
> >         };
> 
> This will trigger kswapd so there will be background reclaim demoting
> from those lower tiers.
> 

given the node is full kswapd will be running, but the above line masks
~__GFP_RECLAIM so it's not supposed to trigger either reclaim path.

> > Any chance you are using hugetlb on this system?  This looks like a
> > clear bug, but it may not be what you're experiencing.
> 
> Hugetlb pages are not sitting on LRU lists so they are not participating
> in the demotion.
> 

I noted in the v4 thread (responded there too) this was the case.
https://lore.kernel.org/linux-mm/aXksUiwYGwad5JvC@gourry-fedora-PF4VCD3F/

But since then we found another path through this code that adds
reclaim back on as well - and i wouldn't be surprised to find more.

the bigger issue is that this fix can cause inversions in transient
pressure situations - and in fact the current code will cause inversions
instead of waiting for reclaim to clear out lower nodes.

The reality is this code probably needs a proper look and detangling.

This has been on my back-burner for a while - i've wanted to sink the 
actual demotion code into memory-tiers.c and provide something like:

... mt_demote_folios(src_nid, folio_list)
{
	/* apply some demotion policy here */
}

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-28 14:21           ` Gregory Price
@ 2026-01-28 21:14             ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2026-01-28 21:14 UTC (permalink / raw)
  To: Gregory Price
  Cc: Akinobu Mita, linux-cxl, linux-kernel, linux-mm, akpm,
	axelrasmussen, yuanchu, weixugc, hannes, david, zhengqi.arch,
	shakeel.butt, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, bingjiao

On Wed 28-01-26 09:21:45, Gregory Price wrote:
> On Wed, Jan 28, 2026 at 10:56:44AM +0100, Michal Hocko wrote:
> > >                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > >                         __GFP_NOMEMALLOC | GFP_NOWAIT,
> > >         };
> > 
> > This will trigger kswapd so there will be background reclaim demoting
> > from those lower tiers.
> > 
> 
> given the node is full kswapd will be running, but the above line masks
> ~__GFP_RECLAIM so it's not supposed to trigger either reclaim path.

Yeah, my bad, I haven't looked carefully enough.
 
> > > Any chance you are using hugetlb on this system?  This looks like a
> > > clear bug, but it may not be what you're experiencing.
> > 
> > Hugetlb pages are not sitting on LRU lists so they are not participating
> > in the demotion.
> > 
> 
> I noted in the v4 thread (responded there too) this was the case.
> https://lore.kernel.org/linux-mm/aXksUiwYGwad5JvC@gourry-fedora-PF4VCD3F/
> 
> But since then we found another path through this code that adds
> reclaim back on as well - and i wouldn't be surprised to find more.
> 
> the bigger issue is that this fix can cause inversions in transient
> pressure situations - and in fact the current code will cause inversions
> instead of waiting for reclaim to clear out lower nodes.
> 
> The reality is this code probably needs a proper look and detangling.

Agreed!

> This has been on my back-burner for a while - i've wanted to sink the 
> actual demotion code into memory-tiers.c and provide something like:
> 
> ... mt_demote_folios(src_nid, folio_list)
> {
> 	/* apply some demotion policy here */
> }
> 
> ~Gregory

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier
  2026-01-27 20:24       ` Gregory Price
  2026-01-27 23:28         ` Bing Jiao
  2026-01-28  9:56         ` Michal Hocko
@ 2026-01-29  0:44         ` Akinobu Mita
  2 siblings, 0 replies; 19+ messages in thread
From: Akinobu Mita @ 2026-01-29  0:44 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-mm, akpm, axelrasmussen, yuanchu,
	weixugc, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, bingjiao

2026年1月28日(水) 5:24 Gregory Price <gourry@gourry.net>:
>
> On Sat, Jan 10, 2026 at 10:55:02PM +0900, Akinobu Mita wrote:
> > 2026年1月10日(土) 1:08 Gregory Price <gourry@gourry.net>:
> > >
> > > > +     for_each_node_mask(nid, allowed_mask) {
> > > > +             int z;
> > > > +             struct zone *zone;
> > > > +             struct pglist_data *pgdat = NODE_DATA(nid);
> > > > +
> > > > +             for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
> > > > +                     if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > > > +                                             ZONE_MOVABLE, 0))
> > >
> > > Why does this only check zone movable?
> >
> > Here, zone_watermark_ok() checks the free memory for all zones from 0 to
> > MAX_NR_ZONES - 1.
> > There is no strong reason to pass ZONE_MOVABLE as the highest_zoneidx
> > argument every time zone_watermark_ok() is called; I can change it if an
> > appropriate value is found.
> > In v1, highest_zoneidx was "sc ? sc->reclaim_idx : MAX_NR_ZONES - 1"
> >
> > > Also, would this also limit pressure-signal to invoke reclaim when
> > > there is still swap space available?  Should demotion not be a pressure
> > > source for triggering harder reclaim?
> >
> > Since can_reclaim_anon_pages() checks whether there is free space on the swap
> > device before checking with can_demote(), I think the negative impact of this
> > change will be small. However, since I have not been able to confirm the
> > behavior when a swap device is available, I would like to correctly understand
> > the impact.
>
> Something else is going on here
>
> See demote_folio_list and alloc_demote_folio
>
> static unsigned int demote_folio_list(struct list_head *demote_folios,
>                                       struct pglist_data *pgdat,
>                                       struct mem_cgroup *memcg)
> {
>         struct migration_target_control mtc = {
>                  */
>                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>                         __GFP_NOMEMALLOC | GFP_NOWAIT,
>         };
> }
>
> static struct folio *alloc_demote_folio(struct folio *src,
>                 unsigned long private)
> {
>         /* Only attempt to demote to the preferred node */
>         mtc->nmask = NULL;
>         mtc->gfp_mask |= __GFP_THISNODE;
>         dst = alloc_migration_target(src, (unsigned long)mtc);
>         if (dst)
>                 return dst;
>
>         /* Now attempt to demote to any node in the lower tier */
>         mtc->gfp_mask &= ~__GFP_THISNODE;
>         mtc->nmask = allowed_mask;
>         return alloc_migration_target(src, (unsigned long)mtc);
> }
>
>
> /*
> * %__GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> */
>
>
> You basically shouldn't be hitting any reclaim behavior at all, and if
> the target nodes are actually under various watermarks, you should be
> getting allocation failures and quick-outs from the demotion logic.
>
> i.e. you should be seeing OOM happen
>
> When I dug in far enough I found this:
>
> static struct folio *alloc_demote_folio(struct folio *src,
>                 unsigned long private)
> {
> ...
>         dst = alloc_migration_target(src, (unsigned long)mtc);
> }
>
> struct folio *alloc_migration_target(struct folio *src, unsigned long private)
> {
>
> ...
>         if (folio_test_hugetlb(src)) {
>                 struct hstate *h = folio_hstate(src);
>
>                 gfp_mask = htlb_modify_alloc_mask(h, gfp_mask);
>                 return alloc_hugetlb_folio_nodemask(h, nid, ...)
>         }
> }
>
> static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
> {
>         gfp_t modified_mask = htlb_alloc_mask(h);
>
>         /* Some callers might want to enforce node */
>         modified_mask |= (gfp_mask & __GFP_THISNODE);
>
>         modified_mask |= (gfp_mask & __GFP_NOWARN);
>
>         return modified_mask;
> }
>
> /* Movability of hugepages depends on migration support. */
> static inline gfp_t htlb_alloc_mask(struct hstate *h)
> {
>         gfp_t gfp = __GFP_COMP | __GFP_NOWARN;
>
>         gfp |= hugepage_movable_supported(h) ? GFP_HIGHUSER_MOVABLE : GFP_HIGHUSER;
>
>         return gfp;
> }
>
> #define GFP_USER        (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> #define GFP_HIGHUSER    (GFP_USER | __GFP_HIGHMEM)
> #define GFP_HIGHUSER_MOVABLE    (GFP_HIGHUSER | __GFP_MOVABLE | __GFP_SKIP_KASAN)
>
>
> If we try to move a hugepage, we start including __GFP_RECLAIM again -
> regardless of whether HIGHUSER_MOVABLE or HIGHUSER is used.
>
>
> Any chance you are using hugetlb on this system?  This looks like a
> clear bug, but it may not be what you're experiencing.

In my case where the issue was reproduced, alloc_demote_folio() failed almost
every time, but the folio passed to alloc_migration_target() was always false
for both folio_test_hugetlb() and folio_test_large().


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-01-29  0:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-08 10:15 [PATCH v3 0/3] mm: fix oom-killer not being invoked when demotion is enabled Akinobu Mita
2026-01-08 10:15 ` [PATCH v3 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes Akinobu Mita
2026-01-08 15:47   ` Jonathan Cameron
2026-01-10  3:47     ` Akinobu Mita
2026-01-09  4:43   ` Pratyush Brahma
2026-01-10  4:03     ` Akinobu Mita
2026-01-08 10:15 ` [PATCH v3 2/3] mm: numa_emu: add document for NUMA emulation Akinobu Mita
2026-01-08 15:51   ` Jonathan Cameron
2026-01-08 10:15 ` [PATCH v3 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier Akinobu Mita
2026-01-08 19:00   ` Andrew Morton
2026-01-09 16:07   ` Gregory Price
2026-01-10 13:55     ` Akinobu Mita
2026-01-27 20:24       ` Gregory Price
2026-01-27 23:28         ` Bing Jiao
2026-01-27 23:43           ` Gregory Price
2026-01-28  9:56         ` Michal Hocko
2026-01-28 14:21           ` Gregory Price
2026-01-28 21:14             ` Michal Hocko
2026-01-29  0:44         ` Akinobu Mita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox