[PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory
@ 2025-04-20 19:40 SeongJae Park
  2025-04-20 19:40 ` [PATCH 1/7] mm/damon/core: introduce damos quota goal metrics for memory node utilization SeongJae Park
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Jonathan Corbet, damon, kernel-team, linux-doc,
	linux-kernel, linux-mm

Utilizing DAMON for memory tiering usually requires manual tuning and/or
tedious controls.  Let it self-tune hotness and coldness thresholds for
promotion and demotion aiming high utilization of high memory tiers, by
introducing new DAMOS quota goal metrics representing the used and the
free memory ratios of specific NUMA nodes.  And introduce a sample DAMON
module that demonstrates how the new feature can be used for memory
tiering use cases.

Backgrounds
===========

A type of tiered memory system exposes the memory tiers as NUMA nodes.
A straightforward pages placement strategy for such systems is placing
access-hot and cold pages on upper and lower tiers, reespectively,
pursuing higher utilization of upper tiers.  Since access temperature
can be dynamic, periodically finding and migrating hot pages and cold
pages to proper tiers (promoting and demoting) is also required.  Linux
kernel provides several features for such dynamic and transparent pages
placement.

Page Faults and LRU
-------------------

One widely known way is using NUMA balancing in tiering mode (a.k.a
NUMAB-2) and reclaim-based demotion features.  In the setup, NUMAB-2
finds hot pages using access check-purpose page faults (a.k.a prot_none)
and promote those inside each process' context, until there is no more
pages to promote, or the upper tier is filled up and memory pressure
happens.  In the latter case, LRU-based reclaim logic wakes up as a
response to the memory pressure and demotes cold pages to lower tiers in
asynchronous (kswapd) and/or synchronous ways (direct reclaim).

DAMON
-----

Yet another available solution is using DAMOS with migrate_hot and
migrate_cold DAMOS actions for promotions and demotions, respectively.
To make it optimum, users need to specify aggressiveness and access
temperature thresholds for promotions and demotions in a good balance
that results in high utilization of upper tiers.  The number of
parameters is not small, and optimum parameter values depend on
characteristics of the underlying hardware and the workload.  As a
result, it often requires manual, time consuming and repetitive tuning
of the DAMOS schemes for given workloads and systems combinations.

Self-tuned DAMON-based Memory Tiering
=====================================

To solve such manual tuning problems, DAMOS provides aim-oriented
feedback-driven quotas self-tuning.  Using the feature, we design a
self-tuned DAMON-based memory tiering for general multi-tier memory
systems.

For each memory tier node, if it has a lower tier, run a DAMOS scheme
that demotes cold pages of the node, auto-tuning the aggressiveness
aiming an amount of free space of the node.  The free space is for
keeping the headroom that avoids significant memory pressure during
upper tier memory usage spike, and promoting hot pages from the lower
tier.

For each memory tier node, if it has an upper tier, run a DAMOS scheme
that promotes hot pages of the current node to the upper tier,
auto-tuning the aggressiveness aiming a high utilization ratio of the
upper tier.  The target ratio is to ensure higher tiers are utilized as
much as possible.  It should match with the headroom for demotion
scheme, but have slight overlap, to ensure promotion and demotion are
not entirely stopped.

The aim-oriented aggressiveness auto-tuning of DAMOS is already
available.  Hence, to make such tiering solution implementation, only
new quota goal metrics for utilization and free space ratio of specific
NUMA node need to be developed.

Discussions
===========

The design imposes below discussion points.

Expected Behaviors
------------------

The system will let upper tier memory node accommodates as many hot data
as possible.  If total amount of the data is less than the top tier
memory's promotion/demotion target utilization, entire data will be just
placed on the top tier.  Promotion scheme will do nothing since there is
no data to promote.  Demotion scheme will also do nothing since the free
space ratio of the top tier is higher than the goal.

Only if the amount of data is larger than the top tier's utilization
ratio, demotion scheme will demote cold pages and ensure the headroom
free space.  Since the promotion and demotion schemes for a single node
has small overlap at their target utilization and free space goals,
promotions and demotions will continue working with a moderate
aggressiveness level.  It will keep all data is placed on access hotness
under dynamic access pattern, while minimizing the migration overhead.

In any case, each node will keep headroom free space and as many upper
tiers are utilized as possible.

Ease of Use
-----------

Users still need to set the target utilization and free space ratio, but
it will be easier to set.  We argue 99.7 % utilization and 0.5 % free
space ratios can be good default values.  It can be easily adjusted
based on desired headroom size of given use case.  Users are also still
required to answer the minimum coldness and hotness thresholds.
Together with monitoring intervals auto-tuning[2], DAMON will always
show meaningful amount of hot and cold memory.  And DAMOS quota's
prioritization mechanism will make good decision as long as the source
information is that colorful.  Hence, users can very naively set the
minimum criterias.  We believe any access observation and no access
observation within last one aggregation interval is enough for minimum
hot and cold regions criterias.

General Tiered Memory Setup Applicability
-----------------------------------------

The design can be applied to any number of tiers having any performance
characteristics, as long as they can be hierarchical.  Hence, applying
the system to different tiered memory system will be straightforward.
Note that this assumes only single CPU NUMA node case.  Because today's
DAMON is not aware of which CPU made each access, applying this on
systems having multiple CPU NUMA nodes can be complicated.  We are
planning to extend DAMON for the use case, but that's out of the scope
of this patch series.

How To Use
----------

Users can implement the auto-tuned DAMON-based memory tiering using
DAMON sysfs interface.  It can be easily done using DAMON user-space
tool like user-space tool.  Below evaluation results section shows an
example DAMON user-space tool command for that.

For wider and simpler deployment, having a kernel module that sets up
and run the DAMOS schemes via DAMON kernel API can be useful.  The
module can enable the memory tiering at boot time via kernel command
line parameter or at run time with single command.  This patch series
implements a sample DAMON kernel module that shows how such module can
be implemented.

Comparison To Page Faults and LRU-based Approaches
--------------------------------------------------

The existing page faults based promotion (NUMAB-2) does hot pages
detection and migration in the process context.  When there are many
pages to promote, it can block the progress of the application's real
works.  DAMOS works in asynchronous worker thread, so it doesn't block
the real works.

NUMAB-2 doesn't provide a way to control aggressiveness of promotion
other than the maximum amount of pages to promote per given time widnow.
If hot pages are found, promotions can happen in the upper-bound speed,
regardless of upper tier's memory pressure.  If the maximum speed is not
well set for the given workload, it can result in slow promotion or
unnecessary memory pressure.  Self-tuned DAMON-based memory tiering
alleviates the problem by adjusting the speed based on current
utilization of the upper tier.

LRU-based demotion can be triggered in both asynchronous (kswapd) and
synchronous (direct reclaim) ways.  Other than the way of finding cold
pages, asynchronous LRU-based demotion and DAMON-based demotion has no
big difference.  DAMON-based demotion can make a better balancing with
DAMON-based promotion, though.  The LRU-based demotion can do better
than DAMON-based demotion when the tier is having significant memory
pressure.  It would be wise to use DAMON-based demotion as a proactive
and primary one, but utilizing LRU-based demotions together as a fast
backup solution.

Evaluation
==========

In short, under a setup that requires fast and frequent promotions,
self-tuned DAMON-based memory tiering's hot pages promotion improves
performance about 4.42 %.  We believe this shows self-tuned DAMON-based
promotion's effectiveness.  Meanwhile, NUMAB-2's hot pages promotion
degrades the performance about 7.34 %.  We suspect the degradation is
mostly due to NUMAB-2's synchronous nature that can block the
application's progress, which highlights the advantage of DAMON-based
solution's asynchronous nature.

Note that the test was done with the RFC version of this patch series.
We don't run it again since this patch series got no meaningful change
after the RFC, while the test takes pretty long time.

Setup
-----

Hardware.  Use a machine that equips 250 GiB DRAM memory tier and 50 GiB
CXL memory tier.  The tiers are exposed as NUMA nodes 0 and 1,
respectively.

Kernel.  Use Linux kernel v6.13 that modified as following.  Add all
DAMON patches that available on mm tree of 2025-03-15, and this patch
series.  Also modify it to ignore mempolicy() system calls, to avoid bad
effects from application's traditional NUMA systems assumed
optimizations.

Workload.  Use a modified version of Taobench benchmark[3] that
available on DCPerf benchmark suite.  It represents an in-memory caching
workload.  We set its 'memsize', 'warmup_time', and 'test_time'
parameter as 340 GiB, 2,500 seconds and 1,440 seconds.  The parameters
are chosen to ensure the workload uses more than DRAM memory tier.  Its
RSS under the parameter grows to 270 GiB within the warmup time.

It turned out the workload has a very static access pattrn.  Only about
13 % of the RSS is frequently accessed from the beginning to end.  Hence
promotion shows no meaningful performance difference regardless of
different design and implementations.  We therefore modify the kernel to
periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold
pages once per minute.  The intention is to simulate periodic access
pattern changes.  The hotness and coldness threshold is very naively set
so that it is more like random access pattern change rather than strict
hot/cold pages exchange.  This is why we call the workload as
"modified".  It is implemented as two DAMOS schemes each running on an
asynchronous thread.  It can be reproduced with DAMON user-space tool
like below.

    # ./damo start \
        --ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \
            --damos_action migrate_hot 1 \
            --damos_quota_interval 60s --damos_quota_space 10G \
        --ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \
            --damos_action migrate_cold 0 \
            --damos_quota_interval 60s --damos_quota_space 10G \
        --nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1

System configurations.  Use below variant system configurations.

- Baseline.  No memory tiering features are turned on.
- Numab_tiering.  On the baseline, enable NUMAB-2 and relcaim-based
  demotion.  In detail, following command is executed:
  echo 2 > /proc/sys/kernel/numa_balancing;
  echo 1 > /sys/kernel/mm/numa/demotion_enabled;
  echo 7 > /proc/sys/vm/zone_reclaim_mode
- DAMON_tiering.  On the baseline, utilize self-tuned DAMON-based memory
  tiering implementation via DAMON user-space tool.  It utilizes two
  kernel threads, namely promotion thread and demotion thread.  Demotion
  thread monitors access pattern of DRAM node using DAMON with
  auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio,
  and demote coldest pages up to 200 MiB per second aiming 0.5% free
  space of DRAM node.  Promotion thread monitors CXL node using same
  intervals auto-tuning, and promote hot pages in same way but aiming
  for 99.7% utilization of DRAM node.  Because DAMON provides only
  best-effort accuracy, add young page DAMOS filters to allow only and
  reject all young pages at promoting and demoting, respectively.  It
  can be reproduced with DAMON user-space tool like below.

    # ./damo start \
        --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
            --damos_action migrate_cold 1 --damos_access_rate 0% 0% \
            --damos_apply_interval 1s \
            --damos_quota_interval 1s --damos_quota_space 200MB \
            --damos_quota_goal node_mem_free_bp 0.5% 0 \
            --damos_filter reject young \
        --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
            --damos_action migrate_hot 0 --damos_access_rate 5% max \
            --damos_apply_interval 1s \
            --damos_quota_interval 1s --damos_quota_space 200MB \
            --damos_quota_goal node_mem_used_bp 99.7% 0 \
            --damos_filter allow young \
            --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
        --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1

Measurment Results
------------------

On each system configuration, run the modified version of Taobench and
collect 'score'.  'score' is a metric that calculated and provided by
Taobench to represents the performance of the run on the  system.  To
handle the measurement errors, repeat the measurement five times.  The
results are as below.

    Config         Score   Stdev   (%)     Normalized
    Baseline       1.6165  0.0319  1.9764  1.0000
    Numab_tiering  1.4976  0.0452  3.0209  0.9264
    DAMON_tiering  1.6881  0.0249  1.4767  1.0443

'Config' column shows the system config of the measurement.  'Score'
column shows the 'score' measurement in average of the five runs on the
system config.  'Stdev' column shows the standsard deviation of the five
measurements of the scores.  '(%)' column shows the 'Stdev' to 'Score'
ratio in percentage.  Finally, 'Normalized' column shows the averaged
score values of the configs that normalized to that of 'Baseline'.

The periodic hot pages demotion and cold pages promotion that was
conducted to simulate dynamic access pattern was started from the
beginning of the workload.  It resulted in the DRAM tier utilization
always under the watermark, and hence no real demotion was happened for
all test runs.  This means the above results show no difference between
LRU-based and DAMON-based demotions.  Only difference between NUMAB-2
and DAMON-based promotions are represented on the results.

Numab_tiering config degraded the performance about 7.36 %.  We suspect
this happened because NUMAB-2's synchronous promotion was blocking the
Taobench's real work progress.

DAMON_tiering config improved the performance about 4.43 %.  We believe
this shows effectiveness of DAMON-based promotion that didn't block
Taobench's real work progress due to its asynchronous nature.  Also this
means DAMON's monitoring results are accurate enough to provide visible
amount of improvement.

Evaluation Limitations
----------------------

As mentioned above, this evaluation shows only comparison of promotion
mechanisms.  DAMON-based tiering is recommended to be used together with
reclaim-based demotion as a faster backup under significant memory
pressure, though.

From some perspective, the modified version of Taobench may seems making
the picture distorted too much.  It would be better to evaluate with
more realistic workload, or more finely tuned micro benchmarks.

Patch Sequence
==============

The first patch (patch 1) implements two new quota goal metrics on core
layer and expose it to DAMON core kernel API.  The second and third ones
(patches 2 and 3) further link it to DAMON sysfs interface.  Three
following patches (patches 4-6) document the new feature and sysfs file
on design, usage, and ABI documents.  The final one (patch 7) implements
a working version of a self-tuned DAMON-based memory tiering solution in
an incomplete but easy to understand form as a kernel module under
samples/damon/ directory.

References
==========

[1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/
[2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org
[3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md

SeongJae Park (7):
  mm/damon/core: introduce damos quota goal metrics for memory node
    utilization
  mm/damon/sysfs-schemes: implement file for quota goal nid parameter
  mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer
  Docs/mm/damon/design: document node_mem_{used,free}_bp
  Docs/admin-guide/mm/damon/usage: document 'nid' file
  Docs/ABI/damon: document nid file
  samples/damon: implement a DAMON module for memory tiering

 .../ABI/testing/sysfs-kernel-mm-damon         |   6 +
 Documentation/admin-guide/mm/damon/usage.rst  |  12 +-
 Documentation/mm/damon/design.rst             |  13 +-
 include/linux/damon.h                         |   6 +
 mm/damon/core.c                               |  27 +++
 mm/damon/sysfs-schemes.c                      |  40 ++++-
 samples/damon/Kconfig                         |  13 ++
 samples/damon/Makefile                        |   1 +
 samples/damon/mtier.c                         | 167 ++++++++++++++++++
 9 files changed, 274 insertions(+), 11 deletions(-)
 create mode 100644 samples/damon/mtier.c

base-commit: 449d17baba9648a901928d38eee56f914b39248e
-- 
2.39.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/7] mm/damon/core: introduce damos quota goal metrics for memory node utilization
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:40 ` [PATCH 2/7] mm/damon/sysfs-schemes: implement file for quota goal nid parameter SeongJae Park
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: SeongJae Park, damon, kernel-team, linux-kernel, linux-mm

Used and free space ratios for specific NUMA nodes can be useful inputs
for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback
loop.  Implement DAMOS quota goal metrics for such self-tuned schemes.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/damon.h    |  6 ++++++
 mm/damon/core.c          | 27 +++++++++++++++++++++++++++
 mm/damon/sysfs-schemes.c |  2 ++
 3 files changed, 35 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 47e36e6ea203..a4011726cb3b 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -145,6 +145,8 @@ enum damos_action {
  *
  * @DAMOS_QUOTA_USER_INPUT:	User-input value.
  * @DAMOS_QUOTA_SOME_MEM_PSI_US:	System level some memory PSI in us.
+ * @DAMOS_QUOTA_NODE_MEM_USED_BP:	MemUsed ratio of a node.
+ * @DAMOS_QUOTA_NODE_MEM_FREE_BP:	MemFree ratio of a node.
  * @NR_DAMOS_QUOTA_GOAL_METRICS:	Number of DAMOS quota goal metrics.
  *
  * Metrics equal to larger than @NR_DAMOS_QUOTA_GOAL_METRICS are unsupported.
@@ -152,6 +154,8 @@ enum damos_action {
 enum damos_quota_goal_metric {
 	DAMOS_QUOTA_USER_INPUT,
 	DAMOS_QUOTA_SOME_MEM_PSI_US,
+	DAMOS_QUOTA_NODE_MEM_USED_BP,
+	DAMOS_QUOTA_NODE_MEM_FREE_BP,
 	NR_DAMOS_QUOTA_GOAL_METRICS,
 };
 
@@ -161,6 +165,7 @@ enum damos_quota_goal_metric {
  * @target_value:	Target value of @metric to achieve with the tuning.
  * @current_value:	Current value of @metric.
  * @last_psi_total:	Last measured total PSI
+ * @nid:		Node id.
  * @list:		List head for siblings.
  *
  * Data structure for getting the current score of the quota tuning goal.  The
@@ -179,6 +184,7 @@ struct damos_quota_goal {
 	/* metric-dependent fields */
 	union {
 		u64 last_psi_total;
+		int nid;
 	};
 	struct list_head list;
 };
diff --git a/mm/damon/core.c b/mm/damon/core.c
index f0c1676f0599..587fb9a4fef8 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1889,6 +1889,29 @@ static inline u64 damos_get_some_mem_psi_total(void)
 
 #endif	/* CONFIG_PSI */
 
+#ifdef CONFIG_NUMA
+static __kernel_ulong_t damos_get_node_mem_bp(
+		struct damos_quota_goal *goal)
+{
+	struct sysinfo i;
+	__kernel_ulong_t numerator;
+
+	si_meminfo_node(&i, goal->nid);
+	if (goal->metric == DAMOS_QUOTA_NODE_MEM_USED_BP)
+		numerator = i.totalram - i.freeram;
+	else	/* DAMOS_QUOTA_NODE_MEM_FREE_BP */
+		numerator = i.freeram;
+	return numerator * 10000 / i.totalram;
+}
+#else
+static __kernel_ulong_t damos_get_node_mem_bp(
+		struct damos_quota_goal *goal)
+{
+	return 0;
+}
+#endif
+
+
 static void damos_set_quota_goal_current_value(struct damos_quota_goal *goal)
 {
 	u64 now_psi_total;
@@ -1902,6 +1925,10 @@ static void damos_set_quota_goal_current_value(struct damos_quota_goal *goal)
 		goal->current_value = now_psi_total - goal->last_psi_total;
 		goal->last_psi_total = now_psi_total;
 		break;
+	case DAMOS_QUOTA_NODE_MEM_USED_BP:
+	case DAMOS_QUOTA_NODE_MEM_FREE_BP:
+		goal->current_value = damos_get_node_mem_bp(goal);
+		break;
 	default:
 		break;
 	}
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 23b562df0839..98108f082178 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -942,6 +942,8 @@ struct damos_sysfs_quota_goal {
 static const char * const damos_sysfs_quota_goal_metric_strs[] = {
 	"user_input",
 	"some_mem_psi_us",
+	"node_mem_used_bp",
+	"node_mem_free_bp",
 };
 
 static struct damos_sysfs_quota_goal *damos_sysfs_quota_goal_alloc(void)
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/7] mm/damon/sysfs-schemes: implement file for quota goal nid parameter
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
  2025-04-20 19:40 ` [PATCH 1/7] mm/damon/core: introduce damos quota goal metrics for memory node utilization SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:40 ` [PATCH 3/7] mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer SeongJae Park
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: SeongJae Park, damon, kernel-team, linux-kernel, linux-mm

DAMOS_QUOTA_NODE_MEM_{USED,FREE}_BP DAMOS quota goal metrics require the
node id parameter.  However, there is no DAMON user ABI for setting it.
Implement a DAMON sysfs file for that with name 'nid', under the quota
goal directory.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/sysfs-schemes.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 98108f082178..7681ed293b62 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -936,6 +936,7 @@ struct damos_sysfs_quota_goal {
 	enum damos_quota_goal_metric metric;
 	unsigned long target_value;
 	unsigned long current_value;
+	int nid;
 };
 
 /* This should match with enum damos_action */
@@ -1016,6 +1017,28 @@ static ssize_t current_value_store(struct kobject *kobj,
 	return err ? err : count;
 }
 
+static ssize_t nid_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct damos_sysfs_quota_goal *goal = container_of(kobj, struct
+			damos_sysfs_quota_goal, kobj);
+
+	/* todo: return error if the goal is not using nid */
+
+	return sysfs_emit(buf, "%d\n", goal->nid);
+}
+
+static ssize_t nid_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	struct damos_sysfs_quota_goal *goal = container_of(kobj, struct
+			damos_sysfs_quota_goal, kobj);
+	int err = kstrtoint(buf, 0, &goal->nid);
+
+	/* feed callback should check existence of this file and read value */
+	return err ? err : count;
+}
+
 static void damos_sysfs_quota_goal_release(struct kobject *kobj)
 {
 	/* or, notify this release to the feed callback */
@@ -1031,10 +1054,14 @@ static struct kobj_attribute damos_sysfs_quota_goal_target_value_attr =
 static struct kobj_attribute damos_sysfs_quota_goal_current_value_attr =
 		__ATTR_RW_MODE(current_value, 0600);
 
+static struct kobj_attribute damos_sysfs_quota_goal_nid_attr =
+		__ATTR_RW_MODE(nid, 0600);
+
 static struct attribute *damos_sysfs_quota_goal_attrs[] = {
 	&damos_sysfs_quota_goal_target_metric_attr.attr,
 	&damos_sysfs_quota_goal_target_value_attr.attr,
 	&damos_sysfs_quota_goal_current_value_attr.attr,
+	&damos_sysfs_quota_goal_nid_attr.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(damos_sysfs_quota_goal);
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/7] mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
  2025-04-20 19:40 ` [PATCH 1/7] mm/damon/core: introduce damos quota goal metrics for memory node utilization SeongJae Park
  2025-04-20 19:40 ` [PATCH 2/7] mm/damon/sysfs-schemes: implement file for quota goal nid parameter SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:40 ` [PATCH 4/7] Docs/mm/damon/design: document node_mem_{used,free}_bp SeongJae Park
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: SeongJae Park, damon, kernel-team, linux-kernel, linux-mm

DAMON sysfs interface file for DAMOS quota goal's node id argument is
not passed to core layer.  Implement the link.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/sysfs-schemes.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 7681ed293b62..729fe5f1ef30 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -2149,8 +2149,17 @@ static int damos_sysfs_add_quota_score(
 				sysfs_goal->target_value);
 		if (!goal)
 			return -ENOMEM;
-		if (sysfs_goal->metric == DAMOS_QUOTA_USER_INPUT)
+		switch (sysfs_goal->metric) {
+		case DAMOS_QUOTA_USER_INPUT:
 			goal->current_value = sysfs_goal->current_value;
+			break;
+		case DAMOS_QUOTA_NODE_MEM_USED_BP:
+		case DAMOS_QUOTA_NODE_MEM_FREE_BP:
+			goal->nid = sysfs_goal->nid;
+			break;
+		default:
+			break;
+		}
 		damos_add_quota_goal(quota, goal);
 	}
 	return 0;
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/7] Docs/mm/damon/design: document node_mem_{used,free}_bp
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
                   ` (2 preceding siblings ...)
  2025-04-20 19:40 ` [PATCH 3/7] mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:40 ` [PATCH 5/7] Docs/admin-guide/mm/damon/usage: document 'nid' file SeongJae Park
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Jonathan Corbet, damon, kernel-team, linux-doc,
	linux-kernel, linux-mm

Add description of DAMOS quota goal metrics for NUMA node utilization on
the DAMON deesign document.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/mm/damon/design.rst | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index f12d33749329..728bf5754343 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -550,10 +550,10 @@ aggressiveness (the quota) of the corresponding scheme.  For example, if DAMOS
 is under achieving the goal, DAMOS automatically increases the quota.  If DAMOS
 is over achieving the goal, it decreases the quota.
 
-The goal can be specified with three parameters, namely ``target_metric``,
-``target_value``, and ``current_value``.  The auto-tuning mechanism tries to
-make ``current_value`` of ``target_metric`` be same to ``target_value``.
-Currently, two ``target_metric`` are provided.
+The goal can be specified with four parameters, namely ``target_metric``,
+``target_value``, ``current_value`` and ``nid``.  The auto-tuning mechanism
+tries to make ``current_value`` of ``target_metric`` be same to
+``target_value``.
 
 - ``user_input``: User-provided value.  Users could use any metric that they
   has interest in for the value.  Use space main workload's latency or
@@ -565,6 +565,11 @@ Currently, two ``target_metric`` are provided.
   in microseconds that measured from last quota reset to next quota reset.
   DAMOS does the measurement on its own, so only ``target_value`` need to be
   set by users at the initial time.  In other words, DAMOS does self-feedback.
+- ``node_mem_used_bp``: Specific NUMA node's used memory ratio in bp (1/10,000).
+- ``node_mem_free_bp``: Specific NUMA node's free memory ratio in bp (1/10,000).
+
+``nid`` is optionally required for only ``node_mem_used_bp`` and
+``node_mem_free_bp`` to point the specific NUMA node.
 
 To know how user-space can set the tuning goal metric, the target value, and/or
 the current value via :ref:`DAMON sysfs interface <sysfs_interface>`, refer to
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/7] Docs/admin-guide/mm/damon/usage: document 'nid' file
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
                   ` (3 preceding siblings ...)
  2025-04-20 19:40 ` [PATCH 4/7] Docs/mm/damon/design: document node_mem_{used,free}_bp SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:40 ` [PATCH 6/7] Docs/ABI/damon: document nid file SeongJae Park
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, Jonathan Corbet, damon, kernel-team, linux-doc,
	linux-kernel, linux-mm

Add description of 'nid' file, which is optionally used for specific
DAMOS quota goal metrics such as node_mem_{used,free}_bp on DAMON usage
document.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/admin-guide/mm/damon/usage.rst | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index ced2013db3df..d960aba72b82 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -81,7 +81,7 @@ comma (",").
     │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes
     │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
     │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
-    │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
+    │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid
     │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
     │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters
     │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max
@@ -390,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0``
 to ``N-1``.  Each directory represents each goal and current achievement.
 Among the multiple feedback, the best one is used.
 
-Each goal directory contains three files, namely ``target_metric``,
-``target_value`` and ``current_value``.  Users can set and get the three
-parameters for the quota auto-tuning goals that specified on the :ref:`design
-doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each
-of the files.  Note that users should further write
+Each goal directory contains four files, namely ``target_metric``,
+``target_value``, ``current_value`` and ``nid``.  Users can set and get the
+four parameters for the quota auto-tuning goals that specified on the
+:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and
+reading from each of the files.  Note that users should further write
 ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
 directory <sysfs_kdamond>` to pass the feedback to DAMON.
 
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 6/7] Docs/ABI/damon: document nid file
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
                   ` (4 preceding siblings ...)
  2025-04-20 19:40 ` [PATCH 5/7] Docs/admin-guide/mm/damon/usage: document 'nid' file SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:40 ` [PATCH 7/7] samples/damon: implement a DAMON module for memory tiering SeongJae Park
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: SeongJae Park, damon, kernel-team, linux-kernel, linux-mm

Add a description of 'nid' file, which is optionally used for specific
DAMOS quota goal metrics such as node_mem_{used,free}_bp on the DAMON
sysfs ABI document.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 Documentation/ABI/testing/sysfs-kernel-mm-damon | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon
index 293197f180ad..5697ab154c1f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-damon
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon
@@ -283,6 +283,12 @@ Contact:	SeongJae Park <sj@kernel.org>
 Description:	Writing to and reading from this file sets and gets the current
 		value of the goal metric.
 
+What:		/sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/nid
+Date:		Apr 2025
+Contact:	SeongJae Park <sj@kernel.org>
+Description:	Writing to and reading from this file sets and gets the nid
+		parameter of the goal.
+
 What:		/sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil
 Date:		Mar 2022
 Contact:	SeongJae Park <sj@kernel.org>
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 7/7] samples/damon: implement a DAMON module for memory tiering
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
                   ` (5 preceding siblings ...)
  2025-04-20 19:40 ` [PATCH 6/7] Docs/ABI/damon: document nid file SeongJae Park
@ 2025-04-20 19:40 ` SeongJae Park
  2025-04-20 19:47 ` [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
  2025-05-02  7:38 ` Yunjeong Mun
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: SeongJae Park, damon, kernel-team, linux-kernel, linux-mm

Implement a sample DAMON module that shows how self-tuned DAMON-based
memory tiering can be written.  It is a sample since the purpose is to
give an idea about how it can be implemented and perform, rather than be
used on general production setups.  Especially, it supports only two
tiers memory setup having only one CPU-attached NUMA node.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 samples/damon/Kconfig  |  13 ++++
 samples/damon/Makefile |   1 +
 samples/damon/mtier.c  | 167 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 181 insertions(+)
 create mode 100644 samples/damon/mtier.c

diff --git a/samples/damon/Kconfig b/samples/damon/Kconfig
index 564c49ed69a2..cbf96fd8a8bf 100644
--- a/samples/damon/Kconfig
+++ b/samples/damon/Kconfig
@@ -27,4 +27,17 @@ config SAMPLE_DAMON_PRCL
 
 	  If unsure, say N.
 
+config SAMPLE_DAMON_MTIER
+	bool "DAMON sample module for memory tiering"
+	depends on DAMON && DAMON_PADDR
+	help
+	  Thps builds DAMON sample module for memory tierign.
+
+	  The module assumes the system is constructed with two NUMA nodes,
+	  which seems as local and remote nodes to all CPUs.  For example,
+	  node0 is for DDR5 DRAMs connected via DIMM, while node1 is for DDR4
+	  DRAMs connected via CXL.
+
+	  If unsure, say N.
+
 endmenu
diff --git a/samples/damon/Makefile b/samples/damon/Makefile
index 7f155143f237..72f68cbf422a 100644
--- a/samples/damon/Makefile
+++ b/samples/damon/Makefile
@@ -2,3 +2,4 @@
 
 obj-$(CONFIG_SAMPLE_DAMON_WSSE) += wsse.o
 obj-$(CONFIG_SAMPLE_DAMON_PRCL) += prcl.o
+obj-$(CONFIG_SAMPLE_DAMON_MTIER) += mtier.o
diff --git a/samples/damon/mtier.c b/samples/damon/mtier.c
new file mode 100644
index 000000000000..3390b7d30c26
--- /dev/null
+++ b/samples/damon/mtier.c
@@ -0,0 +1,167 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * memory tiering: migrate cold pages in node 0 and hot pages in node 1 to node
+ * 1 and node 0, respectively.  Adjust the hotness/coldness threshold aiming
+ * resulting 99.6 % node 0 utilization ratio.
+ */
+
+#define pr_fmt(fmt) "damon_sample_mtier: " fmt
+
+#include <linux/damon.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+static unsigned long node0_start_addr __read_mostly;
+module_param(node0_start_addr, ulong, 0600);
+
+static unsigned long node0_end_addr __read_mostly;
+module_param(node0_end_addr, ulong, 0600);
+
+static unsigned long node1_start_addr __read_mostly;
+module_param(node1_start_addr, ulong, 0600);
+
+static unsigned long node1_end_addr __read_mostly;
+module_param(node1_end_addr, ulong, 0600);
+
+static int damon_sample_mtier_enable_store(
+		const char *val, const struct kernel_param *kp);
+
+static const struct kernel_param_ops enable_param_ops = {
+	.set = damon_sample_mtier_enable_store,
+	.get = param_get_bool,
+};
+
+static bool enable __read_mostly;
+module_param_cb(enable, &enable_param_ops, &enable, 0600);
+MODULE_PARM_DESC(enable, "Enable of disable DAMON_SAMPLE_MTIER");
+
+static struct damon_ctx *ctxs[2];
+
+static struct damon_ctx *damon_sample_mtier_build_ctx(bool promote)
+{
+	struct damon_ctx *ctx;
+	struct damon_target *target;
+	struct damon_region *region;
+	struct damos *scheme;
+	struct damos_quota_goal *quota_goal;
+	struct damos_filter *filter;
+
+	ctx = damon_new_ctx();
+	if (!ctx)
+		return NULL;
+	/*
+	 * auto-tune sampling and aggregation interval aiming 4% DAMON-observed
+	 * accesses ratio, keeping sampling interval in [5ms, 10s] range.
+	 */
+	ctx->attrs.intervals_goal = (struct damon_intervals_goal) {
+		.access_bp = 400, .aggrs = 3,
+		.min_sample_us = 5000, .max_sample_us = 10000000,
+	};
+	if (damon_select_ops(ctx, DAMON_OPS_PADDR))
+		goto free_out;
+
+	target = damon_new_target();
+	if (!target)
+		goto free_out;
+	damon_add_target(ctx, target);
+	region = damon_new_region(
+			promote ? node1_start_addr : node0_start_addr,
+			promote ? node1_end_addr : node0_end_addr);
+	if (!region)
+		goto free_out;
+	damon_add_region(region, target);
+
+	scheme = damon_new_scheme(
+			/* access pattern */
+			&(struct damos_access_pattern) {
+				.min_sz_region = PAGE_SIZE,
+				.max_sz_region = ULONG_MAX,
+				.min_nr_accesses = promote ? 1 : 0,
+				.max_nr_accesses = promote ? UINT_MAX : 0,
+				.min_age_region = 0,
+				.max_age_region = UINT_MAX},
+			/* action */
+			promote ? DAMOS_MIGRATE_HOT : DAMOS_MIGRATE_COLD,
+			1000000,	/* apply interval (1s) */
+			&(struct damos_quota){
+				/* 200 MiB per sec by most */
+				.reset_interval = 1000,
+				.sz = 200 * 1024 * 1024,
+				/* ignore size of region when prioritizing */
+				.weight_sz = 0,
+				.weight_nr_accesses = 100,
+				.weight_age = 100,
+			},
+			&(struct damos_watermarks){},
+			promote ? 0 : 1);	/* migrate target node id */
+	if (!scheme)
+		goto free_out;
+	damon_set_schemes(ctx, &scheme, 1);
+	quota_goal = damos_new_quota_goal(
+			promote ? DAMOS_QUOTA_NODE_MEM_USED_BP :
+			DAMOS_QUOTA_NODE_MEM_FREE_BP,
+			promote ? 9970 : 50);
+	if (!quota_goal)
+		goto free_out;
+	quota_goal->nid = 0;
+	damos_add_quota_goal(&scheme->quota, quota_goal);
+	filter = damos_new_filter(DAMOS_FILTER_TYPE_YOUNG, true, promote);
+	if (!filter)
+		goto free_out;
+	damos_add_filter(scheme, filter);
+	return ctx;
+free_out:
+	damon_destroy_ctx(ctx);
+	return NULL;
+}
+
+static int damon_sample_mtier_start(void)
+{
+	struct damon_ctx *ctx;
+
+	ctx = damon_sample_mtier_build_ctx(true);
+	if (!ctx)
+		return -ENOMEM;
+	ctxs[0] = ctx;
+	ctx = damon_sample_mtier_build_ctx(false);
+	if (!ctx) {
+		damon_destroy_ctx(ctxs[0]);
+		return -ENOMEM;
+	}
+	ctxs[1] = ctx;
+	return damon_start(ctxs, 2, true);
+}
+
+static void damon_sample_mtier_stop(void)
+{
+	damon_stop(ctxs, 2);
+	damon_destroy_ctx(ctxs[0]);
+	damon_destroy_ctx(ctxs[1]);
+}
+
+static int damon_sample_mtier_enable_store(
+		const char *val, const struct kernel_param *kp)
+{
+	bool enabled = enable;
+	int err;
+
+	err = kstrtobool(val, &enable);
+	if (err)
+		return err;
+
+	if (enable == enabled)
+		return 0;
+
+	if (enable)
+		return damon_sample_mtier_start();
+	damon_sample_mtier_stop();
+	return 0;
+}
+
+static int __init damon_sample_mtier_init(void)
+{
+	return 0;
+}
+
+module_init(damon_sample_mtier_init);
-- 
2.39.5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
                   ` (6 preceding siblings ...)
  2025-04-20 19:40 ` [PATCH 7/7] samples/damon: implement a DAMON module for memory tiering SeongJae Park
@ 2025-04-20 19:47 ` SeongJae Park
  2025-05-02  7:38 ` Yunjeong Mun
  8 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-04-20 19:47 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Jonathan Corbet, damon, kernel-team, linux-doc,
	linux-kernel, linux-mm

On Sun, 20 Apr 2025 12:40:23 -0700 SeongJae Park <sj@kernel.org> wrote:

> Utilizing DAMON for memory tiering usually requires manual tuning and/or
> tedious controls.  Let it self-tune hotness and coldness thresholds for
> promotion and demotion aiming high utilization of high memory tiers, by
> introducing new DAMOS quota goal metrics representing the used and the
> free memory ratios of specific NUMA nodes.  And introduce a sample DAMON
> module that demonstrates how the new feature can be used for memory
> tiering use cases.
[...]
> References
> ==========
> 
> [1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/
> [2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org
> [3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md

Forgot adding below, sorry.

Revision History
================

Changes from RFC
(https://lore.kernel.org/20250320053937.57734-1-sj@kernel.org)
- Wordsmith commit messages
- Add documentations

[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory
  2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
                   ` (7 preceding siblings ...)
  2025-04-20 19:47 ` [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
@ 2025-05-02  7:38 ` Yunjeong Mun
  2025-05-02 15:49   ` SeongJae Park
  8 siblings, 1 reply; 13+ messages in thread
From: Yunjeong Mun @ 2025-05-02  7:38 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Jonathan Corbet, damon, kernel-team, linux-doc, linux-kernel,
	linux-mm, Andrew Morton, kernel_team

Hi SeongJae, thanks for your helpful auto-tuning patchset, which optimizes 
the ease of used of DAMON on tiered memory systems. I have tested demotion
mechanism with a microbenchmark and would like to share the result.

On Sun, 20 Apr 2025 12:40:23 -0700 SeongJae Park <sj@kernel.org> wrote:
[..snip..]

> Utilizing DAMON for memory tiering usually requires manual tuning and/ 
> Evaluation Limitations
> ----------------------
> 
> As mentioned above, this evaluation shows only comparison of promotion
> mechanisms.  DAMON-based tiering is recommended to be used together with
> reclaim-based demotion as a faster backup under significant memory
> pressure, though.
> 
> >From some perspective, the modified version of Taobench may seems making
> the picture distorted too much.  It would be better to evaluate with
> more realistic workload, or more finely tuned micro benchmarks.
> 

Hardware. 
- Node 0: 512GB DRAM
- Node 1: 0GB (memoryless)
- Node 2: 96GB CXL memory

Kernel
- RFC patchset on top of v6.14-rc7 
https://lore.kernel.org/damon/20250320053937.57734-1-sj@kernel.org/

Workload
- Microbenchmark creates hot and cold regions based on the specified parameters.
  $ ./hot_cold 1g 100g
It repetitively performs memset on a 1GB hot region, but only performs memset
once on a 100GB cold region. 

DAMON setup
- My intention is to demote most of all regions of cold memory from node 0 to 
node 2. So, damo start with below yaml configuration:
...
# damo v2.7.2 from https://git.kernel.org/pub/scm/linux/kernel/git/sj/damo.git/
   schemes:
   - action: migrate_cold
      target_nid: 2
...
      apply_interval_us: 0
      quotas:
        time_ms: 0 s
        sz_bytes: 0 GiB
        reset_interval_ms: 6 s
        goals:
        - metric: node_mem_free_bp 
          target_value: 99%
          nid: 0
          current_value: 1
        effective_sz_bytes: 0 B
...

Results
I've run the hot_cold benchmark for approximately 2 days, and have monitored 
the memory usage of each node as follows:

$ numastat -c -p hot_cold
Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3  Total
---------------  ------ ------ ------ ------ ------
2689746 (watch)       2      0      0      1      3
2690067 (hot_col 100122      0   3303      0 103426
3770656 (watch)       0      0      0      1      1
3770657 (sh)          2      0      0      0      2
---------------  ------ ------ ------ ------ ------
Total            100127      0   3303      1 103432

I expected that most of cold data from node 0 would be demoted to node 2, but it isn't.
In this situation, DAMON's variables are displayed as follows:

[2067202.863431] totalram 131938449 free 84504526 used 47433923 numerator 84504526
[2067202.863446] goal->current_value: 6404
[2067202.863452] score: 6468
[2067202.863455] quota->esz: 1844674407370955

`score` 6468 means the goal hasn't been achieved yet, and the `quota->esz`, 
which specifies the aggressiveness of the  demotion action, has reached 
ULONG_MAX. However, the demotion has not occured.

[..snip..]

I think there may be some errors or misunderstanding in my experiment.
I would be grateful for any insights or feedback you might have regarding these
results.

Best Regards,
Yunjeong



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory
  2025-05-02  7:38 ` Yunjeong Mun
@ 2025-05-02 15:49   ` SeongJae Park
  2025-05-08  9:28     ` Yunjeong Mun
  0 siblings, 1 reply; 13+ messages in thread
From: SeongJae Park @ 2025-05-02 15:49 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: SeongJae Park, Jonathan Corbet, damon, kernel-team, linux-doc,
	linux-kernel, linux-mm, Andrew Morton, kernel_team

Hi Yunjeong,

On Fri,  2 May 2025 16:38:48 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> Hi SeongJae, thanks for your helpful auto-tuning patchset, which optimizes 
> the ease of used of DAMON on tiered memory systems. I have tested demotion
> mechanism with a microbenchmark and would like to share the result.

Thank you for sharing your test result!

[...]
> Hardware. 
> - Node 0: 512GB DRAM
> - Node 1: 0GB (memoryless)
> - Node 2: 96GB CXL memory
> 
> Kernel
> - RFC patchset on top of v6.14-rc7 
> https://lore.kernel.org/damon/20250320053937.57734-1-sj@kernel.org/
> 
> Workload
> - Microbenchmark creates hot and cold regions based on the specified parameters.
>   $ ./hot_cold 1g 100g
> It repetitively performs memset on a 1GB hot region, but only performs memset
> once on a 100GB cold region. 
> 
> DAMON setup
> - My intention is to demote most of all regions of cold memory from node 0 to 
> node 2. So, damo start with below yaml configuration:
> ...
> # damo v2.7.2 from https://git.kernel.org/pub/scm/linux/kernel/git/sj/damo.git/
>    schemes:
>    - action: migrate_cold
>       target_nid: 2
> ...
>       apply_interval_us: 0
>       quotas:
>         time_ms: 0 s
>         sz_bytes: 0 GiB
>         reset_interval_ms: 6 s
>         goals:
>         - metric: node_mem_free_bp 
>           target_value: 99%
>           nid: 0
>           current_value: 1
>         effective_sz_bytes: 0 B
> ...

Sharing DAMON parameters you used can be helpful, thank you!  Can you further
share full parameters?  I'm especially interested in how the parameters for
monitoring targets and migrate_cold scheme's target access pattern, and if
there are other DAMON contexts or DAMOS schemes running together.

> 
> Results
> I've run the hot_cold benchmark for approximately 2 days, and have monitored 
> the memory usage of each node as follows:
> 
> $ numastat -c -p hot_cold
> Per-node process memory usage (in MBs)
> PID              Node 0 Node 1 Node 2 Node 3  Total
> ---------------  ------ ------ ------ ------ ------
> 2689746 (watch)       2      0      0      1      3
> 2690067 (hot_col 100122      0   3303      0 103426
> 3770656 (watch)       0      0      0      1      1
> 3770657 (sh)          2      0      0      0      2
> ---------------  ------ ------ ------ ------ ------
> Total            100127      0   3303      1 103432
> 
> I expected that most of cold data from node 0 would be demoted to node 2, but it isn't.
> In this situation, DAMON's variables are displayed as follows:
> 
> [2067202.863431] totalram 131938449 free 84504526 used 47433923 numerator 84504526
> [2067202.863446] goal->current_value: 6404
> [2067202.863452] score: 6468
> [2067202.863455] quota->esz: 1844674407370955
> 
> `score` 6468 means the goal hasn't been achieved yet, and the `quota->esz`, 
> which specifies the aggressiveness of the  demotion action, has reached 
> ULONG_MAX. However, the demotion has not occured.

Yes, as you intrpret, seems the auto-tuning is working as designed, but
migration is not successfully happened.  I'm curious if migration is tried but
failed.  DAMOS stats[1] may let us know that.  Can you check and share those?

> 
> [..snip..]
> 
> I think there may be some errors or misunderstanding in my experiment.
> I would be grateful for any insights or feedback you might have regarding these
> results.

I don't have clear idea at the moment, sorry.  It would be helpful if you could
share things I asked above.

Also, it seems you suspect the auto-tuning as one of root causes.  I'm curious
if you tried some different tests (e.g., same one without auto-tuning) and it
gave you some theories.  If so, could you please share those?

[1] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#statistics


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory
  2025-05-02 15:49   ` SeongJae Park
@ 2025-05-08  9:28     ` Yunjeong Mun
  2025-05-08 16:35       ` SeongJae Park
  0 siblings, 1 reply; 13+ messages in thread
From: Yunjeong Mun @ 2025-05-08  9:28 UTC (permalink / raw)
  To: SeongJae Park
  Cc: honggyu.kim, Jonathan Corbet, damon, kernel-team, linux-doc,
	linux-kernel, linux-mm, Andrew Morton, kernel_team

Hi Seongjae, I'm sorry for the delayed response due to the holidays.

On Fri,  2 May 2025 08:49:49 -0700 SeongJae Park <sj@kernel.org> wrote:
> Hi Yunjeong,
> 
> On Fri,  2 May 2025 16:38:48 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> 
> > Hi SeongJae, thanks for your helpful auto-tuning patchset, which optimizes 
> > the ease of used of DAMON on tiered memory systems. I have tested demotion
> > mechanism with a microbenchmark and would like to share the result.
> 
> Thank you for sharing your test result!
> 
> [...]
> > Hardware. 
> > - Node 0: 512GB DRAM
> > - Node 1: 0GB (memoryless)
> > - Node 2: 96GB CXL memory
> > 
> > Kernel
> > - RFC patchset on top of v6.14-rc7 
> > https://lore.kernel.org/damon/20250320053937.57734-1-sj@kernel.org/
> > 
> > Workload
> > - Microbenchmark creates hot and cold regions based on the specified parameters.
> >   $ ./hot_cold 1g 100g
> > It repetitively performs memset on a 1GB hot region, but only performs memset
> > once on a 100GB cold region. 
> > 
> > DAMON setup
> > - My intention is to demote most of all regions of cold memory from node 0 to 
> > node 2. So, damo start with below yaml configuration:
> > ...
> > # damo v2.7.2 from https://git.kernel.org/pub/scm/linux/kernel/git/sj/damo.git/
> >    schemes:
> >    - action: migrate_cold
> >       target_nid: 2
> > ...
> >       apply_interval_us: 0
> >       quotas:
> >         time_ms: 0 s
> >         sz_bytes: 0 GiB
> >         reset_interval_ms: 6 s
> >         goals:
> >         - metric: node_mem_free_bp 
> >           target_value: 99%
> >           nid: 0
> >           current_value: 1
> >         effective_sz_bytes: 0 B
> > ...
> 
> Sharing DAMON parameters you used can be helpful, thank you!  Can you further
> share full parameters?  I'm especially interested in how the parameters for
> monitoring targets and migrate_cold scheme's target access pattern, and if
> there are other DAMON contexts or DAMOS schemes running together.
> 

Actually, I realized that the 'regions' field in my YAML configuration is 
incorrect. I've been using a configuration file that was create on another 
server, not the testing server. As a result, the scheme is applied to wrong
region, causing the results to appear confusing. I've  fixed the issue and
confirmed that the demotion occured successfully. I'm sorry for any confusion
this may have caused.

After fixing it up, Honggyu and I tested this patch again. I would like to
share two issues: 1) slow start of action, 2) action does not stop even when 
target is acheived. Below are the test configurations:

Hardware
- node 0: 64GB DRAM
- node 1: 0GB (memoryless)
- node 2: 96GB CXL memory

Kernel
- This patchset on top of v6.15-rc4

Workload: microbenchmark that `mmap` and `memset` once for size GB
$ ./mmap 50

DAMON setup: just one contexts and schemes.
    ...
    schemes:
    - action: migrate_cold
      target_nid: 2
      access_pattern:
        sz_bytes:
          min: 4.000 KiB
          max: max
        nr_accesses:
          min: 0 %
          max: 0 %
        age:
          min: 10 s
          max: max
      apply_interval_us: 0
      quotas:
        time_ms: 0 s
        sz_bytes: 0 GiB
        reset_interval_ms: 20 s
        goals:
        - metric: node_mem_free_bp
          target_value: 50%
          nid: 0
          current_value: 1
     ...

Two issues mentioned above are both caused by the calculation logic of 
`quota->esz`, which grows too slowly and increases gradually.

Slow start: 50GB of data is allocated on node 0, and the demotion first occurs
after about 15 minutes. This is because `quota->esz` is growing slowly even
when the `current` is lower than the `target`. 

Not stop: the `target` is to maintain 50% free space on node 0, which we expect
to be about 32GB. However, it demoted more than intended, maintaing about 90%
free space as follows:

  Per-node process memory usage (in MBs)
  PID           Node 0 Node 1 Node 2 Total
  ------------  ------ ------ ------ -----
  1182 (watch)       2      0      0     2
  1198 (mmap)     7015      0  44187 51201
  ------------  ------ ------ ------ -----
  Total           7017      0  44187 51204

This is becuase the `esz` decreased slowly after acheiving the `target`.
In the end, the demotion occured more excessively than intended.

We believe that the defference between `target` and `current` increases, the
`esz` should be raised more rapidly to increase the aggressiveness of action.
In the current implementation, the `esz` remains low even when the `current` is
below the `target`, leading to a slow start issue. Also, there is a not-stop
issue where high `esz` persist (decreasing slowly) even when an over_achieved
state. 

> 
> Yes, as you intrpret, seems the auto-tuning is working as designed, but
> migration is not successfully happened.  I'm curious if migration is tried but
> failed.  DAMOS stats[1] may let us know that.  Can you check and share those?
> 

Thank you for providing the DAMOS stats information. I will use it when
analyzing with DAMON. I would appreciate any feedback you might have on the new
results.

Best Regards,
Yunjeong

[..snip..]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory
  2025-05-08  9:28     ` Yunjeong Mun
@ 2025-05-08 16:35       ` SeongJae Park
  0 siblings, 0 replies; 13+ messages in thread
From: SeongJae Park @ 2025-05-08 16:35 UTC (permalink / raw)
  To: Yunjeong Mun
  Cc: SeongJae Park, honggyu.kim, Jonathan Corbet, damon, kernel-team,
	linux-doc, linux-kernel, linux-mm, Andrew Morton, kernel_team

On Thu,  8 May 2025 18:28:27 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:

> Hi Seongjae, I'm sorry for the delayed response due to the holidays.

No worry, hope you had a good break :)

> 
> On Fri,  2 May 2025 08:49:49 -0700 SeongJae Park <sj@kernel.org> wrote:
> > Hi Yunjeong,
> > 
> > On Fri,  2 May 2025 16:38:48 +0900 Yunjeong Mun <yunjeong.mun@sk.com> wrote:
> > 
> > > Hi SeongJae, thanks for your helpful auto-tuning patchset, which optimizes 
> > > the ease of used of DAMON on tiered memory systems. I have tested demotion
> > > mechanism with a microbenchmark and would like to share the result.
> > 
> > Thank you for sharing your test result!
> > 
> > [...]
> > > Hardware. 
> > > - Node 0: 512GB DRAM
> > > - Node 1: 0GB (memoryless)
> > > - Node 2: 96GB CXL memory
> > > 
> > > Kernel
> > > - RFC patchset on top of v6.14-rc7 
> > > https://lore.kernel.org/damon/20250320053937.57734-1-sj@kernel.org/
> > > 
> > > Workload
> > > - Microbenchmark creates hot and cold regions based on the specified parameters.
> > >   $ ./hot_cold 1g 100g
> > > It repetitively performs memset on a 1GB hot region, but only performs memset
> > > once on a 100GB cold region. 
> > > 
> > > DAMON setup
> > > - My intention is to demote most of all regions of cold memory from node 0 to 
> > > node 2. So, damo start with below yaml configuration:
> > > ...
> > > # damo v2.7.2 from https://git.kernel.org/pub/scm/linux/kernel/git/sj/damo.git/
> > >    schemes:
> > >    - action: migrate_cold
> > >       target_nid: 2
> > > ...
> > >       apply_interval_us: 0
> > >       quotas:
> > >         time_ms: 0 s
> > >         sz_bytes: 0 GiB
> > >         reset_interval_ms: 6 s
> > >         goals:
> > >         - metric: node_mem_free_bp 
> > >           target_value: 99%
> > >           nid: 0
> > >           current_value: 1
> > >         effective_sz_bytes: 0 B
> > > ...
> > 
> > Sharing DAMON parameters you used can be helpful, thank you!  Can you further
> > share full parameters?  I'm especially interested in how the parameters for
> > monitoring targets and migrate_cold scheme's target access pattern, and if
> > there are other DAMON contexts or DAMOS schemes running together.
> > 
> 
> Actually, I realized that the 'regions' field in my YAML configuration is 
> incorrect. I've been using a configuration file that was create on another 
> server, not the testing server.

To my understanding, you use YAML configuration because DAMON user-space tool
doesn't provide good interface for multiple kdamonds setup.  Starting from
v2.7.5, DAMON user-space tool supports multiple kdamonds setup from the command
line, and it supports setting target regions as NUMA nodes (--numa_node).
Using those might be a better option for you.

> As a result, the scheme is applied to wrong
> region, causing the results to appear confusing. I've  fixed the issue and
> confirmed that the demotion occured successfully. I'm sorry for any confusion
> this may have caused.

Glad to hear that the issue is fixed.

> 
> After fixing it up, Honggyu and I tested this patch again. I would like to
> share two issues: 1) slow start of action, 2) action does not stop even when 
> target is acheived. Below are the test configurations:
> 
> Hardware
> - node 0: 64GB DRAM
> - node 1: 0GB (memoryless)
> - node 2: 96GB CXL memory
> 
> Kernel
> - This patchset on top of v6.15-rc4
> 
> Workload: microbenchmark that `mmap` and `memset` once for size GB
> $ ./mmap 50
> 
> DAMON setup: just one contexts and schemes.
>     ...
>     schemes:
>     - action: migrate_cold
>       target_nid: 2
>       access_pattern:
>         sz_bytes:
>           min: 4.000 KiB
>           max: max
>         nr_accesses:
>           min: 0 %
>           max: 0 %
>         age:
>           min: 10 s
>           max: max
>       apply_interval_us: 0
>       quotas:
>         time_ms: 0 s
>         sz_bytes: 0 GiB
>         reset_interval_ms: 20 s
>         goals:
>         - metric: node_mem_free_bp
>           target_value: 50%
>           nid: 0
>           current_value: 1
>      ...
> 
> Two issues mentioned above are both caused by the calculation logic of 
> `quota->esz`, which grows too slowly and increases gradually.
> 
> Slow start: 50GB of data is allocated on node 0, and the demotion first occurs
> after about 15 minutes. This is because `quota->esz` is growing slowly even
> when the `current` is lower than the `target`. 

This is an intended design to avoid making unnecessary actions for only
temporal access pattern.  On realistic workloads having a time scale, I think
some delay is not a big problem.  I agree 15 minutes is too long, though.  But,
the speed also depends on reset_interval_ms.  The quota grows up to 100% once
per reset_interval_ms.  The quota size is 1 byte in minimum, so it takes at
least 12 reset_interval_ms to make the size quota at least single 4K page size.
Because reset_interval_ms is 20 seconds in this setup, 12 reset_interval_ms is
four minutes (240 seconds).

My intended use of resset_interval_ms is setting it just not too short, to
reduce unnecessary quota calculation overhead.  From my perspective, 20 seconds
feels too long.  Is there a reason to set it so long?  If there is no reason,
I'd recommend starting with 1 second reset_interval_ms and adjust for your
setup if it doesn't work.

And I realize this would better to be documented.  I will try to make this more
clarified on the documentation when I get time.  Please feel free to submit a
patch if you find a time faster than me :)

> 
> Not stop: the `target` is to maintain 50% free space on node 0, which we expect
> to be about 32GB. However, it demoted more than intended, maintaing about 90%
> free space as follows:
> 
>   Per-node process memory usage (in MBs)
>   PID           Node 0 Node 1 Node 2 Total
>   ------------  ------ ------ ------ -----
>   1182 (watch)       2      0      0     2
>   1198 (mmap)     7015      0  44187 51201
>   ------------  ------ ------ ------ -----
>   Total           7017      0  44187 51204
> 
> This is becuase the `esz` decreased slowly after acheiving the `target`.
> In the end, the demotion occured more excessively than intended.
> 
> We believe that the defference between `target` and `current` increases, the
> `esz` should be raised more rapidly to increase the aggressiveness of action.
> In the current implementation, the `esz` remains low even when the `current` is
> below the `target`, leading to a slow start issue. Also, there is a not-stop
> issue where high `esz` persist (decreasing slowly) even when an over_achieved
> state. 

This is yet another intended design.  The aim-oriented quota auto-tuning
feature assumes there is an ideal amount of quota that fits for the current
situation, that could dynamically change.  For example, proactively reclaiming
cold memory aiming a modest level of memory pressure.

For this case, I think you should have another scheme for promotion.  Please
refer to the design and example implementation of the sample module.  Or, do
you have a special reason to utilize only demotion scheme like this setup?  If
so, please share.

If you really need a feature that turns DAMOS on and off for given situation,
DAMOS watermarks may be the right feature to look.  You could also override
tuned quota from user space.  So you could monitor the free size of given NUMA
node and set the tuned quota as zero, immediately, or jsut remove the scheme.

Again, this might be due to the poor documentation.  Sorry about the poor
documentation and thank you for letting me find this.  I'll try to make the
documentation better.

> 
> > 
> > Yes, as you intrpret, seems the auto-tuning is working as designed, but
> > migration is not successfully happened.  I'm curious if migration is tried but
> > failed.  DAMOS stats[1] may let us know that.  Can you check and share those?
> > 
> 
> Thank you for providing the DAMOS stats information.
> I will use it when analyzing with DAMON.

Maybe an easiest way to monitor it is
'damo report access --tried_regions_of X Y Z --style temperature-sz-hist'.

> I would appreciate any feedback you
> might have on the new
> results.

I wish my above replies helps a bit, and looking forward to anything I missed
or your special reasons for your setup if you have.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-08 16:35 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-20 19:40 [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
2025-04-20 19:40 ` [PATCH 1/7] mm/damon/core: introduce damos quota goal metrics for memory node utilization SeongJae Park
2025-04-20 19:40 ` [PATCH 2/7] mm/damon/sysfs-schemes: implement file for quota goal nid parameter SeongJae Park
2025-04-20 19:40 ` [PATCH 3/7] mm/damon/sysfs-schemes: connect damos_quota_goal nid with core layer SeongJae Park
2025-04-20 19:40 ` [PATCH 4/7] Docs/mm/damon/design: document node_mem_{used,free}_bp SeongJae Park
2025-04-20 19:40 ` [PATCH 5/7] Docs/admin-guide/mm/damon/usage: document 'nid' file SeongJae Park
2025-04-20 19:40 ` [PATCH 6/7] Docs/ABI/damon: document nid file SeongJae Park
2025-04-20 19:40 ` [PATCH 7/7] samples/damon: implement a DAMON module for memory tiering SeongJae Park
2025-04-20 19:47 ` [PATCH 0/7] mm/damon: auto-tune DAMOS for NUMA setups including tiered memory SeongJae Park
2025-05-02  7:38 ` Yunjeong Mun
2025-05-02 15:49   ` SeongJae Park
2025-05-08  9:28     ` Yunjeong Mun
2025-05-08 16:35       ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox