[PATCH v4 0/2] mm/damon: validate min_region

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] mm/damon: validate min_region_size to be power of 2
@ 2026-04-10  4:42 Liew Rui Yan
  2026-04-10  4:42 ` [PATCH v4 1/2] mm/damon/lru_sort: " Liew Rui Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-10  4:42 UTC (permalink / raw)
  To: SeongJae Park; +Cc: Quanmin Yan, damon, linux-mm, Liew Rui Yan

Problem
=======
When a user sets an invalid 'addr_unit' (e.g., 3) via
DAMON_LRU_SORT/DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2
value. This value eventually reaches damon_commit_ctx(), which does:

    dst->maybe_corrupted = true;
    if (!is_power_of_2(src->min_region_sz))
        return -EINVAL;

Although -EINVAL is returned, 'maybe_corrupted' is already set. The
running kdamond observers this flag and terminates unexpectedly.

"Unexpected termination" here means the kdamond exits without any user
request (e.g., not by writing 'N' to 'enabled').

User Impact
===========
Once kdamond terminates this way, it cannot be restarted via sysfs
because:

1. DAMON_LRU_SORT/DAMON_RECLAIM is built into the kernel, so it cannot
   be unloaded and reloaded at runtime.
2. Writing 'N' to 'enabled' fails because kdamond no longer exists;
   Writing 'Y' does nothing, as 'enabled' is already Y.

Solution
========
Add an early validation in damon_lru_sort_apply_parameters()
/damon_reclaim_apply_parameters() to check 'min_region_sz' before any
state change occurs. If it is non-power-of-2, return -EINVAL immediately,
preventing 'maybe_corrupted' from being set.

Patch 1 fixes the issue for DAMON_LRU_SORT.
Patch 2 fixes the issue for DAMON_RECLAIM.

Changes from v3
(https://lore.kernel.org/20260403052837.58063-1-aethernet65535@gmail.com)
- Improve commit message: clarify "unexpected termination".
- Add detailed User Impact with reason why kdamond cannot be restarted.
Changes from v2
(https://lore.kernel.org/20260402053756.26606-1-aethernet65535@gmail.com)
- Split the patch into two per-module patches.
- Add Fixes: and Cc: stable tags.
- Elaborate user impact and reproduction steps.
Changes from v1
(https://lore.kernel.org/20260331073231.30060-1-aethernet65535@gmail.com)
- Fix memory leak issue.
Changes from first attempt
(https://lore.kernel.org/20260327062627.66426-1-aethernet65535@gmail.com)
- Renamed the subject.
- Validate min_region_sz rather than addr_unit.

Liew Rui Yan (2):
  mm/damon/lru_sort: validate min_region_size to be power of 2
  mm/damon/reclaim: validate min_region_size to be power of 2

 mm/damon/lru_sort.c | 5 +++++
 mm/damon/reclaim.c  | 5 +++++
 2 files changed, 10 insertions(+)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10  4:42 [PATCH v4 0/2] mm/damon: validate min_region_size to be power of 2 Liew Rui Yan
@ 2026-04-10  4:42 ` Liew Rui Yan
  2026-04-10  9:40   ` (sashiko review) " Liew Rui Yan
  2026-04-10 13:56   ` SeongJae Park
  2026-04-10  4:42 ` [PATCH v4 2/2] mm/damon/reclaim: " Liew Rui Yan
  2026-04-10 14:05 ` [PATCH v4 0/2] mm/damon: " SeongJae Park
  2 siblings, 2 replies; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-10  4:42 UTC (permalink / raw)
  To: SeongJae Park; +Cc: Quanmin Yan, damon, linux-mm, Liew Rui Yan, stable

Problem
=======
When a user sets an invalid 'addr_unit' (e.g., 3) via
DAMON_LRU_SORT, 'min_region_sz' becomes a non-power-of-2
value. This value eventually reaches damon_commit_ctx(), which does:

    dst->maybe_corrupted = true;
    if (!is_power_of_2(src->min_region_sz))
        return -EINVAL;

Although -EINVAL is returned, 'maybe_corrupted' is already set. The
running kdamond observers this flag and terminates unexpectedly.

"Unexpected termination" here means the kdamond exits without any user
request (e.g., not by writing 'N' to 'enabled').

User Impact
===========
Once kdamond terminates this way, it cannot be restarted via sysfs
because:

1. DAMON_LRU_SORT is built into the kernel, so it cannot be unloaded and
   reloaded at runtime.
2. Writing 'N' to 'enabled' fails because kdamond no longer exists;
   Writing 'Y' does nothing, as 'enabled' is already Y.

Reproduction
============
1. Enable DAMON_LRU_SORT
2. Set addr_unit=3
3. Commit inputs via 'commit_inputs'
4. Observe kdamond termination

Solution
========
Add an early validation in damon_lru_sort_apply_parameters() to check
'min_region_sz' before any state change occurs. If it is non-power-of-2,
return -EINVAL immediately, preventing 'maybe_corrupted' from being set.

Fixes: 2e0fe9245d6b ("mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORT")
Cc: <stable@vger.kernel.org> # 6.18.x
Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>
---
 mm/damon/lru_sort.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 554559d72976..3fd176ef9d9c 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -294,6 +294,11 @@ static int damon_lru_sort_apply_parameters(void)
 	param_ctx->addr_unit = addr_unit;
 	param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1);

+	if (!is_power_of_2(param_ctx->min_region_sz)) {
+		err = -EINVAL;
+		goto out;
+	}
+
 	if (!damon_lru_sort_mon_attrs.sample_interval) {
 		err = -EINVAL;
 		goto out;
-- 
2.53.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 2/2] mm/damon/reclaim: validate min_region_size to be power of 2
  2026-04-10  4:42 [PATCH v4 0/2] mm/damon: validate min_region_size to be power of 2 Liew Rui Yan
  2026-04-10  4:42 ` [PATCH v4 1/2] mm/damon/lru_sort: " Liew Rui Yan
@ 2026-04-10  4:42 ` Liew Rui Yan
  2026-04-10 10:08   ` (sashiko review) " Liew Rui Yan
  2026-04-10 13:57   ` SeongJae Park
  2026-04-10 14:05 ` [PATCH v4 0/2] mm/damon: " SeongJae Park
  2 siblings, 2 replies; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-10  4:42 UTC (permalink / raw)
  To: SeongJae Park; +Cc: Quanmin Yan, damon, linux-mm, Liew Rui Yan, stable

Problem
=======
When a user sets an invalid 'addr_unit' (e.g., 3) via
DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2
value. This value eventually reaches damon_commit_ctx(), which does:

    dst->maybe_corrupted = true;
    if (!is_power_of_2(src->min_region_sz))
        return -EINVAL;

Although -EINVAL is returned, 'maybe_corrupted' is already set. The
running kdamond observers this flag and terminates unexpectedly.

"Unexpected termination" here means the kdamond exits without any user
request (e.g., not by writing 'N' to 'enabled').

User Impact
===========
Once kdamond terminates this way, it cannot be restarted via sysfs
because:

1. DAMON_RECLAIM is built into the kernel, so it cannot be unloaded and
   reloaded at runtime.
2. Writing 'N' to 'enabled' fails because kdamond no longer exists;
   Writing 'Y' does nothing, as 'enabled' is already Y.

Reproduction
============
1. Enable DAMON_RECLAIM
2. Set addr_unit=3
3. Commit inputs via 'commit_inputs'
4. Observe kdamond termination

Solution
========
Add an early validation in damon_reclaim_apply_parameters() to check
'min_region_sz' before any state change occurs. If it is non-power-of-2,
return -EINVAL immediately, preventing 'maybe_corrupted' from being set.

Fixes: 7db551fcfb2a ("mm/damon/reclaim: support addr_unit for DAMON_RECLAIM")
Cc: <stable@vger.kernel.org> # 6.18.x
Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>
---
 mm/damon/reclaim.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 86da14778658..2747eef5919d 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -204,6 +204,11 @@ static int damon_reclaim_apply_parameters(void)
 	param_ctx->addr_unit = addr_unit;
 	param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1);

+	if (!is_power_of_2(param_ctx->min_region_sz)) {
+		err = -EINVAL;
+		goto out;
+	}
+
 	if (!damon_reclaim_mon_attrs.aggr_interval) {
 		err = -EINVAL;
 		goto out;
-- 
2.53.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10  4:42 ` [PATCH v4 1/2] mm/damon/lru_sort: " Liew Rui Yan
@ 2026-04-10  9:40   ` Liew Rui Yan
  2026-04-10 13:55     ` SeongJae Park
  2026-04-10 13:56   ` SeongJae Park
  1 sibling, 1 reply; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-10  9:40 UTC (permalink / raw)
  To: aethernet65535; +Cc: damon, linux-mm, sj

Sashiko found some issues.

Forwarding full sashiko review in a reply format with my 
inline comments below, for sharing details of my view and
doing followup discussions via mails if needed.

> # review url: https://sashiko.dev/#/patchset/20260410044259.95877-2-aethernet65535@gmail.com
> # start of sashiko.dev inline review
> commit bc048f9325d07815d621e80f3c40408c0210af75
> Author: Liew Rui Yan <aethernet65535@gmail.com>
> 
> mm/damon/lru_sort: validate min_region_size to be power of 2
> 
> This commit adds early validation for min_region_sz in
> damon_lru_sort_apply_parameters(). If the user configures an addr_unit that
> produces a non-power-of-2 min_region_sz, it immediately returns -EINVAL to
> avoid setting maybe_corrupted and unexpectedly terminating kdamond.
> 
> > diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
> > index 554559d729760..3fd176ef9d9c1 100644
> > --- a/mm/damon/lru_sort.c
> > +++ b/mm/damon/lru_sort.c
> > @@ -294,6 +294,11 @@ static int damon_lru_sort_apply_parameters(void)
> >  	param_ctx->addr_unit = addr_unit;
> >  	param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1);
> >  
> > +	if (!is_power_of_2(param_ctx->min_region_sz)) {
> > +		err = -EINVAL;
> > +		goto out;
> > +	}
> > +
> 
> Is the DAMON sysfs interface also susceptible to this regression?
> 
> When a user configures addr_unit via sysfs and writes 'on' to the state file,
> damon_sysfs_turn_damon_on() builds the context. The min_region_sz is
> calculated directly from addr_unit without a power-of-two check, and then
> damon_start() is called, bypassing the damon_commit_ctx() safeguard.
> 
> Since the DAMON core uses min_region_sz as an alignment mask in macros like
> ALIGN() and ALIGN_DOWN(), can this cause memory corruption or infinite loops
> in functions like damon_fill_regions_holes() if a non-power-of-2 value is
> provided via sysfs?

This issue has been discussed before [1]. While it's a valid concern,
it's outside the scope of this backport patch. SeongJae Park has
indicated he will work on this.

> 
> This patch works around damon_commit_ctx() failing due to !is_power_of_2()
> by performing the check early, but does damon_commit_ctx() still have a
> regression leading to unexpected termination if other allocations fail?

Yes, unexpected termination still exists.

This patch provides a minimal workaround for the specific case of
non-power-of-2 'min_region_sz', which is suitable for backport.

> 
> If damon_commit_ctx() fails midway, for example returning -ENOMEM from
> damon_commit_schemes(), it leaves dst->maybe_corrupted = true and bails
> out without rolling back the partial updates.
> 
> Even the DAMON sysfs interface, which attempts to mitigate this using a
> test_ctx dry-run, could experience a time-of-check to time-of-use race
> where the test allocation succeeds but the final commit allocation fails.
> 
> Since these modules operate under memory pressure, if an -ENOMEM occurs
> during parameter updates, will the running reclaim daemon permanently
> observe maybe_corrupted = true and unexpectedly terminate itself?

You're correct. If damon_commit_ctx() fails midway, it leaves
'maybe_corrupted=true', causing kdamond to terminate permanently. This
is a core error handling issue that needs to be fixed, but it's outside
the scope of this backport patch.

> 
> >  	if (!damon_lru_sort_mon_attrs.sample_interval) {
> >  		err = -EINVAL;
> >  		goto out;
> 
> 
> # end of sashiko.dev inline review
> # review url: https://sashiko.dev/#/patchset/20260410044259.95877-2-aethernet65535@gmail.com

# hkml [1] generated a draft of this mail.  You can regenerate
# this using below command:
#
#     hkml patch sashiko_dev --for_forwarding \
#             20260410044259.95877-2-aethernet65535@gmail.com
#
# [1] https://github.com/sjp38/hackermail

[1] https://lore.kernel.org/20260403155530.64647-1-sj@kernel.org

Best regards,
Rui Yan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 2/2] mm/damon/reclaim: validate min_region_size to be power of 2
  2026-04-10  4:42 ` [PATCH v4 2/2] mm/damon/reclaim: " Liew Rui Yan
@ 2026-04-10 10:08   ` Liew Rui Yan
  2026-04-10 13:44     ` SeongJae Park
  2026-04-10 13:57   ` SeongJae Park
  1 sibling, 1 reply; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-10 10:08 UTC (permalink / raw)
  To: aethernet65535; +Cc: damon, linux-mm, sj

Sashiko found some issues.

Forwarding sashiko review in a reply format with my inline
comments below, for details of my view and doing discussions
via mails if needed.

> # review url: https://sashiko.dev/#/patchset/20260410044259.95877-3-aethernet65535@gmail.com
> # start of sashiko.dev inline review
> commit 7a816b41f4b321b6a513227bef0e50f3bc1e86f1
> Author: Liew Rui Yan <aethernet65535@gmail.com>
> 
> mm/damon/reclaim: validate min_region_size to be power of 2
> 
> This commit adds an early validation in damon_reclaim_apply_parameters() to
> check if min_region_sz is a power of 2. It attempts to prevent kdamond from
> terminating unexpectedly by catching invalid values before they reach
> damon_commit_ctx().
> 
> > diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
> > index 86da147786583..2747eef5919d1 100644
> > --- a/mm/damon/reclaim.c
> > +++ b/mm/damon/reclaim.c
> > @@ -204,6 +204,11 @@ static int damon_reclaim_apply_parameters(void)
> >  	param_ctx->addr_unit = addr_unit;
> >  	param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1);
> 
> Is it possible for a concurrent sysfs write to change addr_unit between these
> two reads, allowing an invalid value to bypass the new validation?
> 
> The global variable addr_unit is read twice without READ_ONCE(), once to set
> param_ctx->addr_unit, and again to calculate param_ctx->min_region_sz.
> 
> If addr_unit is changed from a non-power-of-2 to a power-of-2 between the two
> reads, the validation below might succeed using the new value, while the
> invalid addr_unit is committed to the active context.

I believe this is possible, because the current call to
damon_reclaim_apply_parameters() is asynchronous and does not hold the
'kernel_param_lock' during execution, so this situation should be
possible.

However, I'm still unsure whether your suggested changes should be
included in this patch.

> 
> > +	if (!is_power_of_2(param_ctx->min_region_sz)) {
> > +		err = -EINVAL;
> > +		goto out;
> > +	}
> > +
> 
> Does this early check mask a structural issue in the core API?
> 
> Looking at damon_commit_ctx() in mm/damon/core.c:
> 
> int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src)
> {
>         ...
>         dst->maybe_corrupted = true;
>         if (!is_power_of_2(src->min_region_sz))
>                 return -EINVAL;
>         ...
> 
> If a different caller passes an invalid min_region_sz, it will still trigger
> this same trap.
> 
> Could the root cause be addressed by moving the maybe_corrupted flag update
> below the validation checks inside damon_commit_ctx()?
> 
> Are there other ways damon_commit_ctx() could leave the context corrupted?
> 
> If a user dynamically enables features that require memory allocation,
> damon_commit_ctx() calls damon_commit_schemes() which attempts to allocate
> memory via damos_commit_quota_goals().
> 
> If this allocation fails with -ENOMEM, it returns an error but leaves
> dst->maybe_corrupted = true on the active context, permanently terminating
> kdamond.
> 

I discussed this in my previous reply. But I realized there might have
been some misundertanding in my previous explanation. What I wanted to
say is that there might indeed be some issues with 'maybe_corrupted' in
damon_commit_ctx() that need adjustment, but that's not within the scope
of this patch.

> Can a concurrent sysfs write cause a divide-by-zero panic in this same
> function?
> 
> Looking further down in damon_reclaim_apply_parameters():
> 
>         if (quota_mem_pressure_us) {
>                 goal = damos_new_quota_goal(DAMOS_QUOTA_SOME_MEM_PSI_US,
>                                 quota_mem_pressure_us);
> 
> The global module parameter quota_mem_pressure_us is read twice without
> READ_ONCE().
> 
> If a concurrent write changes its value to 0 immediately after the if
> condition passes, a quota goal with a target_value of 0 is added.
> 
> When DAMON evaluates the quota score in damos_quota_score(), it uses the
> macro mult_frac(goal->current_value, 10000, goal->target_value). Dividing
> by a goal->target_value of 0 will cause a kernel panic.

I tried to reproduce this issue, but without success. 

However, based on the code, I believe the problem exists because I
didn't find any checks preventing 'goal->target_value=0'. Please let me
know if I've missed anything.

> 
> A similar sequence exists for quota_autotune_feedback.
> 
> >  	if (!damon_reclaim_mon_attrs.aggr_interval) {
> >  		err = -EINVAL;
> >  		goto out;
> >  	}
> 
> 
> # end of sashiko.dev inline review
> # review url: https://sashiko.dev/#/patchset/20260410044259.95877-3-aethernet65535@gmail.com

# hkml [1] generated a draft of this mail.  You can regenerate
# this using below command:
#
#     hkml patch sashiko_dev --for_forwarding \
#             20260410044259.95877-3-aethernet65535@gmail.com
#
# [1] https://github.com/sjp38/hackermail

Best regards,
Rui Yan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 2/2] mm/damon/reclaim: validate min_region_size to be power of 2
  2026-04-10 10:08   ` (sashiko review) " Liew Rui Yan
@ 2026-04-10 13:44     ` SeongJae Park
  0 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 13:44 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, damon, linux-mm

On Fri, 10 Apr 2026 18:08:22 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> Sashiko found some issues.
> 
> Forwarding sashiko review in a reply format with my inline
> comments below, for details of my view and doing discussions
> via mails if needed.
> 
> > # review url: https://sashiko.dev/#/patchset/20260410044259.95877-3-aethernet65535@gmail.com
> > # start of sashiko.dev inline review
> > commit 7a816b41f4b321b6a513227bef0e50f3bc1e86f1
> > Author: Liew Rui Yan <aethernet65535@gmail.com>
> > 
> > mm/damon/reclaim: validate min_region_size to be power of 2
> > 
> > This commit adds an early validation in damon_reclaim_apply_parameters() to
> > check if min_region_sz is a power of 2. It attempts to prevent kdamond from
> > terminating unexpectedly by catching invalid values before they reach
> > damon_commit_ctx().
> > 
> > > diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
> > > index 86da147786583..2747eef5919d1 100644
> > > --- a/mm/damon/reclaim.c
> > > +++ b/mm/damon/reclaim.c
> > > @@ -204,6 +204,11 @@ static int damon_reclaim_apply_parameters(void)
> > >  	param_ctx->addr_unit = addr_unit;
> > >  	param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1);
> > 
> > Is it possible for a concurrent sysfs write to change addr_unit between these
> > two reads, allowing an invalid value to bypass the new validation?
> > 
> > The global variable addr_unit is read twice without READ_ONCE(), once to set
> > param_ctx->addr_unit, and again to calculate param_ctx->min_region_sz.
> > 
> > If addr_unit is changed from a non-power-of-2 to a power-of-2 between the two
> > reads, the validation below might succeed using the new value, while the
> > invalid addr_unit is committed to the active context.
> 
> I believe this is possible, because the current call to
> damon_reclaim_apply_parameters() is asynchronous and does not hold the
> 'kernel_param_lock' during execution, so this situation should be
> possible.
> 
> However, I'm still unsure whether your suggested changes should be
> included in this patch.

We already addressed this with by defining it as an expected behavior that the
user should avoid.  And the queued patch [2] from Liew will make commits
synchronous and fix this issue together for the mainline.  Let me know if I'm
missing something.

> 
> > 
> > > +	if (!is_power_of_2(param_ctx->min_region_sz)) {
> > > +		err = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > 
> > Does this early check mask a structural issue in the core API?
> > 
> > Looking at damon_commit_ctx() in mm/damon/core.c:
> > 
> > int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src)
> > {
> >         ...
> >         dst->maybe_corrupted = true;
> >         if (!is_power_of_2(src->min_region_sz))
> >                 return -EINVAL;
> >         ...
> > 
> > If a different caller passes an invalid min_region_sz, it will still trigger
> > this same trap.
> > 
> > Could the root cause be addressed by moving the maybe_corrupted flag update
> > below the validation checks inside damon_commit_ctx()?
> > 
> > Are there other ways damon_commit_ctx() could leave the context corrupted?
> > 
> > If a user dynamically enables features that require memory allocation,
> > damon_commit_ctx() calls damon_commit_schemes() which attempts to allocate
> > memory via damos_commit_quota_goals().
> > 
> > If this allocation fails with -ENOMEM, it returns an error but leaves
> > dst->maybe_corrupted = true on the active context, permanently terminating
> > kdamond.
> > 
> 
> I discussed this in my previous reply. But I realized there might have
> been some misundertanding in my previous explanation. What I wanted to
> say is that there might indeed be some issues with 'maybe_corrupted' in
> damon_commit_ctx() that need adjustment, but that's not within the scope
> of this patch.

I agree.

> 
> > Can a concurrent sysfs write cause a divide-by-zero panic in this same
> > function?
> > 
> > Looking further down in damon_reclaim_apply_parameters():
> > 
> >         if (quota_mem_pressure_us) {
> >                 goal = damos_new_quota_goal(DAMOS_QUOTA_SOME_MEM_PSI_US,
> >                                 quota_mem_pressure_us);
> > 
> > The global module parameter quota_mem_pressure_us is read twice without
> > READ_ONCE().
> > 
> > If a concurrent write changes its value to 0 immediately after the if
> > condition passes, a quota goal with a target_value of 0 is added.
> > 
> > When DAMON evaluates the quota score in damos_quota_score(), it uses the
> > macro mult_frac(goal->current_value, 10000, goal->target_value). Dividing
> > by a goal->target_value of 0 will cause a kernel panic.
> 
> I tried to reproduce this issue, but without success. 
> 
> However, based on the code, I believe the problem exists because I
> didn't find any checks preventing 'goal->target_value=0'. Please let me
> know if I've missed anything.

As I mentioned above, it is unallowed user inputs [1], and Liew's queued patch
[2] will fix this for mainline.

> 
> > 
> > A similar sequence exists for quota_autotune_feedback.
> > 
> > >  	if (!damon_reclaim_mon_attrs.aggr_interval) {
> > >  		err = -EINVAL;
> > >  		goto out;
> > >  	}
> > 
> > 
> > # end of sashiko.dev inline review
> > # review url: https://sashiko.dev/#/patchset/20260410044259.95877-3-aethernet65535@gmail.com
> 
> # hkml [1] generated a draft of this mail.  You can regenerate
> # this using below command:
> #
> #     hkml patch sashiko_dev --for_forwarding \
> #             20260410044259.95877-3-aethernet65535@gmail.com
> #
> # [1] https://github.com/sjp38/hackermail
> 
> Best regards,
> Rui Yan

[1] https://lkml.kernel.org/r/20260329153052.46657-2-sj@kernel.org
[2] https://lore.kernel.org/20260402155733.77050-4-sj@kernel.org


Thanks,
SJ


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10  9:40   ` (sashiko review) " Liew Rui Yan
@ 2026-04-10 13:55     ` SeongJae Park
  2026-04-10 16:46       ` Liew Rui Yan
  0 siblings, 1 reply; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 13:55 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, damon, linux-mm

On Fri, 10 Apr 2026 17:40:04 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> Sashiko found some issues.

Clarifying what is the following action for the sashiko finding would be
helpful.  For example, if you say "Sashiko found some issues, so I will post a
new versiuon soon", and if I'm busy, I can save my time by saying "Ok, then I
will skip or delaying review of this version and wait for your next review".

Seems you think Sashiko found issues are not worthy to post a new version,
though, so I'm commenting below.

> 
> Forwarding full sashiko review in a reply format with my 
> inline comments below, for sharing details of my view and
> doing followup discussions via mails if needed.
> 
> > # review url: https://sashiko.dev/#/patchset/20260410044259.95877-2-aethernet65535@gmail.com
> > # start of sashiko.dev inline review
> > commit bc048f9325d07815d621e80f3c40408c0210af75
> > Author: Liew Rui Yan <aethernet65535@gmail.com>
> > 
> > mm/damon/lru_sort: validate min_region_size to be power of 2
> > 
> > This commit adds early validation for min_region_sz in
> > damon_lru_sort_apply_parameters(). If the user configures an addr_unit that
> > produces a non-power-of-2 min_region_sz, it immediately returns -EINVAL to
> > avoid setting maybe_corrupted and unexpectedly terminating kdamond.
> > 
> > > diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
> > > index 554559d729760..3fd176ef9d9c1 100644
> > > --- a/mm/damon/lru_sort.c
> > > +++ b/mm/damon/lru_sort.c
> > > @@ -294,6 +294,11 @@ static int damon_lru_sort_apply_parameters(void)
> > >  	param_ctx->addr_unit = addr_unit;
> > >  	param_ctx->min_region_sz = max(DAMON_MIN_REGION_SZ / addr_unit, 1);
> > >  
> > > +	if (!is_power_of_2(param_ctx->min_region_sz)) {
> > > +		err = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > 
> > Is the DAMON sysfs interface also susceptible to this regression?
> > 
> > When a user configures addr_unit via sysfs and writes 'on' to the state file,
> > damon_sysfs_turn_damon_on() builds the context. The min_region_sz is
> > calculated directly from addr_unit without a power-of-two check, and then
> > damon_start() is called, bypassing the damon_commit_ctx() safeguard.
> > 
> > Since the DAMON core uses min_region_sz as an alignment mask in macros like
> > ALIGN() and ALIGN_DOWN(), can this cause memory corruption or infinite loops
> > in functions like damon_fill_regions_holes() if a non-power-of-2 value is
> > provided via sysfs?
> 
> This issue has been discussed before [1]. While it's a valid concern,
> it's outside the scope of this backport patch. SeongJae Park has
> indicated he will work on this.

That's correct.  It's on my todo list.

> 
> > 
> > This patch works around damon_commit_ctx() failing due to !is_power_of_2()
> > by performing the check early, but does damon_commit_ctx() still have a
> > regression leading to unexpected termination if other allocations fail?
> 
> Yes, unexpected termination still exists.
> 
> This patch provides a minimal workaround for the specific case of
> non-power-of-2 'min_region_sz', which is suitable for backport.

I agree.

> 
> > 
> > If damon_commit_ctx() fails midway, for example returning -ENOMEM from
> > damon_commit_schemes(), it leaves dst->maybe_corrupted = true and bails
> > out without rolling back the partial updates.
> > 
> > Even the DAMON sysfs interface, which attempts to mitigate this using a
> > test_ctx dry-run, could experience a time-of-check to time-of-use race
> > where the test allocation succeeds but the final commit allocation fails.
> > 
> > Since these modules operate under memory pressure, if an -ENOMEM occurs
> > during parameter updates, will the running reclaim daemon permanently
> > observe maybe_corrupted = true and unexpectedly terminate itself?
> 
> You're correct. If damon_commit_ctx() fails midway, it leaves
> 'maybe_corrupted=true', causing kdamond to terminate permanently. This
> is a core error handling issue that needs to be fixed, but it's outside
> the scope of this backport patch.

Agreed.  This was unclear to me in previous disucssions, though.  I still agree
it is out of the scope of this patch.  But now I think we need to let users
force-restart.  Adding this to my todo list.

> 
> > 
> > >  	if (!damon_lru_sort_mon_attrs.sample_interval) {
> > >  		err = -EINVAL;
> > >  		goto out;
> > 
> > 
> > # end of sashiko.dev inline review
> > # review url: https://sashiko.dev/#/patchset/20260410044259.95877-2-aethernet65535@gmail.com


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10  4:42 ` [PATCH v4 1/2] mm/damon/lru_sort: " Liew Rui Yan
  2026-04-10  9:40   ` (sashiko review) " Liew Rui Yan
@ 2026-04-10 13:56   ` SeongJae Park
  1 sibling, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 13:56 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, Quanmin Yan, damon, linux-mm, stable

On Fri, 10 Apr 2026 12:42:58 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> Problem
> =======
> When a user sets an invalid 'addr_unit' (e.g., 3) via
> DAMON_LRU_SORT, 'min_region_sz' becomes a non-power-of-2
> value. This value eventually reaches damon_commit_ctx(), which does:
> 
>     dst->maybe_corrupted = true;
>     if (!is_power_of_2(src->min_region_sz))
>         return -EINVAL;
> 
> Although -EINVAL is returned, 'maybe_corrupted' is already set. The
> running kdamond observers this flag and terminates unexpectedly.
> 
> "Unexpected termination" here means the kdamond exits without any user
> request (e.g., not by writing 'N' to 'enabled').
> 
> User Impact
> ===========
> Once kdamond terminates this way, it cannot be restarted via sysfs
> because:
> 
> 1. DAMON_LRU_SORT is built into the kernel, so it cannot be unloaded and
>    reloaded at runtime.
> 2. Writing 'N' to 'enabled' fails because kdamond no longer exists;
>    Writing 'Y' does nothing, as 'enabled' is already Y.
> 
> Reproduction
> ============
> 1. Enable DAMON_LRU_SORT
> 2. Set addr_unit=3
> 3. Commit inputs via 'commit_inputs'
> 4. Observe kdamond termination
> 
> Solution
> ========
> Add an early validation in damon_lru_sort_apply_parameters() to check
> 'min_region_sz' before any state change occurs. If it is non-power-of-2,
> return -EINVAL immediately, preventing 'maybe_corrupted' from being set.
> 
> Fixes: 2e0fe9245d6b ("mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORT")
> Cc: <stable@vger.kernel.org> # 6.18.x
> Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 2/2] mm/damon/reclaim: validate min_region_size to be power of 2
  2026-04-10  4:42 ` [PATCH v4 2/2] mm/damon/reclaim: " Liew Rui Yan
  2026-04-10 10:08   ` (sashiko review) " Liew Rui Yan
@ 2026-04-10 13:57   ` SeongJae Park
  1 sibling, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 13:57 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, Quanmin Yan, damon, linux-mm, stable

On Fri, 10 Apr 2026 12:42:59 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> Problem
> =======
> When a user sets an invalid 'addr_unit' (e.g., 3) via
> DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2
> value. This value eventually reaches damon_commit_ctx(), which does:
> 
>     dst->maybe_corrupted = true;
>     if (!is_power_of_2(src->min_region_sz))
>         return -EINVAL;
> 
> Although -EINVAL is returned, 'maybe_corrupted' is already set. The
> running kdamond observers this flag and terminates unexpectedly.
> 
> "Unexpected termination" here means the kdamond exits without any user
> request (e.g., not by writing 'N' to 'enabled').
> 
> User Impact
> ===========
> Once kdamond terminates this way, it cannot be restarted via sysfs
> because:
> 
> 1. DAMON_RECLAIM is built into the kernel, so it cannot be unloaded and
>    reloaded at runtime.
> 2. Writing 'N' to 'enabled' fails because kdamond no longer exists;
>    Writing 'Y' does nothing, as 'enabled' is already Y.
> 
> Reproduction
> ============
> 1. Enable DAMON_RECLAIM
> 2. Set addr_unit=3
> 3. Commit inputs via 'commit_inputs'
> 4. Observe kdamond termination
> 
> Solution
> ========
> Add an early validation in damon_reclaim_apply_parameters() to check
> 'min_region_sz' before any state change occurs. If it is non-power-of-2,
> return -EINVAL immediately, preventing 'maybe_corrupted' from being set.
> 
> Fixes: 7db551fcfb2a ("mm/damon/reclaim: support addr_unit for DAMON_RECLAIM")
> Cc: <stable@vger.kernel.org> # 6.18.x
> Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/2] mm/damon: validate min_region_size to be power of 2
  2026-04-10  4:42 [PATCH v4 0/2] mm/damon: validate min_region_size to be power of 2 Liew Rui Yan
  2026-04-10  4:42 ` [PATCH v4 1/2] mm/damon/lru_sort: " Liew Rui Yan
  2026-04-10  4:42 ` [PATCH v4 2/2] mm/damon/reclaim: " Liew Rui Yan
@ 2026-04-10 14:05 ` SeongJae Park
  2 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 14:05 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, Quanmin Yan, damon, linux-mm, Andrew Morton

On Fri, 10 Apr 2026 12:42:57 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> Problem
> =======
> When a user sets an invalid 'addr_unit' (e.g., 3) via
> DAMON_LRU_SORT/DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2
> value. This value eventually reaches damon_commit_ctx(), which does:
> 
>     dst->maybe_corrupted = true;
>     if (!is_power_of_2(src->min_region_sz))
>         return -EINVAL;
> 
> Although -EINVAL is returned, 'maybe_corrupted' is already set. The
> running kdamond observers this flag and terminates unexpectedly.
> 
> "Unexpected termination" here means the kdamond exits without any user
> request (e.g., not by writing 'N' to 'enabled').
> 
> User Impact
> ===========
> Once kdamond terminates this way, it cannot be restarted via sysfs
> because:
> 
> 1. DAMON_LRU_SORT/DAMON_RECLAIM is built into the kernel, so it cannot
>    be unloaded and reloaded at runtime.
> 2. Writing 'N' to 'enabled' fails because kdamond no longer exists;
>    Writing 'Y' does nothing, as 'enabled' is already Y.
> 
> Solution
> ========
> Add an early validation in damon_lru_sort_apply_parameters()
> /damon_reclaim_apply_parameters() to check 'min_region_sz' before any
> state change occurs. If it is non-power-of-2, return -EINVAL immediately,
> preventing 'maybe_corrupted' from being set.
> 
> Patch 1 fixes the issue for DAMON_LRU_SORT.
> Patch 2 fixes the issue for DAMON_RECLAIM.
> 
> Changes from v3
> (https://lore.kernel.org/20260403052837.58063-1-aethernet65535@gmail.com)
> - Improve commit message: clarify "unexpected termination".
> - Add detailed User Impact with reason why kdamond cannot be restarted.

Thank you for making the change.  Looks good to me.  I gave my 'Reviewed-by:'
for the two patches as replies.

Andrew, this series is for hotfixes that deserve Cc-ing stable@.  Could you
please pick this up into mm.git?  I don't think these are super-urgent and I
have some capacities to manage this myself, though.  So if you prefer to, I can
stash this in my tree for now and repost after next rc1.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10 13:55     ` SeongJae Park
@ 2026-04-10 16:46       ` Liew Rui Yan
  2026-04-10 17:00         ` SeongJae Park
  0 siblings, 1 reply; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-10 16:46 UTC (permalink / raw)
  To: sj; +Cc: aethernet65535, damon, linux-mm

On Fri, 10 Apr 2026 06:55:00 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Fri, 10 Apr 2026 17:40:04 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:
> 
> > Sashiko found some issues.
> 
> Clarifying what is the following action for the sashiko finding would be
> helpful.  For example, if you say "Sashiko found some issues, so I will post a
> new versiuon soon", and if I'm busy, I can save my time by saying "Ok, then I
> will skip or delaying review of this version and wait for your next review".

Okay, I will keep that in mind next time I foward a Sashiko's Review.
Thank you for you suggestion.

> [...]
> 
> Agreed.  This was unclear to me in previous disucssions, though.  I still agree
> it is out of the scope of this patch.  But now I think we need to let users
> force-restart.  Adding this to my todo list.

Just to make sure - is this the same issue that my recent RFC patch [1]
aims to address? I want to make sure we're not duplicating efforts.

I'm still actively working on that patch, and I plan to send the next
version next week. I've been holding off because I didn't want to send
multiple patches in parallel.

[1] https://lore.kernel.org/20260330164347.12772-1-aethernet65535@gmail.com

Best regards,
Rui Yan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10 16:46       ` Liew Rui Yan
@ 2026-04-10 17:00         ` SeongJae Park
  2026-04-10 23:24           ` SeongJae Park
  2026-04-11  0:04           ` Liew Rui Yan
  0 siblings, 2 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 17:00 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, damon, linux-mm

On Sat, 11 Apr 2026 00:46:10 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> On Fri, 10 Apr 2026 06:55:00 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> > On Fri, 10 Apr 2026 17:40:04 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:
> > 
> > > Sashiko found some issues.
> > 
> > Clarifying what is the following action for the sashiko finding would be
> > helpful.  For example, if you say "Sashiko found some issues, so I will post a
> > new versiuon soon", and if I'm busy, I can save my time by saying "Ok, then I
> > will skip or delaying review of this version and wait for your next review".
> 
> Okay, I will keep that in mind next time I foward a Sashiko's Review.
> Thank you for you suggestion.
> 
> > [...]
> > 
> > Agreed.  This was unclear to me in previous disucssions, though.  I still agree
> > it is out of the scope of this patch.  But now I think we need to let users
> > force-restart.  Adding this to my todo list.
> 
> Just to make sure - is this the same issue that my recent RFC patch [1]
> aims to address? I want to make sure we're not duplicating efforts.

Hmm, this makes me confused about how we ended up working on this series, then.

> 
> I'm still actively working on that patch, and I plan to send the next
> version next week. I've been holding off because I didn't want to send
> multiple patches in parallel.

Ok, seems I dropped a ball.  I was working like AI bot that only works with
limited and nearly fresh context for each mail that on my inbox.  I will work
on making another thing for tracking this kind of parallel works with good
context.  But since I already dropped the ball for this, I'd like to make sure
we are on the same page.  So I'd suggest below.

1. Let's hold this patch series.  Andrew, please don't merge this for now until
   this discussion is completed.
2. Please summarize your parallel works in progress with the context about how
   you decided to do that in the way, with summaries of our previous
   discussions.

Could you please do those, Liew?

> 
> [1] https://lore.kernel.org/20260330164347.12772-1-aethernet65535@gmail.com


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10 17:00         ` SeongJae Park
@ 2026-04-10 23:24           ` SeongJae Park
  2026-04-11  0:04           ` Liew Rui Yan
  1 sibling, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-10 23:24 UTC (permalink / raw)
  To: SeongJae Park; +Cc: Liew Rui Yan, damon, linux-mm, Andrew Morton

+ Andrew

On Fri, 10 Apr 2026 10:00:50 -0700 SeongJae Park <sj@kernel.org> wrote:
[...]
> 1. Let's hold this patch series.  Andrew, please don't merge this for now until
>    this discussion is completed.

For this.


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-10 17:00         ` SeongJae Park
  2026-04-10 23:24           ` SeongJae Park
@ 2026-04-11  0:04           ` Liew Rui Yan
  2026-04-11 15:38             ` SeongJae Park
  1 sibling, 1 reply; 15+ messages in thread
From: Liew Rui Yan @ 2026-04-11  0:04 UTC (permalink / raw)
  To: sj; +Cc: aethernet65535, damon, linux-mm

On Fri, 10 Apr 2026 10:00:50 -0700 SeongJae Park <sj@kernel.org> wrote:

> On Sat, 11 Apr 2026 00:46:10 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:
> 
> > On Fri, 10 Apr 2026 06:55:00 -0700 SeongJae Park <sj@kernel.org> wrote:
> > 
> > > On Fri, 10 Apr 2026 17:40:04 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:
> > > 
> > > [...]
> > > 
> > > Agreed.  This was unclear to me in previous disucssions, though.  I still agree
> > > it is out of the scope of this patch.  But now I think we need to let users
> > > force-restart.  Adding this to my todo list.
> > 
> > Just to make sure - is this the same issue that my recent RFC patch [1]
> > aims to address? I want to make sure we're not duplicating efforts.
> 
> Hmm, this makes me confused about how we ended up working on this series, then.
> 
> > 
> > I'm still actively working on that patch, and I plan to send the next
> > version next week. I've been holding off because I didn't want to send
> > multiple patches in parallel.
> 
> Ok, seems I dropped a ball.  I was working like AI bot that only works with
> limited and nearly fresh context for each mail that on my inbox.  I will work
> on making another thing for tracking this kind of parallel works with good
> context.  But since I already dropped the ball for this, I'd like to make sure
> we are on the same page.  So I'd suggest below.
> 
> 1. Let's hold this patch series.  Andrew, please don't merge this for now until
>    this discussion is completed.
> 2. Please summarize your parallel works in progress with the context about how
>    you decided to do that in the way, with summaries of our previous
>    discussions.
> 
> Could you please do those, Liew?

TL;DR - There is no dependency between my parallel (2) works. I never
intended to merge these two works into a single series because they
address the problem at different level.

Yes, here is the summary of my parallel works and how I decided to
structure them.

My current parallel works
=========================
Series A - min_region_sz power-of-2 validation (this patch)
- Status: You gave Reviewed-by, asked Andrew to merge.
- Scope: Small, standalone fix for specific issue.
- Dependency: None

Series B - reset parameters (enabled/kdamond_pid) on unexpected
termination (RFC [1])
- Status: Preparing V2.
- Scope: Reset 'enabled' and 'kdamond_pid' when kdamond terminated
  unexpectedly.
- Dependency: None

Summary of Series B
===================
The design evolved through our discussion:
1. Extended 'struct damon_ctx' with 'thread_status' pointers
   - SJ pointed out: "This feels too much extension of core API for a
     problem that can more simply be fixed."

2. Alternative proposals discussed:
   - Option 1: Termination callback in core API (Too heavy for backport)
   - Option 2: Override '.get' operator for parameters (code duplication)
   - Option 3: On-demamd correction in enabled_store() (passive)

3. Final direction comfirmed:
   - SJ suggested: "Can't we catch damon_commit_ctx() failure from the
     calling place?"
   - We agreed: Simple, backportable fix is the priority.
   - Decision: Reset 'enabled' and 'kdamond_pid' immediately when
     damon_commit_ctx() fails, add a fallback in the damon_turn() 'N'
     path.

[1] https://lore.kernel.org/20260330164347.12772-1-aethernet65535@gmail.com

Best regards,
Rui Yan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: (sashiko review) [PATCH v4 1/2] mm/damon/lru_sort: validate min_region_size to be power of 2
  2026-04-11  0:04           ` Liew Rui Yan
@ 2026-04-11 15:38             ` SeongJae Park
  0 siblings, 0 replies; 15+ messages in thread
From: SeongJae Park @ 2026-04-11 15:38 UTC (permalink / raw)
  To: Liew Rui Yan; +Cc: SeongJae Park, damon, linux-mm

On Sat, 11 Apr 2026 08:04:58 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:

> On Fri, 10 Apr 2026 10:00:50 -0700 SeongJae Park <sj@kernel.org> wrote:
> 
> > On Sat, 11 Apr 2026 00:46:10 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:
> > 
> > > On Fri, 10 Apr 2026 06:55:00 -0700 SeongJae Park <sj@kernel.org> wrote:
> > > 
> > > > On Fri, 10 Apr 2026 17:40:04 +0800 Liew Rui Yan <aethernet65535@gmail.com> wrote:
> > > > 
> > > > [...]
> > > > 
> > > > Agreed.  This was unclear to me in previous disucssions, though.  I still agree
> > > > it is out of the scope of this patch.  But now I think we need to let users
> > > > force-restart.  Adding this to my todo list.
> > > 
> > > Just to make sure - is this the same issue that my recent RFC patch [1]
> > > aims to address? I want to make sure we're not duplicating efforts.
> > 
> > Hmm, this makes me confused about how we ended up working on this series, then.
> > 
> > > 
> > > I'm still actively working on that patch, and I plan to send the next
> > > version next week. I've been holding off because I didn't want to send
> > > multiple patches in parallel.
> > 
> > Ok, seems I dropped a ball.  I was working like AI bot that only works with
> > limited and nearly fresh context for each mail that on my inbox.  I will work
> > on making another thing for tracking this kind of parallel works with good
> > context.  But since I already dropped the ball for this, I'd like to make sure
> > we are on the same page.  So I'd suggest below.
> > 
> > 1. Let's hold this patch series.  Andrew, please don't merge this for now until
> >    this discussion is completed.
> > 2. Please summarize your parallel works in progress with the context about how
> >    you decided to do that in the way, with summaries of our previous
> >    discussions.
> > 
> > Could you please do those, Liew?
> 
> TL;DR - There is no dependency between my parallel (2) works. I never
> intended to merge these two works into a single series because they
> address the problem at different level.
> 
> Yes, here is the summary of my parallel works and how I decided to
> structure them.
> 
> My current parallel works
> =========================
> Series A - min_region_sz power-of-2 validation (this patch)
> - Status: You gave Reviewed-by, asked Andrew to merge.
> - Scope: Small, standalone fix for specific issue.
> - Dependency: None
> 
> Series B - reset parameters (enabled/kdamond_pid) on unexpected
> termination (RFC [1])
> - Status: Preparing V2.
> - Scope: Reset 'enabled' and 'kdamond_pid' when kdamond terminated
>   unexpectedly.
> - Dependency: None
> 
> Summary of Series B
> ===================
> The design evolved through our discussion:
> 1. Extended 'struct damon_ctx' with 'thread_status' pointers
>    - SJ pointed out: "This feels too much extension of core API for a
>      problem that can more simply be fixed."
> 
> 2. Alternative proposals discussed:
>    - Option 1: Termination callback in core API (Too heavy for backport)
>    - Option 2: Override '.get' operator for parameters (code duplication)
>    - Option 3: On-demamd correction in enabled_store() (passive)
> 
> 3. Final direction comfirmed:
>    - SJ suggested: "Can't we catch damon_commit_ctx() failure from the
>      calling place?"
>    - We agreed: Simple, backportable fix is the priority.
>    - Decision: Reset 'enabled' and 'kdamond_pid' immediately when
>      damon_commit_ctx() fails, add a fallback in the damon_turn() 'N'
>      path.
> 
> [1] https://lore.kernel.org/20260330164347.12772-1-aethernet65535@gmail.com

Thank you for sharing these.  But seems even this context is incomplete from my
perspective...

The Real Context
----------------

The real start of the context is probably the question [1] that you asked on
2026-03-18.  At that time, we found committing wrong parameters stops DAMON
without an user-visible error.  We agreed DAMON being killed is no problem but
the absence of user-visible error is a minor user experience problem that
better to be improved.  So you planned to make the user experience improvement.

Apparently series A is the followup of that.  The first version was posted on
2026-03-26, according to the changelog of this patch series.  I can show I was
saying the patch was confusing on the thread.

While the review of the confusing series A was ongoing, you posted [3] the RFC
of series B on 2026-03-31.  This patch reported one important finding: the
silent DAMON stop is not just a minor user experience issue but a serious bug
because it cannot be started again.

And this made things much more confusing. There are multiple ways to trigger
the issue.  Wrong addr_unit commit is just one of the ways.  We discussed the
way to fix the real issue.  So, by looking back this, I think you should
prioritized series B from the point, or make suer series A is only for the
minor user experience improvement.  Or asked me what to prioritize.  You didn't
and I missed the fact that you are also working on series A.

But you just continued posting new versions of series A.  I was wrongly
thinking that is still the minor user experience improvement.  I still think
the patches were implicitly saying so.  I'm not sure if it was intentional or
not.  But definitely it was confusing me.  On 2026-04-02, I started feeling I'm
missing some of the contexts, and asked you more clarification of user impacts
[4] and full history [5].  You posted v3 [6] right after I asked the question,
even without answering the question.  Only from this point it became clear
sereis A is not just a minor user experience improvement but a critical bug
fix.  Now I think you should clarify this can also fix one trigger point of the
critical bug but the real fix is work in progress, and this is till only a
minor user experience improvement.  But because you didn't, and my poor memory
is volatile, at this point I was thinking this all the work you are doing for
the series B-exposed bug.

In the response to the sashiko review on this thread, therefore, I was thinking
you are thinking the wrong addr_unit commit is the only way to trigger the bug.
I didn't want to ask you to work from the beginning to fix the entire bug, so I
was saying fixing the real bug for all exploit points is out of the scope of
this bug.

But now it is turned out that you were aware of the other ways to trigger the
bug, and didn't transparentl and explicitly exposing that.

So I withdraw what I told on the reply to the sashiko review.  And appreciate
sashiko developers again for giving us a chance to finding this.

Next Steps
----------

So, what to do?  Please prioritize series B, if you still willing to do.  It is
ok to keep doing series A, but only as the minor user experience improvement.
Clearly explain the whole context you are aware of.  Don't Cc stable@ for
series A, as it is only an incomplete fix of it.  The fix of the one trigger
point is just a side effect.

And For Future Contributions
----------------------------

Liew, I really appreciate your contributions.  You found and shared important
DAMON bugs.  But apparently your communication has many rooms to improve, at
least for poort DAMON maintainer who have to work with only limited resources
including the poort and volatile memory.  I find I was asking clarifications to
your mails multiple times.  I have to say it was even frustrating sometimes and
definitely took quite amount of my resource.  Meanwhile, from my uncautiously
biased perspective, you were only adding more traffic and confusions.

This makes me disappointed and even suspect your intention...  I really hate
myself suspecting someone.  But we are in the world of bad actors that now
gained the power of AI.  We actually had a suspicious case from DAMON
contributors recently, do you remember?

I want to still believe you, so I will do so.  Please feel free to keep
contributing to DAMON as long as that's what you want to do.  But, please aware
of the fact that DAMON maintainer has poor volatile memory and working with
limited resource, and try to make every conversation super clear and
transparent, from next time.

[1] https://lore.kernel.org/20260318153731.97470-1-aethernet65535@gmail.com
[2] https://lore.kernel.org/20260327062627.66426-1-aethernet65535@gmail.com
[3] https://lore.kernel.org/20260330164347.12772-1-aethernet65535@gmail.com
[4] https://lore.kernel.org/20260402140314.74600-1-sj@kernel.org
[5] https://lore.kernel.org/20260402152915.75294-1-sj@kernel.org
[6] https://lore.kernel.org/20260403052837.58063-1-aethernet65535@gmail.com

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-11 15:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-10  4:42 [PATCH v4 0/2] mm/damon: validate min_region_size to be power of 2 Liew Rui Yan
2026-04-10  4:42 ` [PATCH v4 1/2] mm/damon/lru_sort: " Liew Rui Yan
2026-04-10  9:40   ` (sashiko review) " Liew Rui Yan
2026-04-10 13:55     ` SeongJae Park
2026-04-10 16:46       ` Liew Rui Yan
2026-04-10 17:00         ` SeongJae Park
2026-04-10 23:24           ` SeongJae Park
2026-04-11  0:04           ` Liew Rui Yan
2026-04-11 15:38             ` SeongJae Park
2026-04-10 13:56   ` SeongJae Park
2026-04-10  4:42 ` [PATCH v4 2/2] mm/damon/reclaim: " Liew Rui Yan
2026-04-10 10:08   ` (sashiko review) " Liew Rui Yan
2026-04-10 13:44     ` SeongJae Park
2026-04-10 13:57   ` SeongJae Park
2026-04-10 14:05 ` [PATCH v4 0/2] mm/damon: " SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox