From: Stepanov Anatoly <stepanov.anatoly@huawei.com>
To: SeongJae Park <sj@kernel.org>, <gutierrez.asier@huawei-partners.com>
Cc: <artem.kuzin@huawei.com>, <wangkefeng.wang@huawei.com>,
<yanquanmin1@huawei.com>, <zuoze1@huawei.com>,
<damon@lists.linux.dev>, <akpm@linux-foundation.org>,
<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v1 1/1] mm/damon: support MADV_COLLAPSE via DAMOS_COLLAPSE scheme action
Date: Tue, 31 Mar 2026 13:46:53 +0300 [thread overview]
Message-ID: <15f60a0e-a21b-472a-ae76-8437e9859e15@huawei.com> (raw)
In-Reply-To: <20260331013109.66590-1-sj@kernel.org>
On 3/31/2026 4:31 AM, SeongJae Park wrote:
> Hello Asier,
>
> On Mon, 30 Mar 2026 14:57:58 +0000 <gutierrez.asier@huawei-partners.com> wrote:
>
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> This patch set introces a new action: DAMOS_COLLAPSE.
>>
>> For DAMOS_HUGEPAGE and DAMOS_NOHUGEPAGE to work, khugepaged should be
>> working, since it relies on hugepage_madvise to add a new slot. This
>> slot should be picked up by khugepaged and eventually collapse (or
>> not, if we are using DAMOS_NOHUGEPAGE) the pages. If THP is not
>> enabled, khugepaged will not be working, and therefore no collapse
>> will happen.
>
> I should raised this in a previous version, sorry. But, that is only a half of
> the picture. That is, khugepaged is not the single THP allocator for
> MADV_HUGEPAGE. IIUC, MADV_HUGEPAGE-applied region also allocates huge pages in
> page fault time. According to the man page,
>
> The kernel will regularly scan the areas marked as huge page candidates
> to replace them with huge pages. The kernel will also allocate huge pages
> directly when the region is naturally aligned to the huge page size (see
> posix_memalign(2)).
>
I think key difference between DAMOS_HUGEPAGE and DAMOS_COLLAPSE is the granularity.
In DAMOS_HUGEPAGE case, the granularity is always VMA, even if the hot region is narrow.
It's true for both page-fault based collapse and khugepaged collapse.
With DAMOS_COLLAPSE we can cover cases, when there's large VMA, for example,
which contains some hot VA region inside, so we can collapse just that region, not the whole VMA.
> I think the description is better to be wordsmithed or clarified. Maybe just
> pointing the MADV_COLLAPSE intro commit (7d8faaf15545 ("mm/madvise: introduce
> MADV_COLLAPSE sync hugepage collapse")) for the rationale could also be a good
> approach, as the aimed goal of DAMOS_COLLAPSE is not different from
> MADV_COLLAPSE.
>
>>
>> DAMOS_COLLAPSE eventually calls madvise_collapse, which will collapse
>> the address range synchronously.
>>
>> This new action may be required to support autotuning with hugepage
>> as a goal[1].
>>
>> [1]: https://lore.kernel.org/damon/20260313000816.79933-1-sj@kernel.org/
>>
>> ---------
>> Benchmarks:
>
> I recently heard some tools could think above line as the commentary
> area [1] separation line. Please use ==== like separator instead. For
> example,
>
> Benchmarks
> ==========
>
>>
>> Tests were performed in an ARM physical server with MariaDB 10.5 and
>> sysbench. Read only benchmark was perform with uniform row hitting,
>> which means that all rows will be access with equal probability.
>>
>> T n, D h: THP set to never, DAMON action set to hugepage
>> T m, D h: THP set to madvise, DAMON action set to hugepage
>> T n, D c: THP set to never, DAMON action set to collapse
>>
>> Memory consumption. Lower is better.
>>
>> +------------------+----------+----------+----------+
>> | | T n, D h | T m, D h | T n, D c |
>> +------------------+----------+----------+----------+
>> | Total memory use | 2.07 | 2.09 | 2.07 |
>> | Huge pages | 0 | 1.3 | 1.25 |
>> +------------------+----------+----------+----------+
>>
>> Performance in TPS (Transactions Per Second). Higher is better.
>>
>> T n, D h: 18324.57
>> T n, D h 18452.69
>
> "T m, D h" ?
>
>> T n, D c: 18432.17
>>
>> Performance counter
>>
>> I got the number of L1 D/I TLB accesses and the number a D/I TLB
>> accesses that triggered a page walk. I divided the second by the
>> first to get the percentage of page walkes per TLB access. The
>> lower the better.
>>
>> +---------------+--------------+--------------+--------------+
>> | | T n, D h | T m, D h | T n, D c |
>> +---------------+--------------+--------------+--------------+
>> | L1 DTLB | 127248242753 | 125431020479 | 125327001821 |
>> | L1 ITLB | 80332558619 | 79346759071 | 79298139590 |
>> | DTLB walk | 75011087 | 52800418 | 55895794 |
>> | ITLB walk | 71577076 | 71505137 | 67262140 |
>> | DTLB % misses | 0.058948623 | 0.042095183 | 0.044599961 |
>> | ITLB % misses | 0.089100954 | 0.090117275 | 0.084821839 |
>> +---------------+--------------+--------------+--------------+
>>
>> - We can see that DAMOS "hugepage" action works only when THP is set
>> to madvise. "collapse" action works even when THP is set to never.
>
> Make sense.
>
>> - Performance for "collapse" action is slightly lower than "hugepage"
>> action and THP madvise.
>
> It would be good to add your theory about from where the difference comes. I
> suspect that's mainly because "hugepage" setup was allocating more THP?
>
>> - Memory consumption is slighly lower for "collapse" than "hugepage"
>> with THP madvise. This is due to the khugepage collapses all VMAs,
>> while "collapse" action only collapses the VMAs in the hot region.
>
> But you use thp=madvise, not thp=always? So only hot regions, which
> DAMOS_HUGEPAGE applied, could use THP. It is same to DAMOS_COLLAPSE use case,
> isn't it?
>
> I'd rather suspect the natural-aligned region huge page allocation of
> DAMOS_HUGEPAGE as a reason of this difference. That is, DAMOS_HUGEPAGE applied
> regions can allocate hugepages in the fault time, on multiple user threads.
> Meanwhile, DAMOS_COLLAPSE should be executed by the single kdamond (if you
> utilize only single kdamond). This might resulted in DAMOS_HUGEPAGE allocating
> more huge pages faster than DAMOS_COLLAPSE?
>
>> - There is an improvement in THP utilization when collapse through
>> "hugepage" or "collapse" actions are triggered.
>
> Could you clarify which data point is showing this? Maybe "Huge pages" /
> "Total memory use" ? And why? I again suspect the fault time huge pages
> allocation.
>
>> - "collapse" action is performance synchronously, which means that
>> page collapses happen earlier and more rapidly.
>
> But these test results are not showing it clearly. Rather, the results is
> saying "hugepage" was able to make more huge pages than "collapse". Still the
> above sentence makes sense when we say about "collapsing" operations. But,
> this test is not showing it clearly. I think we should make it clear the
> limitation of this test.
>
>> This can be
>> useful or not, depending on the scenario.
>>
>> Collapse action just adds a new option to chose the correct system
>> balance.
>
> That's a fair point. I believe we also discussed pros and cons of
> MADV_COLLAPSE, and concluded MADV_COLLAPSE is worthy to be added. For
> DAMOS_COLLAPSE, I don't think we have to do that again.
>
>>
>> Changes
>> ---------
>> RFC v2 -> v1:
>> Fixed a missing comma in the selftest python stript
>> Added performance benchmarks
>>
>> RFC v1 -> RFC v2:
>> Added benchmarks
>> Added damos_filter_type documentation for new action to fix kernel-doc
>
> Please put changelog in the commentary area, and consider adding links to the
> previous revisions [1].
>
>>
>> Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>> ---
>
> Code looks good to me. Nonetheless I'd hope above commit message and benchmark
> results analysis be more polished and/or clarified.
>
> [1] https://docs.kernel.org/process/submitting-patches.html#commentary
>
>
> Thanks,
> SJ
--
Anatoly Stepanov, Huawei
next prev parent reply other threads:[~2026-03-31 10:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-30 14:57 gutierrez.asier
2026-03-30 23:43 ` (sashiko review) " SeongJae Park
2026-03-31 0:01 ` SeongJae Park
2026-03-31 1:31 ` SeongJae Park
2026-03-31 10:46 ` Stepanov Anatoly [this message]
2026-03-31 10:50 ` Stepanov Anatoly
2026-04-01 0:54 ` SeongJae Park
2026-03-31 15:15 ` Gutierrez Asier
2026-04-01 0:59 ` SeongJae Park
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=15f60a0e-a21b-472a-ae76-8437e9859e15@huawei.com \
--to=stepanov.anatoly@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=artem.kuzin@huawei.com \
--cc=damon@lists.linux.dev \
--cc=gutierrez.asier@huawei-partners.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=sj@kernel.org \
--cc=wangkefeng.wang@huawei.com \
--cc=yanquanmin1@huawei.com \
--cc=zuoze1@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox