linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dev Jain <dev.jain@arm.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: akpm@linux-foundation.org, david@redhat.com, kas@kernel.org,
	willy@infradead.org, hughd@google.com, ziy@nvidia.com,
	baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com,
	npache@redhat.com, ryan.roberts@arm.com, baohua@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] mm: Enable khugepaged to operate on non-writable VMAs
Date: Thu, 4 Sep 2025 09:34:35 +0530	[thread overview]
Message-ID: <6abceb3f-e988-45cd-800f-2b54f9a92e26@arm.com> (raw)
In-Reply-To: <9536873f-08ea-4a60-bbec-3e7a832dc0e1@lucifer.local>

[-- Attachment #1: Type: text/plain, Size: 18534 bytes --]


On 04/09/25 2:04 am, Lorenzo Stoakes wrote:
> On Wed, Sep 03, 2025 at 11:16:34AM +0530, Dev Jain wrote:
>> Currently khugepaged does not collapse a region which does not have a
>> single writable page. This is wasteful since non-writable VMAs mapped by
> As discussed elsewhere in the thread, you really need to clarify that you
> mean the PTE is writable. This is far too vague otherwise.

Okay.

>
>> the application won't benefit from THP collapse. Therefore, remove this
>> restriction and allow khugepaged to collapse a VMA with arbitrary
>> protections.
> It's weird thie history of this, it looks like we were super conservative
> at first, and then introduced this 'at least one PTE writable' thing in
> commit 10359213d05a ("mm: incorporate read-only pages into transparent huge
> pages"), but it doesn't really explain why you even need (at least) a
> writable page.
>
> Perhaps a pre-PAE thing... (David?) we already do the refcount stuff
> though, so it's hard to understand.
>
> It seems the main case for anon where it'd matter is swapped in pages
> read-faulting for a R/W mapping (as read-faulting R/W mappings would just
> get you the zero page which vm_normal_page() would exclude anyway).
>
> But not sure why we'd be reticent to collapse those anyway... you'd just
> cahnge R/W bit on PMD instead of PTE?
>
> Yeah it's bizarre.
>
> I can't really see why your change shouldn't be done...
>
>
>> Along with this, currently MADV_COLLAPSE does not perform a collapse on a
>> non-writable VMA, and this restriction is nowhere to be found on the
>> manpage - the restriction itself sounds wrong to me since the user knows
> I'm not sure why a man page would talk about PTE scanning implementation
> details?


Sure, the manpage shouldn't talk about that, but the consequence of this
PTE scanning implementation is that a read-only VMA won't be collapsed,
so the manpage should have at least talked about mapping protections.
So a user doing a PROT_READ mapping and then doing madvise(MADV_COLLAPSE)
will receive -EINVAL which is extremely bizarre.

>
> But I guess as you say you're thinking specifically of a read-only VMA that
> naturally has read-only PTE's as as result...
>
>> the protection of the memory it has mapped, so collapsing read-only
>> memory via madvise() should be a choice of the user which shouldn't
>> be overriden by the kernel.
> NIT: overriden -> overridden.
>
>> On an arm64 machine, an average of 5% improvement is seen on some mmtests
>> benchmarks, particularly hackbench, with a maximum improvement of 12%.
> Nice!
>
> Is this on a raw metal machine, or a VM? I thik it's important to clarify
> details like this.
>
> Please state precisely what you tested this on.

I am guessing these benchmarks run in a container but I'll clarify this.

>
>> Signed-off-by: Dev Jain<dev.jain@arm.com>
> Can't find any problem with this, and doesn't really seem like it'd be
> problematic so:
>
> Reviewed-by: Lorenzo Stoakes<lorenzo.stoakes@oracle.com>

Thanks.

>
>> ---
>> RFC->v1:
>> Drop writable references from tracepoints
>>
>> RFC:
>> https://lore.kernel.org/all/20250901074817.73012-1-dev.jain@arm.com/
>>
>> I can see performance improvements on mmtests run on an arm64 machine
>> comparing with 6.17-rc2. (I) denotes statistically significant improvement,
>> (R) denotes statistically significant regression (Please ignore the
>> numbers in the middle column):
> Let's drop the numbers in the middle column then please, this is going into the
> commit log, let's not put extranous information there.

I'll go study some Unix commands to drop that middle column :)

>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>> | mmtests/hackbench                  | process-pipes-1 (seconds)                                |                 0.145 |                   -0.06% |
>> |                                    | process-pipes-4 (seconds)                                |                0.4335 |                   -0.27% |
>> |                                    | process-pipes-7 (seconds)                                |                 0.823 |              (I) -12.13% |
>> |                                    | process-pipes-12 (seconds)                               |    1.3538333333333334 |               (I) -5.32% |
>> |                                    | process-pipes-21 (seconds)                               |    1.8971666666666664 |               (I) -2.87% |
>> |                                    | process-pipes-30 (seconds)                               |    2.5023333333333335 |               (I) -3.39% |
>> |                                    | process-pipes-48 (seconds)                               |                3.4305 |               (I) -5.65% |
>> |                                    | process-pipes-79 (seconds)                               |     4.245833333333334 |               (I) -6.74% |
>> |                                    | process-pipes-110 (seconds)                              |     5.114833333333333 |               (I) -6.26% |
>> |                                    | process-pipes-141 (seconds)                              |                6.1885 |               (I) -4.99% |
>> |                                    | process-pipes-172 (seconds)                              |     7.231833333333334 |               (I) -4.45% |
>> |                                    | process-pipes-203 (seconds)                              |     8.393166666666668 |               (I) -3.65% |
>> |                                    | process-pipes-234 (seconds)                              |     9.487499999999999 |               (I) -3.45% |
>> |                                    | process-pipes-256 (seconds)                              |    10.316166666666666 |               (I) -3.47% |
>> |                                    | process-sockets-1 (seconds)                              |                 0.289 |                    2.13% |
>> |                                    | process-sockets-4 (seconds)                              |    0.7596666666666666 |                    1.02% |
>> |                                    | process-sockets-7 (seconds)                              |    1.1663333333333334 |                   -0.26% |
>> |                                    | process-sockets-12 (seconds)                             |    1.8641666666666665 |                   -1.24% |
>> |                                    | process-sockets-21 (seconds)                             |    3.0773333333333333 |                    0.01% |
>> |                                    | process-sockets-30 (seconds)                             |                4.2405 |                   -0.15% |
>> |                                    | process-sockets-48 (seconds)                             |     6.459666666666666 |                    0.15% |
>> |                                    | process-sockets-79 (seconds)                             |    10.156833333333333 |                    1.45% |
>> |                                    | process-sockets-110 (seconds)                            |    14.317833333333333 |                   -1.64% |
>> |                                    | process-sockets-141 (seconds)                            |               20.8735 |               (I) -4.27% |
>> |                                    | process-sockets-172 (seconds)                            |    26.205333333333332 |                    0.30% |
>> |                                    | process-sockets-203 (seconds)                            |    31.298000000000002 |                   -1.71% |
>> |                                    | process-sockets-234 (seconds)                            |    36.104000000000006 |                   -1.94% |
>> |                                    | process-sockets-256 (seconds)                            |     39.44016666666667 |                   -0.71% |
>> |                                    | thread-pipes-1 (seconds)                                 |   0.17550000000000002 |                    0.66% |
>> |                                    | thread-pipes-4 (seconds)                                 |   0.44716666666666666 |                    1.66% |
>> |                                    | thread-pipes-7 (seconds)                                 |                0.7345 |                   -0.17% |
>> |                                    | thread-pipes-12 (seconds)                                |     1.405833333333333 |               (I) -4.12% |
>> |                                    | thread-pipes-21 (seconds)                                |    2.0113333333333334 |               (I) -2.13% |
>> |                                    | thread-pipes-30 (seconds)                                |    2.6648333333333336 |               (I) -3.78% |
>> |                                    | thread-pipes-48 (seconds)                                |    3.6341666666666668 |               (I) -5.77% |
>> |                                    | thread-pipes-79 (seconds)                                |                4.4085 |               (I) -5.31% |
>> |                                    | thread-pipes-110 (seconds)                               |     5.374666666666666 |               (I) -6.12% |
>> |                                    | thread-pipes-141 (seconds)                               |     6.385666666666666 |               (I) -4.00% |
>> |                                    | thread-pipes-172 (seconds)                               |     7.403000000000001 |               (I) -3.01% |
>> |                                    | thread-pipes-203 (seconds)                               |     8.570333333333332 |               (I) -2.62% |
>> |                                    | thread-pipes-234 (seconds)                               |     9.719166666666666 |               (I) -2.00% |
>> |                                    | thread-pipes-256 (seconds)                               |    10.552833333333334 |               (I) -2.30% |
>> |                                    | thread-sockets-1 (seconds)                               |                0.3065 |                (R) 2.39% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>> | mmtests/sysbench-mutex             | sysbenchmutex-1 (usec)                                   |    194.38333333333333 |                   -0.02% |
>> |                                    | sysbenchmutex-4 (usec)                                   |               200.875 |                   -0.02% |
>> |                                    | sysbenchmutex-7 (usec)                                   |    201.23000000000002 |                    0.00% |
>> |                                    | sysbenchmutex-12 (usec)                                  |    201.77666666666664 |                    0.12% |
>> |                                    | sysbenchmutex-21 (usec)                                  |                203.03 |                   -0.40% |
>> |                                    | sysbenchmutex-30 (usec)                                  |               203.285 |                    0.08% |
>> |                                    | sysbenchmutex-48 (usec)                                  |    231.30000000000004 |                    2.59% |
>> |                                    | sysbenchmutex-79 (usec)                                  |               362.075 |                   -0.80% |
>> |                                    | sysbenchmutex-110 (usec)                                 |     516.8233333333334 |                   -3.87% |
>> |                                    | sysbenchmutex-128 (usec)                                 |     593.3533333333334 |               (I) -4.46% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
> This is nice, but is clearly hugely exceeding the column width we should have in commit messages.
>
> Let me use emacs's nice features to make life easy for you :) -

Oh you did it for me, thank you so much!

>
> +-------------------------+--------------------------------+---------------+
> | mmtests/hackbench       | process-pipes-1 (seconds)      |        -0.06% |
> |                         | process-pipes-4 (seconds)      |        -0.27% |
> |                         | process-pipes-7 (seconds)      |   (I) -12.13% |
> |                         | process-pipes-12 (seconds)     |    (I) -5.32% |
> |                         | process-pipes-21 (seconds)     |    (I) -2.87% |
> |                         | process-pipes-30 (seconds)     |    (I) -3.39% |
> |                         | process-pipes-48 (seconds)     |    (I) -5.65% |
> |                         | process-pipes-79 (seconds)     |    (I) -6.74% |
> |                         | process-pipes-110 (seconds)    |    (I) -6.26% |
> |                         | process-pipes-141 (seconds)    |    (I) -4.99% |
> |                         | process-pipes-172 (seconds)    |    (I) -4.45% |
> |                         | process-pipes-203 (seconds)    |    (I) -3.65% |
> |                         | process-pipes-234 (seconds)    |    (I) -3.45% |
> |                         | process-pipes-256 (seconds)    |    (I) -3.47% |
> |                         | process-sockets-1 (seconds)    |         2.13% |
> |                         | process-sockets-4 (seconds)    |         1.02% |
> |                         | process-sockets-7 (seconds)    |        -0.26% |
> |                         | process-sockets-12 (seconds)   |        -1.24% |
> |                         | process-sockets-21 (seconds)   |         0.01% |
> |                         | process-sockets-30 (seconds)   |        -0.15% |
> |                         | process-sockets-48 (seconds)   |         0.15% |
> |                         | process-sockets-79 (seconds)   |         1.45% |
> |                         | process-sockets-110 (seconds)  |        -1.64% |
> |                         | process-sockets-141 (seconds)  |    (I) -4.27% |
> |                         | process-sockets-172 (seconds)  |         0.30% |
> |                         | process-sockets-203 (seconds)  |        -1.71% |
> |                         | process-sockets-234 (seconds)  |        -1.94% |
> |                         | process-sockets-256 (seconds)  |        -0.71% |
> |                         | thread-pipes-1 (seconds)       |         0.66% |
> |                         | thread-pipes-4 (seconds)       |         1.66% |
> |                         | thread-pipes-7 (seconds)       |        -0.17% |
> |                         | thread-pipes-12 (seconds)      |    (I) -4.12% |
> |                         | thread-pipes-21 (seconds)      |    (I) -2.13% |
> |                         | thread-pipes-30 (seconds)      |    (I) -3.78% |
> |                         | thread-pipes-48 (seconds)      |    (I) -5.77% |
> |                         | thread-pipes-79 (seconds)      |    (I) -5.31% |
> |                         | thread-pipes-110 (seconds)     |    (I) -6.12% |
> |                         | thread-pipes-141 (seconds)     |    (I) -4.00% |
> |                         | thread-pipes-172 (seconds)     |    (I) -3.01% |
> |                         | thread-pipes-203 (seconds)     |    (I) -2.62% |
> |                         | thread-pipes-234 (seconds)     |    (I) -2.00% |
> |                         | thread-pipes-256 (seconds)     |    (I) -2.30% |
> |                         | thread-sockets-1 (seconds)     |     (R) 2.39% |
> +-------------------------+--------------------------------+---------------+
>
> +-------------------------+------------------------------------------------+
> | mmtests/sysbench-mutex  | sysbenchmutex-1 (usec)         |        -0.02% |
> |                         | sysbenchmutex-4 (usec)         |        -0.02% |
> |                         | sysbenchmutex-7 (usec)         |         0.00% |
> |                         | sysbenchmutex-12 (usec)        |         0.12% |
> |                         | sysbenchmutex-21 (usec)        |        -0.40% |
> |                         | sysbenchmutex-30 (usec)        |         0.08% |
> |                         | sysbenchmutex-48 (usec)        |         2.59% |
> |                         | sysbenchmutex-79 (usec)        |        -0.80% |
> |                         | sysbenchmutex-110 (usec)       |        -3.87% |
> |                         | sysbenchmutex-128 (usec)       |    (I) -4.46% |
> +-------------------------+--------------------------------+---------------+
>
>
>>   mm/khugepaged.c | 9 ++-------
>>   1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 4ec324a4c1fe..a0f1df2a7ae6 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>   			writable = true;
>>   	}
>>
>> -	if (unlikely(!writable)) {
>> -		result = SCAN_PAGE_RO;
>> -	} else if (unlikely(cc->is_khugepaged && !referenced)) {
>> +	if (unlikely(cc->is_khugepaged && !referenced)) {
>>   		result = SCAN_LACK_REFERENCED_PAGE;
>>   	} else {
>>   		result = SCAN_SUCCEED;
>> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   		     mmu_notifier_test_young(vma->vm_mm, _address)))
>>   			referenced++;
>>   	}
>> -	if (!writable) {
>> -		result = SCAN_PAGE_RO;
>> -	} else if (cc->is_khugepaged &&
>> +	if (cc->is_khugepaged &&
>>   		   (!referenced ||
>>   		    (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>>   		result = SCAN_LACK_REFERENCED_PAGE;
>> @@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>   		case SCAN_PMD_NULL:
>>   		case SCAN_PTE_NON_PRESENT:
>>   		case SCAN_PTE_UFFD_WP:
>> -		case SCAN_PAGE_RO:
>>   		case SCAN_LACK_REFERENCED_PAGE:
>>   		case SCAN_PAGE_NULL:
>>   		case SCAN_PAGE_COUNT:
>> --
>> 2.30.2
>>
> I guess you delay the final cleanup so you can combine it with tracepoint
> removal in next patch, not really sure why they're separate but meh not a
> big deal.

[-- Attachment #2: Type: text/html, Size: 21481 bytes --]

  reply	other threads:[~2025-09-04  4:05 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03  5:46 Dev Jain
2025-09-03  5:46 ` [PATCH 2/2] mm: Drop all references of writable and SCAN_PAGE_RO Dev Jain
2025-09-03  6:53   ` David Hildenbrand
2025-09-03  9:04   ` Kiryl Shutsemau
2025-09-03 13:26   ` Lorenzo Stoakes
2025-09-03 14:33     ` David Hildenbrand
2025-09-03 15:47   ` Zi Yan
2025-09-03 20:35   ` Lorenzo Stoakes
2025-09-04  6:12   ` Baolin Wang
2025-09-03  6:52 ` [PATCH 1/2] mm: Enable khugepaged to operate on non-writable VMAs David Hildenbrand
2025-09-03  8:08 ` Wei Yang
2025-09-03  8:13   ` David Hildenbrand
2025-09-03  8:30     ` Wei Yang
2025-09-03  9:06   ` Dev Jain
2025-09-03  9:15   ` Dev Jain
2025-09-03  9:18     ` Dev Jain
2025-09-03  9:22       ` David Hildenbrand
2025-09-03 18:25         ` Lorenzo Stoakes
2025-09-04  3:56           ` Dev Jain
2025-09-03 13:11     ` Wei Yang
2025-09-03  9:03 ` Kiryl Shutsemau
2025-09-03 15:46 ` Zi Yan
2025-09-03 20:34 ` Lorenzo Stoakes
2025-09-04  4:04   ` Dev Jain [this message]
2025-09-04  6:11 ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6abceb3f-e988-45cd-800f-2b54f9a92e26@arm.com \
    --to=dev.jain@arm.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox