On 04/09/25 2:04 am, Lorenzo Stoakes wrote: > On Wed, Sep 03, 2025 at 11:16:34AM +0530, Dev Jain wrote: >> Currently khugepaged does not collapse a region which does not have a >> single writable page. This is wasteful since non-writable VMAs mapped by > As discussed elsewhere in the thread, you really need to clarify that you > mean the PTE is writable. This is far too vague otherwise. Okay. > >> the application won't benefit from THP collapse. Therefore, remove this >> restriction and allow khugepaged to collapse a VMA with arbitrary >> protections. > It's weird thie history of this, it looks like we were super conservative > at first, and then introduced this 'at least one PTE writable' thing in > commit 10359213d05a ("mm: incorporate read-only pages into transparent huge > pages"), but it doesn't really explain why you even need (at least) a > writable page. > > Perhaps a pre-PAE thing... (David?) we already do the refcount stuff > though, so it's hard to understand. > > It seems the main case for anon where it'd matter is swapped in pages > read-faulting for a R/W mapping (as read-faulting R/W mappings would just > get you the zero page which vm_normal_page() would exclude anyway). > > But not sure why we'd be reticent to collapse those anyway... you'd just > cahnge R/W bit on PMD instead of PTE? > > Yeah it's bizarre. > > I can't really see why your change shouldn't be done... > > >> Along with this, currently MADV_COLLAPSE does not perform a collapse on a >> non-writable VMA, and this restriction is nowhere to be found on the >> manpage - the restriction itself sounds wrong to me since the user knows > I'm not sure why a man page would talk about PTE scanning implementation > details? Sure, the manpage shouldn't talk about that, but the consequence of this PTE scanning implementation is that a read-only VMA won't be collapsed, so the manpage should have at least talked about mapping protections. So a user doing a PROT_READ mapping and then doing madvise(MADV_COLLAPSE) will receive -EINVAL which is extremely bizarre. > > But I guess as you say you're thinking specifically of a read-only VMA that > naturally has read-only PTE's as as result... > >> the protection of the memory it has mapped, so collapsing read-only >> memory via madvise() should be a choice of the user which shouldn't >> be overriden by the kernel. > NIT: overriden -> overridden. > >> On an arm64 machine, an average of 5% improvement is seen on some mmtests >> benchmarks, particularly hackbench, with a maximum improvement of 12%. > Nice! > > Is this on a raw metal machine, or a VM? I thik it's important to clarify > details like this. > > Please state precisely what you tested this on. I am guessing these benchmarks run in a container but I'll clarify this. > >> Signed-off-by: Dev Jain > Can't find any problem with this, and doesn't really seem like it'd be > problematic so: > > Reviewed-by: Lorenzo Stoakes Thanks. > >> --- >> RFC->v1: >> Drop writable references from tracepoints >> >> RFC: >> https://lore.kernel.org/all/20250901074817.73012-1-dev.jain@arm.com/ >> >> I can see performance improvements on mmtests run on an arm64 machine >> comparing with 6.17-rc2. (I) denotes statistically significant improvement, >> (R) denotes statistically significant regression (Please ignore the >> numbers in the middle column): > Let's drop the numbers in the middle column then please, this is going into the > commit log, let's not put extranous information there. I'll go study some Unix commands to drop that middle column :) > >> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ >> | mmtests/hackbench | process-pipes-1 (seconds) | 0.145 | -0.06% | >> | | process-pipes-4 (seconds) | 0.4335 | -0.27% | >> | | process-pipes-7 (seconds) | 0.823 | (I) -12.13% | >> | | process-pipes-12 (seconds) | 1.3538333333333334 | (I) -5.32% | >> | | process-pipes-21 (seconds) | 1.8971666666666664 | (I) -2.87% | >> | | process-pipes-30 (seconds) | 2.5023333333333335 | (I) -3.39% | >> | | process-pipes-48 (seconds) | 3.4305 | (I) -5.65% | >> | | process-pipes-79 (seconds) | 4.245833333333334 | (I) -6.74% | >> | | process-pipes-110 (seconds) | 5.114833333333333 | (I) -6.26% | >> | | process-pipes-141 (seconds) | 6.1885 | (I) -4.99% | >> | | process-pipes-172 (seconds) | 7.231833333333334 | (I) -4.45% | >> | | process-pipes-203 (seconds) | 8.393166666666668 | (I) -3.65% | >> | | process-pipes-234 (seconds) | 9.487499999999999 | (I) -3.45% | >> | | process-pipes-256 (seconds) | 10.316166666666666 | (I) -3.47% | >> | | process-sockets-1 (seconds) | 0.289 | 2.13% | >> | | process-sockets-4 (seconds) | 0.7596666666666666 | 1.02% | >> | | process-sockets-7 (seconds) | 1.1663333333333334 | -0.26% | >> | | process-sockets-12 (seconds) | 1.8641666666666665 | -1.24% | >> | | process-sockets-21 (seconds) | 3.0773333333333333 | 0.01% | >> | | process-sockets-30 (seconds) | 4.2405 | -0.15% | >> | | process-sockets-48 (seconds) | 6.459666666666666 | 0.15% | >> | | process-sockets-79 (seconds) | 10.156833333333333 | 1.45% | >> | | process-sockets-110 (seconds) | 14.317833333333333 | -1.64% | >> | | process-sockets-141 (seconds) | 20.8735 | (I) -4.27% | >> | | process-sockets-172 (seconds) | 26.205333333333332 | 0.30% | >> | | process-sockets-203 (seconds) | 31.298000000000002 | -1.71% | >> | | process-sockets-234 (seconds) | 36.104000000000006 | -1.94% | >> | | process-sockets-256 (seconds) | 39.44016666666667 | -0.71% | >> | | thread-pipes-1 (seconds) | 0.17550000000000002 | 0.66% | >> | | thread-pipes-4 (seconds) | 0.44716666666666666 | 1.66% | >> | | thread-pipes-7 (seconds) | 0.7345 | -0.17% | >> | | thread-pipes-12 (seconds) | 1.405833333333333 | (I) -4.12% | >> | | thread-pipes-21 (seconds) | 2.0113333333333334 | (I) -2.13% | >> | | thread-pipes-30 (seconds) | 2.6648333333333336 | (I) -3.78% | >> | | thread-pipes-48 (seconds) | 3.6341666666666668 | (I) -5.77% | >> | | thread-pipes-79 (seconds) | 4.4085 | (I) -5.31% | >> | | thread-pipes-110 (seconds) | 5.374666666666666 | (I) -6.12% | >> | | thread-pipes-141 (seconds) | 6.385666666666666 | (I) -4.00% | >> | | thread-pipes-172 (seconds) | 7.403000000000001 | (I) -3.01% | >> | | thread-pipes-203 (seconds) | 8.570333333333332 | (I) -2.62% | >> | | thread-pipes-234 (seconds) | 9.719166666666666 | (I) -2.00% | >> | | thread-pipes-256 (seconds) | 10.552833333333334 | (I) -2.30% | >> | | thread-sockets-1 (seconds) | 0.3065 | (R) 2.39% | >> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ >> >> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ >> | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | 194.38333333333333 | -0.02% | >> | | sysbenchmutex-4 (usec) | 200.875 | -0.02% | >> | | sysbenchmutex-7 (usec) | 201.23000000000002 | 0.00% | >> | | sysbenchmutex-12 (usec) | 201.77666666666664 | 0.12% | >> | | sysbenchmutex-21 (usec) | 203.03 | -0.40% | >> | | sysbenchmutex-30 (usec) | 203.285 | 0.08% | >> | | sysbenchmutex-48 (usec) | 231.30000000000004 | 2.59% | >> | | sysbenchmutex-79 (usec) | 362.075 | -0.80% | >> | | sysbenchmutex-110 (usec) | 516.8233333333334 | -3.87% | >> | | sysbenchmutex-128 (usec) | 593.3533333333334 | (I) -4.46% | >> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+ > This is nice, but is clearly hugely exceeding the column width we should have in commit messages. > > Let me use emacs's nice features to make life easy for you :) - Oh you did it for me, thank you so much! > > +-------------------------+--------------------------------+---------------+ > | mmtests/hackbench | process-pipes-1 (seconds) | -0.06% | > | | process-pipes-4 (seconds) | -0.27% | > | | process-pipes-7 (seconds) | (I) -12.13% | > | | process-pipes-12 (seconds) | (I) -5.32% | > | | process-pipes-21 (seconds) | (I) -2.87% | > | | process-pipes-30 (seconds) | (I) -3.39% | > | | process-pipes-48 (seconds) | (I) -5.65% | > | | process-pipes-79 (seconds) | (I) -6.74% | > | | process-pipes-110 (seconds) | (I) -6.26% | > | | process-pipes-141 (seconds) | (I) -4.99% | > | | process-pipes-172 (seconds) | (I) -4.45% | > | | process-pipes-203 (seconds) | (I) -3.65% | > | | process-pipes-234 (seconds) | (I) -3.45% | > | | process-pipes-256 (seconds) | (I) -3.47% | > | | process-sockets-1 (seconds) | 2.13% | > | | process-sockets-4 (seconds) | 1.02% | > | | process-sockets-7 (seconds) | -0.26% | > | | process-sockets-12 (seconds) | -1.24% | > | | process-sockets-21 (seconds) | 0.01% | > | | process-sockets-30 (seconds) | -0.15% | > | | process-sockets-48 (seconds) | 0.15% | > | | process-sockets-79 (seconds) | 1.45% | > | | process-sockets-110 (seconds) | -1.64% | > | | process-sockets-141 (seconds) | (I) -4.27% | > | | process-sockets-172 (seconds) | 0.30% | > | | process-sockets-203 (seconds) | -1.71% | > | | process-sockets-234 (seconds) | -1.94% | > | | process-sockets-256 (seconds) | -0.71% | > | | thread-pipes-1 (seconds) | 0.66% | > | | thread-pipes-4 (seconds) | 1.66% | > | | thread-pipes-7 (seconds) | -0.17% | > | | thread-pipes-12 (seconds) | (I) -4.12% | > | | thread-pipes-21 (seconds) | (I) -2.13% | > | | thread-pipes-30 (seconds) | (I) -3.78% | > | | thread-pipes-48 (seconds) | (I) -5.77% | > | | thread-pipes-79 (seconds) | (I) -5.31% | > | | thread-pipes-110 (seconds) | (I) -6.12% | > | | thread-pipes-141 (seconds) | (I) -4.00% | > | | thread-pipes-172 (seconds) | (I) -3.01% | > | | thread-pipes-203 (seconds) | (I) -2.62% | > | | thread-pipes-234 (seconds) | (I) -2.00% | > | | thread-pipes-256 (seconds) | (I) -2.30% | > | | thread-sockets-1 (seconds) | (R) 2.39% | > +-------------------------+--------------------------------+---------------+ > > +-------------------------+------------------------------------------------+ > | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | -0.02% | > | | sysbenchmutex-4 (usec) | -0.02% | > | | sysbenchmutex-7 (usec) | 0.00% | > | | sysbenchmutex-12 (usec) | 0.12% | > | | sysbenchmutex-21 (usec) | -0.40% | > | | sysbenchmutex-30 (usec) | 0.08% | > | | sysbenchmutex-48 (usec) | 2.59% | > | | sysbenchmutex-79 (usec) | -0.80% | > | | sysbenchmutex-110 (usec) | -3.87% | > | | sysbenchmutex-128 (usec) | (I) -4.46% | > +-------------------------+--------------------------------+---------------+ > > >> mm/khugepaged.c | 9 ++------- >> 1 file changed, 2 insertions(+), 7 deletions(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 4ec324a4c1fe..a0f1df2a7ae6 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >> writable = true; >> } >> >> - if (unlikely(!writable)) { >> - result = SCAN_PAGE_RO; >> - } else if (unlikely(cc->is_khugepaged && !referenced)) { >> + if (unlikely(cc->is_khugepaged && !referenced)) { >> result = SCAN_LACK_REFERENCED_PAGE; >> } else { >> result = SCAN_SUCCEED; >> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, >> mmu_notifier_test_young(vma->vm_mm, _address))) >> referenced++; >> } >> - if (!writable) { >> - result = SCAN_PAGE_RO; >> - } else if (cc->is_khugepaged && >> + if (cc->is_khugepaged && >> (!referenced || >> (unmapped && referenced < HPAGE_PMD_NR / 2))) { >> result = SCAN_LACK_REFERENCED_PAGE; >> @@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start, >> case SCAN_PMD_NULL: >> case SCAN_PTE_NON_PRESENT: >> case SCAN_PTE_UFFD_WP: >> - case SCAN_PAGE_RO: >> case SCAN_LACK_REFERENCED_PAGE: >> case SCAN_PAGE_NULL: >> case SCAN_PAGE_COUNT: >> -- >> 2.30.2 >> > I guess you delay the final cleanup so you can combine it with tracepoint > removal in next patch, not really sure why they're separate but meh not a > big deal.