Dev Jain <dev.jain@arm.com> schrieb am Di. 28. Okt. 2025 um 06:32:

On 28/10/25 3:10 am, David Hildenbrand wrote:
> On 10.06.25 05:50, Dev Jain wrote:
>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through
>> all 16
>> entries to collect a/d bits. Hence this optimization will result in a
>> 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block,
>> and only
>> do them on the starting and ending contig block.
>>
>> For split folios, there will be no pte batching; nr_ptes will be 1. For
>> pagetable splitting, the ptes will still point to the same large folio;
>> for arm64, this results in the optimization described above, and for
>> other
>> arches (including the general case), a minor improvement is expected
>> due to
>> a reduction in the number of function calls.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> mm/mremap.c | 39 ++++++++++++++++++++++++++++++++-------
>> 1 file changed, 32 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 180b12225368..18b215521ada 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>       return pte;
>> }
>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma,
>> unsigned long addr,
>> +        pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +    const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +    struct folio *folio;
>> +
>> +    if (max_nr == 1)
>> +        return 1;
>> +
>> +    folio = vm_normal_folio(vma, addr, pte);
>> +    if (!folio || !folio_test_large(folio))
>> +        return 1;
>> +
>> +    return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>> +                   NULL, NULL);
>> +}
>
> Dev, I think there is another bug hiding in here. That function ignores
> the writable bit, which is not what you need here, in particular for
> anonymous folios in some cases.
>
> Later set_ptes() could end up marking ptes writable that were not
> writable
> before, which is bad (at least for anonymous folios, maybe also for
> pagecache
> folios).
>
> I think you really must respect the writable bit through something like
> FPB_RESPECT_WRITE.
>
> I patched out the "pte_batch_hint(ptep, pte) == 1" check we have upstream
> to make it reproduce on x86_64, but the following reproducer should
> likely
> reproduce on aarch64 without further kernel modifications.

You are right. Thanks! I recall during the mremap/mprotect stuff I had
completely

forgotten that batching by default ignores the writable bit and
remembered it during

the last version of mprotect series :(

Thanks for giving a reproducer; for some reason I am unable to reproduce
on my machine as-is.

(Writing from gmail app, probably it will mess up html)

I think leaving only a single page unshared is not sufficient. Try with 16, such that they will be pte-cont and make the initial hint check happy.