From: Lance Yang <lance.yang@linux.dev>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
baohua@kernel.org, dev.jain@arm.com, harry.yoo@oracle.com,
jannh@google.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, lorenzo.stoakes@oracle.com, riel@surriel.com,
stable@kernel.org, vbabka@kernel.org
Subject: Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
Date: Thu, 26 Feb 2026 18:28:23 +0800 [thread overview]
Message-ID: <dfd37e2c-d939-4682-a693-ae3941358fea@linux.dev> (raw)
In-Reply-To: <ece450e0-8e19-42bd-9a7a-fdfc7ab785e6@kernel.org>
On 2026/2/26 18:06, David Hildenbrand (Arm) wrote:
> On 2/26/26 08:09, Lance Yang wrote:
>>
>> On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>>
>>>> Sorry I misread the original mail rushing through this is old... so this is less
>>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>>> but it's a good example of how stuff can go unnoticed for a while.
>>>>
>>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>>> fix for backporting.
>>>
>>> Dev volunteered to un-messify some of the stuff here. In particular, to
>>> extend batching to all cases, not just some hand-selected ones.
>>>
>>> Support for file folios is on the way.
>>>
>>>>
>>>> But is the proposed 'just assume wrprotect' sensible? David?
>>>
>>> In general, I think so. If PTEs were writable, they certainly have
>>> PAE set. The write-fault handler can fully recover from that (as PAE is
>>> set). If it's ever a performance problem (doubt), we can revisit.
>>>
>>> I'm wondering whether we should just perform the wrprotect earlier:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 0f00570d1b9e..19b875ee3fad 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>
>>> /* Nuke the page table entry. */
>>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>>> +
>>> + /*
>>> + * Our batch might include writable and read-only
>>> + * PTEs. When we have to restore the mapping, just
>>> + * assume read-only to not accidentally upgrade
>>> + * write permissions for PTEs that must not be
>>> + * writable.
>>> + */
>>> + pteval = pte_wrprotect(pteval);
>>> +
>>> /*
>>> * We clear the PTE but do not flush so potentially
>>> * a remote CPU could still be writing to the folio
>>>
>>>
>>> Given that nobody asks for writability (pte_write()) later.
>>>
>>> Or does someone care?
>>>
>>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>>> architecture (write-only)? I don't think so.
>>>
>>>
>>> We have the following options:
>>>
>>> 1) pte_wrprotect(): fake that all was read-only.
>>>
>>> Either we do it like Dev suggests, or we do it as above early.
>>>
>>> The downside is that any code that might later want to know "was
>>> this possibly writable" would get that information. Well, it wouldn't
>>> get that information reliably *today* already (and that sounds a bit shaky).
>>
>> Makes sense to me :)
>>
>>> 2) Tell batching logic to honor pte_write()
>>>
>>> Sounds suboptimal for some cases that really don't care in the future.
>>>
>>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>>>
>>> ... then we know for sure whether any PTE was writable and we could
>>>
>>> (a) Pass it as we did before around to all checks, like pte_accessible().
>>>
>>> (b) Have an explicit restore PTE where we play save.
>>>
>>>
>>> I raised to Dev in private that softdirty handling is also shaky, as we
>>> batch over that. Meaning that we could lose or gain softdirty PTE bits in
>>> a batch.
>>
>> I guess we won't lose soft_dirty bits - only gain them (false positive):
>>
>> 1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty()
>> 2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on
>> all architectures that support soft_dirty (x86, s390, powerpc, riscv)
>> 3) set_ptes() uses pte_advance_pfn() which keeps all flags intact
>>
>> So if any PTE in the batch was dirty, all PTEs become soft_dirty after
>> restore.
>
> PTEs can be softdirty without being dirty. That over-complicates the
> situation.
Ah, it's even trickier then :D
prev parent reply other threads:[~2026-02-26 10:28 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 11:09 Dev Jain
2026-02-24 11:31 ` Lorenzo Stoakes
2026-02-24 11:43 ` Lorenzo Stoakes
2026-02-24 16:01 ` David Hildenbrand (Arm)
2026-02-25 5:11 ` Dev Jain
2026-02-26 10:21 ` David Hildenbrand (Arm)
2026-02-26 10:27 ` Dev Jain
2026-02-26 7:01 ` Barry Song
2026-02-26 10:09 ` David Hildenbrand (Arm)
2026-02-26 10:20 ` Barry Song
2026-02-26 7:09 ` Lance Yang
2026-02-26 10:06 ` David Hildenbrand (Arm)
2026-02-26 10:28 ` Lance Yang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dfd37e2c-d939-4682-a693-ae3941358fea@linux.dev \
--to=lance.yang@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=harry.yoo@oracle.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=riel@surriel.com \
--cc=stable@kernel.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox