* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-24 16:01 ` David Hildenbrand (Arm)
@ 2026-02-25 5:11 ` Dev Jain
2026-02-26 10:21 ` David Hildenbrand (Arm)
2026-02-26 7:01 ` Barry Song
2026-02-26 7:09 ` Lance Yang
2 siblings, 1 reply; 13+ messages in thread
From: Dev Jain @ 2026-02-25 5:11 UTC (permalink / raw)
To: David Hildenbrand (Arm), Lorenzo Stoakes
Cc: akpm, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua,
linux-mm, linux-kernel, stable
On 24/02/26 9:31 pm, David Hildenbrand (Arm) wrote:
> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>> On Tue, Feb 24, 2026 at 11:31:24AM +0000, Lorenzo Stoakes wrote:
>>> Thanks Dev.
>>>
>>> Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
>>> large folios during reclamation") merged?
>>>
>>> It had enormous amounts of review commentary at
>>> https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
>>> no tags, this should be a signal to wait for a respin _at least_, and really if
>>> late in cycle suggests it should wait a cycle.
>>>
>>> I've said going forward I'm going to check THP series for tags and if not
>>> present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
>>
>> Sorry I misread the original mail rushing through this is old... so this is less
>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>> but it's a good example of how stuff can go unnoticed for a while.
>>
>> In that case maybe a revert is a bit much and we just want the simplest possible
>> fix for backporting.
>
> Dev volunteered to un-messify some of the stuff here. In particular, to
> extend batching to all cases, not just some hand-selected ones.
>
> Support for file folios is on the way.
Typo - anonymous non-lazyfree folios : )
>
>>
>> But is the proposed 'just assume wrprotect' sensible? David?
>
> In general, I think so. If PTEs were writable, they certainly have
> PAE set. The write-fault handler can fully recover from that (as PAE is
> set). If it's ever a performance problem (doubt), we can revisit.
>
> I'm wondering whether we should just perform the wrprotect earlier:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f00570d1b9e..19b875ee3fad 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>
> /* Nuke the page table entry. */
> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
> +
> + /*
> + * Our batch might include writable and read-only
> + * PTEs. When we have to restore the mapping, just
> + * assume read-only to not accidentally upgrade
> + * write permissions for PTEs that must not be
> + * writable.
> + */
> + pteval = pte_wrprotect(pteval);
> +
> /*
> * We clear the PTE but do not flush so potentially
> * a remote CPU could still be writing to the folio
>
>
> Given that nobody asks for writability (pte_write()) later.
>
> Or does someone care?
>
> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
> architecture (write-only)? I don't think so.
>
>
> We have the following options:
>
> 1) pte_wrprotect(): fake that all was read-only.
>
> Either we do it like Dev suggests, or we do it as above early.
>
> The downside is that any code that might later want to know "was
> this possibly writable" would get that information. Well, it wouldn't
> get that information reliably *today* already (and that sounds a bit shaky).
I would vote for this, since if we were to follow the current patch,
the extension to anon folios will make it worse (pte_wrprotect at 5 places
- the 3 additional places being in the if conditions consisting of
folio_dup_swap, arch_unmap_one, folio_try_share_anon_rmap_pte)
The downside being that if we fail in this rmap path, the ptes are all
write-protected. But then the page is already there - the fault is going
to be processed fast.
>
> 2) Tell batching logic to honor pte_write()
>
> Sounds suboptimal for some cases that really don't care in the future.
>
> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>
> ... then we know for sure whether any PTE was writable and we could
Well, we don't need this? The problem here is that we are making a decision
on the basis of the writability of the *first* pte of the batch - so if
the first pte is writable, only then we have the problem we have been
talking about.
We could have had a FPB_MERGE_WRPROTECT (which I know, is totally
incompatible with FPB_MERGE_WRITE) - that would tell whether at least one
pte in the patch was non-writable, in which case we will be able to avoid
the restoration of the entire batch to writeprotected if all the ptes
were writable (which I am assuming is the common case). But of course this
is not possible to do with the current shape of folio_pte_batch_flags. We
will have to revert the FPB_MERGE_* stuff to just collect the "at least one
is writable, at least one is dirty, at least one is young, at least one is
non-writable" etc information from the function and let the caller handle
it. That will kill all the work you did in simplifying that function :)
>
> (a) Pass it as we did before around to all checks, like pte_accessible().
>
> (b) Have an explicit restore PTE where we play save.
>
>
> I raised to Dev in private that softdirty handling is also shaky, as we
> batch over that. Meaning that we could lose or gain softdirty PTE bits in
> a batch.
>
> For lazyfree and file folios it doesn't really matter I guess. But it will
> matter once we unlock it for all anon folios.
>
>
> 1) sounds simplest, 3) sounds cleanest long-term.
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-25 5:11 ` Dev Jain
@ 2026-02-26 10:21 ` David Hildenbrand (Arm)
2026-02-26 10:27 ` Dev Jain
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-26 10:21 UTC (permalink / raw)
To: Dev Jain, Lorenzo Stoakes
Cc: akpm, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua,
linux-mm, linux-kernel, stable
On 2/25/26 06:11, Dev Jain wrote:
>
>
> On 24/02/26 9:31 pm, David Hildenbrand (Arm) wrote:
>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>
>>> Sorry I misread the original mail rushing through this is old... so this is less
>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>> but it's a good example of how stuff can go unnoticed for a while.
>>>
>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>> fix for backporting.
>>
>> Dev volunteered to un-messify some of the stuff here. In particular, to
>> extend batching to all cases, not just some hand-selected ones.
>>
>> Support for file folios is on the way.
>
> Typo - anonymous non-lazyfree folios : )
Heh, no, not what I meant. We do have file folio support on the way (see
the other patch set).
>
>>
>>>
>>> But is the proposed 'just assume wrprotect' sensible? David?
>>
>> In general, I think so. If PTEs were writable, they certainly have
>> PAE set. The write-fault handler can fully recover from that (as PAE is
>> set). If it's ever a performance problem (doubt), we can revisit.
>>
>> I'm wondering whether we should just perform the wrprotect earlier:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0f00570d1b9e..19b875ee3fad 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>> /* Nuke the page table entry. */
>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>> +
>> + /*
>> + * Our batch might include writable and read-only
>> + * PTEs. When we have to restore the mapping, just
>> + * assume read-only to not accidentally upgrade
>> + * write permissions for PTEs that must not be
>> + * writable.
>> + */
>> + pteval = pte_wrprotect(pteval);
>> +
>> /*
>> * We clear the PTE but do not flush so potentially
>> * a remote CPU could still be writing to the folio
>>
>>
>> Given that nobody asks for writability (pte_write()) later.
>>
>> Or does someone care?
>>
>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>> architecture (write-only)? I don't think so.
>>
>>
>> We have the following options:
>>
>> 1) pte_wrprotect(): fake that all was read-only.
>>
>> Either we do it like Dev suggests, or we do it as above early.
>>
>> The downside is that any code that might later want to know "was
>> this possibly writable" would get that information. Well, it wouldn't
>> get that information reliably *today* already (and that sounds a bit shaky).
>
> I would vote for this, since if we were to follow the current patch,
> the extension to anon folios will make it worse (pte_wrprotect at 5 places
> - the 3 additional places being in the if conditions consisting of
> folio_dup_swap, arch_unmap_one, folio_try_share_anon_rmap_pte)
> The downside being that if we fail in this rmap path, the ptes are all
> write-protected. But then the page is already there - the fault is going
> to be processed fast.
Right, we should only have a single "revert pte", and not have to redo
that from multiple locations.
>
>>
>> 2) Tell batching logic to honor pte_write()
>>
>> Sounds suboptimal for some cases that really don't care in the future.
As per discussion with Barry, we might just want to do that now as an
easy and obviously correct fix.
It's a shame we stop being able to use folio_pte_batch() and have to
create an inlined version.
>>
>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>>
>> ... then we know for sure whether any PTE was writable and we could
>
> Well, we don't need this? The problem here is that we are making a decision
> on the basis of the writability of the *first* pte of the batch - so if
> the first pte is writable, only then we have the problem we have been
> talking about.
That's what I was referring above as "being shaky".
Some code has to be taught that "there is something writable here, so
assume it was accessible in a certain way", other code has to be taught
that "there is something read-only here, so make sure you don't
accidentally make something writable".
One way to handle it is to say that "the resulting pte is writable, so
assume it was accessible", to then say "but just assume it is read-only
as we are not sure whether everything is writable".
>
> We could have had a FPB_MERGE_WRPROTECT (which I know, is totally
> incompatible with FPB_MERGE_WRITE) - that would tell whether at least one
> pte in the patch was non-writable, in which case we will be able to avoid
> the restoration of the entire batch to writeprotected if all the ptes
> were writable (which I am assuming is the common case). But of course this
> is not possible to do with the current shape of folio_pte_batch_flags. We
> will have to revert the FPB_MERGE_* stuff to just collect the "at least one
> is writable, at least one is dirty, at least one is young, at least one is
> non-writable" etc information from the function and let the caller handle
> it. That will kill all the work you did in simplifying that function :)
Yeah, let's not go down that path. :)
To fix what we currently have in the tree, probably we should really
just set FPB_RESPECT_WRITE|FPB_RESPECT_SOFT_DIRTY, saying that this is
"obviously correct", and revisit it once we expect more cases where
batching over these PTEs would provide more value.
For lazyfree, likely it doesn't make a difference.
--
Cheers,
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-26 10:21 ` David Hildenbrand (Arm)
@ 2026-02-26 10:27 ` Dev Jain
0 siblings, 0 replies; 13+ messages in thread
From: Dev Jain @ 2026-02-26 10:27 UTC (permalink / raw)
To: David Hildenbrand (Arm), Lorenzo Stoakes
Cc: akpm, riel, Liam.Howlett, vbabka, harry.yoo, jannh, baohua,
linux-mm, linux-kernel, stable
On 26/02/26 3:51 pm, David Hildenbrand (Arm) wrote:
> On 2/25/26 06:11, Dev Jain wrote:
>>
>>
>> On 24/02/26 9:31 pm, David Hildenbrand (Arm) wrote:
>>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>>
>>>> Sorry I misread the original mail rushing through this is old... so this is less
>>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>>> but it's a good example of how stuff can go unnoticed for a while.
>>>>
>>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>>> fix for backporting.
>>>
>>> Dev volunteered to un-messify some of the stuff here. In particular, to
>>> extend batching to all cases, not just some hand-selected ones.
>>>
>>> Support for file folios is on the way.
>>
>> Typo - anonymous non-lazyfree folios : )
>
> Heh, no, not what I meant. We do have file folio support on the way (see
> the other patch set).
Ah I thought that got merged already : )
>
>>
>>>
>>>>
>>>> But is the proposed 'just assume wrprotect' sensible? David?
>>>
>>> In general, I think so. If PTEs were writable, they certainly have
>>> PAE set. The write-fault handler can fully recover from that (as PAE is
>>> set). If it's ever a performance problem (doubt), we can revisit.
>>>
>>> I'm wondering whether we should just perform the wrprotect earlier:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 0f00570d1b9e..19b875ee3fad 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>
>>> /* Nuke the page table entry. */
>>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>>> +
>>> + /*
>>> + * Our batch might include writable and read-only
>>> + * PTEs. When we have to restore the mapping, just
>>> + * assume read-only to not accidentally upgrade
>>> + * write permissions for PTEs that must not be
>>> + * writable.
>>> + */
>>> + pteval = pte_wrprotect(pteval);
>>> +
>>> /*
>>> * We clear the PTE but do not flush so potentially
>>> * a remote CPU could still be writing to the folio
>>>
>>>
>>> Given that nobody asks for writability (pte_write()) later.
>>>
>>> Or does someone care?
>>>
>>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>>> architecture (write-only)? I don't think so.
>>>
>>>
>>> We have the following options:
>>>
>>> 1) pte_wrprotect(): fake that all was read-only.
>>>
>>> Either we do it like Dev suggests, or we do it as above early.
>>>
>>> The downside is that any code that might later want to know "was
>>> this possibly writable" would get that information. Well, it wouldn't
>>> get that information reliably *today* already (and that sounds a bit shaky).
>>
>> I would vote for this, since if we were to follow the current patch,
>> the extension to anon folios will make it worse (pte_wrprotect at 5 places
>> - the 3 additional places being in the if conditions consisting of
>> folio_dup_swap, arch_unmap_one, folio_try_share_anon_rmap_pte)
>> The downside being that if we fail in this rmap path, the ptes are all
>> write-protected. But then the page is already there - the fault is going
>> to be processed fast.
>
> Right, we should only have a single "revert pte", and not have to redo
> that from multiple locations.
>
>>
>>>
>>> 2) Tell batching logic to honor pte_write()
>>>
>>> Sounds suboptimal for some cases that really don't care in the future.
>
> As per discussion with Barry, we might just want to do that now as an
> easy and obviously correct fix.
>
> It's a shame we stop being able to use folio_pte_batch() and have to
> create an inlined version.
>
>>>
>>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>>>
>>> ... then we know for sure whether any PTE was writable and we could
>>
>> Well, we don't need this? The problem here is that we are making a decision
>> on the basis of the writability of the *first* pte of the batch - so if
>> the first pte is writable, only then we have the problem we have been
>> talking about.
>
> That's what I was referring above as "being shaky".
>
> Some code has to be taught that "there is something writable here, so
> assume it was accessible in a certain way", other code has to be taught
> that "there is something read-only here, so make sure you don't
> accidentally make something writable".
>
> One way to handle it is to say that "the resulting pte is writable, so
> assume it was accessible", to then say "but just assume it is read-only
> as we are not sure whether everything is writable".
>
>>
>> We could have had a FPB_MERGE_WRPROTECT (which I know, is totally
>> incompatible with FPB_MERGE_WRITE) - that would tell whether at least one
>> pte in the patch was non-writable, in which case we will be able to avoid
>> the restoration of the entire batch to writeprotected if all the ptes
>> were writable (which I am assuming is the common case). But of course this
>> is not possible to do with the current shape of folio_pte_batch_flags. We
>> will have to revert the FPB_MERGE_* stuff to just collect the "at least one
>> is writable, at least one is dirty, at least one is young, at least one is
>> non-writable" etc information from the function and let the caller handle
>> it. That will kill all the work you did in simplifying that function :)
> Yeah, let's not go down that path. :)
>
> To fix what we currently have in the tree, probably we should really
> just set FPB_RESPECT_WRITE|FPB_RESPECT_SOFT_DIRTY, saying that this is
> "obviously correct", and revisit it once we expect more cases where
> batching over these PTEs would provide more value.
Yup makes sense, I'll do this.
>
> For lazyfree, likely it doesn't make a difference.
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-24 16:01 ` David Hildenbrand (Arm)
2026-02-25 5:11 ` Dev Jain
@ 2026-02-26 7:01 ` Barry Song
2026-02-26 10:09 ` David Hildenbrand (Arm)
2026-02-26 7:09 ` Lance Yang
2 siblings, 1 reply; 13+ messages in thread
From: Barry Song @ 2026-02-26 7:01 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lorenzo Stoakes, Dev Jain, akpm, riel, Liam.Howlett, vbabka,
harry.yoo, jannh, linux-mm, linux-kernel, stable
On Wed, Feb 25, 2026 at 12:01 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/24/26 12:43, Lorenzo Stoakes wrote:
> > On Tue, Feb 24, 2026 at 11:31:24AM +0000, Lorenzo Stoakes wrote:
> >> Thanks Dev.
> >>
> >> Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
> >> large folios during reclamation") merged?
> >>
> >> It had enormous amounts of review commentary at
> >> https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
> >> no tags, this should be a signal to wait for a respin _at least_, and really if
> >> late in cycle suggests it should wait a cycle.
> >>
> >> I've said going forward I'm going to check THP series for tags and if not
> >> present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
> >
> > Sorry I misread the original mail rushing through this is old... so this is less
> > pressing than I thought (for some reason I thought it was merged last cycle...!)
> > but it's a good example of how stuff can go unnoticed for a while.
> >
> > In that case maybe a revert is a bit much and we just want the simplest possible
> > fix for backporting.
Apologies for the mess I caused, and thanks to Dev for catching this bug.
>
> Dev volunteered to un-messify some of the stuff here. In particular, to
> extend batching to all cases, not just some hand-selected ones.
>
> Support for file folios is on the way.
>
> >
> > But is the proposed 'just assume wrprotect' sensible? David?
>
> In general, I think so. If PTEs were writable, they certainly have
> PAE set. The write-fault handler can fully recover from that (as PAE is
> set). If it's ever a performance problem (doubt), we can revisit.
>
> I'm wondering whether we should just perform the wrprotect earlier:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f00570d1b9e..19b875ee3fad 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>
> /* Nuke the page table entry. */
> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
> +
> + /*
> + * Our batch might include writable and read-only
> + * PTEs. When we have to restore the mapping, just
> + * assume read-only to not accidentally upgrade
> + * write permissions for PTEs that must not be
> + * writable.
> + */
> + pteval = pte_wrprotect(pteval);
> +
> /*
> * We clear the PTE but do not flush so potentially
> * a remote CPU could still be writing to the folio
>
>
> Given that nobody asks for writability (pte_write()) later.
>
> Or does someone care?
>
> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
> architecture (write-only)? I don't think so.
>
>
> We have the following options:
>
> 1) pte_wrprotect(): fake that all was read-only.
>
> Either we do it like Dev suggests, or we do it as above early.
>
> The downside is that any code that might later want to know "was
> this possibly writable" would get that information. Well, it wouldn't
> get that information reliably *today* already (and that sounds a bit shaky).
>
> 2) Tell batching logic to honor pte_write()
>
> Sounds suboptimal for some cases that really don't care in the future.
I'm still curious what the downside would be to applying the
simple fix instead of introducing more "hacks". As I assume,
cases where a folio has both writable and non-writable PTEs
are not common?
diff --git a/mm/rmap.c b/mm/rmap.c
index bff8f222004e..48ad3435593a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1955,7 +1955,7 @@ static inline unsigned int
folio_unmap_pte_batch(struct folio *folio,
if (userfaultfd_wp(vma))
return 1;
- return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
+ return folio_pte_batch_flags(folio, NULL, pvmw->pte, &pte,
max_nr, FPB_RESPECT_WRITE);
}
>
> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>
> ... then we know for sure whether any PTE was writable and we could
>
> (a) Pass it as we did before around to all checks, like pte_accessible().
>
> (b) Have an explicit restore PTE where we play save.
>
>
> I raised to Dev in private that softdirty handling is also shaky, as we
> batch over that. Meaning that we could lose or gain softdirty PTE bits in
> a batch.
>
> For lazyfree and file folios it doesn't really matter I guess. But it will
> matter once we unlock it for all anon folios.
>
>
> 1) sounds simplest, 3) sounds cleanest long-term.
Thanks
Barry
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-26 7:01 ` Barry Song
@ 2026-02-26 10:09 ` David Hildenbrand (Arm)
2026-02-26 10:20 ` Barry Song
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-26 10:09 UTC (permalink / raw)
To: Barry Song
Cc: Lorenzo Stoakes, Dev Jain, akpm, riel, Liam.Howlett, vbabka,
harry.yoo, jannh, linux-mm, linux-kernel, stable
On 2/26/26 08:01, Barry Song wrote:
> On Wed, Feb 25, 2026 at 12:01 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>
>>> Sorry I misread the original mail rushing through this is old... so this is less
>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>> but it's a good example of how stuff can go unnoticed for a while.
>>>
>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>> fix for backporting.
>
> Apologies for the mess I caused, and thanks to Dev for catching this bug.
>
>>
>> Dev volunteered to un-messify some of the stuff here. In particular, to
>> extend batching to all cases, not just some hand-selected ones.
>>
>> Support for file folios is on the way.
>>
>>>
>>> But is the proposed 'just assume wrprotect' sensible? David?
>>
>> In general, I think so. If PTEs were writable, they certainly have
>> PAE set. The write-fault handler can fully recover from that (as PAE is
>> set). If it's ever a performance problem (doubt), we can revisit.
>>
>> I'm wondering whether we should just perform the wrprotect earlier:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0f00570d1b9e..19b875ee3fad 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>> /* Nuke the page table entry. */
>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>> +
>> + /*
>> + * Our batch might include writable and read-only
>> + * PTEs. When we have to restore the mapping, just
>> + * assume read-only to not accidentally upgrade
>> + * write permissions for PTEs that must not be
>> + * writable.
>> + */
>> + pteval = pte_wrprotect(pteval);
>> +
>> /*
>> * We clear the PTE but do not flush so potentially
>> * a remote CPU could still be writing to the folio
>>
>>
>> Given that nobody asks for writability (pte_write()) later.
>>
>> Or does someone care?
>>
>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>> architecture (write-only)? I don't think so.
>>
>>
>> We have the following options:
>>
>> 1) pte_wrprotect(): fake that all was read-only.
>>
>> Either we do it like Dev suggests, or we do it as above early.
>>
>> The downside is that any code that might later want to know "was
>> this possibly writable" would get that information. Well, it wouldn't
>> get that information reliably *today* already (and that sounds a bit shaky).
>>
>> 2) Tell batching logic to honor pte_write()
>>
>> Sounds suboptimal for some cases that really don't care in the future.
>
> I'm still curious what the downside would be to applying the
> simple fix instead of introducing more "hacks". As I assume,
> cases where a folio has both writable and non-writable PTEs
> are not common?
With "in the future" I thought about file folios, where I'd assume ti
could happen more often.
For lazyfree, I agree.
In the end, batching as much as possible is nice, but obviously, once it
gets too shaky in corner cases we might not care that much.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bff8f222004e..48ad3435593a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1955,7 +1955,7 @@ static inline unsigned int
> folio_unmap_pte_batch(struct folio *folio,
> if (userfaultfd_wp(vma))
> return 1;
>
> - return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> + return folio_pte_batch_flags(folio, NULL, pvmw->pte, &pte,
> max_nr, FPB_RESPECT_WRITE);
> }
If we already go for this approach assume we should then just set
FPB_RESPECT_SOFT_DIRTY as well and have it all handled properly.
--
Cheers,
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-26 10:09 ` David Hildenbrand (Arm)
@ 2026-02-26 10:20 ` Barry Song
0 siblings, 0 replies; 13+ messages in thread
From: Barry Song @ 2026-02-26 10:20 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Lorenzo Stoakes, Dev Jain, akpm, riel, Liam.Howlett, vbabka,
harry.yoo, jannh, linux-mm, linux-kernel, stable
On Thu, Feb 26, 2026 at 6:09 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 2/26/26 08:01, Barry Song wrote:
> > On Wed, Feb 25, 2026 at 12:01 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>
> >> On 2/24/26 12:43, Lorenzo Stoakes wrote:
> >>>
> >>> Sorry I misread the original mail rushing through this is old... so this is less
> >>> pressing than I thought (for some reason I thought it was merged last cycle...!)
> >>> but it's a good example of how stuff can go unnoticed for a while.
> >>>
> >>> In that case maybe a revert is a bit much and we just want the simplest possible
> >>> fix for backporting.
> >
> > Apologies for the mess I caused, and thanks to Dev for catching this bug.
> >
> >>
> >> Dev volunteered to un-messify some of the stuff here. In particular, to
> >> extend batching to all cases, not just some hand-selected ones.
> >>
> >> Support for file folios is on the way.
> >>
> >>>
> >>> But is the proposed 'just assume wrprotect' sensible? David?
> >>
> >> In general, I think so. If PTEs were writable, they certainly have
> >> PAE set. The write-fault handler can fully recover from that (as PAE is
> >> set). If it's ever a performance problem (doubt), we can revisit.
> >>
> >> I'm wondering whether we should just perform the wrprotect earlier:
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 0f00570d1b9e..19b875ee3fad 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>
> >> /* Nuke the page table entry. */
> >> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
> >> +
> >> + /*
> >> + * Our batch might include writable and read-only
> >> + * PTEs. When we have to restore the mapping, just
> >> + * assume read-only to not accidentally upgrade
> >> + * write permissions for PTEs that must not be
> >> + * writable.
> >> + */
> >> + pteval = pte_wrprotect(pteval);
> >> +
> >> /*
> >> * We clear the PTE but do not flush so potentially
> >> * a remote CPU could still be writing to the folio
> >>
> >>
> >> Given that nobody asks for writability (pte_write()) later.
> >>
> >> Or does someone care?
> >>
> >> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
> >> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
> >> architecture (write-only)? I don't think so.
> >>
> >>
> >> We have the following options:
> >>
> >> 1) pte_wrprotect(): fake that all was read-only.
> >>
> >> Either we do it like Dev suggests, or we do it as above early.
> >>
> >> The downside is that any code that might later want to know "was
> >> this possibly writable" would get that information. Well, it wouldn't
> >> get that information reliably *today* already (and that sounds a bit shaky).
> >>
> >> 2) Tell batching logic to honor pte_write()
> >>
> >> Sounds suboptimal for some cases that really don't care in the future.
> >
> > I'm still curious what the downside would be to applying the
> > simple fix instead of introducing more "hacks". As I assume,
> > cases where a folio has both writable and non-writable PTEs
> > are not common?
>
> With "in the future" I thought about file folios, where I'd assume ti
> could happen more often.
>
> For lazyfree, I agree.
>
> In the end, batching as much as possible is nice, but obviously, once it
> gets too shaky in corner cases we might not care that much.
Assuming 90% of folios have consistent PTEs, perhaps we don’t
need to worry too much about the remaining 10% of inconsistent
folios. We’ve already gained performance benefits for the
consistent 90%, and while the remaining 10% may not receive the
full batch, they are still getting some batching.
I don’t have the exact number, but it’s likely 90% or higher :-)
>
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index bff8f222004e..48ad3435593a 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1955,7 +1955,7 @@ static inline unsigned int
> > folio_unmap_pte_batch(struct folio *folio,
> > if (userfaultfd_wp(vma))
> > return 1;
> >
> > - return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
> > + return folio_pte_batch_flags(folio, NULL, pvmw->pte, &pte,
> > max_nr, FPB_RESPECT_WRITE);
> > }
>
> If we already go for this approach assume we should then just set
> FPB_RESPECT_SOFT_DIRTY as well and have it all handled properly.
I would vote for this, as supporting those inconsistent PTE
cases could become an endless and painful task :-)
Thanks
Barry
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-24 16:01 ` David Hildenbrand (Arm)
2026-02-25 5:11 ` Dev Jain
2026-02-26 7:01 ` Barry Song
@ 2026-02-26 7:09 ` Lance Yang
2026-02-26 10:06 ` David Hildenbrand (Arm)
2 siblings, 1 reply; 13+ messages in thread
From: Lance Yang @ 2026-02-26 7:09 UTC (permalink / raw)
To: david
Cc: Liam.Howlett, akpm, baohua, dev.jain, harry.yoo, jannh,
linux-kernel, linux-mm, lorenzo.stoakes, riel, stable, vbabka,
Lance Yang
On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote:
>On 2/24/26 12:43, Lorenzo Stoakes wrote:
>> On Tue, Feb 24, 2026 at 11:31:24AM +0000, Lorenzo Stoakes wrote:
>>> Thanks Dev.
>>>
>>> Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
>>> large folios during reclamation") merged?
>>>
>>> It had enormous amounts of review commentary at
>>> https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
>>> no tags, this should be a signal to wait for a respin _at least_, and really if
>>> late in cycle suggests it should wait a cycle.
>>>
>>> I've said going forward I'm going to check THP series for tags and if not
>>> present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
>>
>> Sorry I misread the original mail rushing through this is old... so this is less
>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>> but it's a good example of how stuff can go unnoticed for a while.
>>
>> In that case maybe a revert is a bit much and we just want the simplest possible
>> fix for backporting.
>
>Dev volunteered to un-messify some of the stuff here. In particular, to
>extend batching to all cases, not just some hand-selected ones.
>
>Support for file folios is on the way.
>
>>
>> But is the proposed 'just assume wrprotect' sensible? David?
>
>In general, I think so. If PTEs were writable, they certainly have
>PAE set. The write-fault handler can fully recover from that (as PAE is
>set). If it's ever a performance problem (doubt), we can revisit.
>
>I'm wondering whether we should just perform the wrprotect earlier:
>
>diff --git a/mm/rmap.c b/mm/rmap.c
>index 0f00570d1b9e..19b875ee3fad 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>
> /* Nuke the page table entry. */
> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>+
>+ /*
>+ * Our batch might include writable and read-only
>+ * PTEs. When we have to restore the mapping, just
>+ * assume read-only to not accidentally upgrade
>+ * write permissions for PTEs that must not be
>+ * writable.
>+ */
>+ pteval = pte_wrprotect(pteval);
>+
> /*
> * We clear the PTE but do not flush so potentially
> * a remote CPU could still be writing to the folio
>
>
>Given that nobody asks for writability (pte_write()) later.
>
>Or does someone care?
>
>Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>architecture (write-only)? I don't think so.
>
>
>We have the following options:
>
>1) pte_wrprotect(): fake that all was read-only.
>
>Either we do it like Dev suggests, or we do it as above early.
>
>The downside is that any code that might later want to know "was
>this possibly writable" would get that information. Well, it wouldn't
>get that information reliably *today* already (and that sounds a bit shaky).
Makes sense to me :)
>2) Tell batching logic to honor pte_write()
>
>Sounds suboptimal for some cases that really don't care in the future.
>
>3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>
>... then we know for sure whether any PTE was writable and we could
>
>(a) Pass it as we did before around to all checks, like pte_accessible().
>
>(b) Have an explicit restore PTE where we play save.
>
>
>I raised to Dev in private that softdirty handling is also shaky, as we
>batch over that. Meaning that we could lose or gain softdirty PTE bits in
>a batch.
I guess we won't lose soft_dirty bits - only gain them (false positive):
1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty()
2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on
all architectures that support soft_dirty (x86, s390, powerpc, riscv)
3) set_ptes() uses pte_advance_pfn() which keeps all flags intact
So if any PTE in the batch was dirty, all PTEs become soft_dirty after
restore.
>For lazyfree and file folios it doesn't really matter I guess. But it will
>matter once we unlock it for all anon folios.
>
>
>1) sounds simplest, 3) sounds cleanest long-term.
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-26 7:09 ` Lance Yang
@ 2026-02-26 10:06 ` David Hildenbrand (Arm)
2026-02-26 10:28 ` Lance Yang
0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-26 10:06 UTC (permalink / raw)
To: Lance Yang
Cc: Liam.Howlett, akpm, baohua, dev.jain, harry.yoo, jannh,
linux-kernel, linux-mm, lorenzo.stoakes, riel, stable, vbabka
On 2/26/26 08:09, Lance Yang wrote:
>
> On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote:
>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>
>>> Sorry I misread the original mail rushing through this is old... so this is less
>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>> but it's a good example of how stuff can go unnoticed for a while.
>>>
>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>> fix for backporting.
>>
>> Dev volunteered to un-messify some of the stuff here. In particular, to
>> extend batching to all cases, not just some hand-selected ones.
>>
>> Support for file folios is on the way.
>>
>>>
>>> But is the proposed 'just assume wrprotect' sensible? David?
>>
>> In general, I think so. If PTEs were writable, they certainly have
>> PAE set. The write-fault handler can fully recover from that (as PAE is
>> set). If it's ever a performance problem (doubt), we can revisit.
>>
>> I'm wondering whether we should just perform the wrprotect earlier:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0f00570d1b9e..19b875ee3fad 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>
>> /* Nuke the page table entry. */
>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>> +
>> + /*
>> + * Our batch might include writable and read-only
>> + * PTEs. When we have to restore the mapping, just
>> + * assume read-only to not accidentally upgrade
>> + * write permissions for PTEs that must not be
>> + * writable.
>> + */
>> + pteval = pte_wrprotect(pteval);
>> +
>> /*
>> * We clear the PTE but do not flush so potentially
>> * a remote CPU could still be writing to the folio
>>
>>
>> Given that nobody asks for writability (pte_write()) later.
>>
>> Or does someone care?
>>
>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>> architecture (write-only)? I don't think so.
>>
>>
>> We have the following options:
>>
>> 1) pte_wrprotect(): fake that all was read-only.
>>
>> Either we do it like Dev suggests, or we do it as above early.
>>
>> The downside is that any code that might later want to know "was
>> this possibly writable" would get that information. Well, it wouldn't
>> get that information reliably *today* already (and that sounds a bit shaky).
>
> Makes sense to me :)
>
>> 2) Tell batching logic to honor pte_write()
>>
>> Sounds suboptimal for some cases that really don't care in the future.
>>
>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>>
>> ... then we know for sure whether any PTE was writable and we could
>>
>> (a) Pass it as we did before around to all checks, like pte_accessible().
>>
>> (b) Have an explicit restore PTE where we play save.
>>
>>
>> I raised to Dev in private that softdirty handling is also shaky, as we
>> batch over that. Meaning that we could lose or gain softdirty PTE bits in
>> a batch.
>
> I guess we won't lose soft_dirty bits - only gain them (false positive):
>
> 1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty()
> 2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on
> all architectures that support soft_dirty (x86, s390, powerpc, riscv)
> 3) set_ptes() uses pte_advance_pfn() which keeps all flags intact
>
> So if any PTE in the batch was dirty, all PTEs become soft_dirty after
> restore.
PTEs can be softdirty without being dirty. That over-complicates the
situation.
--
Cheers,
David
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
2026-02-26 10:06 ` David Hildenbrand (Arm)
@ 2026-02-26 10:28 ` Lance Yang
0 siblings, 0 replies; 13+ messages in thread
From: Lance Yang @ 2026-02-26 10:28 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Liam.Howlett, akpm, baohua, dev.jain, harry.yoo, jannh,
linux-kernel, linux-mm, lorenzo.stoakes, riel, stable, vbabka
On 2026/2/26 18:06, David Hildenbrand (Arm) wrote:
> On 2/26/26 08:09, Lance Yang wrote:
>>
>> On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote:
>>> On 2/24/26 12:43, Lorenzo Stoakes wrote:
>>>>
>>>> Sorry I misread the original mail rushing through this is old... so this is less
>>>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>>>> but it's a good example of how stuff can go unnoticed for a while.
>>>>
>>>> In that case maybe a revert is a bit much and we just want the simplest possible
>>>> fix for backporting.
>>>
>>> Dev volunteered to un-messify some of the stuff here. In particular, to
>>> extend batching to all cases, not just some hand-selected ones.
>>>
>>> Support for file folios is on the way.
>>>
>>>>
>>>> But is the proposed 'just assume wrprotect' sensible? David?
>>>
>>> In general, I think so. If PTEs were writable, they certainly have
>>> PAE set. The write-fault handler can fully recover from that (as PAE is
>>> set). If it's ever a performance problem (doubt), we can revisit.
>>>
>>> I'm wondering whether we should just perform the wrprotect earlier:
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 0f00570d1b9e..19b875ee3fad 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>
>>> /* Nuke the page table entry. */
>>> pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>>> +
>>> + /*
>>> + * Our batch might include writable and read-only
>>> + * PTEs. When we have to restore the mapping, just
>>> + * assume read-only to not accidentally upgrade
>>> + * write permissions for PTEs that must not be
>>> + * writable.
>>> + */
>>> + pteval = pte_wrprotect(pteval);
>>> +
>>> /*
>>> * We clear the PTE but do not flush so potentially
>>> * a remote CPU could still be writing to the folio
>>>
>>>
>>> Given that nobody asks for writability (pte_write()) later.
>>>
>>> Or does someone care?
>>>
>>> Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>>> not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>>> architecture (write-only)? I don't think so.
>>>
>>>
>>> We have the following options:
>>>
>>> 1) pte_wrprotect(): fake that all was read-only.
>>>
>>> Either we do it like Dev suggests, or we do it as above early.
>>>
>>> The downside is that any code that might later want to know "was
>>> this possibly writable" would get that information. Well, it wouldn't
>>> get that information reliably *today* already (and that sounds a bit shaky).
>>
>> Makes sense to me :)
>>
>>> 2) Tell batching logic to honor pte_write()
>>>
>>> Sounds suboptimal for some cases that really don't care in the future.
>>>
>>> 3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>>>
>>> ... then we know for sure whether any PTE was writable and we could
>>>
>>> (a) Pass it as we did before around to all checks, like pte_accessible().
>>>
>>> (b) Have an explicit restore PTE where we play save.
>>>
>>>
>>> I raised to Dev in private that softdirty handling is also shaky, as we
>>> batch over that. Meaning that we could lose or gain softdirty PTE bits in
>>> a batch.
>>
>> I guess we won't lose soft_dirty bits - only gain them (false positive):
>>
>> 1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty()
>> 2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on
>> all architectures that support soft_dirty (x86, s390, powerpc, riscv)
>> 3) set_ptes() uses pte_advance_pfn() which keeps all flags intact
>>
>> So if any PTE in the batch was dirty, all PTEs become soft_dirty after
>> restore.
>
> PTEs can be softdirty without being dirty. That over-complicates the
> situation.
Ah, it's even trickier then :D
^ permalink raw reply [flat|nested] 13+ messages in thread