Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Lance Yang <lance.yang@linux.dev>
To: david@kernel.org
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
	baohua@kernel.org, dev.jain@arm.com, harry.yoo@oracle.com,
	jannh@google.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, lorenzo.stoakes@oracle.com, riel@surriel.com,
	stable@kernel.org, vbabka@kernel.org,
	Lance Yang <lance.yang@linux.dev>
Subject: Re: [PATCH] mm/rmap: fix incorrect pte restoration for lazyfree folios
Date: Thu, 26 Feb 2026 15:09:40 +0800	[thread overview]
Message-ID: <20260226070940.96226-1-lance.yang@linux.dev> (raw)
In-Reply-To: <36e676b4-dc6f-45f7-b885-8685227ac6a8@kernel.org>


On Tue, Feb 24, 2026 at 05:01:50PM +0100, David Hildenbrand (Arm) wrote:
>On 2/24/26 12:43, Lorenzo Stoakes wrote:
>> On Tue, Feb 24, 2026 at 11:31:24AM +0000, Lorenzo Stoakes wrote:
>>> Thanks Dev.
>>>
>>> Andrew - why was commit 354dffd29575 ("mm: support batched unmap for lazyfree
>>> large folios during reclamation") merged?
>>>
>>> It had enormous amounts of review commentary at
>>> https://lore.kernel.org/all/146b4cb1-aa1e-4519-9e03-f98cfb1135d2@redhat.com/ and
>>> no tags, this should be a signal to wait for a respin _at least_, and really if
>>> late in cycle suggests it should wait a cycle.
>>>
>>> I've said going forward I'm going to check THP series for tags and if not
>>> present NAK if they hit mm-stable, I guess I'll extend that to rmap also.
>> 
>> Sorry I misread the original mail rushing through this is old... so this is less
>> pressing than I thought (for some reason I thought it was merged last cycle...!)
>> but it's a good example of how stuff can go unnoticed for a while.
>> 
>> In that case maybe a revert is a bit much and we just want the simplest possible
>> fix for backporting.
>
>Dev volunteered to un-messify some of the stuff here. In particular, to
>extend batching to all cases, not just some hand-selected ones.
>
>Support for file folios is on the way.
>
>> 
>> But is the proposed 'just assume wrprotect' sensible? David?
>
>In general, I think so. If PTEs were writable, they certainly have
>PAE set. The write-fault handler can fully recover from that (as PAE is
>set). If it's ever a performance problem (doubt), we can revisit.
>
>I'm wondering whether we should just perform the wrprotect earlier:
>
>diff --git a/mm/rmap.c b/mm/rmap.c
>index 0f00570d1b9e..19b875ee3fad 100644
>--- a/mm/rmap.c
>+++ b/mm/rmap.c
>@@ -2150,6 +2150,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> 
>                        /* Nuke the page table entry. */
>                        pteval = get_and_clear_ptes(mm, address, pvmw.pte, nr_pages);
>+
>+                       /*
>+                        * Our batch might include writable and read-only
>+                        * PTEs. When we have to restore the mapping, just
>+                        * assume read-only to not accidentally upgrade
>+                        * write permissions for PTEs that must not be
>+                        * writable.
>+                        */
>+                       pteval = pte_wrprotect(pteval);
>+
>                        /*
>                         * We clear the PTE but do not flush so potentially
>                         * a remote CPU could still be writing to the folio
>
>
>Given that nobody asks for writability (pte_write()) later.
>
>Or does someone care?
>
>Staring at set_tlb_ubc_flush_pending()->pte_accessible() I am
>not 100% sure. Could pte_wrprotect() turn a PTE inaccessible on some
>architecture (write-only)? I don't think so.
>
>
>We have the following options:
>
>1) pte_wrprotect(): fake that all was read-only.
>
>Either we do it like Dev suggests, or we do it as above early.
>
>The downside is that any code that might later want to know "was
>this possibly writable" would get that information. Well, it wouldn't
>get that information reliably *today* already (and that sounds a bit shaky).

Makes sense to me :)

>2) Tell batching logic to honor pte_write()
>
>Sounds suboptimal for some cases that really don't care in the future.
>
>3) Tell batching logic to tell us if any pte was writable: FPB_MERGE_WRITE
>
>... then we know for sure whether any PTE was writable and we could
>
>(a) Pass it as we did before around to all checks, like pte_accessible().
>
>(b) Have an explicit restore PTE where we play save.
>
>
>I raised to Dev in private that softdirty handling is also shaky, as we
>batch over that. Meaning that we could lose or gain softdirty PTE bits in
>a batch.

I guess we won't lose soft_dirty bits - only gain them (false positive):

1) get_and_clear_ptes() merges dirty bits from all PTEs via pte_mkdirty()
2) pte_mkdirty() atomically sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY on
all architectures that support soft_dirty (x86, s390, powerpc, riscv)
3) set_ptes() uses pte_advance_pfn() which keeps all flags intact

So if any PTE in the batch was dirty, all PTEs become soft_dirty after
restore.

>For lazyfree and file folios it doesn't really matter I guess. But it will
>matter once we unlock it for all anon folios.
>
>
>1) sounds simplest, 3) sounds cleanest long-term.
>

next prev parent reply	other threads:[~2026-02-26  7:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24 11:09 Dev Jain
2026-02-24 11:31 ` Lorenzo Stoakes
2026-02-24 11:43   ` Lorenzo Stoakes
2026-02-24 16:01     ` David Hildenbrand (Arm)
2026-02-25  5:11       ` Dev Jain
2026-02-26 10:21         ` David Hildenbrand (Arm)
2026-02-26 10:27           ` Dev Jain
2026-02-26  7:01       ` Barry Song
2026-02-26 10:09         ` David Hildenbrand (Arm)
2026-02-26 10:20           ` Barry Song
2026-02-26  7:09       ` Lance Yang [this message]
2026-02-26 10:06         ` David Hildenbrand (Arm)
2026-02-26 10:28           ` Lance Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260226070940.96226-1-lance.yang@linux.dev \
    --to=lance.yang@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=riel@surriel.com \
    --cc=stable@kernel.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox