From: Zi Yan <ziy@nvidia.com>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
Song Liu <songliubraving@fb.com>, Chris Mason <clm@fb.com>,
David Sterba <dsterba@suse.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
Lorenzo Stoakes <ljs@kernel.org>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Nico Pache <npache@redhat.com>,
Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Shuah Khan <shuah@kernel.org>,
<linux-btrfs@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-fsdevel@vger.kernel.org>, <linux-mm@kvack.org>,
<linux-kselftest@vger.kernel.org>
Subject: Re: [PATCH 7.2 v2 02/12] mm/khugepaged: add folio dirty check after try_to_unmap_flush()
Date: Thu, 16 Apr 2026 22:09:05 -0400 [thread overview]
Message-ID: <C83282EC-F3C3-44FB-A8A4-615F67F9CEEC@nvidia.com> (raw)
In-Reply-To: <842D3AD2-32A0-41E1-AC09-08EC793EAF7F@nvidia.com>
On 14 Apr 2026, at 11:55, Zi Yan wrote:
> On 14 Apr 2026, at 6:38, David Hildenbrand (Arm) wrote:
>
>> On 4/13/26 21:20, Zi Yan wrote:
>>> This check ensures the correctness of collapse read-only THPs for FSes
>>> after READ_ONLY_THP_FOR_FS is enabled by default for all FSes supporting
>>> PMD THP pagecache.
>>>
>>> READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
>>> and inode->i_writecount to prevent any write to read-only to-be-collapsed
>>> folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
>>> aforementioned mechanism will go away too. To ensure khugepaged functions
>>> as expected after the changes, rollback if any folio is dirty after
>>> try_to_unmap_flush() to , since a dirty folio means this read-only folio
>>> got some writes via mmap can happen between try_to_unmap() and
>>> try_to_unmap_flush() via cached TLB entries and khugepaged does not support
>>> collapse writable pagecache folios.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>> mm/khugepaged.c | 18 ++++++++++++++++++
>>> 1 file changed, 18 insertions(+)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d2f0acd2dac2..ec609e53082e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2121,6 +2121,24 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>>> */
>>> try_to_unmap_flush();
>>>
>>> + /*
>>> + * At this point, all folios are locked, unmapped, and all cached
>>> + * mappings in TLBs are flushed. No one else is able to write to these
>>> + * folios, since
>>> + * 1. writes via FS ops require folio locks (see write_begin_get_folio());
>>> + * 2. writes via mmap require taking a fault and locking folio locks.
>>> + *
>>
>> maybe simplify to "folios, since that would require taking the folio lock first."
>
> Sure.
>
>>
>>> + * khugepaged only works for read-only fd, make sure all folios are
>>> + * clean, since writes via mmap can happen between try_to_unmap() and
>>> + * try_to_unmap_flush() via cached TLB entries.
>>
>>
>> IIRC, after successful try_to_unmap() the PTE dirty bit would be synced to the
>> folio. That's what you care about, not about any stale TLB entries.
>
> Right. I missed the PTE dirty bit to folio dirtiness part.
>
>>
>> The important part is that the
>>
>> So can't we simply test for dirty folios after the refcount check (where
>> we made sure the folio is no longer mapped)?
>>
>
> I think so.
After more thoughts, this still works but the reasoning is more complicated.
After try_to_unmap(), PTE dirty bit is transferred to folio dirty flag if exists.
The code bails out if the folio is dirty. A clean folio means the removed
PTEs were clean, thus TLB entries caching these PTEs have dirty bit not set.
Any write to the folio after try_to_unmap() causes a page table walk and
a page fault due to invalid PTEs, and they cannot proceed since collapse_file()
is holding the folio lock. As a result, we can skip marking folio dirty
after the collapse.
If we allow a dirty folio after try_to_unmap(), we need to mark the folio dirty
after the collapse like shmem. And a potential optimization is that we only
mark the after-collapse folio dirty if any old folio is dirty after try_to_unmap().
>
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b2ac28ddd480..920e16067134 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -2089,6 +2089,14 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr,
>> goto out_unlock;
>> }
>>
>> + /* ... */
>> + if (!is_shmem && folio_test_dirty(folio)) {
>> + result = SCAN_PAGE_DIRTY_OR_WRITEBACK;
>> + xas_unlock_irq(&xas);
>> + folio_putback_lru(folio);
>> + goto out_unlock;
>> + }
>> +
>> /*
>> * Accumulate the folios that are being collapsed.
>>
>>
>> I guess we don't have to recheck folio_test_writeback() ?
>
> Right, writeback needs to take folio lock and we are holding it, so
> others can only dirty the folio but are not able to initiate write backs.
>
> Best Regards,
> Yan, Zi
--
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2026-04-17 2:09 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 19:20 [PATCH 7.2 v2 00/12] Remove read-only THP support for FSes without large folio support Zi Yan
2026-04-13 19:20 ` [PATCH 7.2 v2 01/12] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check Zi Yan
2026-04-13 20:20 ` Matthew Wilcox
2026-04-13 20:34 ` Zi Yan
2026-04-14 10:19 ` David Hildenbrand (Arm)
2026-04-14 10:20 ` David Hildenbrand (Arm)
2026-04-15 6:09 ` Baolin Wang
2026-04-14 10:29 ` David Hildenbrand (Arm)
2026-04-14 15:37 ` Lance Yang
2026-04-14 15:43 ` Lance Yang
2026-04-14 15:59 ` Zi Yan
2026-04-13 19:20 ` [PATCH 7.2 v2 02/12] mm/khugepaged: add folio dirty check after try_to_unmap_flush() Zi Yan
2026-04-13 20:23 ` Matthew Wilcox
2026-04-13 20:28 ` Zi Yan
2026-04-14 10:38 ` David Hildenbrand (Arm)
2026-04-14 15:55 ` Zi Yan
2026-04-17 2:09 ` Zi Yan [this message]
2026-04-13 19:20 ` [PATCH 7.2 v2 03/12] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled() Zi Yan
2026-04-14 10:40 ` David Hildenbrand (Arm)
2026-04-14 15:59 ` Zi Yan
2026-04-15 6:17 ` Baolin Wang
2026-04-13 19:20 ` [PATCH 7.2 v2 04/12] mm: remove READ_ONLY_THP_FOR_FS Kconfig option Zi Yan
2026-04-14 10:40 ` David Hildenbrand (Arm)
2026-04-15 6:20 ` Baolin Wang
2026-04-13 19:20 ` [PATCH 7.2 v2 05/12] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_pmd_enabled() Zi Yan
2026-04-13 20:33 ` Matthew Wilcox
2026-04-13 20:42 ` Zi Yan
2026-04-14 11:02 ` David Hildenbrand (Arm)
2026-04-14 16:30 ` Zi Yan
2026-04-14 18:14 ` David Hildenbrand (Arm)
2026-04-14 18:25 ` Zi Yan
2026-04-15 6:36 ` Baolin Wang
2026-04-15 8:00 ` David Hildenbrand (Arm)
2026-04-15 9:21 ` Baolin Wang
2026-04-15 18:01 ` Zi Yan
2026-04-16 0:49 ` Baolin Wang
2026-04-16 8:47 ` David Hildenbrand (Arm)
2026-04-16 13:56 ` Zi Yan
2026-04-13 19:20 ` [PATCH 7.2 v2 06/12] mm: fs: remove filemap_nr_thps*() functions and their users Zi Yan
2026-04-13 20:35 ` Matthew Wilcox
2026-04-14 11:02 ` David Hildenbrand (Arm)
2026-04-15 6:53 ` Baolin Wang
2026-04-13 19:20 ` [PATCH 7.2 v2 07/12] fs: remove nr_thps from struct address_space Zi Yan
2026-04-13 20:38 ` Matthew Wilcox
2026-04-15 6:44 ` Baolin Wang
2026-04-13 19:20 ` [PATCH 7.2 v2 08/12] mm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FS Zi Yan
2026-04-13 20:41 ` Matthew Wilcox
2026-04-13 20:46 ` Zi Yan
2026-04-14 11:03 ` David Hildenbrand (Arm)
2026-04-15 6:47 ` Baolin Wang
2026-04-13 19:20 ` [PATCH 7.2 v2 09/12] mm/truncate: use folio_split() in truncate_inode_partial_folio() Zi Yan
2026-04-13 19:20 ` [PATCH 7.2 v2 10/12] fs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FS Zi Yan
2026-04-14 11:06 ` David Hildenbrand (Arm)
2026-04-13 19:20 ` [PATCH 7.2 v2 11/12] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged Zi Yan
2026-04-14 11:06 ` David Hildenbrand (Arm)
2026-04-13 19:20 ` [PATCH 7.2 v2 12/12] selftests/mm: remove READ_ONLY_THP_FOR_FS from comments in guard-regions Zi Yan
2026-04-13 20:47 ` Matthew Wilcox
2026-04-13 20:51 ` Zi Yan
2026-04-13 22:28 ` Matthew Wilcox
2026-04-14 11:09 ` David Hildenbrand (Arm)
2026-04-14 16:45 ` Zi Yan
2026-04-14 17:40 ` Matthew Wilcox
2026-04-14 17:53 ` Zi Yan
2026-04-14 11:07 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C83282EC-F3C3-44FB-A8A4-615F67F9CEEC@nvidia.com \
--to=ziy@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=clm@fb.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=dsterba@suse.com \
--cc=jack@suse.cz \
--cc=lance.yang@linux.dev \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=songliubraving@fb.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox