From: Usama Arif <usamaarif642@gmail.com>
To: Zi Yan <ziy@nvidia.com>, Kiryl Shutsemau <kas@kernel.org>,
lorenzo.stoakes@oracle.com
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
linux-mm@kvack.org, hannes@cmpxchg.org, riel@surriel.com,
shakeel.butt@linux.dev, baohua@kernel.org, dev.jain@arm.com,
baolin.wang@linux.alibaba.com, npache@redhat.com,
Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz,
lance.yang@linux.dev, linux-kernel@vger.kernel.org,
kernel-team@meta.com
Subject: Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support
Date: Tue, 3 Feb 2026 14:07:25 -0800 [thread overview]
Message-ID: <05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com> (raw)
In-Reply-To: <63D23D5F-AF35-4199-B52E-DFFC16DFDF91@nvidia.com>
On 02/02/2026 08:01, Zi Yan wrote:
> On 2 Feb 2026, at 5:44, Kiryl Shutsemau wrote:
>
>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>> For page table management, PUD THPs need to pre-deposit page tables
>>> that will be used when the huge page is later split. When a PUD THP
>>> is allocated, we cannot know in advance when or why it might need to
>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>> deposited PMD.
>>>
>>> The deposited PMD tables are stored as a singly-linked stack using only
>>> page->lru.next as the link pointer. A doubly-linked list using the
>>> standard list_head mechanism would cause memory corruption: list_del()
>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>> problem because their deposited PTE tables don't have sub-deposits.
>>> Using only lru.next avoids the overlap entirely.
>>>
>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>> there is no PUD-level migration entry format - the split converts the
>>> single PUD mapping into individual PTE mappings that can be migrated
>>> or swapped normally.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>> include/linux/huge_mm.h | 5 +++
>>> include/linux/mm.h | 19 ++++++++
>>> include/linux/mm_types.h | 5 ++-
>>> include/linux/pgtable.h | 8 ++++
>>> include/linux/rmap.h | 7 ++-
>>> mm/huge_memory.c | 8 ++++
>>> mm/internal.h | 3 ++
>>> mm/page_vma_mapped.c | 35 +++++++++++++++
>>> mm/pgtable-generic.c | 83 ++++++++++++++++++++++++++++++++++
>>> mm/rmap.c | 96 +++++++++++++++++++++++++++++++++++++---
>>> 10 files changed, 260 insertions(+), 9 deletions(-)
>>>
>
> <snip>
>
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index d3aec7a9926ad..2047558ddcd79 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>> }
>>> #endif
>>>
>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> +/*
>>> + * Deposit page tables for PUD THP.
>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>> + *
>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>
>> This is ugly.
>>
>> Sounds like you want to use llist_node/head instead of list_head for this.
>>
>> You might able to avoid taking the lock in some cases. Note that
>> pud_lockptr() is mm->page_table_lock as of now.
>
> I agree. I used llist_node/head in my implementation[1] and it works.
> I have an illustration at[2] to show the concept. Feel free to reuse the code.
>
>
> [1] https://lore.kernel.org/all/20200928193428.GB30994@casper.infradead.org/
> [2] https://normal.zone/blog/2021-01-04-linux-1gb-thp-2/#new-mechanism
>
> Best Regards,
> Yan, Zi
Ah I should have looked at your patches more! I started working by just using lru
and was using list_add/list_del which was ofcourse corrupting the list and took me
way more time than I would like to admit to debug what was going on! The diagrams
in your 2nd link are really useful. I ended up drawing by hand those to debug
the corruption issue. I will point to that link in the next series :)
How about something like the below diff over this patch? (Not included the comment
changes that I will make everywhere)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26a38490ae2e1..3653e24ce97d7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,9 @@ struct page {
struct list_head buddy_list;
struct list_head pcp_list;
struct llist_node pcp_llist;
+
+ /* PMD pagetable deposit head */
+ struct llist_node pgtable_deposit_head;
};
struct address_space *mapping;
union {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 2047558ddcd79..764f14d0afcbb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -215,9 +215,7 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
assert_spin_locked(pud_lockptr(mm, pudp));
- /* Push onto stack using only lru.next as the link */
- pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
- pud_huge_pmd(pudp) = pmd_page;
+ llist_add(&pmd_page->pgtable_deposit_head, (struct llist_head *)&pud_huge_pmd(pudp));
}
/*
@@ -227,16 +225,16 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
*/
pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
{
+ struct llist_node *node;
pgtable_t pmd_page;
assert_spin_locked(pud_lockptr(mm, pudp));
- pmd_page = pud_huge_pmd(pudp);
- if (!pmd_page)
+ node = llist_del_first((struct llist_head *)&pud_huge_pmd(pudp));
+ if (!node)
return NULL;
- /* Pop from stack - lru.next points to next PMD page (or NULL) */
- pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
+ pmd_page = llist_entry(node, struct page, pgtable_deposit_head);
return page_address(pmd_page);
}
Also, Zi is it ok if I add your Co-developed by on this patch in future revisions?
I didn't want to do that without your explicit approval.
next prev parent reply other threads:[~2026-02-03 22:07 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-02 0:50 [RFC 00/12] mm: PUD (1GB) THP implementation Usama Arif
2026-02-02 0:50 ` [RFC 01/12] mm: add PUD THP ptdesc and rmap support Usama Arif
2026-02-02 10:44 ` Kiryl Shutsemau
2026-02-02 16:01 ` Zi Yan
2026-02-03 22:07 ` Usama Arif [this message]
2026-02-05 4:17 ` Matthew Wilcox
2026-02-05 4:21 ` Matthew Wilcox
2026-02-05 5:13 ` Usama Arif
2026-02-05 17:40 ` David Hildenbrand (Arm)
2026-02-05 18:05 ` Usama Arif
2026-02-05 18:11 ` Usama Arif
2026-02-02 12:15 ` Lorenzo Stoakes
2026-02-04 7:38 ` Usama Arif
2026-02-04 12:55 ` Lorenzo Stoakes
2026-02-05 6:40 ` Usama Arif
2026-02-02 0:50 ` [RFC 02/12] mm/thp: add mTHP stats infrastructure for PUD THP Usama Arif
2026-02-02 11:56 ` Lorenzo Stoakes
2026-02-05 5:53 ` Usama Arif
2026-02-02 0:50 ` [RFC 03/12] mm: thp: add PUD THP allocation and fault handling Usama Arif
2026-02-02 0:50 ` [RFC 04/12] mm: thp: implement PUD THP split to PTE level Usama Arif
2026-02-02 0:50 ` [RFC 05/12] mm: thp: add reclaim and migration support for PUD THP Usama Arif
2026-02-02 0:50 ` [RFC 06/12] selftests/mm: add PUD THP basic allocation test Usama Arif
2026-02-02 0:50 ` [RFC 07/12] selftests/mm: add PUD THP read/write access test Usama Arif
2026-02-02 0:50 ` [RFC 08/12] selftests/mm: add PUD THP fork COW test Usama Arif
2026-02-02 0:50 ` [RFC 09/12] selftests/mm: add PUD THP partial munmap test Usama Arif
2026-02-02 0:50 ` [RFC 10/12] selftests/mm: add PUD THP mprotect split test Usama Arif
2026-02-02 0:50 ` [RFC 11/12] selftests/mm: add PUD THP reclaim test Usama Arif
2026-02-02 0:50 ` [RFC 12/12] selftests/mm: add PUD THP migration test Usama Arif
2026-02-02 2:44 ` [RFC 00/12] mm: PUD (1GB) THP implementation Rik van Riel
2026-02-02 11:30 ` Lorenzo Stoakes
2026-02-02 15:50 ` Zi Yan
2026-02-04 10:56 ` Lorenzo Stoakes
2026-02-05 11:29 ` David Hildenbrand (arm)
2026-02-05 11:22 ` David Hildenbrand (arm)
2026-02-02 4:00 ` Matthew Wilcox
2026-02-02 9:06 ` David Hildenbrand (arm)
2026-02-03 21:11 ` Usama Arif
2026-02-02 11:20 ` Lorenzo Stoakes
2026-02-04 1:00 ` Usama Arif
2026-02-04 11:08 ` Lorenzo Stoakes
2026-02-04 11:50 ` Dev Jain
2026-02-04 12:01 ` Dev Jain
2026-02-05 6:08 ` Usama Arif
2026-02-02 16:24 ` Zi Yan
2026-02-03 23:29 ` Usama Arif
2026-02-04 0:08 ` Frank van der Linden
2026-02-05 5:46 ` Usama Arif
2026-02-05 18:07 ` Zi Yan
2026-02-07 23:22 ` Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com \
--to=usamaarif642@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hannes@cmpxchg.org \
--cc=kas@kernel.org \
--cc=kernel-team@meta.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=riel@surriel.com \
--cc=ryan.roberts@arm.com \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox